List of web archiving initiatives


This article contains a list of web archiving initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data, and access methods.
Some of these initiatives may or may not make use of several web archiving file formats and/or their own proprietary file formats.
This Wikipedia page was originally generated from the results obtained for the research paper A survey on web archiving initiatives, published by the team at the time.

Archived data

NameArchived Contents Disk Space Occupied Archive FormatTLD/Broad CrawlsSelective Crawls Comments
EU Web ArchiveWARC.EUY.EU 250 websites in europa.eu domain and subdomains, crawled once per quarter + ad hoc crawls on request of website owners. Status Feb 2019.
Australia's Web Archive11000600WARC.AUY.AU crawls : 10.15 billion files. Selective crawls : 755 million files. AGWA : 525 million files.
Our digital island, a Tasmanian Web Archive0.336HTTrackYPreserves online content related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of 'Book' in the Tasmanian Library Act 1984. Thus, no permission to capture from publishers is required.
Webarchive Austria4095164ARC.AT, .wien, .tirolYA copy of the data is stored in a high security data storage unit.
Deutsche NationalbibliothekWARC.DEYOnly one experimental TLD crawl.
DILIMAG 0.030.996ARCProject from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines.
Bibliothèque et Archives nationales du Québec 16731ARC/WARCYHarvesting began in 2009. Selective crawls of Quebec websites.
Government of Canada Web Archive 175070ARC/WARC.GC.CAYWeb archiving at Library and Archives Canada began in 2005 and concentrated on collecting the federal government web presence and capturing the federal elections, the Olympics, and Canadian commemorative events. Thematic web collections of Canadiana research interest have been curated as an ongoing program activity since 2009.
Web Information Collection and Preservation - WICP .GOV.CNYHarvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn' domain.
Croatian Web Archive 23113Mirror, WARC.HRYSince 2004 selective harvesting over 5000 web resources. Since 2011 annual harvesting of national.hr domain as well as thematic harvesting. All archived content is publicly available via HAW website.
Webarchiv 9412350ARC/WARC.CZYHarvesting began in 2001.
/ The Danish web archive 36000634ARC/WARC.DKY+36 billion objects:
  • html : 19077101525
  • image : 5859756918
  • other : 4080719309
  • text : 757030275
  • pdf : 97318057
  • audio : 8166680
  • video : 7085143
  • word : 47510
  • powerpoint : 5660
  • excel : 4721
  • Snapshot harvesting
  • Selective harvesting
  • Event harvesting
  • Special harvesting

Estonian Web Archive87456ARC/WARC.EEYArchive consists selective, event and topical crawls since 2010. Whole national domain crawls are done yearly since 2015. Besides TLD.ee, Estonia related web content is harvested from other TLD-s like.eu,.org,.com etc.
Finnish Web Archive4300300ARC/WARC / .json / .mp4.FI, .AXYAlso crawls content hosted on machines physically located in Finland, independently from their domain.
BnF - Web Legal Deposit48 0001 800ARC/WARC.FR + all sites hosted in FranceYBnF is making copies of all sites in the .FR TLD, as well as all sites hosted and produced in France, ignoring both the Robots exclusion standard and the licenses of the documents.
BnL Web-Archive54341WARC.LUYThe BnL conducts 2 domain crawls per year, as well as event-based and selective crawls.
Ina (Institut National de l'Audiovisuel)1058002359DAFFYAs of 2021-03-08
DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 10 PB
E-diaspora (Télécom ParisTech, FMSH)103013DAFFYDAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 51 TB
Internet Memory Foundation180WARCCan be done by partnersYFormerly European Archive. Collaborate with Internet Memory Research, which provides the ArchiveTheNet Service. Selective crawls, Domain crawls, expect to grow to 1PB in 2012. New datacenter and a new crawler in 2012.
Bibliotheksservice-Zentrum Baden-Württemberg9WARCYWebsites of about 20 cities, municipalities, districts + their associated corporations, and state libraries are collected by BSZ in commission within various Archive-It collections. Public access. Data storage: San Francisco as well as backup with Baden-Wuerttemberg storage infrastructure.
Web archive of the German BundestagYGerman Federal Parliament. Selective. At regular intervals or at certain events are snapshots of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available.
Iceland
Palestine Web ArchiveARC/WARC.PSY.PS crawls : Pilots Crawls. Selective crawls
Web Archiving Project, The National Diet Library, Japan126701313WARCYas of March 2023
15 TB of selective crawls based on permission. Started the web archiving of official institution sites based on the legislation from April 2010.
National Library of Korea - OASIS 24YRequires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web archiving system will be rebuilt.
Koninklijke Bibliotheek40736WARCYSelective crawls of ca. 20.400 sites
New Zealand Web Archive4300260ARC/WARC.NZY.NZ crawls : 4+ billion URLS. Selective crawls 33,500 websites. Legal deposit covers born digital material.
The National Library of Norway
Arquivo.pt21 1181 455ARC/WARCFocused on .PT but also other domainsY.PT domain crawls and integration of external collections since 2007 and daily crawls of a selection of online publications of since 2010. Selective crawls related to national events such as elections or international content related to science such as websites about Research & Development projects funded by the European Union.
Web archive of Cacak0.2550.013HTTrackYSelective crawls of 130 sites related to the city of Cacak. Collaboration with the Webarchiv team from the National Library of the Czech Republic.
Web Archive SingaporeWARC.SGYSelective crawls of Singapore-related sites and .SG domain archiving.
Digital Resources 1 92189WARC.SK + other TLDs with Slovacical contentYHarvesting of the Slovak web started in 2015. Since then ULB has performed six full-domain harvests, multiple selective crawls and thematic crawls.
Slovenian Web Archive30WARCSelective crawls since 2007, national domain crawls since 2014.
Archivo de la Web Española2539117WARC.ESYDomain .ES crawls : 2.421 million files in collaboration with Internet Archive. Selective crawls : 119 mil files. About 30 news media sites crawled every day. Not launched publicly yet.
PADICAT: The Web Archive of Catalonia62032,5ARC/WARC.CATYIn accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet ; Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life
210.8ARCY
Sweden 5700360Multipart MIME.se, Swedish.nu and geolocation for other tld'sYBulk crawls approximately twice a year.
Selective crawls of about 140 newspapers every day.
Aleph Archives>10000000>25Native HTML, WARC, WARC2, ARC and HTTrack to WARC migration toolsYEnterprise-grade automatic web archiving platform for online capture and preservation. Support eDiscovery with powerful and qualitative technology.
Aimed to corporations, institutions and agencies seeking to capture, preserve and leverage their Web content; dynamic websites, wikis, social media, forums, comments, disclaimers, and ads, for compliance, marketing or pure preservation purposes.
Web Archive Switzerland80ARC, WARCYMainly selected.ch crawls
NTU Web Archiving System, NTUWAS20014Y
Web Archive Taiwan
The UK Web Archive20.6WARCYSelective crawls with previous permission. Now also conducting wholesale UK domain-scale crawls under Non-Print Legal Deposit legislation, enacted April 2013. This content will only be available on premises controlled by one of the six legal deposit libraries. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007.
Hanzo Archives7WARCYCommercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive1000 +150ARC
WARC post July 2017
Between 2003 - 2005 the Internet Archive undertook the technical side of web archiving on behalf of The UK Government Web Archive. Between 2005 - July 2017 the technical side of the web archiving service was contracted out to the Internet Memory Foundation. From July 2017 MirrorWeb took over the contract and moved the entire archive to the cloud. The UK Government Web Archive was part of the UK Web Archiving Consortium from 2004 - 2009.
Internet Archive 69000021000WorldwideYProvides the Archive-it service and leads the Archive-access project. Collection is mirrored at Bibliotheca of Alexandrina in Egypt.
Columbia University Libraries Web Resources Collection Program72350.4ARC/WARCYSelective crawls with permission or notification. Thematic collections in: Human rights; New York City built environment; New York City religions; Resistance. Also capture Columbia University web domain.
North Carolina State Government Web Site Archives51.53.8WARCY
Latin American Web Archiving ProjectY
Web Archiving Project for the Pacific Islands5.5ARC/WARCYIncludes sites of 18 countries.
Library of Congress Web Archives7741420ARC/WARCYFormerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections.
Harvard University Library: the Web Archive Collection Service 190.661ARCYSelective crawls with no previous authorization.
Web Archiving Service from California Digital Library 21625.2ARC/WARCCan be done by partnersYProvides Web Archiving Service to partners worldwide. Was developed at the California Digital Library.
Bentley Historical Library Web Archives34.52.6ARC/WARCYWAS service since 2010.
University of Texas at San Antonio Web Archives261.135ARC/WARCYUniversity administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues.
AUEB Web Archive3WARCaueb.grNThe amount of data crawled from the domain aueb.gr ranges between 10GB and 14.9GB. The data is stored on disk compressed and requires between 8.8GB and 9.7GB, resulting in space savings between 12% and 35%. In the case of new crawl, we can only store on disk the Web pages that change since the previous crawl. Consequently, we crawled 13.1GB from the domain aueb.gr, but we only stored on disk 1.6GB, resulting in space savings of 88%.
World Bank Web Archives0.143HTTrackno, so farY450 sites with historical or research value have been harvested since 2007, each archived before being taken offline or before a major upgrade.
University of North Texas CyberCemetery0.887WARC.govY
Bibliotheca Alexandrina's Internet Archive800001000ARC/WARCEgyptian news and politicsY
York University Digital Library0.435WARCyorku.ca + faculty requestsY
Netherlands Institute for Sound and Vision web archiveARC/WARCYAmong other av-heritage, Sound and Vision is tasked with archiving programmes broadcast by Dutch Public Broadcasters. Therefore, an important part of the web archive consists of websites of public broadcaster related to these programmes. Furthermore, websites are archived that do not have a direct link to the collection, but that are of interest in a broader, media-historical way. Examples are websites of commercial broadcasters.
Kentucky Department for Libraries and Archives30.3007WARCY
University of California, San Francisco Library12.50.587ARC/WARCYWebsites requested by staff and faculty, and growing list attempting to capture all UCSF websites as comprehensively as possible.
Ivy Plus Libraries Confederation34716ARC/WARCYSelective crawls with notification. Thematic collections in politics and political protests, architecture, composers, design, gaming, geology, webcomics, documentary films, art, religion, sexuality, climate change, and more.
Malaysian Government Web Archive 10WARC.GOV.MYYCrawls only Malaysian public sector websites only. View is by subject, i.e. administration, economy, security, and social.
National Library of Medicine 1229.1WARCY-
Smithsonian Libraries and Archives 10WARCY
Common Crawl300 00010 000ARC/WARCworldwideYAdditional data products such as a graph of the web, and parquet indexes of urls and hosts.
67001120WARC, FFV1, FLAC, JSONLMultiple YGlobal archive spanning user-generated content, obsolete web platforms, and interface artifacts. Indexes include defunct CMS exports, blog comment trees, forum structures, and visual UI states. Selective crawls emphasize digital ephemera recovery and platform shutdown captures. Data verified across five mirrored nodes. Status: Active.

Access methods

NameURL history Meta-data search Full-text search Memento Compliance Comments
EU Web ArchiveYYYFreely accessible to all via
Australia's Web ArchiveYYYNoSelected sites are publicly available through a directory structure. Domain harvests are not. The PANDORA Archive is indexed and searchable through the NLA's single search service Trove.
The Australian Domain Harvests are full-text indexed but are not currently publicly available. The Australian Government Web Archive is searchable by URL and full-text indexes through its portal.
Our digital island, a Tasmanian Web ArchiveYYNNoPresents thumbnails generated through Html To Image supplemented in HTTrack. Information is organized in directory: A-Z Subject listing, A-Z Title listing.
Webarchive AustriaYNYNoPossible to search for versions either by URL or in fulltext. The websites are only accessible on special terminals at the Austrian National Library. Has bookmarking feature which allows to save versions online and recall them at the library webarchive terminals.
Deutsche NationalbibliothekYYYNoOnly accessible in the reading rooms of the German National Library. The metadata is included in the publicly accessible library catalogue.
DILIMAG YYNNoMetadata are publicly available, for the archived versions provides free or restricted access depending on the right holders agreement. Full-text search is implemented in the new version.
Bibliothèque et Archives nationales du Québec YNNNoProvides access according to partner policy.
Government of Canada Web Archive YYYProxyLibrary and Archives Canada makes its federal government web archives publicly accessible. Indices are available for discovering Canadian federal web resources alphabetically by authoring organization and by URL. Full text indexing is based on Lucene.
Web Information Collection and Preservation - WICP YNoArchive content is only available in intranet in National Library of China. Some collections are publicly available, with meta-data search and browsable by collection.
Croatian Web Archive YYYProxyFull open access.
Webarchiv YNNNDue to copyright restrictions, only a limited number of archived websites for which agreements were signed with the publishers is available online. For other resources you can find out whether a given website was archived and the number of harvested versions. Unlimited access to all resources in Webarchiv is available from public terminals in the National Library.
Netarkivet.dkYNYNoOnline access granted only to researchers through a Citrix login to free text search based on Solr and a proxy solution that accesses an archive through the Wayback. It has established a framework for running batch jobs with the possibility of data mining.
Estonian Web ArchiveYYNNoPublic access to archived content is allowed only with a permission of the copyright owner. Full archive is accessible merely to the web archive personnel.
Finnish Web ArchiveYN15% of material.NoURL search but on-site access to content. Full-text search is available to 15% of material.
BnF - Web Legal DepositYN15% of the collectionNoAccessible to authorized users through the reading rooms of the BnF Research Library located in Paris and Avignon and in partner libraries in regions and overseas territories. Wayback was customized and interface was translated to French. Full Text search only available on specific collections. Builds special collection galleries based on a selection from the archive on a given topic.
Ina (Institut National de l'Audiovisuel)YYYNoFull text indexing is based on Lucene. To accommodate results from frequent crawls clustering is operated to handle similar versions of pages
E-diaspora YNNNo1381 sites are currently crawled to build an archive on migrants usage of the web, social studies researchers have launched a long run project based on this archive is handling crawls and storage
Internet memory FoundationYYYNoProvides access and search services according to partners policy.
Bibliotheksservice-Zentrum Baden-WürttembergYYYNativeArchived websites accessible via Archive-It; integrated in the SWB union catalog. Full open access for major part of snapshots, some restricted by IP.
Web archive of the German BundestagYNNNoWeb archive itself are snapshots of www.bundestag.de and other websites. Navigation is possible by clicking on the years.
Iceland
Palestine Web ArchiveNYNNoStill in development and pilots
Web Archiving Project, The National Diet Library, JapanYYYNativeAll the archived websites are available on the premises. 85% of them is also accessible on the Internet with the permission of webmasters.
National Library of Korea - OASIS YYYNo100% of the archive is indexed. Enables search by topic classification. Search available.
Koninklijke BibliotheekYNNNoThe web archive is accessible on terminals in the KB reading rooms to full members.
New Zealand Web ArchiveYYYNativeDomain harvests: available to selected staff using Pywb and limited to URL searches. Selective harvests: each website is described in the catalogue and can be viewed by the public via the Internet by clicking on the link to the archived copy. A small subset of the selective harvests are accessible using full-text search.
The National Library of NorwayNYNoSites are integrated in the Catalog. Left bar enables facet navigation with drill-down.
Arquivo.pt - the Portuguese web-archiveYYYA . is also supported. Archived data can be mined through an Hadoop platform or .
Web archive of CacakNNNNoPlans to develop a search engine in the future. One bad characteristic of HTTrack is that it renames files during the archiving, so the original structure of the website is lost, as well file names.
YYYNoThe collection is viewable at the National Library, Singapore with selected content cleared by copyright owners available online.
Digital Resources YYNNoIt is possible to find out whether a website was archived and how many harvested versions exist. Due to the copyright restrictions only a limited number of archived websites is publicly available. The access to other archived resources is available locally in the University Library in Bratislava.
Slovenian Web ArchiveYNYNoThe archive of selective crawls is publicly accessible. Use is possible by browsing and full-text search. National domain crawls are not accessible yet but will be in the future.
Archivo de la Web EspañolaY Y Y NoPlan to provide access on-site in the short-medium term.
PADICAT: The Web Archive of CataloniaYYYNoFull open access.
Basque Digital Heritage ArchiveYYYNo
Sweden YNNNoPublic access through dedicated machines in the library building.
Aleph ArchivesYYYNoEnterprise-grade automatic web archiving platform for online capture and preservation. Support eDiscovery with powerful and qualitative technology.
Aimed to corporations, institutions and agencies seeking to capture, preserve and leverage their Web content; dynamic websites, wikis, social media, forums, comments, disclaimers, and ads, for compliance, marketing or pure preservation purposes.
Web Archive SwitzerlandYYYNoWeb Archive Switzerland is the collection of the Swiss National Library containing websites with a bearing on Switzerland. Web Archive Switzerland has been integrated in e-Helvetica, the access system of the Swiss National Library, giving access to the entire digital collection. So you can do full text searching of a part of the Web Archive. But the archived versions of websites can only be viewed in the reading rooms of the Swiss National Library and of our partner libraries who help us build the collection of Swiss websites. But you can view the metadata of the archived versions from anywhere.
NTU Web Archiving System, NTUWASYYYNoPresents page thumbnails, archived pages mapped to geographical locations.
Web Archive TaiwanYYYNo
PageFreezerYYYNoEnterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. Used by government agencies and public listed corporations in Pharmaceutical, Food, Finance, Healthcare and Retail industry.
The UK Web ArchiveYYN
Hanzo ArchivesYYYNoCommercial web archiving services and appliances. Access includes full-text search, annotations, redaction, URL/History, archive policy and temporal browsing, and configurable metadata schema for advanced e-discovery applications. Used in government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive YYYFull text search is operational on the UK Government Web Archive. Users can browse the collection using a full A-Z list of all sites
YYYFull text search is operational on the EU Exit Web Archive
Internet Archive YYYURL history is available for all archived data. Meta-data and full-text search only for selected crawls. Until 2002 had a mining platform for research composed by Alexa Shell Perl Tools
av_tools and p2 platform for parallel processing. It was replaced by a simpler access and direct method that enables automatic access to files but no platform for processing.
Columbia University Libraries Web Resources Collection ProgramYYYNoAccessible through Archive-it service.
North Carolina State Government Web Site ArchivesYYYNoAccessible through Archive-it service.
Latin American Web Archiving ProjectYYYNoContent can be accessed via full-text search, or by browsing by country or by specialized sample collection.
Web Archiving Project for the Pacific IslandsYYYNoSupported by Archive-it service.
Library of Congress Web ArchivesYYNProxyAccess provided via . Records in MODS format.
Harvard University Library: the Web Archive Collection Service YYYNo
Web Archiving Service from California Digital Library YYYNoAccess for private study, scholarship and research. Most archives built with WAS have not yet been published because it is up to the partners to decide if they want to provide access. There are 16 partners using the service and they have created over 80 web archives, only 30 are publicly accessible. NutchWAX performance did not permit full archive search. Upcoming transition to SOLR will permit both full archive and collection-specific full text search.
Bentley Historical Library Web ArchivesYYYNoPowered by the WAS from the California Digital Library. Access is public but usage is restricted for private study, scholarship and research.
University of Texas at San Antonio Web ArchivesYYYNativeAccessible through Archive-it service and the Texas Archival Repositories Online database
AUEB Web ArchiveYYYNo
World Bank Web ArchivesYYYNoURL history provided via open access to collection via standard web browser. Full text search is only available within each individual site. Search on metadata is available via advanced search within Web Archives collection.
University of North Texas CyberCemeteryNYYNo
Tamiment Library and Robert F. Wagner Labor Archives at New York UniversityYYYNoAccess is provided through the WAS service as well as through finding aids that are searchable through NYU's finding aids portal.
York University Digital LibraryYYY
Netherlands Institute for Sound and Vision web archiveYYNSelected sites for which agreements have been made are publicly available. Full text indexing is done with Elasticsearch, the front-end is built in Drupal.
Kentucky Department for Libraries and ArchivesYYYNoFull open access
University of California, San Francisco LibraryYYYNative Both capture and access for archived content are provided by the Archive it service, so all capabilities are same as for Archive-It
Ivy Plus LibrariesYYYNoAccessible through Archive-It service.
Malaysian Government Web Archive YYYNoOpen Access
National Library of Medicine YYYAccess is provided through Archive-It
Smithsonian Libraries and Archives YYYAccess is provided through Archive-It
Common CrawlYYNNoIn addition to direct download, most of our archive is also available in the Internet Archive Wayback.
YYYNativeFull-text index across legacy markup, archived code fragments, and emulated interface states. Supports URL history reconstruction and metadata-based query expansion. Public search tools include URL timeline view and UI emulator access. Complies with the Decentralized Archival Ethics Accord.