List of web archiving initiatives


This article contains a list of web archiving initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data, and access methods.
Some of these initiatives may or may not make use of several web archiving file formats and/or their own proprietary file formats.
This Wikipedia page was originally generated from the results obtained for the research paper A survey on web archiving initiatives, published by the team at the time.

Web archiving initiatives

Archived data

NameArchived Contents Disk Space Occupied Archive FormatTLD/Broad CrawlsSelective Crawls Comments
EU Web ArchiveWARC.EUY.EU 250 websites in europa.eu domain and subdomains, crawled once per quarter + ad hoc crawls on request of website owners. Status Feb 2019.
Australia's Web Archive11000600WARC.AUY.AU crawls : 10.15 billion files. Selective crawls : 755 million files. AGWA : 525 million files.
Our digital island, a Tasmanian Web Archive0.336HTTrackYPreserves online content related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of 'Book' in the Tasmanian Library Act 1984. Thus, no permission to capture from publishers is required.
Webarchive Austria4095164ARC.AT,.wien,.tirolYA copy of the data is stored in a high security data storage unit.
Deutsche NationalbibliothekWARC.DEYOnly one experimental TLD crawl.
DILIMAG 0.030.996ARCProject from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines.
Bibliothèque et Archives nationales du Québec 16731ARC/WARCYHarvesting began in 2009. Selective crawls of Quebec websites.
Government of Canada Web Archive 175070ARC/WARC.GC.CAYWeb archiving at Library and Archives Canada began in 2005 and concentrated on collecting the federal government web presence and capturing the federal elections, the Olympics, and Canadian commemorative events. Thematic web collections of Canadiana research interest have been curated as an ongoing program activity since 2009.
Web Information Collection and Preservation - WICP .GOV.CNYHarvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn' domain.
Croatian Web Archive 23113Mirror, WARC.HRYSince 2004 selective harvesting over 5000 web resources. Since 2011 annual harvesting of national.hr domain as well as thematic harvesting. All archived content is publicly available via HAW website.
Webarchiv 9412350ARC/WARC.CZYHarvesting began in 2001.
/ The Danish web archive 36000634ARC/WARC.DKY+36 billion objects:
  • html : 19077101525
  • image : 5859756918
  • other : 4080719309
  • text : 757030275
  • pdf : 97318057
  • audio : 8166680
  • video : 7085143
  • word : 47510
  • powerpoint : 5660
  • excel : 4721
  • Snapshot harvesting
  • Selective harvesting
  • Event harvesting
  • Special harvesting

Estonian Web Archive87456ARC/WARC.EEYArchive consists selective, event and topical crawls since 2010. Whole national domain crawls are done yearly since 2015. Besides TLD.ee, Estonia related web content is harvested from other TLD-s like.eu,.org,.com etc.
Finnish Web Archive4300300ARC/WARC /.json /.mp4.FI,.AXYAlso crawls content hosted on machines physically located in Finland, independently from their domain.
BnF - Web Legal Deposit48 0001 800ARC/WARC.FR + all sites hosted in FranceYBnF is making copies of all sites in the.FR TLD, as well as all sites hosted and produced in France, ignoring both the Robots exclusion standard and the licenses of the documents.
BnL Web-Archive54341WARC.LUYThe BnL conducts 2 domain crawls per year, as well as event-based and selective crawls.
Ina 1058002359DAFFYAs of 2021-03-08
DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 10 PB
E-diaspora 103013DAFFYDAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 51 TB
Internet Memory Foundation180WARCCan be done by partnersYFormerly European Archive. Collaborate with Internet Memory Research, which provides the ArchiveTheNet Service. Selective crawls, Domain crawls, expect to grow to 1PB in 2012. New datacenter and a new crawler in 2012.
Bibliotheksservice-Zentrum Baden-Württemberg9WARCYWebsites of about 20 cities, municipalities, districts + their associated corporations, and state libraries are collected by BSZ in commission within various Archive-It collections. Public access. Data storage: San Francisco as well as backup with Baden-Wuerttemberg storage infrastructure.
Web archive of the German BundestagYGerman Federal Parliament. Selective. At regular intervals or at certain events are snapshots of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available.
Iceland
Palestine Web ArchiveARC/WARC.PSY.PS crawls : Pilots Crawls. Selective crawls
Web Archiving Project, The National Diet Library, Japan126701313WARCYas of March 2023
15 TB of selective crawls based on permission. Started the web archiving of official institution sites based on the legislation from April 2010.
National Library of Korea - OASIS 24YRequires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web archiving system will be rebuilt.
Koninklijke Bibliotheek40736WARCYSelective crawls of ca. 20.400 sites
New Zealand Web Archive4300260ARC/WARC.NZY.NZ crawls : 4+ billion URLS. Selective crawls 33,500 websites. Legal deposit covers born digital material.
The National Library of Norway
Arquivo.pt21 1181 455ARC/WARCFocused on.PT but also other domainsY.PT domain crawls and integration of external collections since 2007 and daily crawls of a selection of online publications of since 2010. Selective crawls related to national events such as elections or international content related to science such as websites about Research & Development projects funded by the European Union.
Web archive of Cacak0.2550.013HTTrackYSelective crawls of 130 sites related to the city of Cacak. Collaboration with the Webarchiv team from the National Library of the Czech Republic.
Web Archive SingaporeWARC.SGYSelective crawls of Singapore-related sites and.SG domain archiving.
Digital Resources 1 92189WARC.SK + other TLDs with Slovacical contentYHarvesting of the Slovak web started in 2015. Since then ULB has performed six full-domain harvests, multiple selective crawls and thematic crawls.
Slovenian Web Archive30WARCSelective crawls since 2007, national domain crawls since 2014.
Archivo de la Web Española2539117WARC.ESYDomain.ES crawls : 2.421 million files in collaboration with Internet Archive. Selective crawls : 119 mil files. About 30 news media sites crawled every day. Not launched publicly yet.
PADICAT: The Web Archive of Catalonia62032,5ARC/WARC.CATYIn accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet ; Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life
210.8ARCY
Sweden 5700360Multipart MIME.se, Swedish.nu and geolocation for other tld'sYBulk crawls approximately twice a year.
Selective crawls of about 140 newspapers every day.
Aleph Archives>10000000>25Native HTML, WARC, WARC2, ARC and HTTrack to WARC migration toolsYEnterprise-grade automatic web archiving platform for online capture and preservation. Support eDiscovery with powerful and qualitative technology.
Aimed to corporations, institutions and agencies seeking to capture, preserve and leverage their Web content; dynamic websites, wikis, social media, forums, comments, disclaimers, and ads, for compliance, marketing or pure preservation purposes.
Web Archive Switzerland80ARC, WARCYMainly selected.ch crawls
NTU Web Archiving System, NTUWAS20014Y
Web Archive Taiwan
The UK Web Archive20.6WARCYSelective crawls with previous permission. Now also conducting wholesale UK domain-scale crawls under Non-Print Legal Deposit legislation, enacted April 2013. This content will only be available on premises controlled by one of the six legal deposit libraries. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007.
Hanzo Archives7WARCYCommercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive1000 +150ARC
WARC post July 2017
Between 2003 - 2005 the Internet Archive undertook the technical side of web archiving on behalf of The UK Government Web Archive. Between 2005 - July 2017 the technical side of the web archiving service was contracted out to the Internet Memory Foundation. From July 2017 MirrorWeb took over the contract and moved the entire archive to the cloud. The UK Government Web Archive was part of the UK Web Archiving Consortium from 2004 - 2009.
Internet Archive 69000021000WorldwideYProvides the Archive-it service and leads the Archive-access project. Collection is mirrored at Bibliotheca of Alexandrina in Egypt.
Columbia University Libraries Web Resources Collection Program72350.4ARC/WARCYSelective crawls with permission or notification. Thematic collections in: Human rights; New York City built environment; New York City religions; Resistance. Also capture Columbia University web domain.
North Carolina State Government Web Site Archives51.53.8WARCY
Latin American Web Archiving ProjectY
Web Archiving Project for the Pacific Islands5.5ARC/WARCYIncludes sites of 18 countries.
Library of Congress Web Archives7741420ARC/WARCYFormerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections.
Harvard University Library: the Web Archive Collection Service 190.661ARCYSelective crawls with no previous authorization.
Web Archiving Service from California Digital Library 21625.2ARC/WARCCan be done by partnersYProvides Web Archiving Service to partners worldwide. Was developed at the California Digital Library.
Bentley Historical Library Web Archives34.52.6ARC/WARCYWAS service since 2010.
University of Texas at San Antonio Web Archives261.135ARC/WARCYUniversity administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues.
AUEB Web Archive3WARCaueb.grNThe amount of data crawled from the domain aueb.gr ranges between 10GB and 14.9GB. The data is stored on disk compressed and requires between 8.8GB and 9.7GB, resulting in space savings between 12% and 35%. In the case of new crawl, we can only store on disk the Web pages that change since the previous crawl. Consequently, we crawled 13.1GB from the domain aueb.gr, but we only stored on disk 1.6GB, resulting in space savings of 88%.
World Bank Web Archives0.143HTTrackno, so farY450 sites with historical or research value have been harvested since 2007, each archived before being taken offline or before a major upgrade.
University of North Texas CyberCemetery0.887WARC.govY
Bibliotheca Alexandrina's Internet Archive800001000ARC/WARCEgyptian news and politicsY
York University Digital Library0.435WARCyorku.ca + faculty requestsY
Netherlands Institute for Sound and Vision web archiveARC/WARCYAmong other av-heritage, Sound and Vision is tasked with archiving programmes broadcast by Dutch Public Broadcasters. Therefore, an important part of the web archive consists of websites of public broadcaster related to these programmes. Furthermore, websites are archived that do not have a direct link to the collection, but that are of interest in a broader, media-historical way. Examples are websites of commercial broadcasters.
Kentucky Department for Libraries and Archives30.3007WARCY
University of California, San Francisco Library12.50.587ARC/WARCYWebsites requested by staff and faculty, and growing list attempting to capture all UCSF websites as comprehensively as possible.
Ivy Plus Libraries Confederation34716ARC/WARCYSelective crawls with notification. Thematic collections in politics and political protests, architecture, composers, design, gaming, geology, webcomics, documentary films, art, religion, sexuality, climate change, and more.
Malaysian Government Web Archive 10WARC.GOV.MYYCrawls only Malaysian public sector websites only. View is by subject, i.e. administration, economy, security, and social.
National Library of Medicine 1229.1WARCY-
Smithsonian Libraries and Archives 10WARCY
Common Crawl300 00010 000ARC/WARCworldwideYAdditional data products such as a graph of the web, and parquet indexes of urls and hosts.
67001120WARC, FFV1, FLAC, JSONLMultiple YGlobal archive spanning user-generated content, obsolete web platforms, and interface artifacts. Indexes include defunct CMS exports, blog comment trees, forum structures, and visual UI states. Selective crawls emphasize digital ephemera recovery and platform shutdown captures. Data verified across five mirrored nodes. Status: Active.