List of web archiving initiatives
This article contains a list of web archiving initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data, and access methods.
Some of these initiatives may or may not make use of several web archiving file formats and/or their own proprietary file formats.
This Wikipedia page was originally generated from the results obtained for the research paper A survey on web archiving initiatives, published by the team at the time.
Web archiving initiatives
Archived data
| Name | Archived Contents | Disk Space Occupied | Archive Format | TLD/Broad Crawls | Selective Crawls | Comments |
| EU Web Archive | WARC | .EU | Y | .EU 250 websites in europa.eu domain and subdomains, crawled once per quarter + ad hoc crawls on request of website owners. Status Feb 2019. | ||
| Australia's Web Archive | 11000 | 600 | WARC | .AU | Y | .AU crawls : 10.15 billion files. Selective crawls : 755 million files. AGWA : 525 million files. |
| Our digital island, a Tasmanian Web Archive | 0.336 | HTTrack | Y | Preserves online content related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of 'Book' in the Tasmanian Library Act 1984. Thus, no permission to capture from publishers is required. | ||
| Webarchive Austria | 4095 | 164 | ARC | .AT,.wien,.tirol | Y | A copy of the data is stored in a high security data storage unit. |
| Deutsche Nationalbibliothek | WARC | .DE | Y | Only one experimental TLD crawl. | ||
| DILIMAG | 0.03 | 0.996 | ARC | Project from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines. | ||
| Bibliothèque et Archives nationales du Québec | 167 | 31 | ARC/WARC | Y | Harvesting began in 2009. Selective crawls of Quebec websites. | |
| Government of Canada Web Archive | 1750 | 70 | ARC/WARC | .GC.CA | Y | Web archiving at Library and Archives Canada began in 2005 and concentrated on collecting the federal government web presence and capturing the federal elections, the Olympics, and Canadian commemorative events. Thematic web collections of Canadiana research interest have been curated as an ongoing program activity since 2009. |
| Web Information Collection and Preservation - WICP | .GOV.CN | Y | Harvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn' domain. | |||
| Croatian Web Archive | 231 | 13 | Mirror, WARC | .HR | Y | Since 2004 selective harvesting over 5000 web resources. Since 2011 annual harvesting of national.hr domain as well as thematic harvesting. All archived content is publicly available via HAW website. |
| Webarchiv | 9412 | 350 | ARC/WARC | .CZ | Y | Harvesting began in 2001. |
| / The Danish web archive | 36000 | 634 | ARC/WARC | .DK | Y | +36 billion objects:
|
| Estonian Web Archive | 874 | 56 | ARC/WARC | .EE | Y | Archive consists selective, event and topical crawls since 2010. Whole national domain crawls are done yearly since 2015. Besides TLD.ee, Estonia related web content is harvested from other TLD-s like.eu,.org,.com etc. |
| Finnish Web Archive | 4300 | 300 | ARC/WARC /.json /.mp4 | .FI,.AX | Y | Also crawls content hosted on machines physically located in Finland, independently from their domain. |
| BnF - Web Legal Deposit | 48 000 | 1 800 | ARC/WARC | .FR + all sites hosted in France | Y | BnF is making copies of all sites in the.FR TLD, as well as all sites hosted and produced in France, ignoring both the Robots exclusion standard and the licenses of the documents. |
| BnL Web-Archive | 543 | 41 | WARC | .LU | Y | The BnL conducts 2 domain crawls per year, as well as event-based and selective crawls. |
| Ina | 105800 | 2359 | DAFF | Y | As of 2021-03-08 DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 10 PB | |
| E-diaspora | 1030 | 13 | DAFF | Y | DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 51 TB | |
| Internet Memory Foundation | 180 | WARC | Can be done by partners | Y | Formerly European Archive. Collaborate with Internet Memory Research, which provides the ArchiveTheNet Service. Selective crawls, Domain crawls, expect to grow to 1PB in 2012. New datacenter and a new crawler in 2012. | |
| Bibliotheksservice-Zentrum Baden-Württemberg | 9 | WARC | Y | Websites of about 20 cities, municipalities, districts + their associated corporations, and state libraries are collected by BSZ in commission within various Archive-It collections. Public access. Data storage: San Francisco as well as backup with Baden-Wuerttemberg storage infrastructure. | ||
| Web archive of the German Bundestag | Y | German Federal Parliament. Selective. At regular intervals or at certain events are snapshots of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available. | ||||
| Iceland | ||||||
| Palestine Web Archive | ARC/WARC | .PS | Y | .PS crawls : Pilots Crawls. Selective crawls | ||
| Web Archiving Project, The National Diet Library, Japan | 12670 | 1313 | WARC | Y | as of March 2023 15 TB of selective crawls based on permission. Started the web archiving of official institution sites based on the legislation from April 2010. | |
| National Library of Korea - OASIS | 24 | Y | Requires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web archiving system will be rebuilt. | |||
| Koninklijke Bibliotheek | 407 | 36 | WARC | Y | Selective crawls of ca. 20.400 sites | |
| New Zealand Web Archive | 4300 | 260 | ARC/WARC | .NZ | Y | .NZ crawls : 4+ billion URLS. Selective crawls 33,500 websites. Legal deposit covers born digital material. |
| The National Library of Norway | ||||||
| Arquivo.pt | 21 118 | 1 455 | ARC/WARC | Focused on.PT but also other domains | Y | .PT domain crawls and integration of external collections since 2007 and daily crawls of a selection of online publications of since 2010. Selective crawls related to national events such as elections or international content related to science such as websites about Research & Development projects funded by the European Union. |
| Web archive of Cacak | 0.255 | 0.013 | HTTrack | Y | Selective crawls of 130 sites related to the city of Cacak. Collaboration with the Webarchiv team from the National Library of the Czech Republic. | |
| Web Archive Singapore | WARC | .SG | Y | Selective crawls of Singapore-related sites and.SG domain archiving. | ||
| Digital Resources | 1 921 | 89 | WARC | .SK + other TLDs with Slovacical content | Y | Harvesting of the Slovak web started in 2015. Since then ULB has performed six full-domain harvests, multiple selective crawls and thematic crawls. |
| Slovenian Web Archive | 30 | WARC | Selective crawls since 2007, national domain crawls since 2014. | |||
| Archivo de la Web Española | 2539 | 117 | WARC | .ES | Y | Domain.ES crawls : 2.421 million files in collaboration with Internet Archive. Selective crawls : 119 mil files. About 30 news media sites crawled every day. Not launched publicly yet. |
| PADICAT: The Web Archive of Catalonia | 620 | 32,5 | ARC/WARC | .CAT | Y | In accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet ; Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life |
| 21 | 0.8 | ARC | Y | |||
| Sweden | 5700 | 360 | Multipart MIME | .se, Swedish.nu and geolocation for other tld's | Y | Bulk crawls approximately twice a year. Selective crawls of about 140 newspapers every day. |
| Aleph Archives | >10000000 | >25 | Native HTML, WARC, WARC2, ARC and HTTrack to WARC migration tools | Y | Enterprise-grade automatic web archiving platform for online capture and preservation. Support eDiscovery with powerful and qualitative technology. Aimed to corporations, institutions and agencies seeking to capture, preserve and leverage their Web content; dynamic websites, wikis, social media, forums, comments, disclaimers, and ads, for compliance, marketing or pure preservation purposes. | |
| Web Archive Switzerland | 80 | ARC, WARC | Y | Mainly selected.ch crawls | ||
| NTU Web Archiving System, NTUWAS | 200 | 14 | Y | |||
| Web Archive Taiwan | ||||||
| The UK Web Archive | 20.6 | WARC | Y | Selective crawls with previous permission. Now also conducting wholesale UK domain-scale crawls under Non-Print Legal Deposit legislation, enacted April 2013. This content will only be available on premises controlled by one of the six legal deposit libraries. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007. | ||
| Hanzo Archives | 7 | WARC | Y | Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA. | ||
| UK Government Web Archive | 1000 + | 150 | ARC WARC post July 2017 | Between 2003 - 2005 the Internet Archive undertook the technical side of web archiving on behalf of The UK Government Web Archive. Between 2005 - July 2017 the technical side of the web archiving service was contracted out to the Internet Memory Foundation. From July 2017 MirrorWeb took over the contract and moved the entire archive to the cloud. The UK Government Web Archive was part of the UK Web Archiving Consortium from 2004 - 2009. | ||
| Internet Archive | 690000 | 21000 | Worldwide | Y | Provides the Archive-it service and leads the Archive-access project. Collection is mirrored at Bibliotheca of Alexandrina in Egypt. | |
| Columbia University Libraries Web Resources Collection Program | 723 | 50.4 | ARC/WARC | Y | Selective crawls with permission or notification. Thematic collections in: Human rights; New York City built environment; New York City religions; Resistance. Also capture Columbia University web domain. | |
| North Carolina State Government Web Site Archives | 51.5 | 3.8 | WARC | Y | ||
| Latin American Web Archiving Project | Y | |||||
| Web Archiving Project for the Pacific Islands | 5.5 | ARC/WARC | Y | Includes sites of 18 countries. | ||
| Library of Congress Web Archives | 7741 | 420 | ARC/WARC | Y | Formerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections. | |
| Harvard University Library: the Web Archive Collection Service | 19 | 0.661 | ARC | Y | Selective crawls with no previous authorization. | |
| Web Archiving Service from California Digital Library | 216 | 25.2 | ARC/WARC | Can be done by partners | Y | Provides Web Archiving Service to partners worldwide. Was developed at the California Digital Library. |
| Bentley Historical Library Web Archives | 34.5 | 2.6 | ARC/WARC | Y | WAS service since 2010. | |
| University of Texas at San Antonio Web Archives | 26 | 1.135 | ARC/WARC | Y | University administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues. | |
| AUEB Web Archive | 3 | WARC | aueb.gr | N | The amount of data crawled from the domain aueb.gr ranges between 10GB and 14.9GB. The data is stored on disk compressed and requires between 8.8GB and 9.7GB, resulting in space savings between 12% and 35%. In the case of new crawl, we can only store on disk the Web pages that change since the previous crawl. Consequently, we crawled 13.1GB from the domain aueb.gr, but we only stored on disk 1.6GB, resulting in space savings of 88%. | |
| World Bank Web Archives | 0.143 | HTTrack | no, so far | Y | 450 sites with historical or research value have been harvested since 2007, each archived before being taken offline or before a major upgrade. | |
| University of North Texas CyberCemetery | 0.887 | WARC | .gov | Y | ||
| Bibliotheca Alexandrina's Internet Archive | 80000 | 1000 | ARC/WARC | Egyptian news and politics | Y | |
| York University Digital Library | 0.435 | WARC | yorku.ca + faculty requests | Y | ||
| Netherlands Institute for Sound and Vision web archive | ARC/WARC | Y | Among other av-heritage, Sound and Vision is tasked with archiving programmes broadcast by Dutch Public Broadcasters. Therefore, an important part of the web archive consists of websites of public broadcaster related to these programmes. Furthermore, websites are archived that do not have a direct link to the collection, but that are of interest in a broader, media-historical way. Examples are websites of commercial broadcasters. | |||
| Kentucky Department for Libraries and Archives | 3 | 0.3007 | WARC | Y | ||
| University of California, San Francisco Library | 12.5 | 0.587 | ARC/WARC | Y | Websites requested by staff and faculty, and growing list attempting to capture all UCSF websites as comprehensively as possible. | |
| Ivy Plus Libraries Confederation | 347 | 16 | ARC/WARC | Y | Selective crawls with notification. Thematic collections in politics and political protests, architecture, composers, design, gaming, geology, webcomics, documentary films, art, religion, sexuality, climate change, and more. | |
| Malaysian Government Web Archive | 10 | WARC | .GOV.MY | Y | Crawls only Malaysian public sector websites only. View is by subject, i.e. administration, economy, security, and social. | |
| National Library of Medicine | 122 | 9.1 | WARC | Y | - | |
| Smithsonian Libraries and Archives | 10 | WARC | Y | |||
| Common Crawl | 300 000 | 10 000 | ARC/WARC | worldwide | Y | Additional data products such as a graph of the web, and parquet indexes of urls and hosts. |
| 6700 | 1120 | WARC, FFV1, FLAC, JSONL | Multiple | Y | Global archive spanning user-generated content, obsolete web platforms, and interface artifacts. Indexes include defunct CMS exports, blog comment trees, forum structures, and visual UI states. Selective crawls emphasize digital ephemera recovery and platform shutdown captures. Data verified across five mirrored nodes. Status: Active. |