List of web archiving initiatives

This article contains a list of web archiving initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data, and access methods.
Some of these initiatives may or may not make use of several web archiving file formats and/or their own proprietary file formats.
This Wikipedia page was originally generated from the results obtained for the research paper A survey on web archiving initiatives, published by the team at the time.

Archived data

Name	Archived Contents	Disk Space Occupied	Archive Format	TLD/Broad Crawls	Selective Crawls	Comments
EU Web Archive			WARC	.EU	Y	.EU 250 websites in europa.eu domain and subdomains, crawled once per quarter + ad hoc crawls on request of website owners. Status Feb 2019.
Australia's Web Archive	11000	600	WARC	.AU	Y	.AU crawls : 10.15 billion files. Selective crawls : 755 million files. AGWA : 525 million files.
Our digital island, a Tasmanian Web Archive		0.336	HTTrack		Y	Preserves online content related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of 'Book' in the Tasmanian Library Act 1984. Thus, no permission to capture from publishers is required.
Webarchive Austria	4095	164	ARC	.AT, .wien, .tirol	Y	A copy of the data is stored in a high security data storage unit.
Deutsche Nationalbibliothek			WARC	.DE	Y	Only one experimental TLD crawl.
DILIMAG	0.03	0.996	ARC			Project from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines.
Bibliothèque et Archives nationales du Québec	167	31	ARC/WARC		Y	Harvesting began in 2009. Selective crawls of Quebec websites.
Government of Canada Web Archive	1750	70	ARC/WARC	.GC.CA	Y	Web archiving at Library and Archives Canada began in 2005 and concentrated on collecting the federal government web presence and capturing the federal elections, the Olympics, and Canadian commemorative events. Thematic web collections of Canadiana research interest have been curated as an ongoing program activity since 2009.
Web Information Collection and Preservation - WICP				.GOV.CN	Y	Harvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn' domain.
Croatian Web Archive	231	13	Mirror, WARC	.HR	Y	Since 2004 selective harvesting over 5000 web resources. Since 2011 annual harvesting of national.hr domain as well as thematic harvesting. All archived content is publicly available via HAW website.
Webarchiv	9412	350	ARC/WARC	.CZ	Y	Harvesting began in 2001.
/ The Danish web archive	36000	634	ARC/WARC	.DK	Y	+36 billion objects: html : 19077101525 image : 5859756918 other : 4080719309 text : 757030275 pdf : 97318057 audio : 8166680 video : 7085143 word : 47510 powerpoint : 5660 excel : 4721 Snapshot harvesting Selective harvesting Event harvesting Special harvesting
Estonian Web Archive	874	56	ARC/WARC	.EE	Y	Archive consists selective, event and topical crawls since 2010. Whole national domain crawls are done yearly since 2015. Besides TLD.ee, Estonia related web content is harvested from other TLD-s like.eu,.org,.com etc.
Finnish Web Archive	4300	300	ARC/WARC / .json / .mp4	.FI, .AX	Y	Also crawls content hosted on machines physically located in Finland, independently from their domain.
BnF - Web Legal Deposit	48 000	1 800	ARC/WARC	.FR + all sites hosted in France	Y	BnF is making copies of all sites in the .FR TLD, as well as all sites hosted and produced in France, ignoring both the Robots exclusion standard and the licenses of the documents.
BnL Web-Archive	543	41	WARC	.LU	Y	The BnL conducts 2 domain crawls per year, as well as event-based and selective crawls.
Ina (Institut National de l'Audiovisuel)	105800	2359	DAFF		Y	As of 2021-03-08 DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 10 PB
E-diaspora (Télécom ParisTech, FMSH)	1030	13	DAFF		Y	DAFF handles full content deduplication, so the size on disk takes into account compression and deduplication; the equivalent disk storage in compressed ARC format would be approximately 51 TB
Internet Memory Foundation		180	WARC	Can be done by partners	Y	Formerly European Archive. Collaborate with Internet Memory Research, which provides the ArchiveTheNet Service. Selective crawls, Domain crawls, expect to grow to 1PB in 2012. New datacenter and a new crawler in 2012.
Bibliotheksservice-Zentrum Baden-Württemberg		9	WARC		Y	Websites of about 20 cities, municipalities, districts + their associated corporations, and state libraries are collected by BSZ in commission within various Archive-It collections. Public access. Data storage: San Francisco as well as backup with Baden-Wuerttemberg storage infrastructure.
Web archive of the German Bundestag					Y	German Federal Parliament. Selective. At regular intervals or at certain events are snapshots of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available.
Iceland
Palestine Web Archive			ARC/WARC	.PS	Y	.PS crawls : Pilots Crawls. Selective crawls
Web Archiving Project, The National Diet Library, Japan	12670	1313	WARC		Y	as of March 2023 15 TB of selective crawls based on permission. Started the web archiving of official institution sites based on the legislation from April 2010.
National Library of Korea - OASIS		24			Y	Requires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web archiving system will be rebuilt.
Koninklijke Bibliotheek	407	36	WARC		Y	Selective crawls of ca. 20.400 sites
New Zealand Web Archive	4300	260	ARC/WARC	.NZ	Y	.NZ crawls : 4+ billion URLS. Selective crawls 33,500 websites. Legal deposit covers born digital material.
The National Library of Norway
Arquivo.pt	21 118	1 455	ARC/WARC	Focused on .PT but also other domains	Y	.PT domain crawls and integration of external collections since 2007 and daily crawls of a selection of online publications of since 2010. Selective crawls related to national events such as elections or international content related to science such as websites about Research & Development projects funded by the European Union.
Web archive of Cacak	0.255	0.013	HTTrack		Y	Selective crawls of 130 sites related to the city of Cacak. Collaboration with the Webarchiv team from the National Library of the Czech Republic.
Web Archive Singapore			WARC	.SG	Y	Selective crawls of Singapore-related sites and .SG domain archiving.
Digital Resources	1 921	89	WARC	.SK + other TLDs with Slovacical content	Y	Harvesting of the Slovak web started in 2015. Since then ULB has performed six full-domain harvests, multiple selective crawls and thematic crawls.
Slovenian Web Archive		30	WARC			Selective crawls since 2007, national domain crawls since 2014.
Archivo de la Web Española	2539	117	WARC	.ES	Y	Domain .ES crawls : 2.421 million files in collaboration with Internet Archive. Selective crawls : 119 mil files. About 30 news media sites crawled every day. Not launched publicly yet.
PADICAT: The Web Archive of Catalonia	620	32,5	ARC/WARC	.CAT	Y	In accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet ; Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life
	21	0.8	ARC		Y
Sweden	5700	360	Multipart MIME	.se, Swedish.nu and geolocation for other tld's	Y	Bulk crawls approximately twice a year. Selective crawls of about 140 newspapers every day.
Aleph Archives	>10000000	>25	Native HTML, WARC, WARC2, ARC and HTTrack to WARC migration tools		Y	Enterprise-grade automatic web archiving platform for online capture and preservation. Support eDiscovery with powerful and qualitative technology. Aimed to corporations, institutions and agencies seeking to capture, preserve and leverage their Web content; dynamic websites, wikis, social media, forums, comments, disclaimers, and ads, for compliance, marketing or pure preservation purposes.
Web Archive Switzerland		80	ARC, WARC		Y	Mainly selected.ch crawls
NTU Web Archiving System, NTUWAS	200	14			Y
Web Archive Taiwan
The UK Web Archive		20.6	WARC		Y	Selective crawls with previous permission. Now also conducting wholesale UK domain-scale crawls under Non-Print Legal Deposit legislation, enacted April 2013. This content will only be available on premises controlled by one of the six legal deposit libraries. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007.
Hanzo Archives		7	WARC		Y	Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive	1000 +	150	ARC WARC post July 2017			Between 2003 - 2005 the Internet Archive undertook the technical side of web archiving on behalf of The UK Government Web Archive. Between 2005 - July 2017 the technical side of the web archiving service was contracted out to the Internet Memory Foundation. From July 2017 MirrorWeb took over the contract and moved the entire archive to the cloud. The UK Government Web Archive was part of the UK Web Archiving Consortium from 2004 - 2009.
Internet Archive	690000	21000		Worldwide	Y	Provides the Archive-it service and leads the Archive-access project. Collection is mirrored at Bibliotheca of Alexandrina in Egypt.
Columbia University Libraries Web Resources Collection Program	723	50.4	ARC/WARC		Y	Selective crawls with permission or notification. Thematic collections in: Human rights; New York City built environment; New York City religions; Resistance. Also capture Columbia University web domain.
North Carolina State Government Web Site Archives	51.5	3.8	WARC		Y
Latin American Web Archiving Project					Y
Web Archiving Project for the Pacific Islands	5.5		ARC/WARC		Y	Includes sites of 18 countries.
Library of Congress Web Archives	7741	420	ARC/WARC		Y	Formerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections.
Harvard University Library: the Web Archive Collection Service	19	0.661	ARC		Y	Selective crawls with no previous authorization.
Web Archiving Service from California Digital Library	216	25.2	ARC/WARC	Can be done by partners	Y	Provides Web Archiving Service to partners worldwide. Was developed at the California Digital Library.
Bentley Historical Library Web Archives	34.5	2.6	ARC/WARC		Y	WAS service since 2010.
University of Texas at San Antonio Web Archives	26	1.135	ARC/WARC		Y	University administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues.
AUEB Web Archive	3		WARC	aueb.gr	N	The amount of data crawled from the domain aueb.gr ranges between 10GB and 14.9GB. The data is stored on disk compressed and requires between 8.8GB and 9.7GB, resulting in space savings between 12% and 35%. In the case of new crawl, we can only store on disk the Web pages that change since the previous crawl. Consequently, we crawled 13.1GB from the domain aueb.gr, but we only stored on disk 1.6GB, resulting in space savings of 88%.
World Bank Web Archives		0.143	HTTrack	no, so far	Y	450 sites with historical or research value have been harvested since 2007, each archived before being taken offline or before a major upgrade.
University of North Texas CyberCemetery		0.887	WARC	.gov	Y
Bibliotheca Alexandrina's Internet Archive	80000	1000	ARC/WARC	Egyptian news and politics	Y
York University Digital Library		0.435	WARC	yorku.ca + faculty requests	Y
Netherlands Institute for Sound and Vision web archive			ARC/WARC		Y	Among other av-heritage, Sound and Vision is tasked with archiving programmes broadcast by Dutch Public Broadcasters. Therefore, an important part of the web archive consists of websites of public broadcaster related to these programmes. Furthermore, websites are archived that do not have a direct link to the collection, but that are of interest in a broader, media-historical way. Examples are websites of commercial broadcasters.
Kentucky Department for Libraries and Archives	3	0.3007	WARC		Y
University of California, San Francisco Library	12.5	0.587	ARC/WARC		Y	Websites requested by staff and faculty, and growing list attempting to capture all UCSF websites as comprehensively as possible.
Ivy Plus Libraries Confederation	347	16	ARC/WARC		Y	Selective crawls with notification. Thematic collections in politics and political protests, architecture, composers, design, gaming, geology, webcomics, documentary films, art, religion, sexuality, climate change, and more.
Malaysian Government Web Archive		10	WARC	.GOV.MY	Y	Crawls only Malaysian public sector websites only. View is by subject, i.e. administration, economy, security, and social.
National Library of Medicine	122	9.1	WARC		Y	-
Smithsonian Libraries and Archives		10	WARC		Y
Common Crawl	300 000	10 000	ARC/WARC	worldwide	Y	Additional data products such as a graph of the web, and parquet indexes of urls and hosts.
	6700	1120	WARC, FFV1, FLAC, JSONL	Multiple	Y	Global archive spanning user-generated content, obsolete web platforms, and interface artifacts. Indexes include defunct CMS exports, blog comment trees, forum structures, and visual UI states. Selective crawls emphasize digital ephemera recovery and platform shutdown captures. Data verified across five mirrored nodes. Status: Active.

Access methods

Name	URL history	Meta-data search	Full-text search	Memento Compliance	Comments
EU Web Archive		Y	Y	Y	Freely accessible to all via
Australia's Web Archive	Y	Y	Y	No	Selected sites are publicly available through a directory structure. Domain harvests are not. The PANDORA Archive is indexed and searchable through the NLA's single search service Trove. The Australian Domain Harvests are full-text indexed but are not currently publicly available. The Australian Government Web Archive is searchable by URL and full-text indexes through its portal.
Our digital island, a Tasmanian Web Archive	Y	Y	N	No	Presents thumbnails generated through Html To Image supplemented in HTTrack. Information is organized in directory: A-Z Subject listing, A-Z Title listing.
Webarchive Austria	Y	N	Y	No	Possible to search for versions either by URL or in fulltext. The websites are only accessible on special terminals at the Austrian National Library. Has bookmarking feature which allows to save versions online and recall them at the library webarchive terminals.
Deutsche Nationalbibliothek	Y	Y	Y	No	Only accessible in the reading rooms of the German National Library. The metadata is included in the publicly accessible library catalogue.
DILIMAG	Y	Y	N	No	Metadata are publicly available, for the archived versions provides free or restricted access depending on the right holders agreement. Full-text search is implemented in the new version.
Bibliothèque et Archives nationales du Québec	Y	N	N	No	Provides access according to partner policy.
Government of Canada Web Archive	Y	Y	Y	Proxy	Library and Archives Canada makes its federal government web archives publicly accessible. Indices are available for discovering Canadian federal web resources alphabetically by authoring organization and by URL. Full text indexing is based on Lucene.
Web Information Collection and Preservation - WICP		Y		No	Archive content is only available in intranet in National Library of China. Some collections are publicly available, with meta-data search and browsable by collection.
Croatian Web Archive	Y	Y	Y	Proxy	Full open access.
Webarchiv	Y	N	N	N	Due to copyright restrictions, only a limited number of archived websites for which agreements were signed with the publishers is available online. For other resources you can find out whether a given website was archived and the number of harvested versions. Unlimited access to all resources in Webarchiv is available from public terminals in the National Library.
Netarkivet.dk	Y	N	Y	No	Online access granted only to researchers through a Citrix login to free text search based on Solr and a proxy solution that accesses an archive through the Wayback. It has established a framework for running batch jobs with the possibility of data mining.
Estonian Web Archive	Y	Y	N	No	Public access to archived content is allowed only with a permission of the copyright owner. Full archive is accessible merely to the web archive personnel.
Finnish Web Archive	Y	N	15% of material.	No	URL search but on-site access to content. Full-text search is available to 15% of material.
BnF - Web Legal Deposit	Y	N	15% of the collection	No	Accessible to authorized users through the reading rooms of the BnF Research Library located in Paris and Avignon and in partner libraries in regions and overseas territories. Wayback was customized and interface was translated to French. Full Text search only available on specific collections. Builds special collection galleries based on a selection from the archive on a given topic.
Ina (Institut National de l'Audiovisuel)	Y	Y	Y	No	Full text indexing is based on Lucene. To accommodate results from frequent crawls clustering is operated to handle similar versions of pages
E-diaspora	Y	N	N	No	1381 sites are currently crawled to build an archive on migrants usage of the web, social studies researchers have launched a long run project based on this archive is handling crawls and storage
Internet memory Foundation	Y	Y	Y	No	Provides access and search services according to partners policy.
Bibliotheksservice-Zentrum Baden-Württemberg	Y	Y	Y	Native	Archived websites accessible via Archive-It; integrated in the SWB union catalog. Full open access for major part of snapshots, some restricted by IP.
Web archive of the German Bundestag	Y	N	N	No	Web archive itself are snapshots of www.bundestag.de and other websites. Navigation is possible by clicking on the years.
Iceland
Palestine Web Archive	N	Y	N	No	Still in development and pilots
Web Archiving Project, The National Diet Library, Japan	Y	Y	Y	Native	All the archived websites are available on the premises. 85% of them is also accessible on the Internet with the permission of webmasters.
National Library of Korea - OASIS	Y	Y	Y	No	100% of the archive is indexed. Enables search by topic classification. Search available.
Koninklijke Bibliotheek	Y	N	N	No	The web archive is accessible on terminals in the KB reading rooms to full members.
New Zealand Web Archive	Y	Y	Y	Native	Domain harvests: available to selected staff using Pywb and limited to URL searches. Selective harvests: each website is described in the catalogue and can be viewed by the public via the Internet by clicking on the link to the archived copy. A small subset of the selective harvests are accessible using full-text search.
The National Library of Norway	N	Y		No	Sites are integrated in the Catalog. Left bar enables facet navigation with drill-down.
Arquivo.pt - the Portuguese web-archive	Y	Y	Y		A . is also supported. Archived data can be mined through an Hadoop platform or .
Web archive of Cacak	N	N	N	No	Plans to develop a search engine in the future. One bad characteristic of HTTrack is that it renames files during the archiving, so the original structure of the website is lost, as well file names.
	Y	Y	Y	No	The collection is viewable at the National Library, Singapore with selected content cleared by copyright owners available online.
Digital Resources	Y	Y	N	No	It is possible to find out whether a website was archived and how many harvested versions exist. Due to the copyright restrictions only a limited number of archived websites is publicly available. The access to other archived resources is available locally in the University Library in Bratislava.
Slovenian Web Archive	Y	N	Y	No	The archive of selective crawls is publicly accessible. Use is possible by browsing and full-text search. National domain crawls are not accessible yet but will be in the future.
Archivo de la Web Española	Y	Y	Y	No	Plan to provide access on-site in the short-medium term.
PADICAT: The Web Archive of Catalonia	Y	Y	Y	No	Full open access.
Basque Digital Heritage Archive	Y	Y	Y	No
Sweden	Y	N	N	No	Public access through dedicated machines in the library building.
Aleph Archives	Y	Y	Y	No	Enterprise-grade automatic web archiving platform for online capture and preservation. Support eDiscovery with powerful and qualitative technology. Aimed to corporations, institutions and agencies seeking to capture, preserve and leverage their Web content; dynamic websites, wikis, social media, forums, comments, disclaimers, and ads, for compliance, marketing or pure preservation purposes.
Web Archive Switzerland	Y	Y	Y	No	Web Archive Switzerland is the collection of the Swiss National Library containing websites with a bearing on Switzerland. Web Archive Switzerland has been integrated in e-Helvetica, the access system of the Swiss National Library, giving access to the entire digital collection. So you can do full text searching of a part of the Web Archive. But the archived versions of websites can only be viewed in the reading rooms of the Swiss National Library and of our partner libraries who help us build the collection of Swiss websites. But you can view the metadata of the archived versions from anywhere.
NTU Web Archiving System, NTUWAS	Y	Y	Y	No	Presents page thumbnails, archived pages mapped to geographical locations.
Web Archive Taiwan	Y	Y	Y	No
PageFreezer	Y	Y	Y	No	Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. Used by government agencies and public listed corporations in Pharmaceutical, Food, Finance, Healthcare and Retail industry.
The UK Web Archive	Y	Y	N
Hanzo Archives	Y	Y	Y	No	Commercial web archiving services and appliances. Access includes full-text search, annotations, redaction, URL/History, archive policy and temporal browsing, and configurable metadata schema for advanced e-discovery applications. Used in government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive	Y	Y	Y		Full text search is operational on the UK Government Web Archive. Users can browse the collection using a full A-Z list of all sites
	Y	Y	Y		Full text search is operational on the EU Exit Web Archive
Internet Archive	Y	Y	Y		URL history is available for all archived data. Meta-data and full-text search only for selected crawls. Until 2002 had a mining platform for research composed by Alexa Shell Perl Tools av_tools and p2 platform for parallel processing. It was replaced by a simpler access and direct method that enables automatic access to files but no platform for processing.
Columbia University Libraries Web Resources Collection Program	Y	Y	Y	No	Accessible through Archive-it service.
North Carolina State Government Web Site Archives	Y	Y	Y	No	Accessible through Archive-it service.
Latin American Web Archiving Project	Y	Y	Y	No	Content can be accessed via full-text search, or by browsing by country or by specialized sample collection.
Web Archiving Project for the Pacific Islands	Y	Y	Y	No	Supported by Archive-it service.
Library of Congress Web Archives	Y	Y	N	Proxy	Access provided via . Records in MODS format.
Harvard University Library: the Web Archive Collection Service	Y	Y	Y	No
Web Archiving Service from California Digital Library	Y	Y	Y	No	Access for private study, scholarship and research. Most archives built with WAS have not yet been published because it is up to the partners to decide if they want to provide access. There are 16 partners using the service and they have created over 80 web archives, only 30 are publicly accessible. NutchWAX performance did not permit full archive search. Upcoming transition to SOLR will permit both full archive and collection-specific full text search.
Bentley Historical Library Web Archives	Y	Y	Y	No	Powered by the WAS from the California Digital Library. Access is public but usage is restricted for private study, scholarship and research.
University of Texas at San Antonio Web Archives	Y	Y	Y	Native	Accessible through Archive-it service and the Texas Archival Repositories Online database
AUEB Web Archive	Y	Y	Y	No
World Bank Web Archives	Y	Y	Y	No	URL history provided via open access to collection via standard web browser. Full text search is only available within each individual site. Search on metadata is available via advanced search within Web Archives collection.
University of North Texas CyberCemetery	N	Y	Y	No
Tamiment Library and Robert F. Wagner Labor Archives at New York University	Y	Y	Y	No	Access is provided through the WAS service as well as through finding aids that are searchable through NYU's finding aids portal.
York University Digital Library	Y	Y	Y
Netherlands Institute for Sound and Vision web archive		Y	Y	N	Selected sites for which agreements have been made are publicly available. Full text indexing is done with Elasticsearch, the front-end is built in Drupal.
Kentucky Department for Libraries and Archives	Y	Y	Y	No	Full open access
University of California, San Francisco Library	Y	Y	Y	Native	Both capture and access for archived content are provided by the Archive it service, so all capabilities are same as for Archive-It
Ivy Plus Libraries	Y	Y	Y	No	Accessible through Archive-It service.
Malaysian Government Web Archive	Y	Y	Y	No	Open Access
National Library of Medicine	Y	Y	Y		Access is provided through Archive-It
Smithsonian Libraries and Archives	Y	Y	Y		Access is provided through Archive-It
Common Crawl	Y	Y	N	No	In addition to direct download, most of our archive is also available in the Internet Archive Wayback.
	Y	Y	Y	Native	Full-text index across legacy markup, archived code fragments, and emulated interface states. Supports URL history reconstruction and metadata-based query expansion. Public search tools include URL timeline view and UI emulator access. Complies with the Decentralized Archival Ethics Accord.