Internet Archive
The Internet Archive is an American non-profit library founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including websites, software applications, music, audiovisual, and print materials. The Archive also advocates a free and open Internet. Its mission is committing to provide "universal access to all knowledge".
The Internet Archive allows the public to upload and download digital material to its data cluster, but the bulk of its data is collected automatically by its web crawlers, which work to preserve as much of the public web as possible. The Wayback Machine, its web archive, contains more than 1 trillion web captures. The Archive also oversees numerous [|book digitization projects], collectively one of the world's largest book digitization efforts.
History
founded the Archive in May 1996, around the same time that he began the for-profit web crawling company Alexa Internet. The earliest known archived page on the site, the download page for Internet Explorer, was saved on May 10, 1996, at 14:42 UTC. By October of that year, the Internet Archive had begun to archive and preserve the World Wide Web in large amounts. The archived content became more easily available to the general public in 2001, through the Wayback Machine.In late 1999, the Archive expanded its collections beyond the web archive, beginning with the Prelinger Archives. Now, the Internet Archive includes texts, audio, moving images, and software. It hosts a number of other projects: the NASA Images Archive, the contract crawling service Archive-It, and the wiki-editable library catalog and book information site Open Library. Soon after that, the Archive began working to provide specialized services relating to the information access needs of the print-disabled; publicly accessible books were made available in a protected Digital Accessible Information System format.
In August 2012, the Archive began adding BitTorrent to its file download options.
On November 6, 2013, the Internet Archive's headquarters in San Francisco's Richmond District caught fire, destroying equipment and damaging some nearby apartments. According to the Archive, it lost a side-building housing one of 30 of its scanning centers; cameras, lights, and scanning equipment worth hundreds of thousands of dollars; and "maybe 20 boxes of books and film, some irreplaceable, most already digitized, and some replaceable". The nonprofit Archive sought donations to cover the estimated $600,000 in damage. An overhaul of the site was launched as beta in November 2014, and the legacy layout was removed in March 2016.
In November 2016, Kahle announced that the Internet Archive was building the Internet Archive of Canada, a copy of the Archive to be based somewhere in Canada. The announcement received widespread coverage due to the implication that the decision to build a backup archive in a foreign country was because of the upcoming presidency of Donald Trump. Beginning in 2017, OCLC and the Internet Archive have collaborated to make the Archive's records of digitized books available in WorldCat.
Since 2018, the Internet Archive visual arts residency, which is organized by Amir Saber Esfahani and Andrew McClintock, helps connect artists with the Archive's over 48 petabytes of digitized materials. Over the course of the yearlong residency, visual artists create a body of work which culminates in an exhibition. The hope is to connect digital history with the arts and create something for future generations to appreciate online or off. Previous artists in residence include Taravat Talepasand, Whitney Lynn, and Jenny Odell.
The Internet Archive acquires most materials from donations, such as hundreds of thousands of 78 rpm discs from Boston Public Library in 2017, a donation of 250,000 books from Trent University in 2018, and the entire collection of Marygrove College's library after it closed in 2020. All material is then digitized and retained in digital storage, while a digital copy is returned to the original holder and the Internet Archive's copy, if not in the public domain, is lent to patrons worldwide one at a time under the controlled digital lending theory of the first-sale doctrine.
On June 1, 2020, four large publishing houses – Hachette Book Group, Penguin Random House, HarperCollins, and John Wiley – filed a lawsuit against the Internet Archive before the United States District Court for the Southern District of New York, claiming that the Internet Archive's practice of controlled digital lending constituted copyright infringement. On March 25, 2023, the court found in favor of the publishers. The negotiated judgment of August 11, 2023, barred the Internet Archive from digitally lending books for which electronic copies are on sale.
Also on August 11, 2023, the music industry giants Universal Music Group, Sony Music and Concord sued the Internet Archive before the same United States District Court for the Southern District of New York over the Internet Archive's Great 78 Project for $621 million in damages from alleged copyright infringement. The lawsuit was settled in September 2025.
In September 2024, Google and the Internet Archive announced a collaboration where links to the Wayback Machine would be included in the 'more about this page' menu in Google Search. This collaboration effectively replaced Google's own Google Cache service that it had retired earlier that year. On July 24, 2025, Internet Archive was designated as a Federal Depository Library by the U.S. Senate, allowing it to store public access government records. It opened a new headquarters for its European branch on 19 September 2025.
2024 cyberattacks
During the week of May 27, 2024, the Internet Archive suffered a series of distributed denial of service attacks that made its services unavailable intermittently, sometimes for hours at a time, over a period of several days. The attack was claimed on May 28 by a hacker group called SN_BLACKMETA, with possible links to Anonymous Sudan. The incident drew a comparison with the 2023 British Library cyberattack, which affected the UK Web Archive.Beginning October 9, 2024, the Internet Archive's team, including archivist Jason Scott and security researcher Scott Helme, confirmed DDoS attacks, site defacement, and a data breach. The purported hacktivist group SN_BLACKMETA again claimed responsibility. A pop-up on the defaced site claimed that there was a "catastrophic" security breach, stating "Have you ever felt like the Internet Archive runs on sticks and is constantly on the verge of suffering a catastrophic security breach? It just happened. See 31 million of you on HIBP!" It was reported that about 31 million user accounts were affected, and compromised in a file called "ia_users.sql", dated September 28, 2024. The attackers stole users' email addresses and Bcrypt-hashed passwords.
On October 11, Kahle said that the data is safe, and will bring the service back to normal "in days, not weeks." On October 13, the Wayback Machine was restored in a read-only format, while archiving web pages was temporarily disabled. On October 14, Brewster Kahle said " volume is back to normal: 1,500 requests per second". On October 15, 2024, the website was still mostly offline for "prioritizing keeping data safe at the expense of service availability."
On October 20, threat actors stole unrotated API tokens and breached Internet Archive on its Zendesk email support platform; they also claimed responsibility for the other breaches yet stated that SN_BLACKMETA was behind just the DDoS attacks. Having been told that threat actors leaked some stolen data to others in the data-trafficking community, Bleeping Computer posited that said threat actors breached the "well-known and extremely popular" Internet Archive not to extort money but to "gain cyber street cred," thus "increasing their reputation."
On October 21, Internet Archive went back online in a read-only manner. On October 22, all Internet Archive services temporarily went offline, but later that same day, only the Wayback Machine, Archive-It, and blog.archive.org were resumed. On October 23, archive.org, the Wayback Machine, Archive-It, and the Open Library services all resumed but with some features, such as logging in, still unavailable until the staff announced it back available in the next day or two. On October 25, the login feature was made available and the site has remained active since.
Operations
The Archive is a 501 nonprofit operating in the United States. In 2019, it had an annual budget of $37 million, derived from revenue from its Web crawling services, various partnerships, grants, donations, and the Kahle-Austin Foundation. The Internet Archive also manages periodic funding campaigns. For instance, a December 2019 campaign had a goal of reaching $6 million in donations. It uses Ubuntu as its choice of operating system for the website servers.The Archive is headquartered in San Francisco, California. From 1996 to 2009, its headquarters were in the Presidio of San Francisco, a former U.S. military base. Since 2009, its headquarters have been at 300 Funston Avenue in San Francisco, a former Christian Science Church. At one time, most of its staff worked in its book-scanning centers; as of 2019, scanning is performed by 100 paid operators worldwide. The Archive also has data centers in three Californian cities: San Francisco, Redwood City, and Richmond. To reduce the risk of data loss, the Archive creates copies of parts of its collection at more distant locations, including the Bibliotheca Alexandrina in Egypt and a facility in Amsterdam.
As of 2025, it is reported that Internet Archive operates six data centers, mainly in California, with smaller ones in other U.S. states, Canada and Europe. They have controlled access and fire protection systems, and are monitored for security. All Internet Archive data centers adhere to ISO/IEC 27001 standard, and some of them meet additional certifications.
Also in 2025, it was reported that copies of the archive are kept in locations around the world, as a protection against possible disasters. Back in 2016, all redundancy was provided by RAID-like paired storage, with the 2 copies usually stored at different data centers, while backups were not a regular practice at the time.
Since 2016, Internet Archive started to work to create a decentralized prototype of the digital library. From 2020, content from Internet Archive started to be stored in Filecoin. By October 2023, one petabyte of data had been uploaded to the Filecoin network. The Archive is a member of the International Internet Preservation Consortium and was officially designated as a library by the state of California in 2007.