News publishers are restricting access to the Internet Archive’s Wayback Machine.
The New York Times, CNN, USA Today, The Guardian, and over 241 other news organizations across nine countries have begun to restrict the Internet Archive's crawlers, a move that the Archive's director has described as being 'collateral damage' in a conflict that isn't primarily about them.
Since 1996, the Internet Archive has preserved more than one trillion web pages. It is frequently cited by courts, utilized by journalists to demonstrate edits made after publication, and regarded by historians as a vital primary source. By most standards, it ranks as one of the most crucial public information infrastructure projects of the internet age.
Currently, it faces systematic blocking by the news publishers whose works it has archived, due to a legitimate concern on the part of these publishers: AI firms are utilizing archived news content to train their models without authorization or compensation.
An analysis by AI-detection company Originality AI indicates that 23 significant news outlets are obstructing ia_archiverbot, the primary web crawler employed by the Internet Archive for the Wayback Machine. In total, at least 241 news sites in nine countries have expressly prohibited access to at least one of the Archive's four crawling bots. USA Today Co., the largest newspaper publisher in the US, represents a substantial portion of the blocked sites, which effectively erases hundreds of local publications from the historical record.
The New York Times implemented a so-called 'hard block' around late 2025, according to Mark Graham, the director of the Wayback Machine. While the news organizations' argument is coherent, its repercussions are concerning. Companies that develop AI language models require large amounts of high-quality text. Archived news content fits this description perfectly, offering structured, dated, and accredited high-quality writing gathered over many years. The Internet Archive's Wayback Machine provides vast quantities of this content via API and URL interface, making it an ideal source for training AI models.
A 2023 analysis by The Washington Post revealed that data from the Internet Archive was included in major AI training datasets. For publishers already involved in copyright litigation against OpenAI, Perplexity, and others, the Archive constitutes a vulnerability in their defenses. “The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us,” stated Graham James, a spokesperson for the Times. “The Times invests a significant amount of resources in creating original journalism, and this work should not be utilized without our consent.”
The Guardian has taken a more cautious approach, limiting rather than completely blocking the Archive's access after discovering it was frequently crawled. Robert Hahn, head of business affairs at The Guardian, voiced particular concerns regarding the Archive’s APIs. “Many of these AI businesses are seeking easily accessible, structured databases of content,” he noted. “The Internet Archive’s API would have been a logical target for them to connect their machines to and extract the IP.”
Mark Graham has consistently characterized the situation explicitly, stating, “We are collateral damage.” The Archive has also implemented measures to counter this: it limits bulk downloads, prevents mass downloading of specific sites’ material, and maintains controls to restrict large-scale automated extraction. Graham argues that this undermines the publishers' justification for blocking the Archive's crawlers, claiming the actual risk originates from AI companies accessing archived content through the Archive's controlled interfaces, not from the Archive itself crawling and storing the content.
The Archive is actively engaging with publishers to devise workable solutions. The Guardian noted it has been “collaborating directly with the Internet Archive” to establish these access limits instead of imposing a unilateral hard stop. However, the Archive's assertion that it is a neutral preservation institution, rather than an AI training resource, does not fully alleviate the publishers’ concerns regarding third parties accessing its data, irrespective of the Archive’s intentions.
The limitation imposed by publishers in blocking the Archive's crawlers has repercussions that reach far beyond the realm of AI companies. Once a news article is no longer archived, it can be altered without accountability. Publishers can and do make quiet revisions to stories post-publication, correcting mistakes, softening assertions, or removing quotes. Journalists often rely on the Wayback Machine as the primary tool for documenting such changes. Joe Mullin from the Electronic Frontier Foundation emphasized the gravity of the issue: “The Internet Archive frequently serves as the sole source for observing those changes. There are genuine disputes regarding AI training that need to be settled in court. However, jeopardizing the public record to resolve those issues could prove to be a significant, and potentially irreversible, error.”
Wikipedia links to over 2.6 million news articles preserved by the Wayback Machine across 249 languages. Courts have referenced archived pages as evidence, and journalists have used them to demonstrate modifications made by government agencies to official statements after initial publication. USA Today Co.’s decision to block access has effectively erased hundreds of local newspapers from the historical record at a time when local journalism is already struggling, and every archived
Other articles
News publishers are restricting access to the Internet Archive’s Wayback Machine.
More than 241 news organizations are restricting access to the Internet Archive’s Wayback Machine in order to stop AI companies from utilizing archived material for training purposes.
