Judge William Alsup's decision in Bartz v Anthropic PBC, in the United States District Court for the Northern District of California (Court), now stands as arguably the most significant U.S. ruling yet on whether artificial intelligence developers can lawfully use copyright-protected works to train their models.
In a judgment that sets a definitive line between permissible transformative training use and prohibited unauthorised acquisition, the Court dissected the mechanics of AI training with rare granularity and treated each stage of use on its own merits.
The Dispute and Anthropic's Data Pipeline
At the heart of the dispute was Anthropic's method of sourcing books to train its Claude model, combining scanned copies of legitimately purchased physical books with millions of others allegedly downloaded from notorious pirate sites such as Books3 and LibGen, all funnelled into what the company called a “generalised data area” before selecting subsets for training. When three authors claimed that their works were swept up without permission, Anthropic moved for summary judgment, asserting that every element of its data pipeline qualified as fair use under Section 107 of the U.S. Copyright Act (a doctrine we do not share on this side of the Atlantic). Still, the Court delivered a split verdict that drew sharp lines: training a language model on lawfully obtained books was “spectacularly transformative” and therefore protected, scanning purchased print books for internal search was also deemed fair use, but acquiring and hoarding pirated copies could not be excused under any fair use defence.
Transformative Use and the Human Analogy
In reaching this view, the Court embraced the principle that using copyright works for training is fundamentally transformative when the material is lawfully sourced, likening the statistical modelling process to a human's ability to read widely and write anew, an analogy which has attracted criticism from those who argue that comparing a neural network's token predictions to human cognition oversimplifies technical realities. Although the authors argued that the Claude model memorised creative elements wholesale, the Court held that the mere potential for fragmentary reproduction did not disqualify the training process from fair use protection without evidence that infringing output was generated.
The Compressed Copies Controversy
Of note is the Court's acceptance of the claim that a model might contain “compressed copies” of its training data, which has been heavily criticised and is a notion many experts reject outright, given that a large language model stores statistical relationships between tokens, not miniature libraries of digital books. While the ruling's fair use outcome did not hinge on this technical characterisation, its persistence in the record risks sowing confusion in future disputes about how models actually handle source material.
Pirated Material and the Provenance Principle
Where the Court was unyielding was in its condemnation of pirated training inputs. By downloading millions of books from well-known pirate sites and storing them as part of its training corpus, Anthropic could not shield its conduct behind a transformative purpose. Fair use, the Court made plain, does not wipe away the consequences of unlawful acquisition, nor does it entitle an AI developer to sidestep legitimate markets in the name of innovation. For the industry, the message is clear: no matter how defensible the output is, an AI system's legitimacy can be fatally undermined if its inputs are tainted at the source. This may have deeper implications for Article 53 of the AI Act, which says that General Purpose AI model providers must comply with EU copyright law (and the associated text and data mining opt-outs) when training models.
If an AI model provider ignores EU text and data mining opt-outs, does that cause the data to be considered unlawfully obtained for this judgment, if the data was acquired on EU servers but trained in the U.S.?
Scanning Legitimate Copies for Internal Use
Anthropic did manage to salvage its defence with respect to scanning its own lawfully purchased books, which the Court accepted as fair use because the scanning served purely internal, practical functions and did not create new market substitutes. In this regard, the judgment reinforces that format-shifting for internal efficiencies, when carefully circumscribed, can remain lawful under U.S. copyright doctrine.
Cross-Border Rights Reservations and Future Tensions
Yet the ruling also throws an emerging tension with European law into sharper relief. Under the EU's Copyright in the Digital Single Market Directive, rightsholders can reserve rights against text and data mining for commercial purposes. If a U.S.-based company circumvents such reservations by scraping European works or sidestepping opt-outs under Article 4, American courts could potentially treat that conduct as functionally equivalent to piracy. The judgment leaves the possibility that acquiring material in breach of non-U.S. rights reservations, even if the end use is technically transformative, could taint the entire process under U.S. law because the decisive test turns on how the input was sourced, not just how it is processed.
A Clear Marker for the AI Industry
With that, the decision plants a marker for what U.S. courts may and may not permit in an era of expansive AI development. While far from the final word on AI and copyright, Bartz v Anthropic offers the clearest sign yet that companies must be prepared to defend not only what their models can do but also to prove precisely how and from where their training data was acquired, an evidential burden likely to shape how datasets are licensed, documented and litigated in years to come.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.