Can artificial intelligence (AI) developers train their models on copyrighted works without obtaining prior permission? This question has weighed heavily on creative industries and AI developers alike since the rise of generative.
In recent decisions, two U.S. District Courts for the Northern District of California offered significant guidance on the application of the Fair Use Doctrine to large language model (LLM) training. The courts in Bartz v. Anthropic and Kadrey v. Meta bothheld that using copyrighted books to train LLMs may constitute fair use, provided the works are lawfully acquired.
However, as highlighted by the Meta court, questions and concerns remain that the use of creative works in LLM training could undermine cumulative human creativity and erode the very intellectual property rights the Fair Use Doctrine was meant to protect.
An Overview of the Fair Use Doctrine
Creators of original works of authorship generally hold copyright protection for those original works and also control: the reproduction of the original work, the creation of derivative works based on the original work, and the sale of copies of the original work. Obviously these rights are enjoyed by authors of books, such as at issue in Anthropic and Meta, as well as the creators of the software behind the large language model which analyzed them.
These rights, however, are not necessarily absolute. The Fair Use Doctrine, defined in 17 U.S.C. § 107, aims to balance the interests of the creators of original works with other creators who naturally rely on those existing works for creative inspiration in making works of their own. There are four facts to be considered when determining if a use of a copyrighted work is a fair use:
- The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit education purposes;
- The nature of the copyrighted work;
- The amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
- The effect of the use upon the potential market for or value of the copyrighted work.
Courts may also consider whether the challenged use is "transformative." While the concept has been extensively debated, the core idea is that for a use to be transformative "the use must be productive and must employ the quoted matter in a different manner or for a different purpose from the original." Pierre N. Leval, Toward a Fair Use Standard, 103 Harv. L. Rev. 1107 (1990).
Creative Authors take on AI
The Bartz v. Anthropic
In August 2024, Anthropic, which developed the Claude AI model, was sued by three authors who alleged the company used their copyrighted books without permission. The case centered on two distinct methods Anthropic allegedly used to acquire books for training its model and building a digital library: (1) purchasing physical copies and scanning them, and (2) downloading millions of books from "pirate" sites like Mirror or PiLiMi. Bartz v. Anthropic PBC, 3:24-cv-05417 at 3:6, (N.D. Cal.).
Anthropic argued that its actions were protected by fair use because they were "transformative." In particular, the books were being used to teach an AI system how to generate new, original content, as opposed to being redistributed. The plaintiffs argued this was simply large-scale copyright infringement disguised as innovation.
On June 23, 2025, the court granted summary judgement for Anthropic ruling that training Claude on legally purchased books constituted fair use. In the decision, Judge Alsup stressed that AI training is "quintessentially transformative," noting Claude does not reproduce the books but instead learns linguistic patterns and styles to generate new content, much like human writers learn from reading:
Anthropic's LLMs have not reproduced to the public a given work's creative elements, nor even one author's identifiable expressive style (assuming arguendo that these are even copyrightable). Yes, Claude has outputted grammar, composition, and style that the underlying LLM distilled from thousands of works. But if someone were to read all the modern-day classics because of their exceptional expression, memorize them, and then emulate a blend of their best writing, would that violate the Copyright Act? Of course not.
(id., 12:24-13:2).
With regard to Anthropic's practice of digitizing purchased print books, including fiction and non-fiction, Judge Alsup found such use permissible because it merely facilitated training and organization, and not distribution or creation of new copies: "the digitization of the books purchased in print form by Anthropic was... a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies." (id., 9:25-19).
The court further held that the market impact would be negligible, comparing the process to teaching students how to write. Specifically, Claude, he noted, does not compete with or replace the original books, but instead users engage with it for tasks unlike reading the authors' works: "Authors' complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works. This is not the kind of competitive or creative displacement that concerns the Copyright Act." (id., 28:11-14). In his decision, Judge Alsup compares the use of the books to a person learning from them:
....Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems. (id., 12:14-21).
When addressing the books acquired from pirate sites, the court took a sharply different view and refused to extend fair use to that content. In denying summary judgement, the court held "[t]his order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use." (id., 18:25-27).
The Meta Decision: Market Harm
In July 2023, thirteen published authors sued Meta Platforms for their alleged use of their copyrighted books. Similarly to Antrhopic, Meta downloaded the materials from "shadow libraries" and uploaded some portion of torrented data to their LLM "Llama" for training purposes. Kadrey v. Meta Platforms, Inc., 3:23-cv-03417 at 11-13, (N.D. Cal.)
Just two days after the ruling in Anthropic, on June 25, 2025, the Meta court reached the same ultimate conclusion by largely aligning with Anthropic's fair use analysis with respect to LLM's. However, Judge Chhabria emphasized different concerns. While he ruled in Meta's favor regarding its LLaMA AI training, he disagreed with the reasoning used by the court in Anthropic and highlighted market harm as the most important factor going forward.
Judge Chhabria agreed that Meta's use of copyrighted books to train LLaMA was transformative and therefore fair use. However, he took issue with Anthropic's comparison of AI training to human learning, and stated "[u]sing books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a miniscule fraction of the time and creativity it would otherwise take."
Judge Chhabria also introduced what could become a central issue in future AI copyright cases-market dilution. He expressed concern that AI systems could create an "endless stream of competitors" that might "obliterate" demand for original creative works. Judge Chhabria explained that while the plaintiffs in this case failed to provide adequate evidence of market harm, future cases with better documentation of economic impact could reach different conclusions. He specifically noted that "a better record of dilution or market harm could prevail in other cases."
Other Important Cases to Monitor
Several other pending cases could further impact the legal landscape around AI training and copyright.
NYTimes v. OpenAI, 1:23-cv-11195, (S.D.N.Y)In NYTimes v. OpenAI, the New York Times (NYT) is alleging copyright infringement and trademark dilution due to the use of copyright-protected works to train OpenAI's large-language models which run its ChatGPT engine. The NYT further alleges that ChatGPT can reproduce their articles nearly verbatim and sometimes generates false information attributed to the newspaper.
If courts find that AI systems can infringe copyright through their responses to users, it could establish important limits on what AI models can generate, regardless of whether their training was fair use. The case also raises serious reputational harm concerns when AI systems generate misleading content attributed to real news organizations.
Getty Images vs. Stability AI, 1:23-cv-00135, (D. Del.)
Getty alleges that millions of stock photos (bearing the Getty watermark) were used to train the Stable Diffusion system without permission.
A notable element of this case is that the AI-generated images at issue directly compete with Getty's core business of licensing stock photography, making the market harm argument potentially more significant.
California's AI Legislation
While courts are still working to define the boundaries of fair use for AI training, California is addressing AI-training transparency issues through legislation. Assembly Bill 2013 (AB 2013)—signed into law in September 2024 and taking effect on January 1, 2026—requires developers of generative AI systems to disclose basic information about the datasets used to train their models.
Any company that makes a generative AI system available to users in California must publicly post the following information on its website:
- Data Source Summary: A high-level overview of the datasets used in training, including how and where they were obtained.
- Licensing Status: Whether the data was licensed, purchased, scraped from public sources, or otherwise acquired.
- IP Content: Whether the datasets include material protected by copyright, trademark, or patent law.
- Date of Use: When each dataset was first used during the model's development.
- Synthetic Data Disclosure: Whether synthetic (AI-generated) data was used during training.
The law applies to any generative AI system released or made available since January 1, 2022, including both paid and free products. That means developers will need to disclose training data practices even for legacy models—many of which were built before clear documentation was common.
Conclusion-New Age of Intellectual Property
The development of LLMs and deep neural networks in recent years has given rise to questions regarding the nature of authorship and inventorship respective to intellectual property. While the Meta court may have disagreed with the analogy comparing an LLM to children learning to read by the Anthropic court, the court ultimately ruled that the training usage was fair use.
Despite the detailed and as-yet-unchallenged legal analysis, the decisions may, from a broader philosophical vantage point, overlook the larger picture–the purpose of the Fair Use Doctrine.
The Fair Use doctrine exists for "balancing the interests of pioneering authors and those who use their work as an input for cumulative creativity, and as a safety valve for freedom of expression." Neil Weinstock Netanel, Locating Copyright within the First Amendment Skein, 54 Stan. L. Rv. 1 (2001). That is, the Fair Use Doctrine protects authors who rely on existing works for inspiration in creating their own works, an act naturally done by the human mind.
The courts' decisions in Anthropic and Meta imply an equivalence between a human creator and AI tools. Authors and inventors have used tools to create their work since time immemorial. Never before, however, has there been a tool which can independently create original works on such a scale as demonstrated by artificial intelligence. It appears, then, that the Meta court's insistence on providing evidence of actual market dilution was apt- after all, in the time it takes to read this article, an AI model like Claude or LLaMA could generate an illustrated children's book that mimics the stylistic writing of a well-known literary author or even a songwriter.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.