Verbatim Recall In LLMs: A New Study Raises Important Questions For Trade Secret Protection

Article Insights

Sarah Tishler’s articles from Beck Reed Riden are most popular:

within Intellectual Property topic(s)
in United States

Beck Reed Riden are most popular:

within Intellectual Property and Technology topic(s)
with readers working within the Banking & Credit, Technology and Law Firm industries

For years, AI companies have told courts, regulators, and the public the same thing: their models don’t store copies of training data. OpenAI put it plainly to the U.S. Copyright Office in 2023: “the models do not store copies of the information that they learn from.” Google said much the same, as did numerous commentators. This prevailing wisdom has been the basis of numerous significant legal decisions in the copyright space. See, e.g., Tremblay v. OpenAI, Inc., 716 F. Supp. 3d 772, 778 (N.D. Cal. 2024) (dismissing vicarious copyright infringement claim, holding, “Distinctly, Plaintiffs here have not alleged that the ChatGPT outputs contain direct copies of the copyrighted books.”).

A new preprint from researchers at Stony Brook University, Carnegie Mellon, and Columbia Law School suggests those assurances were wrong, and that in fact, LLMs can reproduce verbatim content from books that it had previously ingested, leading to the conclusion that copies of the information have actually been stored.

The Experiment

The researchers designed a finetuning task with a deceptively simple setup: take a copyrighted book, break it into 300-500 word excerpts, generate a plot summary of each excerpt, and train a model to expand those summaries back into full text. The task looks completely legitimate, and is the sort of thing a commercial writing assistant might do. No actual book text appears at inference time. The model receives only a semantic description of what happens in a passage, and is asked to write it out.

The output was verbatim reproduction of the source text.

Across 81 copyrighted works by 47 contemporary authors, ranging from The Handmaid’s Tale to Sapiens to Twilight, finetuned versions of GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 reproduced up to 85-90% of a held-out book’s content, with single verbatim spans exceeding 460 words. Before finetuning, aligned models produced almost no verbatim content from the same prompts.

The most striking finding is the cross-author result. The researchers finetuned GPT-4o exclusively on Haruki Murakami’s novels, then tested it on Cormac McCarthy, Ta-Nehisi Coates, Suzanne Collins, and dozens of others. Finetuning on Murakami unlocked memorized content from authors he has nothing to do with. In some cases, the model reproduced over 80% of a completely unrelated book it had never seen during finetuning. The same result held when the researchers used Virginia Woolf’s public-domain novels as training data, but not when they used synthetic text. The conclusion is difficult to avoid: the books were already encoded in the weights from pretraining, and finetuning reactivated the retrieval pathway.

Why This Matters Beyond Copyright

The copyright implications are significant, and the paper’s legal section, co-authored by Columbia Law’s Jane Ginsburg, is worth reading carefully for practitioners in that space. But the trade secret implications deserve attention as well.

As noted above, courts evaluating fair use have looked at whether the AI models can reproduce copies of the ingested works. For example, the Bartz et al. v. Anthropic and Kadrey v. Metadecisions conditioned favorable fair use outcomes partly on the absence of evidence that models reproduce source works. See Bartz et al. v. Anthropic PBC, 787 F.Supp.3d 1007, 1018 (N.D. Cal. June 23, 2025), (“Authors do not allege that any infringing copy of their works was or would ever be provided to users by the Claude service . . . But Claude created no exact copy, nor any substantial knock-off. Nothing traceable to Authors’ works.”); Kadrey v. Meta Platforms, Inc., 788 F. Supp. 3d 1026, 1036 (N.D. Cal. 2025) (“They contend that Llama is capable of reproducing small snippets of text from their books. . . . As explained below, both of these arguments are clear losers. Llama is not capable of generating enough text from the plaintiffs’ books to matter . . .”). This paper provides exactly that evidence, at scale and across multiple providers.

For trade secret practitioners, there are also significant implications. For example, if an LLM ingested your client’s confidential documents, through a training pipeline, through employees using consumer AI tools, through any of the many ways proprietary information flows into these systems, this paper suggests that the information may not just be “learned from.” It may be stored in a form that can be retrieved by anyone else.

While the paper shows that aligned models don’t surface stored content under ordinary prompting, finetuning on a completely benign task, with no adversarial intent whatsoever, reactivated the LLMs’ latent memorization at an alarming scale. And the finetuning task the researchers used is commercially available and accessible through a standard API.

This creates at least two problems for trade secret owners. First, companies that rely on vendor assurances that “models don’t store data” as part of their reasonable measures argument may be resting on a factual premise this paper directly challenges. Second, what happens if a company finetunes a commercial model on its own proprietary data to build a specialized tool, and the finetuning reactivates memorized content from someone else’s confidential information that happened to be in the pretraining corpus of data? The researchers found that finetuning on one author’s work could unlock content from over thirty unrelated authors. There is no reason to believe that the same mechanism would not apply to confidential business information.

The Bottom Line

This paper raises more questions than it answers, with hugely important implications for both copyright and trade secret law. We will continue to monitor the dockets for new developments in this area as the research progresses.

Footnotes

1. The academic literature on AI-generated trade secrets is still developing, but has advanced significantly in the past two years. For the most comprehensive recent treatments, see (for example) Camilla A. Hrdy, Trade Secrecy Meets Generative AI, 100 Chi.-Kent L. Rev. 317 (2025); John G. Sprankling, Trade Secrets in the Artificial Intelligence Era, 76 S.C. L. Rev. 181 (2024); John Villasenor, Artificial Intelligence, Trade Secrecy, and the Challenge of Transparency, 25 N.C. J.L. & Tech. 495 (2024).

2. This is a related issue to the “black box” problem, as described in my previous piece: “if the people who build these systems cannot fully explain how they work or how specific inputs influence specific outputs, how is a plaintiff supposed to plead that a specific stolen file contributed to a specific capability in a deployed model?”

3. To read the cautionary tale of the Samsung incident, see Mark Gurman, Samsung Bans Staff’s AI Use After Spotting ChatGPT Data Leak, Bloomberg (May 1, 2023), available at https://www.bloomberg.com/news/articles/2023-05-02/samsung-bans-chatgpt-and-other-generative-ai-use-by-staff-after-leak.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

[View Source]

Verbatim Recall In LLMs: A New Study Raises Important Questions For Trade Secret Protection

Contributor

The Experiment

Why This Matters Beyond Copyright

The Bottom Line

Intellectual Property

Contributor

United States