The Future Of "Open Source" In The Age Of AI

Article Insights

This article from Steptoe LLP is most popular:

in United States

Steptoe LLP are most popular:

within Transport and Immigration topic(s)

As artificial intelligence (AI) proliferates, more and more legal practitioners and technologists are talking about "open source" AI. But just as AI is not new (and has been around since at least the 1950s), open source software has been around for decades, and principles have been established in both the ownership and licensing of that software. As "open source AI" proliferates, the lessons learned from the development and licensing of open source software can help us understand and anticipate new legal risks in the age of AI.

In particular, while open source AI might be a boon to industrial and nonprofit applications and creations of AI-based tools and technology, the use of open source technologies can create a thicket of licensing and other challenges that might discourage the use of open source AI, and some organizations might prefer to develop their software within a "closed source" or proprietary AI framework.

Free and Open Source Software

A computer program is a set of instructions for a computer to follow. "Source code" is the plain text form of those instructions that a human can read and understand. For a computer to follow those instructions, however, source code must be translated into an "executable" form often called "object code" (or alternatively "machine code" or "binary code") that can control a computer's particular processing hardware, through processes such as "compiling" the code. With limited exceptions, once a program is translated into an executable form, it cannot be translated back into its source code—just as a baked cake cannot be transformed back into eggs, flour, and sugar.

Both source code and object code can be protected by copyright law.¹ At least since the 1980s, many publishers of software have distributed only the object code to users, without making the corresponding source code available, often to protect against copying or adapting the source code in competing projects, which can be difficult to detect. That sort of software is deemed "closed source," and the underlying software is not available for the public or purchasers of the software to see. By contrast, "open source" software is, most simply, software that has accessible and readable source code, although the term usually connotes that the owner of the copyright has made the source code available under a license that permits users to copy, learn from, and modify the source code, subject to certain conditions.

The term "open source software" is actually an outgrowth from and reaction to the term "free software," the latter of which was pioneered in the 1980s by Richard Stallman. Stallman became frustrated that companies had begun to distribute software only in executable form, preventing users from learning about and improving the software they worked with. He formed the Free Software Foundation (FSF) and began advocating that software users should have the freedom to run, study, change, redistribute, and redistribute with modifications without impediment.² Such software would be "free" in the sense that those freedoms would be preserved, as in the phrase "free speech," but not necessarily free from cost, like "free beer."³

Indeed, even if open source software does not cost money, it may impose other costs, such as requirements that software developed using open source be distributed to the public.

In 1998, a group of engineers formed the Open Source Initiative (OSI) with the goal of pivoting from free software and Stallman's philosophical approach to it and avoiding the misunderstanding that "free software" was necessarily "no cost" software, instead focusing on the pragmatic benefits of making source code available to users.⁴ OSI published the "Open Source Definition," which, in OSI's view, provides an objective test for confirming that the license will deliver appropriate software freedom.⁵ OSI maintains a list of licenses—templates with terms drafted by various organizations for developers to apply to their code— that it certifies meets that definition.⁶

Open Source Licenses from Permissive to Copyleft

There are various open source licenses that software developers can apply to their code and AI models. Some licenses are "permissive," allowing licensees to copy, modify, and distribute code with only limited restrictions, such as requiring that certain information about the licensed code be conveyed. One example is the three-clause BSD license, which requires any licensees to reproduce the copyright notice, license terms, and disclaimer in any redistribution of source code or object code.⁷ Another well-known permissive license with similar requirements is the MIT license, which requires that copies of the software include the license's copyright notice and explain that the software is provided "as is" without any warranties.⁸

The MIT license has been used in the AI context. In January 2025, a Hangzhou, China–based company called DeepSeek released an AI chatbot that promptly became the top smartphone app in the United States. DeepSeek released a research paper detailing its training methodology (although it has not released the actual implementation of its training process).⁹ And DeepSeek released the source code for its model under the terms of the permissive MIT license.¹⁰

DeepSeek's release notes further state that its first release (R1) is "MIT license[d]: Distill & commercialize freely!"¹¹ DeepSeek's choice to publicly release its model and offer it under a permissive open source license also highlights an interesting risk (and potential trade-offs) companies face when offering open source technology.

In contrast to licenses like the BSD and MIT licenses, other licenses that might be used in the AI context are deemed "copyleft"—socalled because the license seeks to use copyright law to expand users' freedoms in software rather than limit freedoms. An example of a copyleft license is the GNU General Public License version 3 (GPLv3).¹² That license requires licensees who elect to redistribute software built from licensed code to make that code and any modifications to it available to recipients of the distribution on the same terms—potentially requiring the open source release and licensing of a much broader program the code was incorporated into. A related variant called the Lesser General Public License (LGPL) still requires that derivative works that use the software be made available to the public but makes exceptions for other works.¹³

Copyleft licenses are often colorfully called "viral" because the inclusion of copyleft-licensed materials in a broader project can "infect" the rest of the project with new obligations to make the broader project available under open source conditions.¹⁴ Developers of proprietary software frequently avoid open source software for this reason, but there are instances when open source software is necessary or useful for their projects, leaving the developers to navigate a thicket of rights.

There are also licenses that apply to source code that is open to the public but that OSI has not certified as meeting its Open Source Definition. The Creative Commons Zero (CC0) license, for example, dedicates the work to the public domain without restriction for copyright purposes, but OSI has not certified it because it disclaims licensing any patent rights.¹⁵ The Creative Commons NonCommercial (CC-NC) license was not certified because it does not permit commercial uses.¹⁶ And the Server Side Public License (SSPL) was not certified because OSI determined it had copyleft-type restrictions that could not be met in all fields of use.¹⁷ In addition to the broad menu of template licenses, some developers also have their own custom license models that apply to publicly released source code, such as permitting use for a trial and evaluation period but requiring a commercial license before a licensee sells a product using the licensed code.

Open Source and AI

The freedom afforded to users of free and open source software is frequently seen as fostering a collaborative approach to creativity and innovation, often outside the control of large and dominant corporations, and often at low or no cost to the developers. It should be no surprise, then, that the explosion of interest in and development of generative AI (GenAI) models significantly intersects with the world of free and open source software.

GenAI models are software systems that can be used to generate new content, like text, images, music, and source code, based on analysis of and extrapolation from collections of information called "training data." The models include numerical parameters, known as "model weights," that can be used to fine-tune the models' algorithmic decision-making. Such systems implicate open source issues in at least three ways: First, many developers elect to offer some or all aspects of their GenAI systems on open source or similar license terms.

Second, open source licenses that were framed to leverage the property interest conferred by the copyright in code may be an ill fit for training datasets, which carry limited, if any, copyright protections. Third, GenAI models may output content that reproduces open source–licensed content from its training data, in whole or in part. Each of these issues raises complex and important business and compliance considerations for developers and users of GenAI.

Open Source AI Models and the OSI Definition

At a high level, a GenAI model consists of software that runs it, training data that the model analyzes to identify patterns among elements of the data called "tokens" (such as words or visual elements), and parameters (such as the model weights referenced earlier) that numerically represent the patterns the model has learned during training. Some companies offer GenAIdriven services to consumers, without providing users access to any of their proprietary software, training data, or parameters. That "closed source" approach insulates the service provider from competition from others who would reuse or modify its own proprietary technology, but it also means that users and the public at large are prevented from studying and improving on that technology.

Other companies make some or all of their GenAI systems available under open source licenses. Based on a sampling of AI models listed on Hugging Face, around 60% of models bearing a license fell under some sort of "open source" license.¹⁸ A key benefit of open source offerings is that they encourage and enable the community of open source software developers to contribute their own innovations to the project, such as improved features or even developing an ecosystem of complementary services.

Some companies take hybrid approaches, making only certain components (like code) available under open source licenses while maintaining confidentiality over others, such as the specific parameters the code generated when analyzing the training data. This approach can garner many of the benefits of participating in the open source community while enabling the company to maintain a competitive advantage over others who use the open source components. Of course, when a company takes this hybrid approach, it also may not deliver all the freedoms that organizations like OSI and FSF advocate for in free and open source software. Moreover, entities that use or want to work with AI models licensed under a hybrid approach will have to parse which portions of the relevant software are open source, which are proprietary, and what that means for use of the AI model or its code. Recently, OSI published a definition of "Open Source AI" systems, setting forth the specific criteria for an AI system that OSI believes will deliver the appropriate freedoms to use, study, modify, and share the system.¹⁹ OSI is expected to publish and maintain a list of AI models that meet its definition.

According to OSI, an Open Source AI system should make available under OSI-approved license terms like those offered by OSI-certified licenses: (1) the complete source code used to train and run the system, (2) information about the data used to train the system, and (3) the parameters used to configure the model, which should be "freely available to all."²⁰ Therefore, according to OSI, open source AI extends not just to the software, but also to some extent to training data and model weights. To be clear, OSI's definition does not require companies to outright publish their training data, but they must provide a "complete description of all data used for training" and other sufficient information to enable someone to build a "substantially equivalent system."²¹

Since the publication of OSI's definition, many commentators have noted that some of the most well-known examples of selfdescribed "open source" AI models do not meet this definition. Stability AI, for example, offers a no-cost license for "researchers, creators, developers and designers" provided they earn less than $1 million in annual revenue.²² But, a separate license is required for commercial uses by companies with $1 million or more in revenue, which is at odds with OSI's general Open Source Definition. Meta's open source license for its Llama AI model also has restrictions for certain commercial uses, and Meta also does not disclose its training data, which OSI's Open Source AI Definition requires.

OSI's definition does not have any legal impact, but we can expect that the publication of this definition will spark clashes between advocates of open source software and the companies that dub themselves "open source" at odds with OSI's definition. It will be interesting to see how software and algorithms pertaining to "open source AI" will be licensed. Will they be subject to the more permissive MIT and BSD licenses, or to the copyleft licenses such as GPLv3? Moreover, although we know what "open source" software is and can draw from past discussions of open source licensing, there are questions about what "open source" disclosures and licensing of training data and model weights will require.

Open Source Training Data and Model Parameters

When it comes to AI, data is king. To understand and be able to replicate an AI model, one needs not only the underlying source code but also to understand the nature and type of data that has been used to train the AI model, as well as the model parameters.

Open source issues will arise in terms of the required disclosure of the compilation of the training data the model analyzes in order to learn how to respond to prompts. A related issue relates to the model parameters (or weights) used to train the AI model.

Without these elements, one cannot fully replicate the functionality of a particular AI model because every AI model depends on its source code, training data, and parameters to function. At the threshold, there is a question of what "open source" licensing might mean for AI models and weights. Does this mean that all of the details of the data and weights should be disclosed, or just high-level details about what datasets are included in the model? Would an AI model be "open source" if the sources of data used to train the model were disclosed, but the data is not available for free in the public domain?²³ Just as "open source" software takes many forms, one could imagine that "open source" training data in the context of AI will likely take many forms, with various degrees of disclosure and licensing rights.

Copyright law does not protect collections of raw data, absent some original expression in that collection, such as through independent and creative selection and arrangement of the data.²⁴ The scope of that protection, however, has been described as "thin."²⁵

Where a compilation incorporates information that is, itself, protectable by copyright, that may implicate the copyright protection in the individual elements of the compilation, and certain compilations may be deemed "derivative works" of the underlying copyrighted components.²⁶ This means that the existing corpus of open source software license templates may prove an ill fit for training data. Given the "thin" copyright protection for compilations of raw data, a user of training data may not need a license at all and thus may not be bound to, for example, share a derivative dataset under similarly free and open conditions.²⁷

Absent legislation, this limitation of copyright law could prove to be a stumbling block to the level of openness that OSI contends is required in AI. However, even if the compilations of data are not copyrightable, many collections of data will necessarily include copyrighted data (such as images, audio, video, and authored text) and other content.

In light of the above, data licensing is a complex and fascinating issue that could be applied to data that is used in the context of AI. Even without copyright protection, companies have rights to their datasets that are maintained as a trade secret, and even publicly available data can be protectable from disclosure as a result of privacy, right of publicity, and website terms and conditions. Still other laws or regulations might place limitations on web scraping. As such, Reddit recently agreed to license its content and data to make its content available to train Google's AI model²⁸, but for a payment of $60 million per year.

The question is whether other entities might choose to make data available under an "open source" model that permits broad use of the data (subject to privacy and other laws). Other disclosures of the data used to train an AI model might be compelled by law in the public interest.

For example, California recently passed AB 2013, which requires developers of public-facing GenAI systems to post on their website documentation regarding the data that is used to train GenAI.²⁹ Other laws, like the EU AI Act,³⁰ require transparency around the data used to train models. In such circumstances, organizations or entities might choose to designate aspects of their AI models to be open source.

Open Source Code in the Outputs from AI Models

Another aspect of GenAI that intersects with open source issues pertains to the outputs from a GenAI model. Those outputs can include text, including source code that might itself be an amalgamation of source code derived from open source licenses.

Service providers often license their models' outputs under open source–type licenses, though there may be no copyright protection in the outputs to license anyway.³¹ But the rights granted by the service provider may not necessarily account for the copyrights of the authors of works that were included in the training data.³²

Amid the flurry of litigation surrounding the use of unlicensed material in training data, the complaint in a pending putative class action alleges that several GenAI models designed to generate source code, such as GitHub's Copilot, reproduced the plaintiffs' open source–licensed code in response to various prompts without complying with applicable license terms, such as attribution or reproducing the copyright notice.³³

Allegations like these carry risk not only for the providers of GenAI tools but also for others who use the outputs from GenAI models in their own products.

Service providers that reproduce copyrighted materials in their outputs could face risk for copyright infringement or for breach of the terms of an open source license, similar to the allegations in numerous cases against GenAI providers grounded in more traditional media. But in the context of open source software, GenAI providers may also face exposure for violations of § 1202(b) of the Digital Millennium Copyright Act (DMCA), which prohibits the distribution of works "knowing that copyright management information has been removed . . . without authority of the copyright owner."³⁴ Each such violation carries the potential for an award of statutory damages between $2,500 and $25,000.³⁵

These allegations also highlight the risk that a person using a GenAI tool to generate source code could incorporate copyrighted code into some other project, without following or even knowing of the license terms—or what combination of licensing terms might apply if, for example, generated source code includes components that are licensed under both the BSD license and the LGPL license. Although that risk exists for any use of GenAI model, at least where the model includes unlicensed, copyrighted content in its training data, the risk is particularly heightened where the materials in question are distributed under a copyleft license.

This is because the incorporation of copyleft-licensed code into a broader project could cause the entire project to fall under that copyleft regime, potentially requiring the public disclosure of the source code for an entire proprietary software system. Although there has been very limited U.S. litigation applying the terms of copyleft licenses, at least one complaint concerning a copyleft license sought specific performance of the obligation to produce the complete source code for the affected product, subject to that copyleft license, that is, compelling the open source provision of the allegedly offending code.³⁶

Navigating the Open Source AI Thicket

Open source software and related licensing issues were already complex before the recent rise of GenAI, with a thicket of open source licensing regimes ranging from permissive to restrictive and copyleft. The advent of "open source AI" has raised a litany of new issues, and organizations will continue to hammer out what open source means in the context of AI systems and models.

Because AI models include more than just source code, but also training data and model parameters, the existing open source software licensing regimes do not fully account for what "open source" and licensing will look like in the age of AI. Nor do these regimes account for source code that might be output by GenAI tools, and what licenses might apply to source code output by such tools.

Uncertainty about the scope and meaning of "open source AI" will remain for the foreseeable future. Therefore, legal teams and practitioners are advised to seek clarity in any scenario where someone asserts that an AI model or output is "open source," and to consider—but also think beyond—traditional principles of open source software licensing when evaluating legal risks and rights in such scenarios.

As a best practice, lawyers should not make assumptions about open source AI or software: They should read the relevant licensing provisions in detail, work to appreciate their nuances, and attempt to determine their interrelationships within the relevant context of a licensed piece of software or AI model.

At the end of the day, this is not an area where there will be a hard and fast playbook because the number of open source AI and licensing scenarios that might arise in the future will be boundless. But by building familiarity and experience with the various licensing provisions, it is possible to wade one's way through the thicket.

Footnotes

1 See U.S. COPYRIGHT OFF., CIRCULAR 61, COPYRIGHT REGISTRATION OF COMPUTER PROGRAMS (2021), https://www.copyright.gov/circs/circ61.pdf.

2 What Is Free Software?, GNU OPERATING SYS., https://www.gnu.org/philosophy/free-sw.en.html (last updated Jan. 1, 2024); The GNU Manifesto, GNU OPERATING SYS.,https://www.gnu.org/gnu/manifesto.en.html (last updated Nov. 2, 2021).

3 What Is Free Software?, supra note 2.

4 History of the OSI, OPEN SOURCE INITIATIVE (Sept. 19, 2006), https://opensource.org/history; Christine Peterson, How I Coined the Term "Open Source," OPENSOURCE.COM (Feb. 1, 2018), https://opensource.com/article/18/2/coining-term-open-source-software.

5 The Open Source Definition, OPEN SOURCE INITIATIVE (Feb. 16, 2024), https://opensource.org/osd; Frequently Answered Questions, OPEN SOURCE INITIATIVE (Jan. 21, 2025), https://opensource.org/faq

6 OSI Approved Licenses, OPEN SOURCE INITIATIVE, https://opensource.org/licenses (last visited Mar. 6, 2025). Other organizations maintain their own lists of approved licenses, including Stallman's FSF. See Licenses, GNU OPERATING SYS., https://www.gnu.org/licenses/licenses.en.html (last updated Apr. 12, 2022).

7 The 3-Clause BSD License, OPEN SOURCE INITIATIVE, https://opensource.org/license/bsd-3-clause (last visited Mar. 6, 2025).

8 The MIT License, OPEN SOURCE INITIATIVE, https://opensource.org/license/mit (last visited Mar. 6, 2025).

9 DeepSeek, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv:2501.12948 [cs.CL] (2025), https://arxiv.org/abs/2501.12948.

10 DeepSeek-R1, GITHUB, https://github.com/deepseek-ai/DeepSeek-R1?tab=MIT-1-ov-file#readme (last visited Mar. 6, 2025).

11 DeepSeek-R1 Release 2025/01/20, DEEPSEEK (Jan. 20, 2025), https://api-docs.deepseek.com/news/news250120. "Distill" in this context refers to a process where a new AI model is trained to match the outputs of an existing pretrained AI model, often analogized to a student learning from a teacher.

12 GNU General Public License, GNU OPERATING SYS. (June 29, 2007), https://www.gnu.org/licenses/gpl-3.0.en.html.

13 GNU Lesser General Public License, GNU OPERATING SYS. (June 29, 2007), https://www.gnu.org/licenses/lgpl-3.0.en.html.

14 David Kappos & Asa Kling, Ground-Level Pressing Issues at the Intersection of AI and IP, 22 COLUM. SCI. & TECH. L. REV. 263, 277–78 & n. 116 (2021), https://journals.library.columbia.edu/index.php/stlr/article/view/8665.

15 CC0 1.0 Universal, CREATIVE COMMONS, https://creativecommons.org/publicdomain/zero/1.0/ (last visited Mar. 6, 2025); Frequently Answered Questions, supra note 5.

16 Attribution-NonCommercial 4.0 International, CREATIVE COMMONS, https://creativecommons.org/licenses/by-nc/4.0/deed.en (last visited Mar. 6, 2025).

17 Press Release, Open Source Initiative, The SSPL Is Not an Open Source License (Jan. 19, 2021), https://opensource.org/blog/the-sspl-is-not-an-open-source-license.

18 Aurora Starita, Top Open Source Licenses Explained, MEND.IO (Nov. 30, 2023), https://www.mend.io/blog/top-open-source-licenses-explained/.

19 The Open Source AI Definition—1.0, OPEN SOURCE INITIATIVE, https://opensource.org/ai/open-source-ai-definition (last visited Mar. 6, 2025). Other organizations are engaging in similar exercises. The Linux Foundation recently defined "open source AI models." Ibrahim Haddad, Embracing the Future of AI with Open Source and Open Science Models, LF AI & DATA (Oct, 25, 3034), https://lfaidata.foundation/blog/2024/10/25/embracing-the-future-of-ai-with-open-source-and-open-science-models/. Richard Stallman's FSF has also announced that it intends to publish its own criteria for software freedom in the context of machine learning applications and their inputs. FSF Is Working on Freedom in Machine Learning Applications, FREE SOFTWARE FOUND. (Oct. 22, 2024), https://www.fsf.org/news/fsf-is-working-on-freedom-in-machine-learning-applications.

20 The Open Source AI Definition—1.0, supra note 19.

21 Id.

22 Self-Hosted Licenses, STABILITY AI, https://stability.ai/license (last visited Mar. 6, 2025).

23 In fact, some have criticized OSI's AI definition as too narrow, with respect to training data. While OSI requires AI developers to disclose the sources of their training data, the definition permits developers to use information in their training sets that is not available to the public (referred to as "unshareable data" in the definition), provided that it is disclosed. Critics have explained that means models can call themselves "open source" under OSI's definition even if someone would be unable to recreate and modify the entire system themselves (such as by tweaking the parameters and code but using the same training set). OSI, for its part, has explained that decision was made to promote the use of open source AI in fields with restrictions on sharing data, such as in the medical field. See Answers to Frequently Asked Questions, HACKMD (Oct. 29, 2024), https://hackmd.io/@opensourceinitiative/osaid-faq.

24 Feist Publ'ns, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 346–48 (1991).

25 Id. at 349.

26 See, e.g., Castle Rock Ent., Inc. v. Carol Publ'g Grp., Inc., 150 F.3d 132, 139 (2d Cir. 1998).

27 See Steven J. Vaughan-Nichols, Open Source Licenses Need to Leave the 1980s and Evolve to Deal with AI, REGISTER (June 23, 2023), https://www.theregister.com/2023/06/23/open_source_licenses_ai/.

28 Anna Tong et al., Exclusive: Reddit in AI Content Licensing Deal with Google, REUTERS (Feb. 21, 2024), https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/.

29 AB 2013, 2023–2024 Leg. (Cal. 2024), https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=202320240AB2013 (codified at CAL. CIV. CODE §§ 3110–3111).

30 EU ARTIFICIAL INTELLIGENCE ACT, https://artificialintelligenceact.eu/ (last visited Mar. 6, 2025).

31 See Thaler v. Perlmutter, 687 F. Supp. 3d 140 (D.D.C. 2023).

32 In addition to publishing a definition of "open source AI," the Linux Foundation has defined the term "open science AI models," which are open source models that offer licenses to that data and any other component of the model. See Haddad, supra note 19.

33 E.g., Second Amended Complaint ¶¶ 112–16, Doe 1 v. GitHub, Inc., No. 4:22-cv-06823-JST (N.D. Cal. Jan. 24, 2024), ECF No. 200.

34 17 U.S.C. § 1202(b)(3). Violations of this provision require knowledge of the removal of copyright management information, but courts have held that knowledge may be established via "willful blindness" for purposes of the DMCA. Viacom Int'l, Inc. v. YouTube, Inc., 676 F.3d 19, 35 (2d Cir. 2012). The district court in Doe 1 v. GitHub held that a violation of § 1202(b) only occurs where copyright management information has been removed from an identical copy of the asserted work. No. 22-cv-06823-JST, 2024 WL 235217, at *9 (N.D. Cal. Jan. 22, 2024).

35 17 U.S.C. § 1203(c)(3)(B).

36 First Amended Complaint, Prayer for Relief ¶ a, Software Freedom Conservancy, Inc. v. Vizio, Inc., No. 30-2021-01226723 (Cal. Super. Ct. Jan. 10, 2024), ECF No. 165.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

The Future Of "Open Source" In The Age Of AI

Contributor

Technology

Contributor

United States