GitHub Inc. offers a cloud-based platform that is popular among many software programmers for hosting and sharing source code, and collaborating on source code drafting. GitHub's artificial intelligence (AI)-based Copilot tool has become a valuable resource for software developers, offering real-time code autocompletion suggestions across various programming languages. However, a class action lawsuit filed against GitHub, OpenAI, and Microsoft (GitHub's parent) alleges violations related to open-source licensing and copyright law, raising complex legal questions about the tool's usage and the reproduction of code.

This article delves into the controversy surrounding Copilot by examining allegations ofDigital Millennium Copyright Act(DMCA) violations and breaches of open-source licenses. It also discusses the implications for AI-generated code and offers recommendations for navigating the legal challenges in the evolving landscape of AI-assisted coding.

Copilot

Copilot is a programming assistance tool developed by GitHub, a software hosting service and open-source version control system, in collaboration with AI research lab OpenAI. It is powered by OpenAI's Codex machine learning model, a variant of the famed GPT model—which powers OpenAI's ChatGPT—and uses AI to provide software developers with real-time suggestions for completing lines of source code. The tool offers code snippet suggestions in various lengths ranging from short auto-fills to longer lines of code complete with variable names, function definitions, and algorithms.

Microsoft, which acquired GitHub in 2018, hosts Copilot on its cloud servers. Copilot supports a wide range of programming languages, making it versatile and accessible for programmers working in different software development environments. Copilot also allows pair programming so that developers may collaborate on coding tasks.

Copilot utilizes machine learning, a process by which a computer is trained to find patterns or make predictions by ingesting and analyzing large amounts of sample data and was trained on a massive dataset primarily sourced from publicly available code repositories on GitHub. The majority of those repositories were subject, at least in part, to open-source licenses.

Open-Source Licenses: A Primer

As a general matter, source code is protected by copyright law as a literary work. By default, the creator of a work holds the copyright and exclusive distribution rights. Without a license, others cannot use copyrighted code without infringing that copyright. To promote collaboration and expedite progress in code development, many developers choose to publish their code under cost-free licenses that permit third-party use, distribution, and modification of that code subject to specified terms. These terms comprise an open-source license.

The Open Source Initiative (OSI) is a non-profit organization that provides a commonly accepteddefinitionof what constitutes "Open Source": access to source code, free redistribution, enabling create derivative works, no usage limits placed on who may use the work and for what purpose, no requirement for additional licenses, and no distribution format requirements. Open-Source Software (OSS) can be used for commercial purposes as long as the legal terms of the licenses are adhered to.

Most OSS licenses require attribution—meaning that developers incorporating OSS must credit its original authors—and some licenses require derivatives to be distributed under the same or comparable terms—meaning that if a user incorporates code subject to this term into a new program, the new program must likewise be distributed to the public as OSS.

Doe v. GitHub

GitHub, OpenAI, and Microsoft made Copilot available to the public in mid-2021, charging recurring subscription fees for their code assistance tools and services.

On Nov. 3, 2022, several anonymous coders claiming ownership in software stored on GitHub filed a class action lawsuit in the US District Court for the Northern District of California against GitHub, OpenAI, and Microsoft. The class action lawsuit alleges several causes of action arising from the use of the plaintiffs' OSS that was stored on GitHub and used to train Copilot, and the reproduction of that source code in Copilot's real-time suggestions without proper attribution.Doe et al v. GitHub, Inc. et al,No. 4:22-cv-06823(N.D. Cal. Nov 03, 2022).

GitHub, OpenAI, and Microsoft moved to dismiss the complaint on several grounds, including that, because the plaintiffs were anonymous, plaintiffs failed to specify instances where Copilot reproduced their licensed code. They also moved to dismiss on the ground that plaintiffs failed to state a claim because GitHub's Terms of Service ("TOS") grant broad rights to use, display, perform, and reproduce code, and the TOS preempt the breach of license claims.

Although GitHub, OpenAI, and Microsoft succeeded in dismissing certain claims, the court allowed the claims ofDMCAviolation, breach of OSS licenses, unjust enrichment, and unfair competition to proceed, some with amendments by the plaintiffs. Specifically, the court granted themotion to dismissin part on May 11, 2023, dismissing plaintiffs' claims for violation of Sections 1202(a) and 1202(b)(2) of theDMCA, tortious interference in a contractual relationship, fraud, false designation of origin, unjust enrichment, unfair competition, breach of the GitHub Privacy Policy and TOS, violation of the CCPA, and negligence.

However, the court granted plaintiffs leave to amend to correct those deficiencies. The court also dismissed with prejudice plaintiffs' claims for civil conspiracy and declaratory relief. On June 8, 2023, plaintiffs filed anamended complaint.

This analysis is focused on plaintiffs' claims of violation of theDMCAand breach of open-source licenses governing their OSS. These two claims remained substantively unchanged in the amended complaint.

Violation ofDMCA§1202

"Copyright law restricts the removal or alteration of copyright management information ('CMI') – information such as the title, the author, the copyright owner, the terms and conditions for use of the work, and other identifying information set forth in a copyright notice or conveyed in connection with the work."Stevens v. Corelogic, Inc.,899 F.3d 666,671(9th Cir. 2018). Section 1202(b) of theDMCAprovides that one cannot, without authorization, (1) "intentionally remove or alter any" CMI, (2) "distribute . . . [CMI] knowing that the [CMI] has been removed or altered," or (3) "distribute . . . copies of works . . . knowing that [CMI] has been removed or altered" while "knowing, or . . . having reasonable grounds to know, that it will induce, enable, facilitate, or conceal" infringement. 17 U.S.C. §1202(b).

Plaintiffs alleged that their OSS contains CMI including copyright notices, titles, authors' names, copyright owners' names, terms and conditions for use of the code, and identifying numbers or symbols. Plaintiffs further alleged that GitHub and OpenAI knowingly failed to program Copilot to review attribution, copyright notices, and license terms, and that when Copilot makes suggestions that reproduce code subject to open-source licenses, the suggestions would omit attribution, copyright notices, or license terms. They conclude that Copilot removes or alters that CMI, and that GitHub, OpenAI, and Microsoft distributed Copilot knowing it would alter or remove CMI.

Breach of OSS Licenses

As to the second claim, California breach of contract law requires plaintiffs to "identify with specificity the contractual obligations allegedly breached by the defendant."Williams v. Apple, Inc.,449 F. Supp. 3d 892, 908 (N.D. Cal. 2020). On the same factual grounds, plaintiffs alleged that Copilot's outputs failed to provide (1) attribution to the owner, (2) a copyright notice, and (3) the license terms, despite express OSS licensing terms that condition permission to create derivative works on this information. Plaintiffs alleged that use of licensed code thus violated the relevant provisions of each OSS license. While several different types of OSS licenses may be applicable to the code on which Copilot was trained, most (if not all) of these OSS licenses likely require proper attribution when code or code sections are used.

Current Status

GitHub, OpenAI, and Microsoft filed motions to dismiss the Amended Complaint on June 29, 2023. As relevant to this discussion, the companies challenge theDMCAclaim on the ground that plaintiffs failed to identify specific examples of works that had been copied or distributed in identical form after removal of CMI. However, GitHub, OpenAI, and Microsoft did not challenge plaintiffs' breach of OSS licenses claim.

We expect the companies will raise fair use as a potential affirmative defense, on the theory that, if GitHub's use of the code qualifies as fair use, it does not require a license and therefore is not subject to the OSS license terms.

Implications

The Copilot case highlights the legal complexities surrounding the use of AI-generated code from tools like Copilot that have been trained on copyrighted materials. Code subject to open-source licenses is still copyright-protected, and the terms and limitations set forth under the open-source licenses govern the code's use. As discussed, OSS licenses carry diverse obligations, usually including complex attribution requirements that differ by code. For AI companies and companies using AI, determining the content of training sets and whether AI tools directly reproduce code or independently create it remains challenging, especially when companies are dealing with millions of lines of code, if not more.

For software developers, refraining from using tools like Copilot until the lawsuit is resolved is the safest way to avoid an action for breach of OSS terms. But in the current competitive software development market, this recommendation may be impractical. Companies that choose to proceed with using AI-assisted tools should therefore exercise caution and avoid unnecessary risks. On the front end, companies can ask their AI tool vendors whether the AI training model included source code subject to OSS licenses. If so, companies can ask whether their tool can exclude training data subject to OSS licenses. On the back end, companies can use code scanners to audit code for potential matches with code subject to OSS licenses.

Companies interested in protecting their own source code under copyright law also need to be aware that reducing overall reliance on AI and OSS will increase the strength and scope of protection. Recent guidance out of the US Copyright Office requires applicants seeking to register their copyrights to "disclose the inclusion of AI-generated content" and "to provide a brief explanation of the human author's contributions to the work." Non-human contributions, such as AI-generated code, are not eligible for copyright registration, and pre-existing materials, such as code derived from OSS, are excluded from the scope of protection.

Finally, even if a company does not expressly provide or permit its employees to use AI tools, it is safest to assume that developers are already using AI tools to assist in their programming. As a result, companies should adopt internal policies concerning the use of AI and provide training to educate employees about the issues and risks involved.

Conclusion

Copilot represents a significant advancement in software development but comes with complex legal considerations. Companies and their lawyers should be aware of these issues, stay updated on the ongoing dispute, and take precautions to mitigate risks associated with AI-generated code. In this evolving landscape, careful consideration and initiative-taking measures are essential to navigate the legal challenges surrounding AI in coding.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.