As generative AI tools such as OpenAI's ChatGPT and Dall-E, Meta's Llama, and Anthropic's Claude become ever more capable and more mainstream, companies seeking to benefit from this technology are likely to find it necessary to adopt policies on its use and to revise existing policies in the face of developments in AI-related litigation and determinations made or pending by the U.S. Copyright Office.
This article provides an overview of several classes of existing intellectual property risks in employing generative AI — particularly those relating to confidential/trade secret information and copyright — and proposes a set of best practices (and lower-risk use cases) for companies that seek to benefit from generative AI while reducing their own risk exposure.
Classes of Intellectual Property Risks From AI: Confidentiality Risks and Copyright Liability
Immediate and direct intellectual property risks posed by the use of generative AI include:
- The potential compromise or unlicensed disclosure of
proprietary information (e.g., confidential trade secrets) when
provided in prompts to remotely-hosted AIs;
- Copyright infringement liability related to the training,
outputs, or creation or copying of an AI model; and
- The potential for competitors or others to use AI outputs that are publicly disseminated (without the company having recourse to copyright protection to limit that use).
Each of these is elaborated on below. See the bottom of this article for a list of potential "Do's" and "Don'ts" with respect to risk mitigation for these IP concerns implicated by generative AI.
While these classes of risk are not exhaustive — for example, AI outputs may also implicate trademark risk, as reflected in the suit filed by Getty Images against Stability AI (D.Del. 1:23-cv-00135) — they reflect major classes of risk reflected in ongoing litigation at the time of writing.
Proprietary Information and Confidentiality Risks for Information Provided in Prompts to Externally-Hosted AI Models
The licensing terms of several mainstream AI models include terms that include licensing of prompt information to the AI vendor (e.g., for use in training the model). Information provided to remotely-hosted AI models in prompts can therefore pose a risk to a company's control of its internal IP.
For example, information provided in a prompt may be used in training later model iterations, and this information may then be incidentally replicated in response to prompts by others.
Alternatively, information provided in a prompt might be directly viewed by potentially-competitive human personnel at the AI vendor, or else information provided in a prompt may reflect a per seviolation of contractual or ethical confidentiality obligations (e.g., providing information that must be kept legally or ethically confidential to a remotely hosted AI model without a confidentiality guarantee from the vendor).
Disclosure to third parties without adequate safeguards may also compromise the capacity of a company to either seek patent protections (at least on the company's intended timeline) or to retain trade secret protections.
Companies drafting generative AI guidelines relating to the use of internal, confidential information (for example, an employee might wish to use a generative AI to generate a three-page summary of a forty-page internal analysis) should therefore emphasize that such information should:
- Be marked and treated in accordance with existing
confidentiality protocols to avoid inadvertent disclosure of
sensitive information to AI models that lack confidentiality
guarantees, and
- Not be shared with externally-hosted generative AI services that do not provide clear confidentiality guarantees without either express clearance or clear guidance on that a particular class of information may be shared and under what circumstances.
From a practical business perspective, the tangible risks to organizations may not warrant precluding the usage of nonconfidential, externally-hosted AI models (like ChatGPT) under all circumstances, even for nominally confidential but low-sensitivity business information. The likelihood of specific human review of information sent to the AI vendor and such information's capacity to cause competitive harms will often — perhaps usually — be low, and the risks posed by using such data for model training are often ambiguous and time-sensitive.
However, between the risk of uncontrolled dissemination potentially weakening trade secret protections (information must generally be kept confidential to retain trade secret protections) and the parlous state of copyright protection for AI-generated works (see below), it is essential that organizations and general counsel contend with the potential that information shared with ChatGPT or similar generative AI systems — absent a guarantee of confidentiality — may become effectively public domain if appears in an AI output. Establishing clear guidelines and clearance mechanisms ahead of time can significantly mitigate the potential for disclosure of information in prompts to create crises later on.
Copyright Infringement Liability Risks From the Use or Deployment of AI
In the wake of a putative class action recently filed against OpenAI by the Authors Guild, the copyright liability risks of generative AI have again been brought to the fore, following similar suits by entities such as Getty Images (against Stability AI), Sarah Silverman (against OpenAI and Meta), and a group of visual artists led by Sarah Andersen (against Stability AI and others).
Against this backdrop, companies should be aware of the specific copyright liability risks posed by generative AI when crafting internal usage policies.
Three Classes of Liability Risk: Training Data, Models Themselves, and Model Outputs
Training Data and Ingestion Containing Infringing Content
The allegations of the recent suit against OpenAI by the Authors Guild (a putative class action including named plaintiff authors such as Jonathan Franzen and George R.R. Martin) revolve primarily around the ingestion of datasets — referred to as "Books1" and "Books2" in the complaint — used to train the GPT model. These datasets allegedly included pirated copies of copyrighted works (of which ChatGPT was able to provide accurate summaries and, for a time period before the filing of the complaint, verbatim or near-verbatim excerpts) and thus, the complaint alleges, by making copies of these datasets for training purposes, OpenAI committed acts of infringement.
For most companies — that is, not those training their own AI models — the specific liability attributable to copying and ingesting allegedly-infringing training data not be an acute concern, but for companies that are performing model training or who are "fine tuning" open-source models to get better domain-specific performance, should ensure that they permit the use only of materials that are public domain or for which such use is authorized.
Models Themselves
While the Authors Guild complaint focused primarily on the allegedly infringing copying of data for model training, the Jan. 13, 2023, putative class action complaint by Andersen et al. against Stability AI, MidJourney, and Deviantart and argued that AI model weights themselveswere infringing, on the grounds that they stored encoded or compressed versions (or encoded or compressed derivative works) of the works used to train them. (Andersen Compl. at ¶¶65-100, 95 160). This is also partially suggested by facts of the Getty Images Complaint, which notes that the Stable Diffusion AIwas outputting the "Getty Images" watermark — sometimes distorted — on AI-generated sports pictures (See, e.g.,Getty Complaint ¶52). Similar risks are reflected in statements by AI companies that their models may have a capacity for near-verbatim "recall" of certain copyrighted works.
While the Andersen Complaint's allegations that model weights contained "compressed" versions of training data were recently dismissed (See order at 8-14) with leave to amend, and the nature of the pleadings implicates certain issues such as fair use that pose thorny legal as well as factual determinations (e.g., the degree of transformativeness of converting training data into model weights and the requirement of substantial similarity to establish infringement), companies that seek to run local instances of AI models (e.g., open source models that may have been trained using infringing works and that may be adjudicated to themselves be infringing derivative works of that training data) should be aware of potential risks in the event that those models themselves are found to be infringing works — in which case, copying them locally might itself be an act of infringement.
Pending full resolution of the Andersen complaint and issues such as fair use, counsel drafting generative AI guidelines may wish to advise or require the use of remotely-hosted AI models rather than locally-run ones, and in turn mandate that only nonconfidential/nonsensitive information be provided to such models and/or that the models used provide a contractual confidentiality guarantee (as appears may be in the works from Microsoft).
Model Outputs
The Authors Guild Complaint against OpenAI averred that summaries produced by ChatGPT were infringing derivative works of the allegedly pirated works used to train it – for example, ChatGPT was alleged to have "generated an infringing, unauthorized, and detailed outline for a prequel book to 'A Game of Thrones,' one of the Martin Infringed Works." (Compl. ¶¶ 238-248) Similar allegations are made in the Andersen complaint [¶ 95 ("Every output image from the system is derived exclusively from the latent images, which are copies of copyrighted images. For these reasons, every hybrid image is necessarily a derivative work")].
Pending resolution of these complaints, companies crafting generative AI policies should, at minimum, caution employees about using prompts that are likely to generate derivative works of information that is copyrighted (for example, employees should be advised not to ask for outlines for sequels or prequels to George R.R. Martin's A Song of Ice and Fire novels).
A separate issue is whether every outputfrom a model that is allegedly itself infringing is an infringing derivative work of that model — while this is an allegation of the Andersen complaint, it is less explicitly alleged in other suits. The Authors Guild complaint, for example, points to an outline for a prequel work as an infringement of George R.R. Martin's copyrights, but not as an infringement of the copyrights of other members of the putative plaintiff class (e.g., Jonathan Franzen).
At present, there some reason to believe that not every output of an AI — even one trained on allegedly infringing data — is necessarily infringing or derivative of the inputs used to train it. In particular, copyright infringement generally requires establishing substantial similarity between accused and original works. Likewise The Copyright Office's recent Request for Comments on AI-related regulation suggests that not every output is necessarily derivative of training data, noting, for example copying of an artist's "style" but not their specific works is (at present) generally not a form of copyright infringement, even though "style" is presumably learned through exposure to the artists' works. See, e.g.,Notice of Inquiry and request for comments re: Artificial Intelligence and Copyright, Docket No. 2023-6, 10 ("the Office heard from artists and performers concerned about generative AI systems' ability to mimic their voices, likenesses, or styles. Although these personal attributes are not generally protected by copyright law...."). (Note that the Office sought comment on potential protections for artistic style in the same RFC). The recent dismissal (with leave to amend) of various counts of the Andersen Complaint also suggests that substantial similarity to a copyrighted training work is still required to establish infringement (Order at 10-13) ("Even if that clarity is provided and even if plaintiffs narrow their allegations to limit them to Output Images that draw upon Training Images based upon copyrighted images, I am not convinced that copyright claims based a derivative theory can survive absent 'substantial similarity' type allegations.").
While the issue remains legally unsettled, certain AI vendors may provide indemnification and liability protection for the use of generative AI outputs for users who sign contracts — for examples, via Microsoft's "CoPilot Protection Program" or Getty Images' own just-announced generative AI tool. Similar indemnification may be available from provider such as Google and Adobe.
Companies seeking to minimize their liability risk should therefore consider, and possibly mandate, the use of generative AI models that provide such liability protections.
A remotely-hosted, liability-protection-providing model, that is prompted only with nonsensitive information that employees have a right to use (e.g., public-domain information, licensed information, or internally-sourced non-sensitive information) likely presents the lowest cross section of liability risk and confidentiality risk for companies seeking to use generative AI at this time.
AI Outputs May Not Be Copyrightable — There May Be No Right to Exclude Competitors From Copying Them
Companies should also be aware that the output of generative AI may not be eligible for copyright protection, and thus any such outputs made public (for example, used on a public-facing website) risk being freely available for competitors, analysts, and the public at large to reproduce.
The Copyright Office has determined that works of AI-generated visual art are not eligible for copyright based on reasoning that suggests that the output of generative AI systems do not reflect human authorship, even if generated in response to human-authored prompts.
Accordingly, the output of generative AI (particularly if not obviously a derivative work of a copyright-eligible work, such as creating an abridged version or summary of a memorandum) may not represent a protectible company asset if publicly disseminated.
For information that is not confidential or sensitive and may be publicly disseminated, but that reflects generically-useful output that the company would prefer to prevent others from using (for example, certain types of ad copy or generic product descriptions that do not implicate company-proprietary trademarks), companies are best advised to stick to human authorship to retain the capacity to limit others' right to copy this work, which otherwise would be at risk of falling into the public domain.
If generative AI output reflects company-confidential information, companies should continue to preserve confidentiality but also should be aware that trade secret protections and the prevention of public dissemination are now potentially the only legal avenues available for protection of such information, rather than merely (as they have always been) the bestpractical ones.
This also suggests potential new risks to information that is publicly disseminated by a judgment-proof entity following a confidentiality breach: if the trade secret protections are lost following widespread public disclosure, then preventing subsequent copying and dissemination will likely be more difficult to enforce on works not independently protectible by copyright.
Takeaways and "Do's and Don'ts"
Accordingly, companies developing or revising generative AI policies should take the following best practices into account:
Do
- Have clear policies in place for what information may and may
not be used in prompts and under what circumstances, and maintain
clear guidelines about confidentiality expectations and document
sensitivity, and appropriate marking documents as confidential
where necessary.
- Use remotely-hosted AI instances that provide confidentiality
guarantees akin to those used by existing discovery vendors, where
possible. Be aware that many publicly-hosted AI models require
prompts to be licensed for use in training.
- Consider using privately-hosted instances (e.g., based on the
open-source tunable "Llama" model from Meta) as an
alternative to publicly-hosted services that don't provide
confidentiality guarantees.
- Be aware, however, of potential liability risks stemming from allegations that the model weights (or potentially even any outputs their produce) are a derivative work of copyrighted training inputs.
Don't
- Publish AI-generated materials that would be potentially useful
to competitors (generic product descriptions, ad copy, business
plans), whether or not based on internal human-generated
descriptions.
- These materials may not be eligible for copyright protection
and so you will have limited recourse to stop competitors from
taking advantage of them without an element of human
authorship.
- These materials may not be eligible for copyright protection
and so you will have limited recourse to stop competitors from
taking advantage of them without an element of human
authorship.
- Provide internal, confidential information to publicly
available, remotely hosted AI that may use it for training or other
purposes (such as currently-available ChatGPT).
- The compromise of this information may be a per sebreach of
confidentiality obligations, or else the information content may
find itself replicated in response to future prompts (and/or looked
at by humans who may have competitive interests).
- The disclosure and/or dissemination of this information may
also compromise the capacity to seek patents and/or preserve trade
secret rights.
- The compromise of this information may be a per sebreach of
confidentiality obligations, or else the information content may
find itself replicated in response to future prompts (and/or looked
at by humans who may have competitive interests).
- Deliberately request and/or duplicate of information that may
result in an output that is colorably a derivative work of a
copyrighted creative work — for example, requesting that an
AI author a work of fan-fiction or propose a sequel to an
unlicensed copyrighted work.
- Because facts are generally not copyrightable, this is less of a concern when asking purely factual questions, although the accuracy of the model outputs should be double-checked where possible — AIs may "hallucinate" and produce answers that sound convincing but are factually false.
A remotely hosted, liability-protection-providing model, that is prompted only with nonsensitive information that employees have a right to use (e.g., public-domain information, licensed information, or internally sourced nonsensitive information) likely presents the lowest cross section of liability risk and confidentiality risk for companies seeking to use generative AI at this time.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.