Introduction
The rapid advancement of Artificial Intelligence (AI) presents a significant legal and ethical challenge inter alia the intersection between AI and Copyright. Generative AI (Gen AI) models are 'self-learning tools' which digest material and produce new content based on training data comprising extensive datasets of text, images, and video. Large Language Models (LLMs), such as ChatGPT are a specific type of Gen AI that specialize in understanding and generating human-like text through natural language processing (NLP). These models produce content ('output') in response to user prompts ('input'). There is generally no concern if the LLMs are trained on publicly available sources, i.e., creative material that is not protected by intellectual property laws. However, the problem arises when such data used for training includes copyrighted data. Since AI models may 'learn' on the copyrighted data, there is an inherent risk that the output produced by these AI models is a literal/substantial reproduction of the copyrighted work.
Countries across the world are grappling with the challenges posed by AI to intellectual property laws and a series of lawsuits alleging copyright infringement have been filed. In India, we have the recent cases of ANI Media Pvt. Ltd. v. Open AI Inc & Anr., 2024, and Kanchan Nagar & Ors. V. Union of India & Ors., 2024 filed before the High Court of Delhi, that shall determine the Indian position.
This article explores the intersection between AI & Copyright, primarily focusing upon (i) the risk of infringement (ii) Fair Use and other defences (iii) Consent requirements (iv) Ownership of copyrighted work; and (v) Global position.
Training an AI model: Risk of Infringement
Copyright law protects the original works of an author, and provides the exclusive right of reproduction to the owner amongst other rights. Per Section 51 of the Indian Copyright Act, 1957, copyright infringement constitutes use of someone's copyrighted work without obtaining requisite permissions/licenses. With regard to AI models, copyright infringement can take place at two stages:
Input: The inclusion of copyrighted material in the training datasets of AI models can result in the creation of unauthorised copies generated during the training process, and training data embedded and stored within the model after training.
Output: If the content produced by the AI model, either due to manipulative user prompts or copyrighted data used in the training process is substantially similar to the copyright-protected data of a third party, it may invite infringement claims.
Thus, it becomes essential to understand how AI models extract proprietary data. Generally speaking, methods like data scraping or web crawling are used to train datasets. Data scraping refers to the automated process of collecting information from websites and online platforms without an owner's explicit consent. This technique is frequently employed by LLMs, such as ChatGPT, to extract information from publicly available sources. The use of web-scrapped data which violates any copyright may invite legal action and interesting precedents exist on the issue.
In the case of Ryanair Ltd. v. P.R. Aviation BV, the Court of Justice of the European Union, 20151 held that "the use of automated systems or software to extract data from this website for commercial purposes, ('screen scraping') is prohibited unless the third party has directly concluded a written licence agreement." Closer home, in the case of OLX BV and Ors. v. Padawan Ltd., 2016, UK based Padawan Ltd. allegedly copied listings and data from Indian online sale portal OLX and posted it on its own platform without authorisation. The High Court of Delhi vide order dated December 15, 2016 restrained Padawan Ltd. from using automated/manual means to scrape data, including commercial data on OLX's website. The reasoning employed here could very well be stretched to the domain of AI model training.
To cope with the growing threat of data scraping or web crawling, certain websites have included clauses that strictly disallow data scraping, and it thus becomes critical for developers to comply with the terms of use/user agreements. Few examples are:
- Youtube: You are not allowed to access the Service using any automated means (such as robots, botnets or scrapers) except (a) in the case of public search engines, in accordance with YouTube's robots.txt file; or (b) with YouTube's prior written permission;2
- X (Twitter): You must abide by the Services' acceptable use terms: ...you cannot scrape the Services, try to work around any technical limitations we impose, or otherwise attempt to disrupt the operation of the Services. 3
Defences: Fair Use and AI
The 'Fair Use' doctrine attempts to strike a balance between the legitimate rights of copyright owners to control and benefit from their work and societal interest in using these works for the purpose of inter alia criticism, review, research, personal use etc. In copyright law, the doctrine depends upon the purpose of use, nature or amount of work copied, and the effect on the potential market. On the question of whether the use of copyrighted works to train AI models falls under the exception of Fair Use, AI companies often argue in favour of transformative use.
In the Indian context, the doctrine of 'Fair Use' envisaged under Section 52 of the Copyright Act extends to private or personal use including research, translations, criticism etc.; however, the use of copyrighted material as training data for AI model does not find explicit mention. It is perceived that the use of datasets to train AI models may fall under personal use. However, if the datasets are commercially applied, it will constitute copyright infringement. Some AI models offer an advanced version which is generally paid for, such as ChatGPT Plus, thus it is difficult to say that these AI models are not commercially applied.
The other defences often advanced by AI companies to escape liability are non-expressive use and transformative use. It is a settled principle of law that copyright subsists in the expression of a work, and not mere ideas or facts. In the case of Wiley Eastern Ltd. v. Indian Institute of Management, 1995 the High Court of Delhi emphasized that facts and knowledge embedded in expressions cannot be monopolised. Further, if the use of the work is of a "transformative character" i.e., the purpose served by the use is different from the one for which the work was created, it is a limitation to copyright protection of the owner. Parallels may be drawn from the case of Authors Guild v. Google wherein Authors Guild sued Google for copyright infringement over its Book Search project, which scanned books from major libraries without permission. Google created a searchable database, offering limited excerpts, but not the full text. The court ruled that Google's actions were transformative, benefiting the public by expanding access to books, while not substituting the market for the originals. The US Court of Appeals held that "mass digitization of a large volume of in-copyright books in order to distil and reveal new information about the books amounts to fair use." Thus, if the AI companies are able to prove Fair Use, they may be able to escape liability.
The concept of 'Consent'
The concept of consent is embedded in the idea of 'licenses/authorisations' obtained by the AI companies to train their datasets on copyright-protected data. Companies like OpenAI are entering into licensing deals with content owners including publishing or media houses, like Vox Media, Financial Times, News Corp, The Atlantic etc. to use their copyrighted material to train its AI model. Licensing agreements help AI developers avoid copyright infringement claims. Obtaining licenses is considered to be the best method to build an accurate AI model while ensuring economic benefits of copyright owners. Recently, the Government of India, in response to a question in Parliament, has clarified that appropriate permissions need to be obtained from the intellectual property rights holder before using copyright-protected work for training AI models.4
Who owns the AI generated work?
The ability of AI models to generate content raises an important question, i.e., who owns the copyright for the work generated by AI- the AI company, system developer of the AI model who collated training data, or the user whose prompts produced the output? The issue becomes more complex where the AI system is trained on existing copyrighted material. It is difficult to answer such questions in isolation, and it may depend upon the terms and policies of each AI platform. For instance, Open AI's terms of use state that, "As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output."5 Interestingly, the terms of use also mention that "Due to the nature of our Services and artificial intelligence generally, output may not be unique and other users may receive similar output from our Services." Thus, it is possible for two different individuals to obtain same/similar output in response to their prompts.
Another important point of consideration in defining ownership is whether, an AI model, being a non-natural person can be granted copyright? Copyright law in some jurisdictions, such as the USA and India, mandate that a human author must create a work, which may exclude AI-generated content from copyright protection altogether. In the Indian context, Section 2(d)(vi) of the Copyright Act defines an author "in relation to any literary, dramatic, musical or artistic work which is computer-generated, the person who causes the work to be created." An 'author' has to be a natural person. Moreover, in India a copyright is granted for a maximum of sixty years after the death of the author. Thus, if copyright is granted to an AI model having perpetual existence, the entire purpose of the Act would be defeated.
Since AI cannot qualify as a natural person, it remains to be seen as to how the definition of 'author' will be interpreted in the coming times.
That said, some re-working of laws and their interpretation seems to be in order. Cue may be taken from the US Copyright Office (USCO). Up until January 2025, USCO had rejected all AI-generated works on the basis that they lacked sufficient human authorship - reiterating that providing only a text prompt or command to an AI model, without additional human creative involvement, fails to meet standards and will not result in a copyright. Invoke, a generative artificial intelligence platform, submitted a claim for an image titled 'A Single Piece of American Cheese' in August 5, 2024, and on January 30, 2025, that claim was approved. USCO ruled that A Single Piece of American Cheese met the threshold for copyright because the creator actively selected, coordinated, and arranged numerous AI-generated image fragments into unified composition. This creative decision-making process mirrored that of a collage artist, where individual components are curated and structured into a cohesive whole. Interestingly, Invoke has developed a tool it calls "Provenance Records" that track changes an artist has made to an image and embeds that information into the metadata.
Global Position: Existing laws and regulations
While there is no specific legislation in the US on AI, courts have interpreted the extant copyright laws to adjudicate the issue. The US copyright office has also launched an initiative to examine the use of copyrighted materials in AI training which shall clarify the position in coming times.6
Similarly in the case of Japan, though there is no statute as yet, the government in its report "General Understanding on AI and Copyright in Japan" 7 released in May, 2024 has allowed the exploitation of any copyrighted works for non-enjoyment purposes to train AI models.
The Canadian Parliament has proposed the Artificial Intelligence and Data Act (AIDA) which aims to focus upon responsible management and development of AI. 8
The European Union, is the first jurisdiction to have introduced a comprehensive AI Act. The EU AI Act, 2024 requires general disclosure of training data and specific authorization from copyright owners whose works are used to train a GenAI system. The Act also introduces a new concept of 'opt-out' model which explicitly restricts the use of copyrighted material if the owner has expressly reserved it. The opt-out model is a mechanism to allow content creators to prevent their data from being used for training AI by signalling their withdrawal.
As a part of first global effort, on March 21, 2024 the United Nations General Assembly passed the first global resolution on AI which highlighted the respect, protection and promotion of human rights in the design, development, deployment and the use of AI.
Though there are no specific AI laws in India, the government is taking steps to shape the AI regulatory landscape with initiatives and guidelines for ethical and responsible deployment of artificial intelligence technologies. Initiatives such as the National AI Strategy by NITI Aayog aim to promote AI research, development, and adoption in various sectors like healthcare, agriculture, and education. In 2024, the government allocated INR 103 billion towards the IndiaAI Mission to bolster the development of India's AI ecosystem. 9 Recently, the government has also announced the development of India's first foundational AI model with a new compute facility housing 18,000+ GPUs. Establishment of an Artificial Intelligence Safety Institute (AISI) too is planned which aims to position India as a leader in responsible AI development, ensuring artificial intelligence benefits people without harming ethical standards.
On 10 and 11 February 2025, France and India co-chaired the Artificial Intelligence Action Summit, gathering Heads of State and industry, representatives of academia, non-governmental organizations, artists and members of civil society, in order to build on the previous summits - Bletchley Park (November 2023) and Seoul (May 2024). The commitment to take concrete actions to ensure that the global AI sector can drive beneficial social, economic and environmental outcomes in the public interest was underlined and it has been announced that India will host the next AI Summit.
Conclusion
It is undisputed that 'artificial' intelligence is today's 'real(ity)'!
There is a pressing need to govern the use of copyrighted works for AI training in a progressive manner that will stand the test of time. It is essential to go beyond the traditional concept of copyright infringement including interpreting the definition of 'author'. In navigating the evolving landscape of AI and copyright, it is crucial to strike a balance between fostering innovation and safeguarding the rights of creators.
Footnotes
1. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:62014CJ0030&from=EN
2. https://kids.youtube.com/t/terms
4. https://sansad.in/getFile/annex/263/AU845.pdf?source=pqars
5. https://openai.com/policies/row-terms-of-use/
6. https://www.copyright.gov/ai/
7. https://www.bunka.go.jp/english/policy/copyright/pdf/94055801_01.pdf
8. The Artificial Intelligence and Data Act (AIDA) – Companion document
9. https://pib.gov.in/PressReleasePage.aspx?PRID=2012375
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.