ARTICLE
9 September 2025

Generative AI And Copyright Law In India: Who Owns The Training Data?

MC
MAHESHWARI & CO. Advocates & Legal Consultants

Contributor

MAHESHWARI & CO., a multi-speciality law firm, advice on a variety of practice areas including Corporate & Commercial Law, M&A, IPR, Real Estate, Litigation, Arbitration and more. With expertise across diverse sectors like Automotive, Healthcare, IT and emerging fields such as Green Hydrogen and Construction, we deliver legal solutions tailored to evolving industry needs.
Generative artificial intelligence has opened new frontiers in creativity, law, and commerce. From drafting text to producing artwork, these systems depend on massive volumes of pre-existing material.
India Intellectual Property

Generative artificial intelligence has opened new frontiers in creativity, law, and commerce. From drafting text to producing artwork, these systems depend on massive volumes of pre-existing material. That raises a pressing question: when AI learns from books, music, or images created by others, what does Indian copyright law say about it? The debate is not just academic. It cuts to the core of whether using copyrighted works for machine training is permissible, whether such use falls under fair dealing, and what safeguards companies should put in place before deploying AI models.

The framework around AI copyright in India remains unsettled. Unlike some jurisdictions that have explicitly carved out exceptions for text and data mining, India continues to operate within the traditional boundaries of its Copyright Act, 1957. This makes the treatment of training data copyright in India a legal grey zone, where businesses, creators, and policymakers are all searching for clarity.

Explore More: IP law firm in india

Fair Dealing and AI Training under Indian Copyright Law

At the heart of the debate lies the doctrine of fair dealing. Indian copyright law, unlike the more flexible "fair use" standard in the United States, provides a closed list of exceptions where copyrighted works may be used without permission. These include private or personal use, criticism, review, reporting of current events, judicial proceedings, and a few educational and research purposes.

When applied to AI, the question becomes: does feeding copyrighted works into machine learning models for training qualify as one of these exceptions? Developers often argue that training is a form of non-expressive use—it does not copy or distribute the work in its original form but uses it to detect patterns. Yet, Indian law has not explicitly recognised such use. Courts have historically interpreted fair dealing narrowly, meaning reliance on this exception for AI training remains legally uncertain.

For companies and researchers, this creates significant risk. Unless Parliament introduces a specific carve-out for AI-related learning, the use of copyrighted materials for training datasets could be challenged as infringement. In this context, the treatment of AI copyright in India becomes less about innovation policy and more about navigating an untested legal landscape.

Text and Data Mining in India

Text and data mining, or TDM, refers to the process of analysing large volumes of textual, visual, or audio material to extract patterns, relationships, and insights. For AI models, this step is indispensable. Yet, under Indian copyright law, the legality of TDM depends on whether the underlying works are protected and how they are used.

Unlike the European Union, which has carved out explicit TDM exceptions for research and certain commercial uses, India has not adopted a similar framework. The Copyright Act does allow use of works for research or private study, but this is typically understood as human research, not automated mining by machines. This creates ambiguity. If a company scrapes newspapers, journals, or image archives to build an AI model, that activity could easily fall outside recognised exceptions.

Developers seeking to conduct text and data mining in India are therefore left with two choices: either rely on open-access material such as works in the public domain or Creative Commons-licensed content, or obtain licences from rights holders. Without such diligence, the use of copyrighted works in TDM could expose developers to infringement claims. As courts begin to grapple with disputes involving AI, the lack of statutory guidance makes contracting and compliance even more critical.

Training Data Copyright in India

Training data is the lifeblood of generative AI systems. Whether it is novels, journal articles, song recordings, or digital art, these inputs shape how an AI learns to generate new outputs. But under Indian law, such material is often subject to copyright. This raises a fundamental concern: does the act of using copyrighted works as training data constitute infringement, even if the AI does not reproduce them verbatim?

Indian copyright law grants authors exclusive rights to reproduce, adapt, and communicate their works. Training data arguably implicates the reproduction right, because the AI must make copies—however temporary or partial—to process information. Courts in India have not yet ruled on whether such technical copies fall within infringement, but the absence of explicit exemptions leaves developers in a vulnerable position.

Another layer of complexity arises when training datasets mix public domain works with copyrighted materials. While public domain content is safe to use, copyrighted works demand licensing unless they clearly fall within fair dealing. Companies building AI tools in India are therefore urged to carry out copyright audits of their training data. This ensures that reliance on disputed material does not lead to future litigation.

The unsettled state of training data copyright in India highlights the urgent need for legislative or judicial guidance. Without it, innovators operate under legal uncertainty, and creators fear their works are being used without consent or compensation.

Contractual Safeguards in AI Development

Given the uncertainties around copyright exceptions, contracts are becoming the most reliable tool to manage risk. For businesses developing or deploying AI systems in India, well-drafted agreements can clarify rights, allocate liabilities, and ensure compliance with copyright law.

Licensing contracts with data providers are the first line of defence. By negotiating express rights to use specific datasets for training, companies reduce the chances of infringement claims. These licences should address not only access and use but also the scope—whether the data may be used for research, commercial development, or resale.

Equally important are warranties and indemnities in AI development contracts. Service providers may warrant that the training data they use is lawfully obtained and free of third-party claims. Indemnities can then shift the financial risk of infringement disputes to the party best positioned to control data sourcing.

On the user side, businesses integrating generative AI tools into their workflows should insist on clear contractual terms regarding ownership of outputs, responsibility for copyright compliance, and liability if generated works infringe existing copyrights. Without these safeguards, end-users may unknowingly expose themselves to litigation.

Conclusion

India stands at a crossroads in regulating the relationship between generative AI and copyright. While the current law does not explicitly accommodate text and data mining or AI training, the pace of innovation makes it urgent for lawmakers and courts to clarify the rules. A balanced approach could involve introducing limited statutory exceptions for non-commercial TDM and research, while preserving licensing rights for commercial exploitation. Such clarity would encourage innovation without undermining the interests of creators.

Until then, businesses and developers must rely on practical safeguards—sourcing training data responsibly, negotiating licences, and embedding strong contractual protections. The debate over AI copyright in India is not only about compliance but also about building trust in an ecosystem where human creativity and machine learning can coexist without conflict.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

Mondaq uses cookies on this website. By using our website you agree to our use of cookies as set out in our Privacy Policy.

Learn More