This post explores if Generative AI tools such as ChatGPT can
collect our data from the Internet without our consent on the
ground that the data is in the public domain. It also explores if
ChatGPT's recent ban in Italy over data protection norms has
any lessons for India.
This post was first published on Medianama and can be accessed here.
Chat GPT can write sonnets, code websites, and even pass the bar exam. It learned how to do this by training on huge amounts of data. A lot of this data is personal information about individuals scraped from the Internet, often without them knowing.
Catching on to this, last month, Italy's data protection regulator stopped Chat GPT's operations over a breach of their data norms.
India is still finalising its data protection law. Against the backdrop of Italy's action, we discuss how Chat GPT would fare under India's proposed law, and if there are lessons for us to draw from this episode.
Chat GPT under the scanner across the EU
Italy's ban on ChatGPT was prompted by a few reasons:
- There was no legal basis to justify the massive collection of data to train Chat GPT's algorithms.
- Open AI did not have appropriate age-gating mechanisms to ensure that children's data was not collected to train algorithms.
- The company didn't give people adequate notice before collecting their data.
- Chat GPT gave out factually incorrect information.
Italy has now asked OpenAI to abide by certain norms for the ban to be lifted. Open AI must publish information about its data processing and must clarify the legal basis for processing personal data for training its AI. It must allow users to seek correction of inaccurate data or its deletion and allow users to object to OpenAI's use of their personal data to train its algorithms.
While Italy's approach raises several interesting questions, we focus on one key issue – training AI models by using data that's available freely and publicly. Think public social media profiles, news pieces, Reddit posts, and so on.
Is data from public sources 'private'?
Chat GPT's technical paper says its training data includes "publicly available personal information". Under EU law, any data that can identify an individual is 'personal information'. To collect and use such data, a business must meet privacy norms, regardless of whether it's collected from the individual directly or is available publicly and freely.
Interestingly, under India's current data protection law – rules under the Information Technology Act, data that is "freely available" or "accessible in public domain" is not considered sensitive data. And so, for collecting and using such publicly available information, you need not abide by data protection rules.
But the draft Digital Personal Data Protection Bill 2022 (India's current draft data protection law) takes a different position. One that's similar to the EU approach. Even if you collect data from public sources, if it relates to an identifiable individual, it is 'personal'. And all do's and don'ts that attach to collection and use of personal data apply to it (with one exception – around deemed consent).
How can data be collected and used to train AI models?
In the EU, even if a business is collecting/ scraping personal information off the Internet, it must still justify its collection and use under one of six legal 'bases' set out in the GDPR. User consent is one basis. Another is fulfilling a contract. But the one that is often used for training AI algorithms or for improving a product is "legitimate interests" of a business.
As such, India's draft law doesn't require the data collector to have legal bases. However, to collect and use personal data, a platform must get users' consent or deemed consent, i.e. either you get actual consent from individuals or your collection/ use of data falls within one of the 'deemed consent' grounds recognised in law, such as processing data for complying with a court order or responding to a medical emergency or a public health response or processing data for 'reasonable purposes' recognised by the Indian government.
'Deemed consent' may help in training AI
Taking repeated consent to collect data for training AI models is cumbersome. So developers are likely to consider two "deemed consent" grounds that could be relevant here.
One, under the draft law, consent can be assumed when you are processing "publicly available personal data" in "public interest. Say, if a platform scoops up a public Reddit thread where users discuss their worst dating encounters, to train its algorithm. Does the AI developer not need to take users' consent separately to process this data since it is publicly available?
Interestingly, platforms like Reddit are going to start charging AI developers for accessing their content. But the question of consent/ deemed consent would remain.
Using data to train AI models- A reasonable purpose?
As India seeks to establish itself as an AI powerhouse, it would be worth exploring if the use of data to train AI models should be a 'reasonable purpose' under India's data protection law. This should be subject, of course, to appropriate checks and balances. For instance, similar to Italy's guidance, individuals could be allowed the right to object to the use of their personal data for training AI models an opt-out rather than an opt-in.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.