The rise of artificial intelligence (AI) and its widespread availability offers significant growth opportunities for businesses. However, it necessitates a robust governance framework to ensure compliance with regulatory requirements, especially under the EU Artificial Intelligence Act (AI Act; see our Guide to the AI Act) and the EU General Data Protection Regulation (GDPR). The reason GDPR compliance is so important is that (personal) data is a key pillar of AI. For AI to function effectively, it requires good-quality and abundant data so that it can be trained to identify patterns and relationships. Additional personal data is often gathered during deployment and incorporated into AI to assist with individual decision-making.
In this series of five blog posts, we discuss GDPR compliance throughout the AI development life cycle and when using AI.
Data Protection by Design
GDPR compliance plays a key role throughout the AI development life cycle, starting from the very first stages. This reflects one of the key requirements and guiding principles of the GDPR, called data protection by design (Article 25 GDPR). Businesses are required to implement appropriate technical and organizational measures, such as pseudonymization, both at the determination stage of processing methods and during the processing itself. These measures should aim to implement data protection principles, such as data minimization, and integrate necessary safeguards into the processing to ensure GDPR compliance and protect individuals' data protection rights.
AI Development Life Cycle
The AI development life cycle encompasses four distinct phases: planning, design, development, and deployment. In this context, in accordance with the terminology of the EU AI Act, we will refer to both AI models and AI systems.
- AI models are a component of an AI system and are the engines that drive the functionality of AI systems. AI models require the addition of further components, such as a user interface, to become AI systems.
- AI systems present two characteristics: (1) they operate with varying levels of autonomy and (2) they infer from the input they receive how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.
In this blog post, we focus on the first phase of the AI development life cycle: planning.
The Planning Phase
The first phase of the AI development life cycle involves understanding the business problem and defining objectives, requirements, and a solid AI governance structure to ensure regulatory compliance. During this phase, it is essential to determine the scope of (personal) data needed and identify any constraints related to such data, with a focus on the availability of the relevant datasets.
In this context, key GDPR compliance considerations involve evaluating whether the data is personal data, ensuring the processing has a valid legal basis, and verifying that the processing respects the principle of purpose limitation, including with regard to other key principles under the GDPR.
Personal Data
The GDPR only applies to personal data, i.e., any information relating to a natural person that is or can be identified, directly or indirectly. A key question, therefore, is whether AI input or output data constitutes personal data.
- Input data is information provided to or directly obtained by an AI system, based on which the system generates an output.
- Output data varies depending on the type of AI model and its intended usage. There are three major sorts of outputs: prediction, recommendation, and classification.
The European Data Protection Board (EDPB), the umbrella group of the EU's data protection authorities, issued a nonbinding opinion in December 2024 on the processing of personal data in the context of AI models (EDPB Opinion on AI Models). In the opinion, the EDPB considered whether and how AI models trained with personal data can be deemed anonymous. The EDPB identified two scenarios.
- The AI model is designed to provide personal data. When an AI model is specifically designed to provide personal data regarding individuals whose personal data was used to train the model, or in some way to make such data available, it cannot be regarded as anonymous and the GDPR necessarily applies. According to the EDPB, examples of such AI models include a generative model fine-tuned on the voice recordings of an individual to mimic their voice, or a model designed to reply with personal data from the training when prompted for information regarding a specific person.
- The AI model is not designed to provide
personal data. The EDPB considers that, even when an AI
model has not been designed to produce personal data from the
training data, it is still possible that personal data from the
training dataset remains absorbed in the parameters of the model
and can be extracted from that model. Whether the outputs of such
AI models can be considered anonymous should be determined on a
case-by-case basis. The EDPB appears to agree that an AI model may
be anonymous, although it considers such a scenario highly
unlikely. According to the EDPB, an AI model can only be anonymous
provided that it meets the following conditions:
- The likelihood that individuals whose data was used to build the model may be identified (directly or indirectly) is insignificant; and
- The likelihood of obtaining, intentionally or not, such personal data from queries is insignificant too.
The EDPB considers that examining whether these conditions are met must take into account the EDPB's Guidance on Anonymization and whether the risk of identification has been assessed, considering all the means reasonably likely to be used to identify individuals (Recital 26 GDPR). According to the EDPB, the determination of those means should be based on objective factors, such as:
- The characteristics of the training data (e.g., the uniqueness of the records in the training data, precision of the information, aggregation, and randomization, and how these affect the vulnerability to identification), the AI model, and the training procedure;
- The context in which the AI model is released and/or processed, with contextual elements including measures such as limiting access only to some persons and legal safeguards;
- The additional information that would allow the identification and may be available to the given person;
- The costs and amount of time that the person would need to obtain such additional information; and
- The available technology at the time of the processing and technological developments.
The EDPB Opinion on AI Models provides a non-exhaustive and non-prescriptive list of possible elements that may be considered when assessing AI's anonymity. These include the steps controllers take in the design stage to minimize or stop the gathering of training-related personal data and make it less identifiable, AI model testing and resistance to attacks, and documentation regarding processing operations, including anonymization. Pending cases before the Court of Justice of the EU may affect the EDPB's analysis.
Legal Basis
Under the GDPR, the processing of personal data is only lawful if the controller can demonstrate a valid legal basis. The most relevant legal bases for AI under the GDPR are consent and legitimate interests. According to the EDPB, the development and deployment phases entail different processing activities that call for different legal bases and should be evaluated individually.
- Consent. Valid consent is often difficult to obtain because it must be individual, specific, informed, unambiguous, and provided by a clear affirmative action. These conditions are generally interpreted restrictively. In addition, consent can be withdrawn at any time, and it should be as easy to withdraw consent as to give it.
- Legitimate interests. Personal data may be
processed if the processing is necessary to pursue a legitimate
interest and such interest is not overridden by the interests or
fundamental rights and freedoms of the individuals concerned.
Legitimate interests may only be relied on provided the following
three-step test is satisfied, and this test must be assessed on a
case-by-case basis.
- Legitimate interest. The processing must pursue a legitimate interest. An interest is considered legitimate if it is lawful, clearly and precisely articulated, and real and present (i.e., not hypothetical). For example, the EDPB considers that the use of a chatbot to assist users and the use of AI to improve cyber threat detection may be legitimate interests.
- Necessity. The processing must be necessary to pursue the legitimate interest in question. The EDPB sets a very high bar for necessity, as it considers that the assessment must evaluate the appropriate volume of personal data involved to determine whether the processing is proportionate to pursue the legitimate interest, but also whether there are less intrusive alternatives to achieve it in accordance with the data minimization principle. In other words, the processing of personal data is not necessary if the legitimate interest can be pursued through an AI model that does not entail such processing. This is obviously a very restrictive approach.
- Balancing test. The legitimate interest must
not be overridden by the interests or fundamental rights and
freedoms of the individuals concerned. This step consists of
identifying and describing the different opposing rights and
interests at stake. The interests of the individuals concerned may
include, for example, their interest in retaining control over
their personal data, financial interests (e.g., where an AI model
is used by an individual to generate revenues), personal benefits
(e.g., where the individual is using AI to improve accessibility to
services), or socioeconomic interests (e.g., AI that improves
access to healthcare or education). Opposing interests would
typically include the AI developer's fundamental right to
conduct business.
The impact of the processing on individuals may be influenced by the nature of the data processed by the models (e.g., financial or location data may be particularly sensitive), the context of the processing (e.g., whether personal data is combined with other datasets, what is the overall volume of data and individuals affected, and whether they are vulnerable), and its consequences (e.g., violation of fundamental rights, damage, or discrimination). Importantly, the analysis of such possible consequences must take into account the likelihood of these consequences materializing, especially considering the measures in place and the circumstances of the case.
Individuals' reasonable expectations also play a key role in the balancing test. The assessment of such expectations must take into account various criteria, such as the information provided to the individuals concerned and the wider context of the processing, including whether or not the personal information was accessible to the public, the type of relationship with the company processing personal data, the type of service, the context and source of the data collection, the possible future applications of the model, and whether people are genuinely aware that their personal data is online.
If the balancing exercise indicates that there are negative impacts from the processing on individuals, mitigation measures may tip the balance in favor of the AI developer. These steps may be technical in character (e.g., data minimization, pseudonymization, or using synthetic data), facilitate the exercise of human rights (e.g., offer unconditional opt-out or a right to erasure that is more generous than the one enshrined in the GDPR), or improve transparency (provide extensive information to individuals, including through campaigns by email, or by using FAQs, graphic visualizations, and transparency labels).
Purpose Limitation
As discussed above, the planning phase involves understanding the business problem and defining objectives of the AI model or system to be developed. This is key for GDPR compliance. This is because the GDPR requires that personal data only be collected for specified, explicit, and legitimate purposes, and that it not be further processed in a manner that is incompatible with those purposes. This is also because compliance with other core GDPR principles requires a solid understanding of the purpose of AI development.
- Transparency. The purpose of the processing must be communicated to the individuals concerned.
- Data minimization. The processing must be limited to what is necessary in relation to the purpose of the processing.
- Accuracy. Every reasonable step must be taken to ensure that personal data that is inaccurate, with regard to the purpose for which it is processed, is erased or rectified without delay.
- Storage limitation. Personal data must be kept for no longer than is necessary for the purpose for which it is processed. This entails laying down protocols for the safe disposal of data, setting precise retention periods (carefully determined based on the specific needs of the AI model), and stating the need for data retention.
Data Protection Impact Assessment
The GDPR requires a Data Protection Impact Assessment (DPIA) prior to the processing when the processing is likely to result in a high risk to the rights and freedoms of individuals. In this context, the nature, scope, context, and purposes of the processing must be taken into account.
According to a recent report commissioned by the EDPB on large language models, examples of common scenarios that may require a DPIA include:
- The use of new technologies that could introduce privacy risks;
- Large-scale monitoring of publicly accessible spaces (e.g., video surveillance);
- Processing sensitive data categories such as racial or ethnic origin, political opinions, religious beliefs, genetic data, biometric data, or health information;
- Automated decision-making that has legal or similarly significant effects on individuals; and
- Processing children's data or any data where a breach could lead to physical harm.
Even when a DPIA is not legally required, conducting one can be prudent for best practices in AI projects. It allows organizations to preemptively address potential data protection risks, assess the impact of their solutions, and demonstrate accountability.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.