In this Insight article, Marguerite Brac de La Perrière, Partner at Fieldfisher, examines the French data protection authority's (CNIL) recommendations on artificial intelligence (AI), offering guidance for applying data protection principles in AI development.
There can be no artificial intelligence (AI) system without huge
amounts of data, but the relationship between the creation of
personal learning databases and the development of the system on
the one hand and data protection on the other can be complex.
Concerns have been raised that regulation could stifle innovation
and, more specifically, that the GDPR could inhibit AI innovation
in Europe. Indeed, designers and developers of AI systems face
significant challenges in applying the requirements of the GDPR,
particularly when training models.
With the main objective of helping organizations reconcile
innovation and respect for human rights in the development of their
AI systems, the French data protection authority (CNIL) has
published recommendations on the development of AI
systems that were adopted following public consultation.
Taking into account the adopted EU Artificial Intelligence Act
(the EU AI Act), the recommendations suggest ways to apply
fundamental data protection principles to the different stages of
development of all types of AI systems when personal data is
involved.
The recommendations are not binding, provided that data controllers
who would deviate from them under their responsibility can justify
their decisions.
The first steps for any developer are to identify the applicable
regime and the purpose for the data processing carried out during
the development phase, which are closely related.
Applicable legal regime
Some provisions from the French Act No.78-17 of 6 January 1978
on Data Processing, Data Files and Individual Liberties (as amended
to implement the GDPR) (the Act) may apply to the development and
deployment phases of an AI system, such as those regarding health
data processing, law enforcement sectors, or national
defense.
When operational use of the AI system in the deployment phase is
defined from the development phase, it is usually considered that
they are covered by the same legal regime as the one that will be
applicable in the deployment phase.
When it is not possible to clearly identify the purpose of the
processing in the deployment phase from the development phase, the
legal regime of the development phase may not be the same as the
one which will be applicable in the deployment phase.
Purpose of the processing
Any AI system based on the exploitation of personal data must be
developed with a 'purpose,' meaning a well-defined
objective, which makes it possible to ensure transparency, limit
the personal data that can be used for training, and not store and
process unnecessary data.
As for any other processing of data, the purpose must be determined
or established as soon as the project is defined, explicit, and
legitimate.
When an AI system is developed for a single operational use, it is
considered that the purpose in the development phase is directly
related to the one pursued by the processing in the deployment
phase. Therefore, if the purpose in the deployment phase is itself
specified, explicit, and legitimate, the purpose in the development
phase will also be determined.
However, this is more complex when developing a general-purpose AI
system that can be used in various contexts and applications or
when your system is developed for scientific research purposes. For
general-purpose AI systems, organizations may not have any specific
operational use being foreseen when developing the model. In this
situation, the organization should not define the purpose too
broadly as, for example, development and improvement of an AI
system.' It needs to be more precise and refer to the type of
system developed, such as a large language model, computer vision
system, or generative AI system for images, videos, sounds,
computer code, etc., the technically feasible functionalities and
capabilities. Furthermore, it will be good practice to give even
more details, such as the foreseeable capacities most at risk,
functionalities excluded by design, and the conditions of use of
the AI system (open source, SaaS, or via an API, etc.).
In some cases, when the creation of training datasets for AI
pursues a scientific research purpose, it may be difficult to
identify entirely from the beginning of the work the objective.
Therefore, the degree of precision of the purpose could be less
precise or not specified in its entirety. It will then be possible
to provide information to clarify the objective as the project
progresses.
Responsibilities under GDPR
Taking into account the identified purpose, any developer of an AI
system must identify its responsibilities with regard to the data
processing implemented during the development phase.
When a provider is at the initiative of the development of an AI
system and constitutes a training dataset from data selected for
its own account, it can be qualified as a controller. If the
provider builds a training dataset of an AI system with other
controllers for a purpose that they have defined together, they can
be referred to as joint controllers.
When an AI system provider develops a system on behalf of one of
its customers, they can be a subcontractor. In this situation, the
customer may be responsible for processing.
If the customer only gives the goal to achieve and the provider
designs the AI system, the provider may be the controller.
Legal basis
Like any personal data processing, the creation and use of a
training dataset containing personal data can only be implemented
if it corresponds to one of the 'legal bases' provided for
in the GDPR, which is what gives an organization the right to
process personal data. The choice of the legal basis is an
essential first step to ensure compliance with the processing,
depending on which obligations of the organization and the rights
of individuals may vary. Consent is rarely the appropriate legal
basis, either because direct contact with data subjects is not
possible or because it is difficult to obtain free consent and to
guarantee the right of withdrawal. The legal bases of the contract
and the legal obligation may be used more exceptionally only when
processing is necessary to meet the performance of the contract or
pre-contractual measures or a sufficiently precise legal obligation
to which the controller is subject. Therefore, private actors will
most of the time have to rely on legitimate interest, provided that
they comply with the three conditions: legitimate, necessity of
personal data, and no disproportionate interference with
individuals' privacy.
In the case of re-use of data, the controller must determine
whether that further processing is compatible with the purpose for
which the data were originally collected, where the processing is
not based on the data subject's consent or on the law, except
when the further processing of data had been foreseen and brought
to the attention of data subjects when collecting the data.
DPIA
A Data Protection Impact Assessment (DPIA) for the development of
an AI system is often required, given that two of the nine criteria
defined by Article 29 Working Party (WP29) are met, especially the
following: sensitive data are collected; personal data are
collected on a large scale; data of vulnerable persons (minors,
persons with disabilities, etc.) are collected; datasets are
crossed or combined; new technological solutions are implemented;
or innovative uses are introduced.
The processing of personal data based on AI systems presents
specific risks that have to be taken into account, such as:
- risks related to the confidentiality of data that can be extracted from the AI system;
- risks related to the misuse of the training dataset;
- automated discrimination caused by a bias in the AI system;
- the risk of producing false fictitious content about a real person;
- the risk of automated decision-making;
- the risk of users losing control over their data;
- the risk of data poisoning attacks; or
- systemic and serious ethical risks related to the deployment of the system.
For the development of high-risk systems covered by the EU AI
Act and involving personal data, developers will be required to
carry out a DPIA.
When the developer knows what the operational use of the AI system
will be, it is recommended to carry out a general DPIA for the
whole life cycle, which includes the development and deployment
phases, even if, in the end, it is the deployer of the AI system
that will be responsible for the final DPIA of the deployment
phase.
When developing a general-purpose AI system, the developer will
only be able to carry out a DPIA covering the development phase.
This DPIA should be provided to the deployer of the AI system to
enable them to conduct their own DPIA.
Principles of minimization and limitation
Those principles may appear as the most difficult to implement when
developing an AI system.
For the record, the minimization principle requires that the data
collected is relevant considering the defined purpose.
Several steps are strongly recommended:
- data cleaning, which allows you to build a quality dataset and strengthen the relevance of the data by reducing inconsistencies, as well as the cost of learning;
- identification of the relevant data, which aims to optimize system performance while avoiding under- and over-fitting;
- implementation of measures such as generalization and/or randomization measures, and/or data anonymization, to limit the impact on people;
- monitoring and updating of data, based on a regular analysis to ensure the follow-up of the constituted dataset; and
- documentation of the data used for the development of an AI system, which guarantees the traceability of the datasets used.
Regarding data retention periods, they should be assessed on a
phase-by-phase basis such as during the deployment of an AI system,
and only for data that is strictly necessary for the phase
justifying its processing. According to the CNIL, the development
phase may require more data than the product maintenance phase.
For the development phase, data retention needs to be
pre-planned and monitored over time, in accordance with the
information provided to data subjects (information notices). For
the maintenance or improvement of the AI system, where the data no
longer needs to be accessible for the day-to-day tasks of the
persons in charge of the development of the AI system, it should in
principle be deleted. However, they can be kept for the maintenance
of the product or its improvement if guarantees are implemented
(partitioned support, restriction of access only to authorized
persons, etc.).
The conditions for reconciling the imperatives of AI innovation on
the one hand, and data protection on the other, are becoming
clearer for developers and deployers while waiting for the
delegated acts that the European Commission will adopt in the
coming months to establish the precise conditions for compliance
with the EU AI Act.
Article also published on : Dataguidance.com
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.