Overview Of The Recommendations On The Development Of AI Systems With Respect To Data Protection

In this Insight article, Marguerite Brac de La Perrière, Partner at Fieldfisher, examines the French data protection authority's (CNIL) recommendations on artificial intelligence (AI), offering guidance for applying data protection principles in AI development.

There can be no artificial intelligence (AI) system without huge amounts of data, but the relationship between the creation of personal learning databases and the development of the system on the one hand and data protection on the other can be complex.

Concerns have been raised that regulation could stifle innovation and, more specifically, that the GDPR could inhibit AI innovation in Europe. Indeed, designers and developers of AI systems face significant challenges in applying the requirements of the GDPR, particularly when training models.

With the main objective of helping organizations reconcile innovation and respect for human rights in the development of their AI systems, the French data protection authority (CNIL) has published recommendations on the development of AI systems that were adopted following public consultation.

Taking into account the adopted EU Artificial Intelligence Act (the EU AI Act), the recommendations suggest ways to apply fundamental data protection principles to the different stages of development of all types of AI systems when personal data is involved.

The recommendations are not binding, provided that data controllers who would deviate from them under their responsibility can justify their decisions.

The first steps for any developer are to identify the applicable regime and the purpose for the data processing carried out during the development phase, which are closely related.

Applicable legal regime

Some provisions from the French Act No.78-17 of 6 January 1978 on Data Processing, Data Files and Individual Liberties (as amended to implement the GDPR) (the Act) may apply to the development and deployment phases of an AI system, such as those regarding health data processing, law enforcement sectors, or national defense.

When operational use of the AI system in the deployment phase is defined from the development phase, it is usually considered that they are covered by the same legal regime as the one that will be applicable in the deployment phase.

When it is not possible to clearly identify the purpose of the processing in the deployment phase from the development phase, the legal regime of the development phase may not be the same as the one which will be applicable in the deployment phase.

Purpose of the processing

Any AI system based on the exploitation of personal data must be developed with a 'purpose,' meaning a well-defined objective, which makes it possible to ensure transparency, limit the personal data that can be used for training, and not store and process unnecessary data.

As for any other processing of data, the purpose must be determined or established as soon as the project is defined, explicit, and legitimate.

When an AI system is developed for a single operational use, it is considered that the purpose in the development phase is directly related to the one pursued by the processing in the deployment phase. Therefore, if the purpose in the deployment phase is itself specified, explicit, and legitimate, the purpose in the development phase will also be determined.

However, this is more complex when developing a general-purpose AI system that can be used in various contexts and applications or when your system is developed for scientific research purposes. For general-purpose AI systems, organizations may not have any specific operational use being foreseen when developing the model. In this situation, the organization should not define the purpose too broadly as, for example, development and improvement of an AI system.' It needs to be more precise and refer to the type of system developed, such as a large language model, computer vision system, or generative AI system for images, videos, sounds, computer code, etc., the technically feasible functionalities and capabilities. Furthermore, it will be good practice to give even more details, such as the foreseeable capacities most at risk, functionalities excluded by design, and the conditions of use of the AI system (open source, SaaS, or via an API, etc.).

In some cases, when the creation of training datasets for AI pursues a scientific research purpose, it may be difficult to identify entirely from the beginning of the work the objective. Therefore, the degree of precision of the purpose could be less precise or not specified in its entirety. It will then be possible to provide information to clarify the objective as the project progresses.

Responsibilities under GDPR

Taking into account the identified purpose, any developer of an AI system must identify its responsibilities with regard to the data processing implemented during the development phase.

When a provider is at the initiative of the development of an AI system and constitutes a training dataset from data selected for its own account, it can be qualified as a controller. If the provider builds a training dataset of an AI system with other controllers for a purpose that they have defined together, they can be referred to as joint controllers.

When an AI system provider develops a system on behalf of one of its customers, they can be a subcontractor. In this situation, the customer may be responsible for processing.

If the customer only gives the goal to achieve and the provider designs the AI system, the provider may be the controller.

Legal basis

Like any personal data processing, the creation and use of a training dataset containing personal data can only be implemented if it corresponds to one of the 'legal bases' provided for in the GDPR, which is what gives an organization the right to process personal data. The choice of the legal basis is an essential first step to ensure compliance with the processing, depending on which obligations of the organization and the rights of individuals may vary. Consent is rarely the appropriate legal basis, either because direct contact with data subjects is not possible or because it is difficult to obtain free consent and to guarantee the right of withdrawal. The legal bases of the contract and the legal obligation may be used more exceptionally only when processing is necessary to meet the performance of the contract or pre-contractual measures or a sufficiently precise legal obligation to which the controller is subject. Therefore, private actors will most of the time have to rely on legitimate interest, provided that they comply with the three conditions: legitimate, necessity of personal data, and no disproportionate interference with individuals' privacy.

In the case of re-use of data, the controller must determine whether that further processing is compatible with the purpose for which the data were originally collected, where the processing is not based on the data subject's consent or on the law, except when the further processing of data had been foreseen and brought to the attention of data subjects when collecting the data.

DPIA

A Data Protection Impact Assessment (DPIA) for the development of an AI system is often required, given that two of the nine criteria defined by Article 29 Working Party (WP29) are met, especially the following: sensitive data are collected; personal data are collected on a large scale; data of vulnerable persons (minors, persons with disabilities, etc.) are collected; datasets are crossed or combined; new technological solutions are implemented; or innovative uses are introduced.

The processing of personal data based on AI systems presents specific risks that have to be taken into account, such as:

risks related to the confidentiality of data that can be extracted from the AI system;
risks related to the misuse of the training dataset;
automated discrimination caused by a bias in the AI system;
the risk of producing false fictitious content about a real person;
the risk of automated decision-making;
the risk of users losing control over their data;
the risk of data poisoning attacks; or
systemic and serious ethical risks related to the deployment of the system.

For the development of high-risk systems covered by the EU AI Act and involving personal data, developers will be required to carry out a DPIA.

When the developer knows what the operational use of the AI system will be, it is recommended to carry out a general DPIA for the whole life cycle, which includes the development and deployment phases, even if, in the end, it is the deployer of the AI system that will be responsible for the final DPIA of the deployment phase.

When developing a general-purpose AI system, the developer will only be able to carry out a DPIA covering the development phase. This DPIA should be provided to the deployer of the AI system to enable them to conduct their own DPIA.

Principles of minimization and limitation

Those principles may appear as the most difficult to implement when developing an AI system.

For the record, the minimization principle requires that the data collected is relevant considering the defined purpose.

Several steps are strongly recommended:

data cleaning, which allows you to build a quality dataset and strengthen the relevance of the data by reducing inconsistencies, as well as the cost of learning;
identification of the relevant data, which aims to optimize system performance while avoiding under- and over-fitting;
implementation of measures such as generalization and/or randomization measures, and/or data anonymization, to limit the impact on people;
monitoring and updating of data, based on a regular analysis to ensure the follow-up of the constituted dataset; and
documentation of the data used for the development of an AI system, which guarantees the traceability of the datasets used.

Regarding data retention periods, they should be assessed on a phase-by-phase basis such as during the deployment of an AI system, and only for data that is strictly necessary for the phase justifying its processing. According to the CNIL, the development phase may require more data than the product maintenance phase.

For the development phase, data retention needs to be pre-planned and monitored over time, in accordance with the information provided to data subjects (information notices). For the maintenance or improvement of the AI system, where the data no longer needs to be accessible for the day-to-day tasks of the persons in charge of the development of the AI system, it should in principle be deleted. However, they can be kept for the maintenance of the product or its improvement if guarantees are implemented (partitioned support, restriction of access only to authorized persons, etc.).

The conditions for reconciling the imperatives of AI innovation on the one hand, and data protection on the other, are becoming clearer for developers and deployers while waiting for the delegated acts that the European Commission will adopt in the coming months to establish the precise conditions for compliance with the EU AI Act.

Article also published on : Dataguidance.com

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

Overview Of The Recommendations On The Development Of AI Systems With Respect To Data Protection

Contributor

Technology

Contributor

France