In previous blog posts, we have explored the challenges of processing genetic data in compliance with the GDPR, and in particular in relying on anonymisation, consent and/or the scientific research exemption to do so. We have seen that each of these avenues comes with its difficulties. However, it is clear that genetic data – and sensitive health data more generally – is an invaluable resource for life sciences companies whose business it is to better understand disease, to better diagnose it and to better treat it. Studying and sharing this data can lead to new insights and a huge potential payoff for patients.

In this context "synthetic health data" has been touted as a practical way that the life sciences industry may be able to more easily make use of the power lying in its data.

Synthetic data is artificial data generated from an original (real) dataset. A machine learning model is used to capture the patterns in the real data and to generate new data from that model. The synthetic dataset is designed to mirror the statistical properties of the original real data, but because the synthetic data consists of new data points, it removes all links to the original patients. Each synthetic patient looks like a real patient, except the variables associated with that patient are simply derived from the relationships found in the original dataset. The idea is that if the synthetic data is analysed, the same statistical conclusions would be drawn as would be drawn from the original real data. However, at the same time, the data should be "non-identifiable" and therefore should not be considered "personal data" subject to GDPR. Of course, processing the original real data to generate the synthetic dataset at the outset, including sharing the original data with third parties who you might work with in this process, must be carried out in accordance with GDPR and any other applicable privacy laws.

Synthetic data may sound far-fetched but the use of generative models to create synthetic data is already a reality. The website "This Person Does Not Exist" produces shockingly realistic portraits of human faces that do not in fact exist in the real world, based on a 'training' dataset of thousands of real photos.

Of course, to be useful for scientific research and reliable analysis, synthetic data cannot just "look" real. It needs to be have validated statistical fidelity i.e. it must match the underlying data as closely as possible. Ideally, synthetic data demonstrates this statistical fidelity on a "patient" level rather than at an "aggregate" level – so that it is powerful enough when studying precision medicine and rare diseases. Therefore, as well as developing methods to build synthetic data, it is also important to develop a package of validation metrics to assess the statistical fidelity and give the highest levels of confidence.

Synthetic data should also have validated levels of privacy. Although the synthetic data is not that of real patients, if safeguards are not put in place then it can in certain circumstances be possible to recover original from the training dataset (for example, if properties of the generative model are made available, if the data is 'overfitted' and/or through a 'membership interference attack').

Overall, synthetic data offers great promise in the life sciences field. It can be generated fairly rapidly compared to anonymised data (especially with voluminous datasets), and should in theory carry a much lower risk of re-identification. Further, it can be used to smooth some of the interoperability challenges that often hamper the processing of health data. On the other hand, it can be vulnerable to sample-selection bias or lack of diversity in the training data, it must have high levels of statistical fidelity and it is not necessarily invincible when it comes to privacy concerns . This is an area to watch closely going forwards, as the use of synthetic data in healthcare is sure to proliferate further as more evidence emerges of its utility in dozens of use cases.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.