We can all agree that training data is essential to any machine learning model, but what happens when that data runs out? According to a recent article in The New Scientist, the high-quality language data used to train models such as ChatGPT could run out as soon as 2026.

High-quality language data includes books and scientific papers but is slow and costly to generate. Lower-quality data includes posts on blogs, forums and social media and is plentiful, but machine learning models based on lower-quality data may struggle to make the paradigm-shifting developments seen in machine leaning models recently. Not only is this data shortage likely to slow development, it could also see the cost of training data rocket.

But all is not lost. While these predictions are based on human-created data, synthetic data can also be generated leading to a potentially infinite source. The effectiveness of synthetic data for training machine learning models must be evaluated, but it certainly provides new opportunities for training. Also, more efficient learning algorithms are being developed all the time which can enable models to extract more knowledge from existing data sets, learn from smaller data sets and even transfer learning from one task to another.

I look forward to reading about innovations in these areas over the coming years and I am sure we will continue to see huge leaps in AI development into 2026 and beyond.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.