Regurgitative Training: The Value of Real Data in Training Large Language Models
41 Pages Posted: 29 Jun 2024
Date Written: July 03, 2024
Abstract
What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs, such as ChatGPT and LLAMA, means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. In this paper, we evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The ease of getting large quantities of LLM-generated data cannot compensate for performance loss-even training with a fraction of real data is enough to outperform regurgitative training. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We carry out textual analyses to compare LLM-generated data with real human-generated data, and find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. In the first strategy, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered regurgitative training process where high-quality data are added before low-quality ones. In the second strategy, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). In the third strategy, we train an AI detection classifier to differentiate between LLM-and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data. Given the inevitability of having some LLM-generated data in the training sets of future LLMs, our work serves as both a cautionary tale of its performance implication as well as a call-to-action for developing effective mitigation strategies.
Keywords: Generative AI, Large Language Model, AI-Generated Data, Synthetic Data, Machine Learning
Suggested Citation: Suggested Citation