Regurgitative Training: The Value of Real Data in Training Large Language Models

41 Pages Posted: 29 Jun 2024

See all articles by Jinghui Zhang

Jinghui Zhang

Tsinghua University - School of Economics & Management

Dandan Qiao

National University of Singapore (NUS)

Mochen Yang

University of Minnesota - Twin Cities - Carlson School of Management

Qiang Wei

Tsinghua University - Tsinghua University School of Economics and Management

Date Written: July 03, 2024

Abstract

What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs, such as ChatGPT and LLAMA, means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. In this paper, we evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The ease of getting large quantities of LLM-generated data cannot compensate for performance loss-even training with a fraction of real data is enough to outperform regurgitative training. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We carry out textual analyses to compare LLM-generated data with real human-generated data, and find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. In the first strategy, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered regurgitative training process where high-quality data are added before low-quality ones. In the second strategy, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). In the third strategy, we train an AI detection classifier to differentiate between LLM-and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data. Given the inevitability of having some LLM-generated data in the training sets of future LLMs, our work serves as both a cautionary tale of its performance implication as well as a call-to-action for developing effective mitigation strategies.

Keywords: Generative AI, Large Language Model, AI-Generated Data, Synthetic Data, Machine Learning

Suggested Citation

Zhang, Jinghui and Qiao, Dandan and Yang, Mochen and Wei, Qiang, Regurgitative Training: The Value of Real Data in Training Large Language Models (July 03, 2024). Available at SSRN: https://ssrn.com/abstract=4870843

Jinghui Zhang

Tsinghua University - School of Economics & Management ( email )

Beijing, 100084
China

Dandan Qiao

National University of Singapore (NUS) ( email )

13 computing drive
Singapore, 117591
Singapore

Mochen Yang (Contact Author)

University of Minnesota - Twin Cities - Carlson School of Management ( email )

19th Avenue South
Minneapolis, MN 55455
United States

Qiang Wei

Tsinghua University - Tsinghua University School of Economics and Management ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
85
Abstract Views
377
Rank
566,816
PlumX Metrics