Synthetic Data and the Future of AI

52 Pages Posted: 19 Feb 2024 Last revised: 26 Mar 2024

See all articles by Peter Lee

Peter Lee

University of California, Davis - School of Law

Date Written: February 10, 2024


The future of artificial intelligence (AI) is synthetic. Several of the most prominent technical and legal challenges of AI derive from the need to amass huge amounts of real-world data to train machine learning (ML) models. Collecting such real-world data can be highly difficult and can threaten privacy, introduce bias in automated decision making, and infringe copyrights on a massive scale. This Article explores the emergence of a seemingly paradoxical technical creation that can mitigate—though not completely eliminate—these concerns: synthetic data. Increasingly, data scientists are using simulated driving environments, fabricated medical records, fake images, and other forms of synthetic data to train ML models. Artificial data, in other words, is being used to train artificial intelligence. Synthetic data offers a host of technical and legal benefits; it promises to radically decrease the cost of obtaining data, sidestep privacy issues, reduce automated discrimination, and avoid copyright infringement. Alongside such promise, however, synthetic data offers perils as well. Deficiencies in the development and deployment of synthetic data can exacerbate the dangers of AI and cause significant social harm.

In light of the enormous value and importance of synthetic data, this Article sketches the contours of an innovation ecosystem to promote its robust and responsible development. It identifies three objectives that should guide legal and policy measures shaping the creation of synthetic data: provisioning, disclosure, and democratization. Ideally, such an ecosystem should incentivize the generation of high-quality synthetic data, encourage disclosure of both synthetic data and processes for generating it, and promote multiple sources of innovation. This Article then examines a suite of “innovation mechanisms” that can advance these objectives, ranging from open source production to proprietary approaches based on patents, trade secrets, and copyrights. Throughout, it suggests policy and doctrinal reforms to enhance innovation, transparency, and democratic access to synthetic data. Just as AI will have enormous legal implications, law and policy can play a central role in shaping the future of AI.

Keywords: artificial intelligence, machine learning, data, training data, synthetic data, labeling, industry concentration, antitrust, privacy, algorithmic bias, automated discrimination, copyright infringement, open source, intellectual property, patents, trade secrets, copyrights

JEL Classification: D43, H40, H44, I14, J6, K21, L13, L17, L5, L86, O25, O3, O31, O32, O34, O36

Suggested Citation

Lee, Peter, Synthetic Data and the Future of AI (February 10, 2024). 110 Cornell Law Review (Forthcoming), Available at SSRN:

Peter Lee (Contact Author)

University of California, Davis - School of Law ( email )

Martin Luther King, Jr. Hall
Davis, CA CA 95616-5201
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
PlumX Metrics