Data Dysphoria: The Governance Challenge Posed by Large Learning Models

29 Pages Posted: 18 Sep 2023

See all articles by Susan Ariel Aaronson

Susan Ariel Aaronson

George Washington University - Elliott School of International Affairs

Date Written: August 28, 2023


Only 8 months have passed since Chat-GPT and the large learning model underpinning it took the world by storm. This article focuses on the data supply chain—the data collected and then utilized to train large language models and the governance challenge it presents to policymakers These challenges include:

• How web scraping may affect individuals and firms which hold copyrights.
• How web scraping may affect individuals and groups who are supposed to be protected under privacy and personal data protection laws.
• How web scraping revealed the lack of protections for content creators and content providers on open access web sites; and
• How the debate over open and closed source LLM reveals the lack of clear and universal rules to ensure the quality and validity of datasets. As the US National Institute of Standards explained, many LLMs depend on “largescale datasets, which can lead to data quality and validity concerns. “The difficulty of finding the “right” data may lead AI actors to select datasets based more on accessibility and availability than on suitability… Such decisions could contribute to an environment where the data used in processes is not fully representative of the populations or phenomena that are being modeled, introducing downstream risks” –in short problems of quality and validity (NIST: 2023, 80).

Thie author uses qualitative methods to examine these data governance challenges. In general, this report discusses only those governments that adopted specific steps (actions, policies, new regulations etc.) to address web scraping, LLMs, or generative AI. The author acknowledges that these examples do not comprise a representative sample based on income, LLM expertise, and geographic diversity. However, the author uses these examples to show that while some policymakers are responsive to rising concerns, they do not seem to be looking at these issues systemically. A systemic approach has two components: First policymakers recognize that these AI chatbots are a complex system with different sources of data, that are linked to other systems designed, developed, owned, and controlled by different people and organizations. Data and algorithm production, deployment, and use are distributed among a wide range of actors who together produce the system’s outcomes and functionality Hence accountability is diffused and opaque(Cobbe et al: 2023). Secondly, as a report for the US National Academy of Sciences notes, the only way to govern such complex systems is to create “a governance ecosystem that cuts across sectors and disciplinary silos and solicits and addresses the concerns of many stakeholders.” This assessment is particularly true for LLMs—a global product with a global supply chain with numerous interdependencies among those who supply data, those who control data, and those who are data subjects or content creators (Cobbe et al: 2023).

In many countries, policymakers are trying to address these complex systems with policies designed to promote accountability, transparency, and mitigate risk. For example, some governments have proposed one size fits all AI regulation to address the risks, business practices, and the technology. For example, the EU AI Act has been approved by the EU Parliament, but many people want to update it to meet the challenges of generative AI. They are calling for provisions to encourage transparency in the data supply chain and algorithms that could complement the regulation of digital services in the Digital Services Act. In short, they are pushing for a more systemic and coherent approach. In contrast, , in 2019, Canada adopted procurement regulations, The Directive on Automated Decision Making, to govern a wide range of AI systems procured by the Canadian government. The Directive requires that the data be relevant, accurate, up to date, and traceable, protected and accessed appropriately, and lawfully collected, used, retained, and disposed. However, thus far Canadian policymakers have not linked learning from this directive to its approach to governing AI risk. As of August 2023, Canadian Parliamentarians are still reviewing the AI and Data Act (which says very little about the data supply chain and data governance and nothing about LLMs). It is in short, disconnected from the governance of data.

Keywords: data, data governance, generative AI, LLMs, systemic approach

JEL Classification: 033, 034, 038, 036, P51

Suggested Citation

Aaronson, Susan, Data Dysphoria: The Governance Challenge Posed by Large Learning Models (August 28, 2023). Available at SSRN: or

Susan Aaronson (Contact Author)

George Washington University - Elliott School of International Affairs ( email )

1957 E Street
Washington, DC 20052
United States


Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
PlumX Metrics