Off-the-Shelf Large Language Models Are Unreliable Judges

135 Pages Posted: 8 Apr 2025 Last revised: 7 May 2025

See all articles by Jonathan H. Choi

Jonathan H. Choi

University of Southern California; University of Southern California Gould School of Law

Date Written: February 28, 2025

Abstract

Can off-the-shelf large language models (LLMs) like ChatGPT or Claude serve as “AI judges” that provide answers to legal questions? I conduct the first series of empirical experiments to systematically test their reliability as legal interpreters. I find that LLM judgments are highly sensitive to prompt phrasing, output processing methods, and model training choices, undermining their credibility and creating opportunities for motivated judges to cherry-pick results. I also find that post-training procedures used to create the most popular models can cause LLM assessments to substantially deviate from empirical predictions of language use, casting doubt on claims that LLMs elucidate ordinary meaning.

Keywords: generative interpretation, large language models, llms, legal interpretation, artificial intelligence, prompt sensitivity, ordinary meaning, AI judges, legal reasoning, judicial decisionmaking, prompt engineering, model sensitivity, post-training, empirical legal studies, textualism, statutory interpretation, computational linguistics, legal technology, contract interpretation

Suggested Citation

Choi, Jonathan H., Off-the-Shelf Large Language Models Are Unreliable Judges (February 28, 2025). Available at SSRN: https://ssrn.com/abstract=5188865 or http://dx.doi.org/10.2139/ssrn.5188865

Jonathan H. Choi (Contact Author)

University of Southern California ( email )

2250 Alcazar Street
Los Angeles, CA 90089
United States

University of Southern California Gould School of Law ( email )

699 Exposition Blvd.
Los Angeles, CA 90089
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
333
Abstract Views
1,275
Rank
200,950
PlumX Metrics