Large Language Models Are Unreliable Judges
131 Pages Posted: 8 Apr 2025 Last revised: 9 Apr 2025
Date Written: February 28, 2025
Abstract
Can large language models (LLMs) serve as "AI judges" that provide answers to legal questions? I conduct the first series of empirical experiments to systematically test the reliability of LLMs as legal interpreters. I find that LLM judgments are highly sensitive to prompt phrasing, output processing methods, and model training choices, undermining their credibility and creating opportunities for motivated judges to cherry-pick results. I also find that post-training procedures used to create the most popular models can cause LLM assessments to substantially deviate from empirical predictions of language use, casting doubt on claims that LLMs elucidate ordinary meaning.
Keywords: generative interpretation, large language models, llms, legal interpretation, artificial intelligence, prompt sensitivity, ordinary meaning, AI judges, legal reasoning, judicial decisionmaking, prompt engineering, model sensitivity, post-training, empirical legal studies, textualism, statutory interpretation, computational linguistics, legal technology, contract interpretation
Suggested Citation: Suggested Citation