Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

19 Pages Posted: 13 Jun 2025 Last revised: 8 Jun 2025

See all articles by Lennart Meincke

Lennart Meincke

University of Pennsylvania; The Wharton School; WHU - Otto Beisheim School of Management

Ethan R. Mollick

University of Pennsylvania - Management Department

Lilach Mollick

University of Pennsylvania - Wharton School

Dan Shapiro

Glowforge, Inc; University of Pennsylvania - The Wharton School

Date Written: June 08, 2025

Abstract

This is the second in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate Chain-of-Thought (CoT) prompting, a technique that encourages a large language model (LLM) to "think step by step" (Wei et al., 2022). CoT is a widely adopted method for improving reasoning tasks, however, our findings reveal a more nuanced picture of its effectiveness. We demonstrate two things:


  • The effectiveness of Chain-of-Thought prompting can vary greatly depending on the type of task and model. For non-reasoning models, CoT generally improves average performance by a small amount, particularly if the model does not inherently engage in step-by-step processing by default. However, CoT can introduce more variability in answers, sometimes triggering occasional errors in questions the model would otherwise get right. We also found that many recent models perform some form of CoT reasoning even if not asked; for these models, a request to perform CoT had little impact. Performing CoT generally requires far more tokens (increasing cost and time) than direct answers.

  • For models designed with explicit reasoning capabilities, CoT prompting often results in only marginal, if any, gains in answer accuracy. However, it significantly increases the time and tokens needed to generate a response.


Taken together, this suggests that a simple CoT prompt is generally still a useful tool for boosting average performance in non-reasoning models, especially older or smaller models that may not engage in a CoT reasoning by default. However, the gains must be weighed against increased response times and potential decreases in perfect accuracy due to more variability in answers. For dedicated reasoning models, the added benefits of explicit CoT prompting appear negligible and may not justify the substantial increase in processing time.

Keywords: llm, large language models, benchmarking, chain-of-thought, cot, prompt engineering

Suggested Citation

Meincke, Lennart and Mollick, Ethan R. and Mollick, Lilach and Shapiro, Dan, Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting (June 08, 2025). The Wharton School Research Paper , Available at SSRN: https://ssrn.com/abstract=5285532 or http://dx.doi.org/10.2139/ssrn.5285532

Lennart Meincke (Contact Author)

University of Pennsylvania ( email )

Philadelphia, PA 19104
United States

The Wharton School ( email )

3641 Locust Walk
Philadelphia, PA 19104-6365
United States

WHU - Otto Beisheim School of Management ( email )

Burgplatz 2
Vallendar, 56179
Germany

Ethan R. Mollick

University of Pennsylvania - Management Department ( email )

The Wharton School
Philadelphia, PA 19104-6370
United States

Lilach Mollick

University of Pennsylvania - Wharton School ( email )

3641 Locust Walk
Philadelphia, PA 19104-6365
United States

Dan Shapiro

Glowforge, Inc ( email )

1938 Occidental Ave S
Suite C
Seattle, WA 98134
United States

University of Pennsylvania - The Wharton School ( email )

3641 Locust Walk
Philadelphia, PA 19104-6365
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
3,916
Abstract Views
24,961
Rank
6,903
PlumX Metrics