AutoThink: efficient inference for reasoning LLMs
12 Pages Posted: 14 May 2025
Date Written: May 13, 2025
Abstract
We explore different aspects of improving inference efficiency for reasoning LLMs. In particular, we study the impact of reasoning budgets, guided decoding and controlled steering on the accuracy of reasoning LLMs. Our approach is called AutoThink and it consists of two parts-a query complexity classifier which determines the number of reasoning tokens we allow during inference before generating the final response, and a dataset of control vectors we use to steer the response during inference. We derive the dataset of control vectors from pivotal tokens for the LLM that are discovered using a search procedure over the distribution of responses generated by the LLM on a calibration dataset. Our findings show that AutoThink can reduce the average number of output tokens by 55% while improving accuracy by 43% on GPQA-Diamond. Our implementation is open-source and available at https://github.com/codelion/optillm.
Keywords: Large Language Models, Reasoning Efficiency, Token Budgeting, Query Complexity Classification, Model Steering, Inference Optimization, Computational Efficiency, Pivotal Tokens, AI Reasoning, LLM Inference
Suggested Citation: Suggested Citation