Inverse Scaling in Test-time Compute

Aryo Pradipta GemaAFP 2Alexander HägeleAFP 3Runjin ChenAFP 4Andy ArditiAFP
Jacob Goldman-WetzlerAFPKit Fraser-TalienteAFPHenry Sleight5Linda Petrini6
Julian Michael7,*Beatrice Alex2Pasquale Minervini2,8
Yanda Chen1Joe Benton1Ethan Perez1
AFPAnthropic Fellows Program • 1Anthropic • 2University of Edinburgh • 3EPFL • 4University of Texas at Austin
5Constellation • 6Independent • 7Scale AI • 8MiniML.AI
* Now at Meta
Corresponding authors: aryo.gema@ed.ac.uk, ethan@anthropic.com

Abstract

We construct evaluation tasks where extending reasoning length deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: red herring tasks with embedded distractors, spurious correlation tasks, constraint satisfaction tasks, and advanced AI risks.

Our analyses show that different Large Reasoning Models (LRMs) exhibit distinct failure modes: Claude models become increasingly distracted by irrelevant information in a given prompt as they reason longer; OpenAI o-series models resist distractors but shows pronounced overfitting to problem framings; in spurious correlation tasks, extended reasoning causes models to shift from reasonable priors to plausible but incorrect features, though providing few-shot examples largely corrects this behavior; in constraint satisfaction tasks, all models show performance degradation with extended reasoning, suggesting difficulties in maintaining focus during complex deductive tasks; and in AI safety evaluation tasks, we find that extended reasoning can amplify model-specific concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation in longer reasoning traces.

These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

Demo

Explore how different models perform on inverse scaling tasks across various reasoning budgets. Compare performance between baseline (no thinking) and maximum reasoning budget to see inverse scaling effects.

Citation

@article{gema2025inverse,
  title={Inverse Scaling in Test-time Compute},
  author={Aryo Pradipta Gema and Alexander Hägele and Runjin Chen and Andy Arditi and Jacob Goldman-Wetzler and Kit Fraser-Taliente and Henry Sleight and Linda Petrini and Julian Michael and Beatrice Alex and Pasquale Minervini and Yanda Chen and Joe Benton and Ethan Perez},
  journal={arXiv preprint arXiv:2025.14417},
  year={2025}
}