Morning Edition · Monday, June 15, 2026Published at 3:00 AM EDT · New York

Study Finds LLM Judges Disagree With Themselves on Repeated Identical Runs

Re-running the same evaluation many times exposes run-to-run instability in the LLM-as-a-Judge method that underpins leaderboards and reward models.

Save

Study Finds LLM Judges Disagree With Themselves on Repeated Identical Runs

A new paper, "The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation," studies what happens when the same model is asked to judge the same outputs repeatedly, in work posted to arXiv. The authors run repeated identical evalu…

Continue the AI Intelligence Brief

Track frontier labs, chips, export controls, model releases, regulation, and AI infrastructure.

5 AI intelligence signals a day
Frontier labs, compute, and chips
Model releases and AI infrastructure
Source-grounded analysis with confidence labels

The Global Intelligence Brief stays free.

Subscribe for $19/mo Already a member? Sign in