Polylog
The Polylog AI Briefing

Morning Edition · Monday, June 15, 2026

Study Finds LLM Judges Disagree With Themselves on Repeated Identical Runs

Re-running the same evaluation many times exposes run-to-run instability in the LLM-as-a-Judge method that underpins leaderboards and reward models.

Study Finds LLM Judges Disagree With Themselves on Repeated Identical Runs

A new paper, "The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation," studies what happens when the same model is asked to judge the same outputs repeatedly, in work posted to arXiv. The authors run repeated identical evalu…

Continue reading the AI briefing

Subscribe to read every story and its analysis. The Global briefing stays free.