← The Polylog AI Briefing
Morning Edition · Monday, June 15, 2026
Study Finds LLM Judges Disagree With Themselves on Repeated Identical Runs
Re-running the same evaluation many times exposes run-to-run instability in the LLM-as-a-Judge method that underpins leaderboards and reward models.
A new paper, "The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation," studies what happens when the same model is asked to judge the same outputs repeatedly, in work posted to arXiv. The authors run repeated identical evalu…
Continue reading the AI briefing
Subscribe to read every story and its analysis. The Global briefing stays free.
More from this edition
- US Export Directive Suspends Access to Anthropic's Fable 5 and Mythos 5
- Liquid AI Ships an 8B Mixture-of-Experts Model Built for Laptops and Phones
- Anthropic Says Claude Matches Dedicated Software on NMR Spectrum Analysis
- DeepMind Researchers Map Possible Paths From AGI Toward Superintelligence
- Researchers Trace a Gemma 4 Repetition Bug to a Single Neuron
- Paper Targets Diffusion LLM Inference Bottlenecks on Mobile NPUs
- OpenAI Commits $150M to a New Enterprise Partner Network
- Anthropic Trains Claude to Translate Its Internal Representations Into Text
- A Wave of Benchmarks Probes How Easily AI Agents Are Manipulated
- Meta Pushes Segment Anything to Version 3 and Adds New Research Tooling
- Macron Frames Mistral as Europe's Only Frontier-Class Lab
- Anthropic Argues Policymaking Cannot Keep Pace With Exponential AI