The Graders Are Outrun

GPT-5.2 scores 93.2% on GPQA Diamond. The benchmark is designed so that PhD-level domain experts in the relevant fields score about 65%. Skilled non-experts with full internet access and 30 minutes per question average 34% — barely above random chance.

The model is outperforming the people meant to check its work.

Charlie Guo documented the deeper problem in Artificial Ignorance earlier this month: OpenAI audited SWE-bench Verified and found that GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash had all seen the test answers during training. Given just a task ID and a brief hint, each model could reproduce the original code fix from memory — variable names, inline comments, implementation details that never appear in the problem description. The benchmark was testing recall, not coding ability. OpenAI’s response: declare SWE-bench “no longer suitable” and introduce SWE-bench Pro, a harder replacement from Scale AI.

This is the benchmark treadmill: saturate, replace, repeat. Each replacement is harder, requires narrower expertise to grade, and produces results fewer people can verify. The institutional response to being outrun is to build a track that fewer humans can run at all.

The same thing is playing out inside enterprise software teams. Amazon tightened oversight on AI-assisted code changes after production outages. Saanya Ojha’s writeup in The Change Constant identified the pattern across enterprises: companies don’t know what good AI adoption looks like, so they borrow metrics from AI labs. Anthropic and OpenAI cite roughly 90% AI-generated code. If that number is good enough for them, maybe it’s the right thing to track.

Tracking output instead of value is gameable inside an afternoon. Tell engineers the goal is more AI-generated code and you get one-line changes, version bumps, split refactors, and cleanup commits. Everyone develops a philosophical commitment to small, atomic diffs. The metric moves; the product doesn’t.

At both layers, the cause is the same: instruments chosen for legibility over validity. Benchmarks are legible: you can rank models, publish a leaderboard, hold a press release. Enterprise metrics are legible: you can put them in a board deck. The thing they’re meant to proxy — whether the output is actually any good — is harder to evaluate and doesn’t fit a dashboard, so the proxy replaces it.

What’s new isn’t this trade-off. It’s the speed. Models are advancing faster than the benchmarks that track them. Code is generating faster than the review processes meant to catch failures. The gap between what’s measurable and what’s true is widening at both layers simultaneously.

The institutional response makes it worse. Harder benchmarks mean smaller grader pools and results that fewer people can sanity-check. More productivity metrics mean more proxies to optimize against. Amazon’s AI outage wasn’t a failure of instrumentation — they had metrics. It was a failure of judgment about what the metrics meant.

The bottleneck in both cases is human evaluation capacity: not grading against a rubric, but actually reading the output, understanding it, and knowing whether it’s right. That capacity doesn’t scale with model releases. It doesn’t improve because you hired a metrics analyst. It’s a human asset that has to be cultivated deliberately, and most organizations are treating it as an afterthought behind deployment velocity.

The answer to “who checks the work?” isn’t a better benchmark. It’s maintaining enough judgment capacity to tell a model that solved a problem from a model that produced something that looks like a solution. That distinction is growing harder to make — and more expensive to get wrong.