AI benchmarks are losing their usefulness — so we built one you play
Every few months a new model tops the charts, and every few months the charts mean a little less. The problem isn't the models — it's the tests. The benchmarks we use to compare frontier systems are decaying in three predictable ways, and once you see them you can't unsee them.
1. Saturation
Headline benchmarks are pinned near the ceiling. When the top models all score 90-something, the last few points are noise, not signal — the test has lost its power to discriminate. A ranking where everyone is "excellent" tells you nothing about who is actually better at the task.
2. Contamination
Static test sets leak. Questions and answers end up in training data — directly, or through the endless secondary write-ups about them — and a score stops measuring reasoning and starts measuring recall. You can rarely prove a specific leak, but with a fixed question bank and web-scale training, it's the null hypothesis.
3. Gaming (Goodhart's law)
"When a measure becomes a target, it ceases to be a good measure." Once a benchmark matters, the incentive is to optimize for the eval rather than the underlying capability — prompt formats, answer styles, and fine-tunes that lift the number without lifting the skill it was meant to proxy.
Saturation kills discrimination, contamination kills validity, and gaming kills meaning. A static quiz can't escape all three at once.
Our answer: a task, not a quiz
Pixel Wars is a deterministic turn-based tactics game — fog of war, an economy, terrain, and luck-free combat — wrapped as a benchmark. To do well you have to plan over a long horizon in space, not recall a fact. And it's built specifically to dodge the three failure modes above:
- Fresh every game. Maps are procedurally generated and mirror-symmetric for provable fairness. There is no fixed set of positions to memorize or leak.
- Replay-verifiable. A game is just a starting seed plus a recorded action log. The server re-runs it to confirm the outcome, so no ranked number is taken on trust.
- Anchored to a calibrated baseline. A single classical AI — the Commander — is the fixed yardstick, the Stockfish of Pixel Wars. Timed-out games are scored like boxing, on who was pressing to win, so a model can't turtle to a safe "draw".
- Self-improving. When a model beats the Commander, we mine the losing lines, harden the AI, and ship a stronger anchor. The benchmark rises with the frontier — beating it just moves the bar, so it can't be permanently solved or gamed.
What we're seeing so far
On large maps with fog of war the field splits cleanly. Measured against Commander ultimate-2026.06 — an early, deliberately soft anchor — two models (DeepSeek V4 Flash and GPT-5.4 mini) have edged it on points so far, while most of the field loses but spreads across a wide boxing-style score (how hard a model presses before losing). That spread is exactly what a useful benchmark should do — separate models a saturated quiz can’t. Read these as provisional, pre-v3: Commander v3 is in calibration and the bar is about to rise.
The numbers move as we add models and revise the Commander, so we publish them live rather than freezing them into a screenshot.
Does this transfer? — what we claim, and what we don't
Our thesis: long-horizon strategic planning under uncertainty is one of the most under-measured capabilities in AI — and it's exactly what this game tests. We'd rather prove that than assert it, so we're precise about the claim. We won't pretend a Pixel Wars score is settled science for real-world agent performance; what we do claim is narrower and checkable — the game isolates capabilities current evals under-test: long-horizon planning, tracking hidden state under fog, adapting to an adversary, and allocating a scarce economy, in a deterministic setting with an objective outcome and no answer key.
So instead of one headline number we publish the per-capability breakdown — every move graded against the engine's best line, every game replayable move by move — and let you judge the transfer for yourself. That's the subject of Behavioural fingerprints — what per-move grading reveals.
Why this is the durable angle
A benchmark that is a real task, fresh every run, verifiable, and that gets harder when you beat it can't saturate (the anchor moves), can't be contaminated (there's nothing fixed to memorize), and can't be gamed (the target keeps moving). It's also just fun to watch — agents fighting on a public ladder, win or lose, with the full game replayable move by move.
— The Pixel Wars team