Pixel Wars as a regression test — wire it into your eval loop

Pixel Wars · for eval & post-training teams

A benchmark you run once is a snapshot. It tells you where a model sat on one afternoon against one opponent. That's useful for a launch slide and nearly useless for the thing you actually care about: did this checkpoint get worse at planning than the last one? Teams shipping agents don't need a number — they need a signal they can watch move over time.

Pixel Wars is built to be that signal. It's a deterministic, fog-of-war tactics game where a model plays a fixed, versioned opponent — the Commander — across procedurally generated, mirror-symmetric maps. Every game reduces to a seed plus an action log, so every result is re-runnable and the grading is mechanical. That's exactly the property a regression test needs.

Run it now — BYOK, in the browser

The repeatable part starts today, with no integration work. The free web client has a Benchmark mode: pick a model, paste your API key, and score it against the full-strength Commander — best-of-25, fog on, large maps, a few dollars of API. The key lives in your browser session and talks straight to your vendor; it never touches our servers or logs.

A run produces four things, none of them a vibe:

pts% — a margin-weighted score across the series. Outcomes resolve to one of five buckets: win, win-on-points, draw, loss-on-points, loss. A timeout is scored by objective margin, so turtling to a draw isn't a safe 0.5.
win% — the blunt one. How often the model actually beat the Commander.
Per-move accuracy — every move is graded against the engine's best line, giving a move-by-move accuracy %. This is where a clean-but-overrun game and a five-blunder collapse stop looking the same on the scoreboard.
A replay-verified log — the server re-runs each game move-by-move from the seed and action log to confirm it; fabricated or illegal logs are rejected. You can step through any game in the replay viewer and watch the loss happen.

There's no LLM-as-judge and no rubric anywhere in that pipeline — the grade is the game engine computing an outcome. Same Commander, same methodology, your key. The number you get is the number you can quote.

Current public numbers were measured against Commander ultimate-2026.06 — an early, deliberately soft anchor. Read them as provisional: DeepSeek V4 Flash landed ~46% pts / 40% win, GPT-5.4 mini ~45% pts / 36% win, and two models cleared the Commander. Those figures are tagged by the Commander revision they ran against and will be re-measured when the harder v3 anchor ships.

The workflow we're building toward

Running it by hand is a snapshot generator. To make it a regression test, it has to live inside the loop you already run — and become a signal you track across checkpoints, not just within one. Two pieces are coming for that:

Inspect-compatible task wrapper. Coming Pixel Wars becomes one task in your existing Inspect suite — same harness, same scheduling, same artifacts as everything else you run against a checkpoint. No bespoke runner to babysit.
Checkpoint-over-checkpoint tracking. Coming Once the task is in your suite, the per-capability breakdown — strategic skill, economy allocation, scouting under fog — becomes a series you can chart from one checkpoint to the next. A regression shows up as a line that dips, not a number you have to remember to compare by hand.

The primitive underneath is already solid: a deterministic task with an objective, replay-verified outcome and a fixed, versioned opponent. CI integration, hosted dashboards, and automated drift-tracking are what we're building on top of it — the foundation is here now, the workflow layer is what's next.

That versioning is the part that keeps a regression test honest over time. The Commander is the fixed anchor — when models start beating it, we mine those games and ship a stronger revision, and old numbers are kept and tagged by the Commander version they ran against. So your series stays comparable within an anchor, and the bar rising to v3 is a feature, not a reset.

Why long-horizon planning is the thing that silently regresses

Static quizzes are good at catching the regressions that announce themselves — a fact recalled wrong, a format broken, a refusal where there shouldn't be one. They are bad at catching the regression that matters most for an agent: the model that still answers every question correctly but has quietly gotten worse at sequencing decisions toward a goal.

That skill is invisible to a single-turn eval because there's no horizon to plan over and no adversary to adapt to. A full game — dozens of sequential moves — has both. To beat the Commander a model has to track hidden state under fog, manage an economy across many turns, and adjust to an opponent that punishes a weak line — the capabilities current evals under-test, isolated in a setting with one objective outcome and no answer key to leak or memorize. Maps are fresh every game, so there's nothing to pattern-match; the model has to actually play.

A static test gets easier as models improve and tells you less as they do. A game against a versioned anchor stays a real question — and a regression in long-horizon planning shows up as a line that moves.

Our thesis is that long-horizon planning under uncertainty is one of the most under-measured capabilities in AI — and we'd rather prove it than assert it. We won't call a Pixel Wars score settled science for real-world agent performance; we publish the per-capability breakdown and the full replay for every game so you can judge it on a real task with no answer key. You can watch a model beat the Commander or watch one collapse and decide for yourself whether the failure mode is the one you care about.

Run it against your model What the numbers mean