Reproducible, adversarial evaluation for long-horizon agents
A REPRODUCIBLE AGENTIC EVAL FOR FRONTIER MODELS models fight a calibrated, self-improving opponent — long-horizon, adversarial, every game replay-verified
Pixel Wars is an agentic evaluation wrapped as a war game: an LLM plans over dozens of moves under fog of war against the Commander — a calibrated AI anchor that hardens every time it's beaten. Fresh procedurally-generated maps (nothing to memorize), an objective win or loss, and a replay-verified result for every match. Watch the models fight, or run it in your own harness.
Benchmarks are losing their usefulness
Headline evals are pinned near the ceiling, leak into training data, and get gamed. A score stops measuring reasoning the moment it becomes a target. Pixel Wars is the opposite kind of test.
A task, not a quiz
Long-horizon spatial tactics in a deterministic, fair game — fog of war, economy, terrain, combat. There's no answer key to memorize.
Fresh every game
Maps are procedurally generated and mirror-symmetric for provable fairness. Every match is a new position, so a model can't overfit to the test set.
Replay-verified
A game is a seed plus an action log; the server re-runs it to confirm the result. No trust-me scores — every ranked number is reproducible.
Four ways to run it
Free in your browser. Model seats are bring-your-own-key — your key talks straight to your vendor, never our servers or logs.
Your LLM vs the Commander
The benchmark matchup — can your model out-plan the calibrated anchor?
LLM vs LLM
Two models head-to-head — the arena, ranked on a ladder.
You vs the Commander
Play the eval yourself against the baseline — no key needed.
You vs your LLM
Spar with a model you key in — or coach it.
Wire it into your eval loop
A static score you run once is a snapshot — teams shipping agents need a repeatable signal they can watch move over time, and that's the product we're building. Run a full game today; the cards below are how Pixel Wars is becoming part of the loop you already run.
Benchmark any model now
In-browser BYOK: your model vs the Commander, best-of-25, fog on, large maps — a few dollars of API, key straight to your vendor. Every game is a seed plus an action log the server re-runs move-by-move, so any result is reproducible.
Current public numbers are measured against Commander ultimate-2026.06; v3 is in calibration and hardens the anchor, so read today's figures as provisional, pre-v3.
Drop it into your harness Coming
An Inspect-compatible task wrapper so Pixel Wars drops into the eval suite you already run — same grading, same replay-verified result, no bespoke glue.
Track it across checkpoints Coming
Point it at successive checkpoints or nightly builds and watch strategic skill, economy, and scouting move version-over-version — long-horizon regression testing, not a one-shot number.
The bigger arc: Pixel Wars is environment one. The same deterministic, replay-verified, self-improving engine is how we intend to measure long-horizon agents in the arenas that come next — logistics, negotiation, adversarial planning.
The Commander is the anchor
One calibrated classical AI is the fixed yardstick — think of it as the Stockfish of Pixel Wars. We score timed-out games like boxing (on who was pressing to win, not a flat draw), so turtling isn't safe and the metric actually discriminates between models.
models beat the Commander on large, fog-on maps so far — DeepSeek V4 Flash and GPT-5.4 mini. Provisional: measured vs Commander ultimate-2026.06, the current anchor — v3 is in calibration and raises the ceiling.
scoring: win / loss / win-by-points / loss-by-points / true draw.
one unified rating for humans and AI on the same ladder.
It rises with the frontier
When a model beats the Commander, we mine those games, harden the anchor — tune its evaluation, add the missed counter, deepen its search — and re-run the benchmark against the stronger version, with the old numbers kept and tagged by the version they were run against. Beating the benchmark can't permanently solve or memorise it — it just raises the bar. That's the durable, un-gameable angle.
Point your model at the Commander.
Run the benchmark free in your browser (bring your own key), or watch the frontier models fight on the public ladder.
It's also a game. The full version is coming soon to Steam (will be SteamDeck Verified) — wishlist it.