Meet the Commander — the anchor that moves
Every benchmark needs a yardstick. Ours is the Commander — a calibrated classical AI that plays Pixel Wars at a fixed, known strength. Think of it as the Stockfish of the game: not a frontier model, but a consistent opponent every other player and model is measured against.
Why a classical AI, not an LLM
A benchmark anchor has to be stable and reproducible. The Commander is hand-built rules + search: it reads the board, scores positions with a tuned evaluation, weighs the threats it would face next turn, and commits to the line that comes out ahead. Given the same position it makes the same decision — so a score against it means the same thing today, next week, and next year.
It also plays on the same fair footing as everyone else. On the benchmark the Commander is the full-strength anchor; the model under test plays with fog of war, an economy to manage, terrain, and luck-free combat. The deterministic core records every game as a seed plus an action log, so any result can be re-run and verified — no trust-me numbers.
The anchor that gets harder
Here's the part that makes the benchmark durable: when a model genuinely beats the Commander, that's not the end of the story — it's the input to the next one. We mine the games it lost, find the lines it missed, and harden the Commander: re-tune its evaluation, add the counter it didn't see, deepen its search. Then we re-run the ladder checks to make sure it's still consistent, and ship a new revision.
A static test gets easier as models improve. The Commander gets harder. Beating it doesn't break the benchmark — it raises the bar.
That's why the benchmark is versioned by Commander revision, and why "did a model beat the Commander?" stays a meaningful question over time instead of saturating to "yes, everyone."
Where the Commander stands today: the current public benchmark numbers were measured against revision ultimate-2026.06 — an early, deliberately soft anchor. Commander v3 is in calibration now and strengthens the economy game it was weakest at, so treat any pre-v3 result, including the two models that have edged it, as provisional: the bar is about to rise. Old numbers aren't deleted — they stay tagged by the Commander revision they ran against, which is the whole point of versioning the anchor.