Benchmark

Deterministic — no LLM-as-judge Replay-verified Fresh maps — nothing to memorize Self-improving anchor BYOK — reproduce any row

GitHub · How it works · Methodology

Loading results…

How a match runs

Every match runs the same three steps — and it's graded by the rules, not a judge model.

1 · Generate

A fresh, mirror-symmetric map from a seed — procedural, so there's no fixed test set to leak into training data or memorize.

2 · Play

The model plans a whole turn under fog of war against the Commander, over dozens of moves, to a win, a loss, or the 200-turn cap.

3 · Grade & verify

A deterministic win/loss + margin score — no LLM-as-judge. The server re-runs the seed + action log to confirm every result.

Methodology

Official

Our models, our keys, server-side, vs the full-strength Commander — best-of-25 (25 games per battlefield), fog on, large maps. Replay-verified, versioned by Commander revision.

Community

Self-reported BYOK runs researchers submit from the in-browser benchmark. Game logs are replay-checked, but model identity is self-asserted — read it as crowd signal, not the official ranking.

Reading a row

Win% is decisive wins only. pts% is margin-weighted (win 1.0 · draw 0.5 · loss-on-points ~0.2 · loss 0): a true mutual deadlock scores ~0.5, but a one-sided turtle that runs down the clock scores below it — the pressing opponent gets the credit — so stalling isn't safe. Record is the full split — W win · Wp win-on-points · Lp loss-on-points · L loss · D draw. Aggregate spans all battlefields; each map is also shown alone.

See how a model actually plays — step through a real game with the engine's best line on every move: DeepSeek (leads) · GPT-5.4 mini (overrun) · Haiku (collapses) · Commander vs Commander (baseline).

FAQ

Why are the scores low?

It's a hard, honest eval against a calibrated opponent that punishes every mistake — low scores are signal, not noise. And the anchor hardens when it's beaten, so the bar keeps rising.

Can it be gamed?

No LLM-as-judge and no fixed answer key. The outcome is a deterministic win/loss + a margin score that punishes turtling, on a map the model has never seen — and every game is replay-verified.

Can I reproduce a result?

Yes — every result is a seed + an action log; re-run it and you get the same outcome. Bring your own key and benchmark any model in-browser (an Inspect-compatible task wrapper is coming).

What is the Commander?

A calibrated classical AI — the fixed, versioned anchor every model is measured against. When a model beats it, we mine those games and ship a stronger revision.

Why would this transfer?

We don't claim a score proves real-world agent capability — transfer is a hypothesis we validate in the open. What Pixel Wars isolates is four capabilities saturated quizzes under-test: long-horizon planning over ~40–100 moves, hidden-state tracking under fog, adapting to an adversary that punishes mistakes, and resource allocation under pressure. We commit to publishing the per-capability breakdown so you can judge transfer yourself.