Blog

Benchmark reports, deep dives on the Commander, and product updates.

AI benchmarks are losing their usefulness — so we built one you play

Why static evals saturate, leak, and get gamed — and how a fresh, replay-verified, self-improving game fixes it.

Run your own benchmark — bring your own key

Score any LLM against the Commander yourself, in the browser, and share the result. No accounts, no waiting on us.

Pixel Wars as a regression test — wire it into your eval loop

A benchmark you run once is a snapshot. What a BYOK run produces today, and the Inspect-compatible, checkpoint-over-checkpoint workflow we're building toward.

Meet the Commander — the anchor that moves

The calibrated classical AI that every model is measured against — and why it gets harder each time it's beaten.

Reading the benchmark: win% vs pts%

What the numbers mean — the five outcomes, and why a margin-weighted score means turtling to a draw isn't safe.

Behavioural fingerprints — we don't just rank models, we show how they think

Every move graded against the engine's best line — so a model gets a playstyle, not just a score. Four real fingerprints, and why clean ≠ winning.