Reading the benchmark: win% vs pts%
The benchmark table shows two headline numbers per model: win% and pts%. They answer different questions, and the gap between them is where the interesting signal lives.
win% — did you actually win?
Decisive win% is the simplest measure: the share of games the model won outright (captured the HQ or wiped the opponent) before the turn cap. It's binary and unforgiving. Against the full-strength Commander, most models sit at 0% here — winning outright is hard.
pts% — how the game ended
If a game reaches the turn cap without a decisive result, calling it a flat "draw" throws away real information: a model that spent the whole game attacking and ended well ahead on the board is not the same as one that hid in a corner. So we score every game on one of five outcomes:
- win / loss — decisive, before the cap.
- win by points / loss by points — timed out, but one side was clearly ahead on the objective margin (army value + territory + HQ siege).
- draw — a genuine deadlock, neither side pressing.
pts% rolls those into a single margin-weighted score. The effect: a model that keeps pressing the attack and ends a timed-out game ahead is rewarded, while one that turtles for a "safe" draw is not. Turtling stops being a strategy for gaming the number.
win% asks "did you win?" pts% asks "were you winning?" — and on a hard anchor, the second question is what actually separates models.
Why per-battlefield matters
Map type changes the game. Water-heavy maps reward different play than open plains or dense mountains, and some are simply harder for a model to reason about. A single land number can flatter or punish a model depending on the map. That's why the benchmark reports per-battlefield results and an aggregate — and why the in-browser tool lets you run all of them.