Here’s something that doesn’t get talked about enough: evaluating AI models has become absurdly expensive. Not training them — that’s the part everyone fixates on — but the boring, necessary work of figuring out whether they actually work.
A team recently ran the Holistic Agent Leaderboard (HAL) and dropped about $40,000 on 21,730 agent rollouts across 9 models and 9 benchmarks. That’s not a one-off. A single GAIA run on a frontier model? $2,829 before you even think about caching. Exgentic’s sweep across agent configurations came to $22,000, and they found a 33× cost spread on identical tasks just because of scaffold choice. UK-AISI has scaled agentic steps into the millions to study inference-time compute. This is not a niche problem.
The cost problem predates agents, by the way. When Stanford’s CRFM released HELM in 2022, they reported per-model API costs ranging from $85 for OpenAI’s code-cushman-001 to $10,926 for AI21’s J1-Jumbo (178B). Open models needed 540 to 4,200 GPU-hours, with BLOOM (176B) and OPT (175B) at the top end. Across HELM’s 30 models and 42 scenarios, the aggregate came to roughly $100,000. That was three years ago.
Perlitz et al. dug into EleutherAI’s Pythia checkpoints and found something worse: developers pay for evaluation repeatedly during model development. Pythia released 154 checkpoints for each of 16 models across 8 sizes — 2,464 checkpoints total. Running the LM Evaluation Harness across all of them turns eval into a multiplier on training. For small models, evaluation becomes the dominant compute line item across the whole development cycle. When you scale inference-time compute, you scale evaluation costs.
Here’s the thing that bothered me when I first read this: most of that compute was wasted. Perlitz et al. found that a 100× to 200× reduction in compute preserved nearly the same ranking. Flash-HELM turned that into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM’s compute was confirming rankings the field could have inferred much more cheaply.
Other work reached the same conclusion. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 87 language-model/prompt pairs on GLUE. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling.
That trick weakens sharply once you move from static predictions to agents.
Agent evals are a different beast
Agent benchmarks are fundamentally messier. They don’t test “the model” in isolation. They test a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×. The HAL paper notes “a 9× difference in cost despite just a two-percentage-point difference in accuracy” between two configurations on Online Mind2Web. On GAIA, one agent cost $2,828 for 28.5% accuracy while another hit 57.6% for $1,686. CLEAR found across 6 SOTA agents on 300 enterprise tasks that “accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives” with comparable real-world performance.
Behind these numbers is a blunt pricing fact. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40 — a two-order-of-magnitude spread on input alone. When your eval involves thousands of agent rollouts, that spread becomes real money fast.
The HAL paper’s own accounting is instructive. By April 2026, the leaderboard had grown to 26,597 rollouts. Ndzomga’s independent reproduction arrives at almost the same number: $46,000 across 242 agent runs. Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks.
What this means for the field
The implication is uncomfortable. If evaluation costs continue to scale with inference-time compute, we’re heading toward a world where only well-funded labs can properly evaluate their models. The kind of iterative, open experimentation that drove progress in the early days of deep learning becomes prohibitively expensive.
There are proposed solutions. Compression techniques for static benchmarks have been around for a while. But agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction. And when you try to add reliability to these evals — running repeated trials to account for variance — costs multiply further.
Some groups are working on this. The HELM team’s Flash-HELM approach is a start. But the field needs something more systematic. We need standardized cost reporting, shared scaffolds, and maybe a tiered evaluation system where cheap initial passes filter candidates before expensive full evaluations.
Until then, the cost of knowing whether your model actually works might be the thing that slows the field down more than anything else. And that’s a bottleneck nobody saw coming.
Comments (0)
Login Log in to comment.
Be the first to comment!