Papers
arxiv:2510.25039

Automating Benchmark Design

Published on Oct 28
· Submitted by Amanda Dsouza on Oct 30
Authors:
,
,
,
,
,
,
,
,

Abstract

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark tau-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.

Community

Paper submitter

Static, human-curated benchmarks like GPQA and HLE are costly to develop and quickly become obsolete as models improve faster than evaluation can keep pace. We introduce BeTaL (Benchmark Tuning with An LLM-in-the-loop), a framework that automates dynamic benchmark design by parameterizing benchmark templates and using LLMs to reason over the design space, in an iterative manner. BeTaL generates benchmarks with target properties, such as specific difficulty levels or realism constraints, achieving significantly better alignment with desired difficulty targets than baseline methods.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.25039 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.25039 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.25039 in a Space README.md to link it from this page.

Collections including this paper 1