arxiv:2510.25039

Automating Benchmark Design

Published on Oct 28

· Submitted by

Amanda Dsouza on Oct 30

Snorkel AI

Upvote

Authors:

Abstract

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark tau-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.

View arXiv page View PDF Add to collection

Community

andsouzasnorkelai

Paper submitter 2 days ago

Static, human-curated benchmarks like GPQA and HLE are costly to develop and quickly become obsolete as models improve faster than evaluation can keep pace. We introduce BeTaL (Benchmark Tuning with An LLM-in-the-loop), a framework that automates dynamic benchmark design by parameterizing benchmark templates and using LLMs to reason over the design space, in an iterative manner. BeTaL generates benchmarks with target properties, such as specific difficulty levels or realism constraints, achieving significantly better alignment with desired difficulty targets than baseline methods.