Spaces:
Running
A newer version of the Gradio SDK is available:
5.49.1
To submit a new agent to the CORE leaderboard, follow these steps:
Run your agent on the CORE-Bench Harness. When developing your agent, ensure that it generates a file named
agent_trace.login the base directory it is invoked for each run. The content of this file must be in JSON format and at least include the keyscostandagent_trace:{ "cost": 0.59, "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution. This trace does not need to follow a specific format." }cost: A float representing the total cost (USD) of API calls made by the agent. We recommend using Weave for easy cost logging.agent_trace: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by SWE-Bench:- Human-readable.
- Reflects the intermediate steps your system took that led to the final solution.
- Generated with the inference process, not post-hoc.
If you have any trouble implementing this, feel free to reach out to us for support.
Run your agent on all tasks of the test set. You will almost certainly need to run your agent using our Azure VM harness (with the
--use_azureflag) to avoid long experiment times. Set the--experiment_nameflag to be the name of your agent. You can submit results for any of the three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, or CORE-Bench-Hard.Submit the following two directories from the harness:
benchmark/results/[experiment_name]: Contains the results of your agent on each task.benchmark/logs/[experiment_name]: Contains the logs of your agent's execution on each task (which are theagent_trace.logfiles your agent submits).- These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
Compress these directories into two
.tar.gzor.zipfiles and email them to zss@princeton.edu. If the files are too large to email, please upload them to Google Drive, Dropbox, etc., and email the link. In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.[Optional] We highly encourage you to submit the files of your agent (i.e.
benchmark/agents/[agent_name]) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a.tar.gzfile and include it in the email.