|
|
<!DOCTYPE html> |
|
|
<html lang="en"> |
|
|
<head> |
|
|
<meta charset="UTF-8"> |
|
|
<meta http-equiv="X-UA-Compatible" content="IE=edge"> |
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
|
<title>AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents</title> |
|
|
<link rel="icon" href="static/img/icon/logo.png" type="image/icon type"> |
|
|
<link rel="stylesheet" href="static/css/leaderboard.css"> |
|
|
<link rel="stylesheet" href="static/css/main.css"> |
|
|
<link rel="stylesheet" href="static/css/nav.css"> |
|
|
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script> |
|
|
<script src="https://cdn.jsdelivr.net/npm/chartjs-plugin-datalabels@2.0.0"></script> |
|
|
<script src="https://cdn.jsdelivr.net/npm/chartjs-plugin-annotation@3.0.1/dist/chartjs-plugin-annotation.min.js"></script> |
|
|
|
|
|
<script async src="https://www.googletagmanager.com/gtag/js?id=G-NV0BT2WTBM"></script> |
|
|
<script> |
|
|
window.dataLayer = window.dataLayer || []; |
|
|
|
|
|
function gtag() { |
|
|
dataLayer.push(arguments); |
|
|
} |
|
|
|
|
|
gtag('js', new Date()); |
|
|
|
|
|
gtag('config', 'G-NV0BT2WTBM'); |
|
|
</script> |
|
|
</head> |
|
|
|
|
|
<body> |
|
|
|
|
|
<div id="nav"> |
|
|
<div id="icon"> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div> |
|
|
<div> |
|
|
<a class="nav-button" href="index.html#home"> |
|
|
<img src="static/img/icon/home_icon.png" alt="Home">Home</a> |
|
|
<a class="nav-button" href="https://hkust-nlp.github.io/agentboard/static/leaderboard.html"> |
|
|
<img src="static/img/icon/leaderboard_icon.png" alt="Result">Result</a> |
|
|
<a class="nav-button" href="https://hkust-nlp.github.io/agentboard/static/explore.html"> |
|
|
<img src="static/img/icon/explore_icon.png" alt="Explore">Explore</a> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
<div id="home" style="position: absolute; top: 0;"></div> |
|
|
|
|
|
|
|
|
<div id="title"> |
|
|
<div id="title-wrapper"> |
|
|
<h1 id="title-text"><img src="static/img/icon/logo.png" id="title-icon" alt="AgentBoard">AgentBoard: An Analytical Evaluation Board of <br> |
|
|
Multi-Turn LLM Agents<br> |
|
|
(2024) |
|
|
</h1> |
|
|
</div> |
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
<div class="centered-nav"> |
|
|
<a class="nav-button" href="http://arxiv.org/abs/2401.13178"> |
|
|
<img src="static/img/icon/paper_icon.png" alt="Paper">Paper |
|
|
</a> |
|
|
<a class="nav-button" href="https://github.com/hkust-nlp/AgentBoard/blob/main" class="ext-link" target="_blank"> |
|
|
<img src="static/img/icon/code_icon.png" alt="Code">Code |
|
|
</a> |
|
|
<a class="nav-button" href="https://huggingface.co/datasets/hkust-nlp/agentboard" class="ext-link" target="_blank"> |
|
|
<img src="static/img/icon/data_icon.png" alt="Data">Data |
|
|
</a> |
|
|
</div> |
|
|
|
|
|
<div class="main-body"> |
|
|
<div class="section"> |
|
|
<h2>About AgentBoard</h2> |
|
|
<p style="line-height: 150%"> |
|
|
<b>AgentBoard</b> is a benchmark designed for multi-turn LLM agents, complemented by an analytical |
|
|
evaluation board for detailed model assessment beyond final success rates.<br> |
|
|
Main Performance of different LLMs across various environments are shown below, please check our <a |
|
|
class="ext-link" href="https://hkust-nlp.github.io/agentboard/static/leaderboard.html">Result</a> for more details. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
<div class="section_panel"> |
|
|
<div class="text-center btn-group"> |
|
|
<div class="btn-group btn-switch task-filter-selector task-filter-selector-index" data-toggle="buttons"> |
|
|
<button type="button" class="btn btn-container" disabled>Filter by Task:</button> |
|
|
<button type="button" class="btn btn-container active" id="filter-by-Avg">Avg</button> |
|
|
<button type="button" class="btn btn-container" id="filter-by-Embodied">Embodied</button> |
|
|
<button type="button" class="btn btn-container" id="filter-by-Game">Game</button> |
|
|
<button type="button" class="btn btn-container" id="filter-by-Web">Web</button> |
|
|
<button type="button" class="btn btn-container" id="filter-by-Tools">Tools</button> |
|
|
</div> |
|
|
<div class="btn-group btn-switch metric-filter-selector metric-filter-selector-index" data-toggle="buttons"> |
|
|
<button type="button" class="btn btn-container" disabled>Sort |
|
|
by: |
|
|
</button> |
|
|
<button type="button" class="btn btn-container active" |
|
|
id="sort-by-reward-score"> |
|
|
Progress Rate |
|
|
</button> |
|
|
|
|
|
<button type="button" class="btn btn-container" |
|
|
id="sort-by-success-rate"> |
|
|
Success Rate |
|
|
</button> |
|
|
|
|
|
<button type="button" class="btn btn-container" |
|
|
id="sort-by-grounding-acc"> |
|
|
Grounding Accuracy |
|
|
</button> |
|
|
</div> |
|
|
</div> |
|
|
<div style="display: flex; justify-content: center; width: 100%;"> |
|
|
<div class="chart-container" style="width: 750px; height: 640px"> |
|
|
<canvas id="chart-success-reward-rate" width="520" height="500"></canvas> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div> |
|
|
<script src="static/javascript/main_results_show.js" type="module"></script> |
|
|
<div class="main-body"> |
|
|
<div class="section"> |
|
|
<h2>Illustrative Overview</h2> |
|
|
<p style="line-height: 150%"> |
|
|
<b>AgentBoard</b> consists of 9 diverse tasks and 1013 exemplary environments, covering a range from |
|
|
embodied AI and game agents to web and tool agents. |
|
|
Our environment provides <b>well-annotated subgoals</b> and <b>fine-grained interactions</b>. Furthermore, |
|
|
it provides <b>detailed analyses</b> for agent evaluation, as shown below. |
|
|
You may explore our dataset examples at <a class="ext-link" href="https://hkust-nlp.github.io/agentboard/static/explore.html">Explore</a>, or check |
|
|
our <a |
|
|
class="ext-link" href="http://arxiv.org/abs/2401.13178" target="_blank">paper</a> for more details. |
|
|
</p> |
|
|
<div class="example-box"> |
|
|
<figure id="example-img"> |
|
|
<img src="static/img/Overview.png" alt="overview.png"> |
|
|
</figure> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
<div class="section"> |
|
|
<div id="data" class="anchor"></div> |
|
|
<h2>Data</h2> |
|
|
<p> |
|
|
Our data can be directly downloaded on <a class="ext-link" |
|
|
href="https://huggingface.co/datasets/hkust-nlp/agentboard/tree/main" |
|
|
target="_blank">Huggingface |
|
|
datasets</a>. Please refer to our <a class="ext-link" href="https://github.com/hkust-nlp/AgentBoard?tab=readme-ov-file#6-data" |
|
|
target="_blank">github |
|
|
instructions</a> for how to read and use the data. |
|
|
</p> |
|
|
</div> |
|
|
<div class="section"> |
|
|
<div id="citation" class="anchor"></div> |
|
|
<h2>Citation</h2> |
|
|
|
|
|
|
|
|
|
|
|
<pre class="bibtax"> |
|
|
@misc{ma2024agentboard, |
|
|
title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents}, |
|
|
author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He}, |
|
|
year={2024}, |
|
|
eprint={2401.13178}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL} |
|
|
} |
|
|
</pre> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<div class="section" id="contact"> |
|
|
<h2>Contact Us</h2> |
|
|
<p>Have any questions about AgentBoard? Please contact us at <a href="mailto:llmagentboard@gmail.com">llmagentboard@gmail.com</a> |
|
|
or create an issue on <a |
|
|
href="https://github.com/hkust-nlp/AgentBoard" class="ext-link" target="_blank"> Github</a>. |
|
|
For potential collaboration, please contact <a href="mailto:junxianh2@gmail.com">junxianh2@gmail.com</a>. |
|
|
</p> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div> |
|
|
|
|
|
<div id="footer" style="font-size: 12pt; text-align: center"> |
|
|
© 2024 |
|
|
</div> |
|
|
</body> |
|
|
|
|
|
</html> |
|
|
|