File size: 11,300 Bytes
7b6b43e 5dad6cc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>MCP-BENCH: Benchmarking Tool-Using LLM Agents</title>
<script src="https://cdn.tailwindcss.com"></script>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
<style>
body {
font-family: 'Inter', sans-serif;
}
.gradient-text {
background: linear-gradient(to right, #4f46e5, #ec4899);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
}
.table-hover tr:hover {
background-color: #f9fafb;
}
</style>
</head>
<body class="bg-gray-50 text-gray-800">
<div class="container mx-auto px-4 py-8 md:py-16 max-w-5xl">
<!-- Header Section -->
<header class="text-center mb-12">
<h1 class="text-4xl md:text-5xl font-bold text-gray-900 mb-2">
MCP-BENCH
</h1>
<h2 class="text-lg md:text-xl text-gray-600">
Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers
</h2>
</header>
<!-- Leaderboard Section -->
<section id="leaderboard" class="mb-16">
<h3 class="text-2xl md:text-3xl font-bold text-center mb-8 gradient-text">Leaderboard</h3>
<div class="overflow-x-auto bg-white rounded-lg shadow-lg">
<table class="min-w-full text-sm text-left text-gray-600">
<thead class="bg-gray-100 text-xs text-gray-700 uppercase tracking-wider">
<tr>
<th scope="col" class="px-6 py-3 font-semibold">Rank</th>
<th scope="col" class="px-6 py-3 font-semibold">Model</th>
<th scope="col" class="px-6 py-3 font-semibold text-center">Overall Score</th>
<th scope="col" class="px-6 py-3 font-semibold text-center">Task Fulfillment</th>
<th scope="col" class="px-6 py-3 font-semibold text-center">Graph Exact Match</th>
</tr>
</thead>
<!--
LEADERBOARD DATA
To update the leaderboard, edit the rows (<tr>...</tr>) below.
Each row represents a model. The columns are:
1. Rank (#)
2. Model Name
3. Overall Score
4. Task Fulfillment (LLM Judge)
5. Graph Exact Match
Make sure the data is sorted by the Overall Score in descending order.
-->
<tbody class="divide-y divide-gray-200 table-hover">
<!-- Rank 1 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">1</td>
<td class="px-6 py-4 font-semibold text-gray-900">GPT-4o-mini</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.691</td>
<td class="px-6 py-4 text-center">0.77</td>
<td class="px-6 py-4 text-center">52.4%</td>
</tr>
<!-- Rank 2 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">2</td>
<td class="px-6 py-4 font-semibold text-gray-900">Qwen-3-32b</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.631</td>
<td class="px-6 py-4 text-center">0.57</td>
<td class="px-6 py-4 text-center">47.8%</td>
</tr>
<!-- Rank 3 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">3</td>
<td class="px-6 py-4 font-semibold text-gray-900">DeepSeek-R1-Qwen-32b</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.587</td>
<td class="px-6 py-4 text-center">0.52</td>
<td class="px-6 py-4 text-center">43.5%</td>
</tr>
<!-- Rank 4 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">4</td>
<td class="px-6 py-4 font-semibold text-gray-900">Mistral-small-2403</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.552</td>
<td class="px-6 py-4 text-center">0.49</td>
<td class="px-6 py-4 text-center">30.4%</td>
</tr>
<!-- Rank 5 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">5</td>
<td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-70b</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.542</td>
<td class="px-6 py-4 text-center">0.50</td>
<td class="px-6 py-4 text-center">21.7%</td>
</tr>
<!-- Rank 6 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">6</td>
<td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-8b</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.483</td>
<td class="px-6 py-4 text-center">0.43</td>
<td class="px-6 py-4 text-center">26.1%</td>
</tr>
<!-- Rank 7 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">7</td>
<td class="px-6 py-4 font-semibold text-gray-900">Mistral-7b-v0.3</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.423</td>
<td class="px-6 py-4 text-center">0.50</td>
<td class="px-6 py-4 text-center">0.0%</td>
</tr>
<!-- Rank 8 -->
<tr class="border-b border-gray-200">
<td class="px-6 py-4 font-bold text-lg text-gray-900">8</td>
<td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3-8b</td>
<td class="px-6 py-4 font-semibold text-center text-indigo-600">0.395</td>
<td class="px-6 py-4 text-center">0.51</td>
<td class="px-6 py-4 text-center">4.5%</td>
</tr>
</tbody>
</table>
</div>
<p class="text-xs text-gray-500 text-center mt-4">Leaderboard data from Table 1 of the MCP-BENCH paper. Last updated: August 7, 2025.</p>
</section>
<!-- Abstract Section -->
<section id="abstract" class="mb-16">
<h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Abstract</h3>
<div class="bg-white p-8 rounded-lg shadow-md">
<p class="text-gray-700 leading-relaxed">
We introduce MCP-Bench, a new benchmark designed to evaluate large language models (LLMs) on realistic, multi-step tasks that require tool use, cross-tool coordination, and precise parameter control. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 31 live MCP servers spanning diverse real-world domains such as weather forecasting, stock analysis, scientific computing, and academic search. Tasks are structured as layered dependency graphs involving tools from one or more servers, testing an agent's ability to interpret tool schemas, plan coherent execution traces, retrieve relevant tools, and fill parameters with high structural and semantic fidelity. Unlike existing benchmarks, MCP-Bench targets real-world tool-use scenarios with complex input-output dependencies, diverse tool schemas, and multi-step reasoning requirements. We develop a multi-faceted evaluation framework that measures task success, tool-level execution accuracy, and alignment with ground-truth execution graphs. This includes metrics for tool name validity, schema compliance, graph exact match, structure-aware move distance, and semantic quality assessed by LLM-as-a-Judge. Experiments across 13 advanced LLMs—including GPT-4o, Claude 3, and LLaMA 3.1—reveal persistent challenges in long-horizon planning, tool reuse, and multi-server coordination. We release MCP-Bench, along with its evaluation toolkit, baseline results, and data synthesis pipeline, to enable robust and reproducible evaluation of agentic LLMs and to support future research on structured and scalable tool-based reasoning.
</p>
</div>
</section>
<!-- Links Section -->
<section id="links" class="text-center mb-16">
<div class="flex justify-center items-center space-x-4">
<a href="#" onclick="alert('Paper download link not available yet.'); return false;" class="inline-block bg-indigo-600 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-indigo-700 transition-colors duration-300">
Download Paper (PDF)
</a>
<a href="#" onclick="alert('Code repository link not available yet.'); return false;" class="inline-block bg-gray-700 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-gray-800 transition-colors duration-300">
View Code on GitHub
</a>
</div>
</section>
<!-- Citation Section -->
<section id="citation">
<h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Citation</h3>
<div class="bg-gray-200 p-6 rounded-lg shadow-inner">
<pre class="text-sm text-gray-800 whitespace-pre-wrap break-words"><code>@misc{mcpbench2025,
title={{MCP-BENCH: Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers}},
author={Your Name and Co-authors},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}</code></pre>
</div>
</section>
</div>
<!-- Footer -->
<footer class="text-center py-6 bg-gray-100 border-t border-gray-200">
<p class="text-sm text-gray-500">© 2025 MCP-BENCH Project. All Rights Reserved.</p>
</footer>
</body>
</html>
|