eugenesiow commited on
Commit
7b6b43e
·
1 Parent(s): 5dad6cc

Add update to index.html with leaderboard.

Browse files
Files changed (1) hide show
  1. index.html +177 -18
index.html CHANGED
@@ -1,19 +1,178 @@
1
- <!doctype html>
2
- <html>
3
- <head>
4
- <meta charset="utf-8" />
5
- <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
- <link rel="stylesheet" href="style.css" />
8
- </head>
9
- <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
18
- </body>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  </html>
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>MCP-BENCH: Benchmarking Tool-Using LLM Agents</title>
7
+ <script src="https://cdn.tailwindcss.com"></script>
8
+ <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
9
+ <style>
10
+ body {
11
+ font-family: 'Inter', sans-serif;
12
+ }
13
+ .gradient-text {
14
+ background: linear-gradient(to right, #4f46e5, #ec4899);
15
+ -webkit-background-clip: text;
16
+ -webkit-text-fill-color: transparent;
17
+ }
18
+ .table-hover tr:hover {
19
+ background-color: #f9fafb;
20
+ }
21
+ </style>
22
+ </head>
23
+ <body class="bg-gray-50 text-gray-800">
24
+
25
+ <div class="container mx-auto px-4 py-8 md:py-16 max-w-5xl">
26
+
27
+ <!-- Header Section -->
28
+ <header class="text-center mb-12">
29
+ <h1 class="text-4xl md:text-5xl font-bold text-gray-900 mb-2">
30
+ MCP-BENCH
31
+ </h1>
32
+ <h2 class="text-lg md:text-xl text-gray-600">
33
+ Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers
34
+ </h2>
35
+ </header>
36
+
37
+ <!-- Leaderboard Section -->
38
+ <section id="leaderboard" class="mb-16">
39
+ <h3 class="text-2xl md:text-3xl font-bold text-center mb-8 gradient-text">Leaderboard</h3>
40
+ <div class="overflow-x-auto bg-white rounded-lg shadow-lg">
41
+ <table class="min-w-full text-sm text-left text-gray-600">
42
+ <thead class="bg-gray-100 text-xs text-gray-700 uppercase tracking-wider">
43
+ <tr>
44
+ <th scope="col" class="px-6 py-3 font-semibold">Rank</th>
45
+ <th scope="col" class="px-6 py-3 font-semibold">Model</th>
46
+ <th scope="col" class="px-6 py-3 font-semibold text-center">Overall Score</th>
47
+ <th scope="col" class="px-6 py-3 font-semibold text-center">Task Fulfillment</th>
48
+ <th scope="col" class="px-6 py-3 font-semibold text-center">Graph Exact Match</th>
49
+ </tr>
50
+ </thead>
51
+ <!--
52
+ LEADERBOARD DATA
53
+ To update the leaderboard, edit the rows (<tr>...</tr>) below.
54
+ Each row represents a model. The columns are:
55
+ 1. Rank (#)
56
+ 2. Model Name
57
+ 3. Overall Score
58
+ 4. Task Fulfillment (LLM Judge)
59
+ 5. Graph Exact Match
60
+ Make sure the data is sorted by the Overall Score in descending order.
61
+ -->
62
+ <tbody class="divide-y divide-gray-200 table-hover">
63
+ <!-- Rank 1 -->
64
+ <tr class="border-b border-gray-200">
65
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">1</td>
66
+ <td class="px-6 py-4 font-semibold text-gray-900">GPT-4o-mini</td>
67
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.691</td>
68
+ <td class="px-6 py-4 text-center">0.77</td>
69
+ <td class="px-6 py-4 text-center">52.4%</td>
70
+ </tr>
71
+ <!-- Rank 2 -->
72
+ <tr class="border-b border-gray-200">
73
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">2</td>
74
+ <td class="px-6 py-4 font-semibold text-gray-900">Qwen-3-32b</td>
75
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.631</td>
76
+ <td class="px-6 py-4 text-center">0.57</td>
77
+ <td class="px-6 py-4 text-center">47.8%</td>
78
+ </tr>
79
+ <!-- Rank 3 -->
80
+ <tr class="border-b border-gray-200">
81
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">3</td>
82
+ <td class="px-6 py-4 font-semibold text-gray-900">DeepSeek-R1-Qwen-32b</td>
83
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.587</td>
84
+ <td class="px-6 py-4 text-center">0.52</td>
85
+ <td class="px-6 py-4 text-center">43.5%</td>
86
+ </tr>
87
+ <!-- Rank 4 -->
88
+ <tr class="border-b border-gray-200">
89
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">4</td>
90
+ <td class="px-6 py-4 font-semibold text-gray-900">Mistral-small-2403</td>
91
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.552</td>
92
+ <td class="px-6 py-4 text-center">0.49</td>
93
+ <td class="px-6 py-4 text-center">30.4%</td>
94
+ </tr>
95
+ <!-- Rank 5 -->
96
+ <tr class="border-b border-gray-200">
97
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">5</td>
98
+ <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-70b</td>
99
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.542</td>
100
+ <td class="px-6 py-4 text-center">0.50</td>
101
+ <td class="px-6 py-4 text-center">21.7%</td>
102
+ </tr>
103
+ <!-- Rank 6 -->
104
+ <tr class="border-b border-gray-200">
105
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">6</td>
106
+ <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-8b</td>
107
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.483</td>
108
+ <td class="px-6 py-4 text-center">0.43</td>
109
+ <td class="px-6 py-4 text-center">26.1%</td>
110
+ </tr>
111
+ <!-- Rank 7 -->
112
+ <tr class="border-b border-gray-200">
113
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">7</td>
114
+ <td class="px-6 py-4 font-semibold text-gray-900">Mistral-7b-v0.3</td>
115
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.423</td>
116
+ <td class="px-6 py-4 text-center">0.50</td>
117
+ <td class="px-6 py-4 text-center">0.0%</td>
118
+ </tr>
119
+ <!-- Rank 8 -->
120
+ <tr class="border-b border-gray-200">
121
+ <td class="px-6 py-4 font-bold text-lg text-gray-900">8</td>
122
+ <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3-8b</td>
123
+ <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.395</td>
124
+ <td class="px-6 py-4 text-center">0.51</td>
125
+ <td class="px-6 py-4 text-center">4.5%</td>
126
+ </tr>
127
+ </tbody>
128
+ </table>
129
+ </div>
130
+ <p class="text-xs text-gray-500 text-center mt-4">Leaderboard data from Table 1 of the MCP-BENCH paper. Last updated: August 7, 2025.</p>
131
+ </section>
132
+
133
+ <!-- Abstract Section -->
134
+ <section id="abstract" class="mb-16">
135
+ <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Abstract</h3>
136
+ <div class="bg-white p-8 rounded-lg shadow-md">
137
+ <p class="text-gray-700 leading-relaxed">
138
+ We introduce MCP-Bench, a new benchmark designed to evaluate large language models (LLMs) on realistic, multi-step tasks that require tool use, cross-tool coordination, and precise parameter control. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 31 live MCP servers spanning diverse real-world domains such as weather forecasting, stock analysis, scientific computing, and academic search. Tasks are structured as layered dependency graphs involving tools from one or more servers, testing an agent's ability to interpret tool schemas, plan coherent execution traces, retrieve relevant tools, and fill parameters with high structural and semantic fidelity. Unlike existing benchmarks, MCP-Bench targets real-world tool-use scenarios with complex input-output dependencies, diverse tool schemas, and multi-step reasoning requirements. We develop a multi-faceted evaluation framework that measures task success, tool-level execution accuracy, and alignment with ground-truth execution graphs. This includes metrics for tool name validity, schema compliance, graph exact match, structure-aware move distance, and semantic quality assessed by LLM-as-a-Judge. Experiments across 13 advanced LLMs—including GPT-4o, Claude 3, and LLaMA 3.1—reveal persistent challenges in long-horizon planning, tool reuse, and multi-server coordination. We release MCP-Bench, along with its evaluation toolkit, baseline results, and data synthesis pipeline, to enable robust and reproducible evaluation of agentic LLMs and to support future research on structured and scalable tool-based reasoning.
139
+ </p>
140
+ </div>
141
+ </section>
142
+
143
+ <!-- Links Section -->
144
+ <section id="links" class="text-center mb-16">
145
+ <div class="flex justify-center items-center space-x-4">
146
+ <a href="#" onclick="alert('Paper download link not available yet.'); return false;" class="inline-block bg-indigo-600 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-indigo-700 transition-colors duration-300">
147
+ Download Paper (PDF)
148
+ </a>
149
+ <a href="#" onclick="alert('Code repository link not available yet.'); return false;" class="inline-block bg-gray-700 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-gray-800 transition-colors duration-300">
150
+ View Code on GitHub
151
+ </a>
152
+ </div>
153
+ </section>
154
+
155
+ <!-- Citation Section -->
156
+ <section id="citation">
157
+ <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Citation</h3>
158
+ <div class="bg-gray-200 p-6 rounded-lg shadow-inner">
159
+ <pre class="text-sm text-gray-800 whitespace-pre-wrap break-words"><code>@misc{mcpbench2025,
160
+ title={{MCP-BENCH: Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers}},
161
+ author={Your Name and Co-authors},
162
+ year={2025},
163
+ eprint={},
164
+ archivePrefix={arXiv},
165
+ primaryClass={cs.CL}
166
+ }</code></pre>
167
+ </div>
168
+ </section>
169
+
170
+ </div>
171
+
172
+ <!-- Footer -->
173
+ <footer class="text-center py-6 bg-gray-100 border-t border-gray-200">
174
+ <p class="text-sm text-gray-500">&copy; 2025 MCP-BENCH Project. All Rights Reserved.</p>
175
+ </footer>
176
+
177
+ </body>
178
  </html>