File size: 11,300 Bytes
7b6b43e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dad6cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>MCP-BENCH: Benchmarking Tool-Using LLM Agents</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
    <style>
        body {
            font-family: 'Inter', sans-serif;
        }
        .gradient-text {
            background: linear-gradient(to right, #4f46e5, #ec4899);
            -webkit-background-clip: text;
            -webkit-text-fill-color: transparent;
        }
        .table-hover tr:hover {
            background-color: #f9fafb;
        }
    </style>
</head>
<body class="bg-gray-50 text-gray-800">

    <div class="container mx-auto px-4 py-8 md:py-16 max-w-5xl">

        <!-- Header Section -->
        <header class="text-center mb-12">
            <h1 class="text-4xl md:text-5xl font-bold text-gray-900 mb-2">
                MCP-BENCH
            </h1>
            <h2 class="text-lg md:text-xl text-gray-600">
                Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers
            </h2>
        </header>

        <!-- Leaderboard Section -->
        <section id="leaderboard" class="mb-16">
            <h3 class="text-2xl md:text-3xl font-bold text-center mb-8 gradient-text">Leaderboard</h3>
            <div class="overflow-x-auto bg-white rounded-lg shadow-lg">
                <table class="min-w-full text-sm text-left text-gray-600">
                    <thead class="bg-gray-100 text-xs text-gray-700 uppercase tracking-wider">
                        <tr>
                            <th scope="col" class="px-6 py-3 font-semibold">Rank</th>
                            <th scope="col" class="px-6 py-3 font-semibold">Model</th>
                            <th scope="col" class="px-6 py-3 font-semibold text-center">Overall Score</th>
                            <th scope="col" class="px-6 py-3 font-semibold text-center">Task Fulfillment</th>
                            <th scope="col" class="px-6 py-3 font-semibold text-center">Graph Exact Match</th>
                        </tr>
                    </thead>
                    <!-- 
                        LEADERBOARD DATA
                        To update the leaderboard, edit the rows (<tr>...</tr>) below.
                        Each row represents a model. The columns are:
                        1. Rank (#)
                        2. Model Name
                        3. Overall Score
                        4. Task Fulfillment (LLM Judge)
                        5. Graph Exact Match
                        Make sure the data is sorted by the Overall Score in descending order.
                    -->
                    <tbody class="divide-y divide-gray-200 table-hover">
                        <!-- Rank 1 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">1</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">GPT-4o-mini</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.691</td>
                            <td class="px-6 py-4 text-center">0.77</td>
                            <td class="px-6 py-4 text-center">52.4%</td>
                        </tr>
                        <!-- Rank 2 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">2</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">Qwen-3-32b</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.631</td>
                            <td class="px-6 py-4 text-center">0.57</td>
                            <td class="px-6 py-4 text-center">47.8%</td>
                        </tr>
                        <!-- Rank 3 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">3</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">DeepSeek-R1-Qwen-32b</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.587</td>
                            <td class="px-6 py-4 text-center">0.52</td>
                            <td class="px-6 py-4 text-center">43.5%</td>
                        </tr>
                        <!-- Rank 4 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">4</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">Mistral-small-2403</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.552</td>
                            <td class="px-6 py-4 text-center">0.49</td>
                            <td class="px-6 py-4 text-center">30.4%</td>
                        </tr>
                        <!-- Rank 5 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">5</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-70b</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.542</td>
                            <td class="px-6 py-4 text-center">0.50</td>
                            <td class="px-6 py-4 text-center">21.7%</td>
                        </tr>
                         <!-- Rank 6 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">6</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-8b</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.483</td>
                            <td class="px-6 py-4 text-center">0.43</td>
                            <td class="px-6 py-4 text-center">26.1%</td>
                        </tr>
                         <!-- Rank 7 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">7</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">Mistral-7b-v0.3</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.423</td>
                            <td class="px-6 py-4 text-center">0.50</td>
                            <td class="px-6 py-4 text-center">0.0%</td>
                        </tr>
                         <!-- Rank 8 -->
                        <tr class="border-b border-gray-200">
                            <td class="px-6 py-4 font-bold text-lg text-gray-900">8</td>
                            <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3-8b</td>
                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.395</td>
                            <td class="px-6 py-4 text-center">0.51</td>
                            <td class="px-6 py-4 text-center">4.5%</td>
                        </tr>
                    </tbody>
                </table>
            </div>
            <p class="text-xs text-gray-500 text-center mt-4">Leaderboard data from Table 1 of the MCP-BENCH paper. Last updated: August 7, 2025.</p>
        </section>

        <!-- Abstract Section -->
        <section id="abstract" class="mb-16">
            <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Abstract</h3>
            <div class="bg-white p-8 rounded-lg shadow-md">
                <p class="text-gray-700 leading-relaxed">
                    We introduce MCP-Bench, a new benchmark designed to evaluate large language models (LLMs) on realistic, multi-step tasks that require tool use, cross-tool coordination, and precise parameter control. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 31 live MCP servers spanning diverse real-world domains such as weather forecasting, stock analysis, scientific computing, and academic search. Tasks are structured as layered dependency graphs involving tools from one or more servers, testing an agent's ability to interpret tool schemas, plan coherent execution traces, retrieve relevant tools, and fill parameters with high structural and semantic fidelity. Unlike existing benchmarks, MCP-Bench targets real-world tool-use scenarios with complex input-output dependencies, diverse tool schemas, and multi-step reasoning requirements. We develop a multi-faceted evaluation framework that measures task success, tool-level execution accuracy, and alignment with ground-truth execution graphs. This includes metrics for tool name validity, schema compliance, graph exact match, structure-aware move distance, and semantic quality assessed by LLM-as-a-Judge. Experiments across 13 advanced LLMs—including GPT-4o, Claude 3, and LLaMA 3.1—reveal persistent challenges in long-horizon planning, tool reuse, and multi-server coordination. We release MCP-Bench, along with its evaluation toolkit, baseline results, and data synthesis pipeline, to enable robust and reproducible evaluation of agentic LLMs and to support future research on structured and scalable tool-based reasoning.
                </p>
            </div>
        </section>
        
        <!-- Links Section -->
        <section id="links" class="text-center mb-16">
             <div class="flex justify-center items-center space-x-4">
                 <a href="#" onclick="alert('Paper download link not available yet.'); return false;" class="inline-block bg-indigo-600 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-indigo-700 transition-colors duration-300">
                     Download Paper (PDF)
                 </a>
                 <a href="#" onclick="alert('Code repository link not available yet.'); return false;" class="inline-block bg-gray-700 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-gray-800 transition-colors duration-300">
                     View Code on GitHub
                 </a>
             </div>
        </section>

        <!-- Citation Section -->
        <section id="citation">
            <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Citation</h3>
            <div class="bg-gray-200 p-6 rounded-lg shadow-inner">
                <pre class="text-sm text-gray-800 whitespace-pre-wrap break-words"><code>@misc{mcpbench2025,
    title={{MCP-BENCH: Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers}}, 
    author={Your Name and Co-authors},
    year={2025},
    eprint={},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}</code></pre>
            </div>
        </section>

    </div>

    <!-- Footer -->
    <footer class="text-center py-6 bg-gray-100 border-t border-gray-200">
        <p class="text-sm text-gray-500">&copy; 2025 MCP-BENCH Project. All Rights Reserved.</p>
    </footer>

</body>
</html>