rts-commander / docs /reports /MCP_EVALUATION_REPORT.textile
Luigi's picture
Organize project structure: move test scripts to tests/scripts and documentation to docs/reports
d28c36c
h1. MCP Capability Evaluation Report for Small Language Models (SLMs)
h2. Executive Summary
This report presents a comprehensive evaluation of 14 Small Language Models (SLMs) under 3B parameters for their MCP (Model Context Protocol) capabilities. The evaluation focuses on assessing how effectively these models can convert natural language RTS game commands into structured JSON tool calls.
*NEW*: This report has been updated with two additional code-specialized models (Qwen2.5-Coder-1.5B and Yi-Coder-1.5B), revealing a new champion with exceptional MCP performance.
h2. What is MCP?
MCP (Model Context Protocol) is a standardized protocol that enables AI models to interact with external tools and systems through structured JSON calls. In the context of RTS games, MCP allows:
* Conversion of natural language commands into executable actions
* Structured communication between AI and game engines
* Standardized tool calling interface
* Real-time command processing
h2. Evaluation Methodology
h3. Test Scenarios
We evaluated each model on three realistic RTS game scenarios:
# *State Command*: "show game state" โ†’ Expected: @get_game_state@
# *Movement Command*: "move infantry to 150,200" โ†’ Expected: @move_units@ with coordinates
# *Attack Command*: "attack enemy tank at 300,150" โ†’ Expected: @attack_unit@ with target coordinates
h3. Scoring System (0-10 points per test)
* +4 points: Correct tool identification
* +3 points: Valid JSON structure
* +2 points: Proper tool/action terminology
* +1 point: Correct coordinate extraction
h3. Models Evaluated
The evaluation includes 14 models across different categories:
* *General-purpose SLMs* (3 models)
* *MCP-specialized models* (7 models with various quantizations)
* *Code-specialized models* (3 models)
* *Code-specialized failing* (1 model)
h2. Test Results
The comprehensive evaluation revealed significant differences in MCP capabilities across models:
h3. Performance Ranking Table
|_. Rank|_. Model|_. MCP Score|_. Avg Time|_. Size|_. Efficiency|_. Notes|
| *1* | *@Qwen2.5-Coder-1.5B-Q4@* | *9.7/10* | *4.12s* | *1017MB* | *2.34 pts/s* | ๐Ÿ† *Champion* |
| 2 | @Qwen2.5-Coder-0.5B@ | 4.3/10 | 2.08s | 409MB | 2.08 pts/s | Previous champion |
| 3 | @Qwen3-0.6B@ | 3.7/10 | 3.98s | 610MB | 0.92 pts/s | |
| 4 | @Gemma-3-270M@ | 3.7/10 | 2.29s | 428MB | 1.60 pts/s | |
| 5 | @MCPR-L-3B-Exa-Q8@ | 3.7/10 | 17.42s | 3133MB | 0.21 pts/s | |
| 6 | @Gemma-3n-E2B-it-Q8@ | 3.7/10 | 14.80s | 4566MB | 0.25 pts/s | |
| 7 | @Qwen3-1.7B@ | 3.7/10 | 6.24s | 1008MB | 0.59 pts/s | |
| 8 | @Qwen2.5-0.5B@ | 2.7/10 | 1.17s | 409MB | 2.28 pts/s | |
| 9 | @Gemma-3n-E2B-it-IQ2@ | 2.3/10 | 14.11s | 1958MB | 0.17 pts/s | |
| 10 | @Llama-Breeze2-3B-Q2@ | 1.3/10 | 11.39s | 1424MB | 0.12 pts/s | |
| 11 | @Yi-Coder-1.5B-Q4@ | 0.0/10 | 11.64s | 826MB | 0.00 pts/s | Prompt format issue |
| 12 | @MCP-Instruct-v1-Q4@ | 0.0/10 | 0.00s | 697MB | 0.00 pts/s | |
| 13 | @MCPR-L-3B-Exa-Q2@ | 0.0/10 | 10.63s | 1216MB | 0.00 pts/s | |
| 14 | @MCP-Instruct-v1-Q8@ | 0.0/10 | 0.00s | 1465MB | 0.00 pts/s | |
h2. Key Findings
h3. Performance Insights
* *Code-specialized models dramatically outperform others*: Qwen2.5-Coder-1.5B achieved an exceptional 9.7/10 score, more than 2x better than any other model
* *Scaling works for code-specialized models*: Increasing from 0.5B to 1.5B parameters improved the score from 4.3/10 to 9.7/10
* *Near-perfect MCP capability exists in small models*: The 1.5B model achieved 10/10 on 2 out of 3 tests with proper JSON extraction
* *Smaller models can be more efficient*: The 270M parameter Gemma model performed as well as much larger 3B models
* *Quantization matters*: Q8 versions generally performed better than Q2/Q4 versions for MCP-specialized models
h3. Technical Observations
* *Markdown wrapping requires extraction*: Qwen2.5-Coder-1.5B wraps JSON in markdown code blocks (@```json```@), requiring extraction logic
* *MCP-Instruct models failed completely* due to technical issues (@llama_decode returned -1@)
* *Yi-Coder has prompt format incompatibility*: Returns the prompt itself rather than generating responses
* *Larger models don't guarantee better performance*: The 3B models were significantly slower with similar scores (except code-specialized)
* *Response time varies dramatically*: From 1.17s (Qwen2.5-0.5B) to 17.42s (MCPR-L-3B-Exa-Q8)
h2. Recommendations
Based on the updated evaluation results, we recommend:
# *Primary Choice*: @Qwen2.5-Coder-1.5B-Q4@ - *Exceptional MCP performance* (9.7/10) with reasonable speed (4.12s) and size (1017MB)
# *Budget Alternative*: @Qwen2.5-Coder-0.5B@ - Best balance for resource-constrained environments (4.3/10, 2.08s, 409MB)
# *Ultra-lightweight*: @Gemma-3-270M@ - Excellent efficiency for its tiny size (3.7/10, 2.29s, 428MB)
# *Avoid*: MCP-Instruct models (technical incompatibility), Yi-Coder (prompt format issues)
h2. Conclusion
This comprehensive 14-model evaluation demonstrates critical insights for MCP capabilities in RTS games:
* *Code-specialized models are vastly superior*: The champion (Qwen2.5-Coder-1.5B) achieved 9.7/10, while the best MCP-specialized model only reached 3.7/10
* *Parameter scaling works for code models*: Tripling parameters (0.5B โ†’ 1.5B) more than doubled MCP performance (4.3 โ†’ 9.7)
* *Near-perfect MCP is achievable*: Small models under 2B parameters can achieve 10/10 on individual tests with proper implementation
* *JSON extraction is critical*: Modern code models wrap output in markdown, requiring extraction logic for production use
* *Efficiency varies dramatically*: The best model is 11.7x more effective than the worst functional model
The results provide valuable insights for developers implementing MCP-based AI assistants in gaming applications, demonstrating that code-specialized models offer the most reliable path to high-quality MCP capabilities.
*Report generated on: 2025-10-05*
*Updated on: 2025-10-05* (added Qwen2.5-Coder-1.5B and Yi-Coder-1.5B)
*Evaluation framework: llama.cpp with MCP protocol simulation*