rts-commander / docs /MCP_MODEL_CAPABILITY_ANALYSIS.md
Luigi's picture
Initial commit: Complete RTS project with MCP evaluation
551ad28

Qwen2.5 0.5B Model Capability for MCP Instruction Translation

Model Assessment

Strengths for This Task

  1. Instruction Following: Qwen2.5 is specifically designed for instruction following and has strong capabilities in understanding and executing complex instructions.

  2. Code Understanding: As a coding-focused model, it has good comprehension of APIs, protocols, and structured data formats like JSON.

  3. Task-Specific Fine-Tuning: Your implementation can provide specific examples and context that guide the model toward correct translations.

  4. Context Awareness: The model can work with the detailed game state information provided via MCP to make informed decisions.

Limitations to Consider

  1. Size Constraint: At 0.5B parameters, it's smaller than the largest models, which may affect complex reasoning.

  2. Specialized Knowledge: It may not have specific training on the MCP protocol itself (though it can understand the concept from examples).

  3. Consistency: Smaller models can sometimes be less consistent in output quality.

Recommended Approach

Prompt Engineering Strategy

The key to success is providing the model with clear, structured prompts that guide it toward correct behavior:

def create_translation_prompt(user_instruction: str, game_state: dict) -> str:
    return f"""
You are an RTS game command interpreter. Convert natural language instructions 
into specific MCP tool calls for an RTS game.

GAME CONTEXT:
- You are controlling the PLAYER (player_id: 0)
- Enemy is player_id: 1
- Game uses a grid coordinate system
- Units have specific capabilities and movement patterns

AVAILABLE MCP TOOLS:
1. get_game_state() - Retrieve current game situation
2. move_units(unit_ids: List[str], target_x: float, target_y: float)
3. attack_unit(attacker_ids: List[str], target_id: str)
4. build_building(building_type: str, position_x: float, position_y: float, player_id: int)
5. build_unit(unit_type: str, player_id: int, building_id: str)
6. get_ai_analysis(language: str) - Get tactical advice

CURRENT GAME STATE:
{json.dumps(game_state, indent=2)}

USER INSTRUCTION: "{user_instruction}"

TRANSLATION GUIDELINES:
1. ALWAYS verify that referenced units/buildings exist in the game state
2. Check that player has sufficient resources for construction actions
3. Ensure coordinates are valid (within map bounds, not in water)
4. Use appropriate unit types for actions (infantry for barracks, etc.)
5. Return ONLY a JSON array of tool calls in this exact format:
[
  {{"tool": "move_units", "arguments": {{"unit_ids": ["unit1"], "target_x": 100, "target_y": 200}}}}
]

EXAMPLE TRANSLATIONS:
User: "Move my tanks to position 200,300"
AI: [{{"tool": "move_units", "arguments": {{"unit_ids": ["tank1", "tank2"], "target_x": 200, "target_y": 300}}}}]

User: "Build a barracks near my HQ"
AI: [{{"tool": "build_building", "arguments": {{"building_type": "barracks", "position_x": 240, "position_y": 240, "player_id": 0}}}}]

Now translate the user instruction:
"""

Few-Shot Learning Approach

Provide several examples in the prompt to guide the model:

EXAMPLES = [
    {
        "instruction": "Attack the enemy with my infantry",
        "game_state_context": "Player has infantry1, infantry2. Enemy has barracks at location barracks1",
        "translation": [
            {"tool": "attack_unit", "arguments": {"attacker_ids": ["infantry1", "infantry2"], "target_id": "barracks1"}}
        ]
    },
    {
        "instruction": "I need more power",
        "game_state_context": "Player has 500 credits, HQ at 100,100",
        "translation": [
            {"tool": "build_building", "arguments": {"building_type": "power_plant", "position_x": 140, "position_y": 100, "player_id": 0}}
        ]
    }
]

Implementation Strategies

1. Validation Layer

Implement a validation system that checks AI-generated tool calls:

def validate_tool_call(tool_call: dict, game_state: dict) -> bool:
    """Validate that an AI-generated tool call is reasonable"""
    tool_name = tool_call.get("tool")
    args = tool_call.get("arguments", {})
    
    if tool_name == "move_units":
        # Check that units exist
        unit_ids = args.get("unit_ids", [])
        for unit_id in unit_ids:
            if unit_id not in game_state.get("units", {}):
                return False, f"Unit {unit_id} not found"
        
        # Check coordinate bounds
        x, y = args.get("target_x", 0), args.get("target_y", 0)
        if not (0 <= x <= 3840 and 0 <= y <= 2880):  # Map bounds
            return False, "Target coordinates out of bounds"
    
    elif tool_name == "build_building":
        # Check resources
        building_type = args.get("building_type")
        cost = BUILDING_COSTS.get(building_type, 0)
        player_credits = game_state.get("players", {}).get("0", {}).get("credits", 0)
        if player_credits < cost:
            return False, "Insufficient credits"
    
    return True, "Valid"

2. Iterative Refinement

Implement a feedback loop to improve translations:

class MCPTranslationEngine:
    def __init__(self):
        self.successful_translations = []
        self.failed_translations = []
    
    def translate_instruction(self, instruction: str, game_state: dict) -> List[dict]:
        """Translate instruction with learning from past examples"""
        # Include successful examples in prompt
        prompt = self.build_prompt_with_examples(instruction, game_state)
        response = self.query_model(prompt)
        return self.parse_response(response)
    
    def record_result(self, instruction: str, translation: List[dict], success: bool):
        """Record translation results for future learning"""
        if success:
            self.successful_translations.append((instruction, translation))
        else:
            self.failed_translations.append((instruction, translation))

3. Fallback Mechanisms

Implement fallback strategies for complex instructions:

def translate_with_fallback(instruction: str, game_state: dict) -> List[dict]:
    """Attempt translation with multiple strategies"""
    
    # Try direct translation first
    try:
        direct_result = attempt_direct_translation(instruction, game_state)
        if validate_translation(direct_result, game_state):
            return direct_result
    except:
        pass
    
    # Try breaking into simpler steps
    try:
        steps = break_into_simple_steps(instruction)
        results = []
        for step in steps:
            step_result = attempt_direct_translation(step, game_state)
            if validate_translation(step_result, game_state):
                results.extend(step_result)
        return results
    except:
        pass
    
    # Fallback to AI analysis request
    return [{"tool": "get_ai_analysis", "arguments": {"language": "en"}}]

Performance Expectations

Likely Success Cases

  1. Simple Commands: "Move tanks to position X,Y" - High accuracy
  2. Basic Strategy: "Build a power plant" - High accuracy
  3. Direct Attacks: "Attack enemy barracks" - High accuracy
  4. Resource Management: "Build more harvesters" - Moderate to high accuracy

Challenging Cases

  1. Complex Tactics: "Flank the enemy while defending our base" - Moderate accuracy
  2. Abstract Concepts: "Win the game" - Lower accuracy, needs breakdown
  3. Multi-step Plans: "Expand economy then build army" - Needs iterative approach
  4. Contextual Nuances: "Defend aggressively" - Interpretation challenges

Enhancement Recommendations

1. Model Fine-Tuning

If possible, fine-tune the model on RTS command examples:

  • Collect successful translation examples
  • Create a dataset of instruction → tool call mappings
  • Fine-tune for better consistency

2. Hybrid Approach

Combine LLM with rule-based systems:

def smart_translate(instruction: str, game_state: dict):
    # Simple pattern matching for common commands
    if "move" in instruction.lower() and "to" in instruction.lower():
        return pattern_based_move_translation(instruction, game_state)
    
    # Complex reasoning for abstract commands
    elif "win" in instruction.lower() or "strategy" in instruction.lower():
        return ai_assisted_strategic_translation(instruction, game_state)
    
    # Default to LLM for everything else
    else:
        return llm_based_translation(instruction, game_state)

3. Confidence Scoring

Implement confidence scoring for translations:

def translate_with_confidence(instruction: str, game_state: dict) -> Tuple[List[dict], float]:
    """Return translation with confidence score (0.0 to 1.0)"""
    translation = generate_translation(instruction, game_state)
    confidence = calculate_confidence(translation, instruction, game_state)
    return translation, confidence

# Only execute high-confidence translations automatically
# Ask for confirmation on low-confidence ones

Testing Strategy

Unit Tests for Translation

def test_translation_accuracy():
    test_cases = [
        ("Move my tanks to 200,300", expected_tank_move_call),
        ("Build a barracks", expected_build_barracks_call),
        ("Attack enemy HQ", expected_attack_call),
    ]
    
    for instruction, expected in test_cases:
        result = translate_instruction(instruction, sample_game_state)
        assert result == expected, f"Failed for: {instruction}"

A/B Testing Framework

def compare_translation_strategies():
    instructions = load_test_instructions()
    
    strategy_a_results = []
    strategy_b_results = []
    
    for instruction in instructions:
        # Test different approaches
        result_a = strategy_a(instruction, game_state)
        result_b = strategy_b(instruction, game_state)
        
        # Measure success (manual or automated evaluation)
        success_a = evaluate_success(result_a)
        success_b = evaluate_success(result_b)
        
        strategy_a_results.append(success_a)
        strategy_b_results.append(success_b)
    
    # Compare effectiveness
    avg_a = sum(strategy_a_results) / len(strategy_a_results)
    avg_b = sum(strategy_b_results) / len(strategy_b_results)

Conclusion

While Qwen2.5 0.5B may not be the largest model available, it is absolutely capable of translating user instructions to MCP tool calls for your RTS game, especially with proper:

  1. Structured prompting with clear examples
  2. Validation layers to catch errors
  3. Fallback mechanisms for complex cases
  4. Iterative improvement through learning

The key is not raw model size, but intelligent implementation that works with the model's strengths while compensating for its limitations. Your existing investment in the Qwen2.5 model, combined with the robust MCP interface, provides an excellent foundation for natural language game control.