rts-commander / docs /LLM_THREAD_OPTIMIZATION.md
Luigi's picture
perf: Apply all thread optimizations + complete documentation
8e1770a

๐ŸŽฎ LLM Thread Management on 2 vCPU System

๐Ÿ› Problem Discovered

Symptom: During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented.

Root Cause: llama-cpp-python has TWO thread parameters:

  1. n_threads - Threads for prompt processing
  2. n_threads_batch - Threads for token generation (defaults to n_threads if not set!)

Previous Config:

Llama(
    n_threads=1,          # โœ… Set to 1
    # n_threads_batch=?    # โŒ NOT SET โ†’ defaults to n_threads (1)
)

BUT - When n_threads_batch is not explicitly set, llama.cpp uses an internal default which may be higher!

๐Ÿ”ง Solution

Explicitly set BOTH parameters to 1:

Llama(
    n_threads=1,          # Prompt processing: 1 thread
    n_threads_batch=1,    # Token generation: 1 thread (CRITICAL!)
    n_batch=128,          # Batch size
)

CPU Allocation:

  • vCPU 0: LLM inference (1 thread total)
  • vCPU 1: Game loop, websockets, async I/O

This ensures game always has 1 full vCPU available! ๐ŸŽฏ

๐Ÿ“Š HuggingFace Spaces Constraints

Available Resources:

  • 2 vCPUs (shared, not dedicated)
  • 16GB RAM
  • No GPU (CPU-only inference)

Challenges:

  1. CPU-bound LLM: Qwen2.5-Coder-1.5B takes 10-15s per inference
  2. Real-time game: Needs consistent 20 FPS (50ms per frame)
  3. WebSocket server: Needs to respond to user input instantly
  4. Shared system: Other processes may use CPU

๐ŸŽ›๏ธ Additional Optimizations

1. Reduce Context Window

n_ctx=4096,  # Current - high memory, slower
n_ctx=2048,  # Optimized - lower memory, faster โœ…

Benefit: Faster prompt processing, less memory

2. Increase Batch Size

n_batch=128,   # Current - more frequent updates
n_batch=256,   # Optimized - fewer updates, faster overall โœ…

Benefit: Faster generation, less overhead

3. Set Thread Priority (OS Level)

import os
import threading

# Lower LLM worker thread priority
def _process_requests(self):
    # Set low priority (nice value 10-19)
    try:
        os.nice(10)  # Lower priority
    except:
        pass
    
    while not self._stop_worker:
        # ... process requests

Benefit: OS scheduler favors game thread

4. CPU Affinity (Advanced)

import os

# Pin LLM thread to CPU 0 only
try:
    os.sched_setaffinity(0, {0})  # Use only CPU 0
except:
    pass

Benefit: Game thread has exclusive access to CPU 1

5. Reduce Token Generation

max_tokens=128,  # Current for translations
max_tokens=64,   # Optimized - shorter responses โœ…

max_tokens=200,  # Current for AI analysis
max_tokens=150,  # Optimized - more concise โœ…

Benefit: Faster inference, less CPU time

๐Ÿงช Testing Strategy

Test 1: Idle Baseline

# No LLM inference
โ†’ Game FPS: 20 โœ…
โ†’ Mouse response: Instant โœ…

Test 2: During Translation

# User types NL command during inference
โ†’ Game FPS: Should stay 20 โœ…
โ†’ Mouse clicks: Should respond immediately โœ…
โ†’ Unit movement: Should execute smoothly โœ…

Test 3: During AI Analysis

# Game requests tactical analysis
โ†’ Game FPS: Should stay 20 โœ…
โ†’ User input: Should respond immediately โœ…
โ†’ Combat: Should continue smoothly โœ…

Test 4: Concurrent

# Translation + Analysis at same time
โ†’ Game FPS: Should stay 18-20 (slight drop ok) โœ…
โ†’ Critical: Mouse/keyboard should work! โœ…

๐Ÿ“ˆ Expected Improvements

Before Fix

During LLM Inference (n_threads_batch unset, potentially 2+):
โ”œโ”€ LLM uses both vCPUs
โ”œโ”€ Game thread starved
โ”œโ”€ Mouse clicks delayed/lost
โ””โ”€ Units don't respond to orders โŒ

After Fix

During LLM Inference (n_threads=1, n_threads_batch=1):
โ”œโ”€ LLM uses only 1 vCPU
โ”œโ”€ Game has 1 dedicated vCPU
โ”œโ”€ Mouse clicks instant
โ””โ”€ Units respond immediately โœ…

๐Ÿ” Monitoring

Add CPU usage logging:

import psutil
import time

def _process_requests(self):
    while not self._stop_worker:
        # Monitor CPU before inference
        cpu_before = psutil.cpu_percent(interval=0.1)
        
        # Process request
        start = time.time()
        response = self.model.create_chat_completion(...)
        elapsed = time.time() - start
        
        # Monitor CPU after
        cpu_after = psutil.cpu_percent(interval=0.1)
        
        print(f"โš™๏ธ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%โ†’{cpu_after:.0f}%")

๐ŸŽฏ Recommendations

Immediate (Done โœ…)

  • Set n_threads=1
  • Set n_threads_batch=1

High Priority

  • Reduce n_ctx to 2048
  • Increase n_batch to 256
  • Reduce max_tokens (64 for translation, 150 for analysis)

Medium Priority

  • Add CPU monitoring logs
  • Test on different command types
  • Benchmark inference times

Low Priority (Only if still laggy)

  • Set thread priority with os.nice()
  • CPU affinity with sched_setaffinity()
  • Consider even smaller model (0.5B variant)

๐Ÿ“Š Performance Targets

Metric Target Acceptable Critical
Game FPS 20 18-20 < 15 โŒ
Mouse latency < 50ms < 100ms > 200ms โŒ
LLM inference 10-15s < 20s > 30s โŒ
Translation time 5-10s < 15s > 20s โŒ
Analysis time 10-15s < 20s > 30s โŒ

๐Ÿšจ If Still Laggy

Option 1: Smaller Model

  • Switch to Qwen2.5-0.5B (even faster)
  • Trade quality for speed

Option 2: Longer Batch

n_batch=512  # Process more at once

Option 3: Limit Concurrent Requests

# Don't allow translation + analysis simultaneously
if self._current_request_id is not None:
    return "Please wait for current inference to complete"

Option 4: CPU Pinning

# Force LLM to CPU 0 only
os.sched_setaffinity(os.getpid(), {0})

Option 5: Reduce Model Precision

# Use Q2_K instead of Q4_0
# Smaller, faster, slightly lower quality
model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf"

๐Ÿ“ Summary

Problem: LLM was potentially using 2 threads (n_threads_batch unset) Solution: Explicitly set both n_threads=1 and n_threads_batch=1 Result: LLM uses only 1 vCPU, game gets dedicated 1 vCPU Expected: Smooth mouse/unit controls during inference! ๐ŸŽฎ


Commit: Added n_threads_batch=1 parameter Status: Testing required to confirm improvement Next: Monitor game responsiveness during inference