Spaces:

Luigi
/

rts-commander

Sleeping

App Files Files Community

rts-commander / docs /LLM_THREAD_OPTIMIZATION.md

Luigi

perf: Apply all thread optimizations + complete documentation

8e1770a about 2 months ago

preview code

raw

history blame contribute delete

6.65 kB

🎮 LLM Thread Management on 2 vCPU System

🐛 Problem Discovered

Symptom: During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented.

Root Cause: llama-cpp-python has TWO thread parameters:

n_threads - Threads for prompt processing
n_threads_batch - Threads for token generation (defaults to n_threads if not set!)

Previous Config:

Llama(
    n_threads=1,          # ✅ Set to 1
    # n_threads_batch=?    # ❌ NOT SET → defaults to n_threads (1)
)

BUT - When n_threads_batch is not explicitly set, llama.cpp uses an internal default which may be higher!

🔧 Solution

Explicitly set BOTH parameters to 1:

Llama(
    n_threads=1,          # Prompt processing: 1 thread
    n_threads_batch=1,    # Token generation: 1 thread (CRITICAL!)
    n_batch=128,          # Batch size
)

CPU Allocation:

vCPU 0: LLM inference (1 thread total)
vCPU 1: Game loop, websockets, async I/O

This ensures game always has 1 full vCPU available! 🎯

📊 HuggingFace Spaces Constraints

Available Resources:

2 vCPUs (shared, not dedicated)
16GB RAM
No GPU (CPU-only inference)

Challenges:

CPU-bound LLM: Qwen2.5-Coder-1.5B takes 10-15s per inference
Real-time game: Needs consistent 20 FPS (50ms per frame)
WebSocket server: Needs to respond to user input instantly
Shared system: Other processes may use CPU

🎛️ Additional Optimizations

1. Reduce Context Window

n_ctx=4096,  # Current - high memory, slower
n_ctx=2048,  # Optimized - lower memory, faster ✅

Benefit: Faster prompt processing, less memory

2. Increase Batch Size

n_batch=128,   # Current - more frequent updates
n_batch=256,   # Optimized - fewer updates, faster overall ✅

Benefit: Faster generation, less overhead

3. Set Thread Priority (OS Level)

import os
import threading

# Lower LLM worker thread priority
def _process_requests(self):
    # Set low priority (nice value 10-19)
    try:
        os.nice(10)  # Lower priority
    except:
        pass
    
    while not self._stop_worker:
        # ... process requests

Benefit: OS scheduler favors game thread

4. CPU Affinity (Advanced)

import os

# Pin LLM thread to CPU 0 only
try:
    os.sched_setaffinity(0, {0})  # Use only CPU 0
except:
    pass

Benefit: Game thread has exclusive access to CPU 1

5. Reduce Token Generation

max_tokens=128,  # Current for translations
max_tokens=64,   # Optimized - shorter responses ✅

max_tokens=200,  # Current for AI analysis
max_tokens=150,  # Optimized - more concise ✅

Benefit: Faster inference, less CPU time

🧪 Testing Strategy

Test 1: Idle Baseline

# No LLM inference
→ Game FPS: 20 ✅
→ Mouse response: Instant ✅

Test 2: During Translation

# User types NL command during inference
→ Game FPS: Should stay 20 ✅
→ Mouse clicks: Should respond immediately ✅
→ Unit movement: Should execute smoothly ✅

Test 3: During AI Analysis

# Game requests tactical analysis
→ Game FPS: Should stay 20 ✅
→ User input: Should respond immediately ✅
→ Combat: Should continue smoothly ✅

Test 4: Concurrent

# Translation + Analysis at same time
→ Game FPS: Should stay 18-20 (slight drop ok) ✅
→ Critical: Mouse/keyboard should work! ✅

📈 Expected Improvements

Before Fix

During LLM Inference (n_threads_batch unset, potentially 2+):
├─ LLM uses both vCPUs
├─ Game thread starved
├─ Mouse clicks delayed/lost
└─ Units don't respond to orders ❌

After Fix

During LLM Inference (n_threads=1, n_threads_batch=1):
├─ LLM uses only 1 vCPU
├─ Game has 1 dedicated vCPU
├─ Mouse clicks instant
└─ Units respond immediately ✅

🔍 Monitoring

Add CPU usage logging:

import psutil
import time

def _process_requests(self):
    while not self._stop_worker:
        # Monitor CPU before inference
        cpu_before = psutil.cpu_percent(interval=0.1)
        
        # Process request
        start = time.time()
        response = self.model.create_chat_completion(...)
        elapsed = time.time() - start
        
        # Monitor CPU after
        cpu_after = psutil.cpu_percent(interval=0.1)
        
        print(f"⚙️ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%→{cpu_after:.0f}%")

🎯 Recommendations

Immediate (Done ✅)

Set n_threads=1
Set n_threads_batch=1

High Priority

Reduce n_ctx to 2048
Increase n_batch to 256
Reduce max_tokens (64 for translation, 150 for analysis)

Medium Priority

Add CPU monitoring logs
Test on different command types
Benchmark inference times

Low Priority (Only if still laggy)

Set thread priority with os.nice()
CPU affinity with sched_setaffinity()
Consider even smaller model (0.5B variant)

📊 Performance Targets

Metric	Target	Acceptable	Critical
Game FPS	20	18-20	< 15 ❌
Mouse latency	< 50ms	< 100ms	> 200ms ❌
LLM inference	10-15s	< 20s	> 30s ❌
Translation time	5-10s	< 15s	> 20s ❌
Analysis time	10-15s	< 20s	> 30s ❌

🚨 If Still Laggy

Option 1: Smaller Model

Switch to Qwen2.5-0.5B (even faster)
Trade quality for speed

Option 2: Longer Batch

n_batch=512  # Process more at once

Option 3: Limit Concurrent Requests

# Don't allow translation + analysis simultaneously
if self._current_request_id is not None:
    return "Please wait for current inference to complete"

Option 4: CPU Pinning

# Force LLM to CPU 0 only
os.sched_setaffinity(os.getpid(), {0})

Option 5: Reduce Model Precision

# Use Q2_K instead of Q4_0
# Smaller, faster, slightly lower quality
model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf"

📝 Summary

Problem: LLM was potentially using 2 threads (n_threads_batch unset) Solution: Explicitly set both n_threads=1 and n_threads_batch=1 Result: LLM uses only 1 vCPU, game gets dedicated 1 vCPU Expected: Smooth mouse/unit controls during inference! 🎮

Commit: Added n_threads_batch=1 parameter Status: Testing required to confirm improvement Next: Monitor game responsiveness during inference