Spaces:
Sleeping
๐ฎ LLM Thread Management on 2 vCPU System
๐ Problem Discovered
Symptom: During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented.
Root Cause: llama-cpp-python has TWO thread parameters:
n_threads- Threads for prompt processingn_threads_batch- Threads for token generation (defaults to n_threads if not set!)
Previous Config:
Llama(
n_threads=1, # โ
Set to 1
# n_threads_batch=? # โ NOT SET โ defaults to n_threads (1)
)
BUT - When n_threads_batch is not explicitly set, llama.cpp uses an internal default which may be higher!
๐ง Solution
Explicitly set BOTH parameters to 1:
Llama(
n_threads=1, # Prompt processing: 1 thread
n_threads_batch=1, # Token generation: 1 thread (CRITICAL!)
n_batch=128, # Batch size
)
CPU Allocation:
- vCPU 0: LLM inference (1 thread total)
- vCPU 1: Game loop, websockets, async I/O
This ensures game always has 1 full vCPU available! ๐ฏ
๐ HuggingFace Spaces Constraints
Available Resources:
- 2 vCPUs (shared, not dedicated)
- 16GB RAM
- No GPU (CPU-only inference)
Challenges:
- CPU-bound LLM: Qwen2.5-Coder-1.5B takes 10-15s per inference
- Real-time game: Needs consistent 20 FPS (50ms per frame)
- WebSocket server: Needs to respond to user input instantly
- Shared system: Other processes may use CPU
๐๏ธ Additional Optimizations
1. Reduce Context Window
n_ctx=4096, # Current - high memory, slower
n_ctx=2048, # Optimized - lower memory, faster โ
Benefit: Faster prompt processing, less memory
2. Increase Batch Size
n_batch=128, # Current - more frequent updates
n_batch=256, # Optimized - fewer updates, faster overall โ
Benefit: Faster generation, less overhead
3. Set Thread Priority (OS Level)
import os
import threading
# Lower LLM worker thread priority
def _process_requests(self):
# Set low priority (nice value 10-19)
try:
os.nice(10) # Lower priority
except:
pass
while not self._stop_worker:
# ... process requests
Benefit: OS scheduler favors game thread
4. CPU Affinity (Advanced)
import os
# Pin LLM thread to CPU 0 only
try:
os.sched_setaffinity(0, {0}) # Use only CPU 0
except:
pass
Benefit: Game thread has exclusive access to CPU 1
5. Reduce Token Generation
max_tokens=128, # Current for translations
max_tokens=64, # Optimized - shorter responses โ
max_tokens=200, # Current for AI analysis
max_tokens=150, # Optimized - more concise โ
Benefit: Faster inference, less CPU time
๐งช Testing Strategy
Test 1: Idle Baseline
# No LLM inference
โ Game FPS: 20 โ
โ Mouse response: Instant โ
Test 2: During Translation
# User types NL command during inference
โ Game FPS: Should stay 20 โ
โ Mouse clicks: Should respond immediately โ
โ Unit movement: Should execute smoothly โ
Test 3: During AI Analysis
# Game requests tactical analysis
โ Game FPS: Should stay 20 โ
โ User input: Should respond immediately โ
โ Combat: Should continue smoothly โ
Test 4: Concurrent
# Translation + Analysis at same time
โ Game FPS: Should stay 18-20 (slight drop ok) โ
โ Critical: Mouse/keyboard should work! โ
๐ Expected Improvements
Before Fix
During LLM Inference (n_threads_batch unset, potentially 2+):
โโ LLM uses both vCPUs
โโ Game thread starved
โโ Mouse clicks delayed/lost
โโ Units don't respond to orders โ
After Fix
During LLM Inference (n_threads=1, n_threads_batch=1):
โโ LLM uses only 1 vCPU
โโ Game has 1 dedicated vCPU
โโ Mouse clicks instant
โโ Units respond immediately โ
๐ Monitoring
Add CPU usage logging:
import psutil
import time
def _process_requests(self):
while not self._stop_worker:
# Monitor CPU before inference
cpu_before = psutil.cpu_percent(interval=0.1)
# Process request
start = time.time()
response = self.model.create_chat_completion(...)
elapsed = time.time() - start
# Monitor CPU after
cpu_after = psutil.cpu_percent(interval=0.1)
print(f"โ๏ธ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%โ{cpu_after:.0f}%")
๐ฏ Recommendations
Immediate (Done โ )
- Set
n_threads=1 - Set
n_threads_batch=1
High Priority
- Reduce
n_ctxto 2048 - Increase
n_batchto 256 - Reduce
max_tokens(64 for translation, 150 for analysis)
Medium Priority
- Add CPU monitoring logs
- Test on different command types
- Benchmark inference times
Low Priority (Only if still laggy)
- Set thread priority with
os.nice() - CPU affinity with
sched_setaffinity() - Consider even smaller model (0.5B variant)
๐ Performance Targets
| Metric | Target | Acceptable | Critical |
|---|---|---|---|
| Game FPS | 20 | 18-20 | < 15 โ |
| Mouse latency | < 50ms | < 100ms | > 200ms โ |
| LLM inference | 10-15s | < 20s | > 30s โ |
| Translation time | 5-10s | < 15s | > 20s โ |
| Analysis time | 10-15s | < 20s | > 30s โ |
๐จ If Still Laggy
Option 1: Smaller Model
- Switch to Qwen2.5-0.5B (even faster)
- Trade quality for speed
Option 2: Longer Batch
n_batch=512 # Process more at once
Option 3: Limit Concurrent Requests
# Don't allow translation + analysis simultaneously
if self._current_request_id is not None:
return "Please wait for current inference to complete"
Option 4: CPU Pinning
# Force LLM to CPU 0 only
os.sched_setaffinity(os.getpid(), {0})
Option 5: Reduce Model Precision
# Use Q2_K instead of Q4_0
# Smaller, faster, slightly lower quality
model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf"
๐ Summary
Problem: LLM was potentially using 2 threads (n_threads_batch unset)
Solution: Explicitly set both n_threads=1 and n_threads_batch=1
Result: LLM uses only 1 vCPU, game gets dedicated 1 vCPU
Expected: Smooth mouse/unit controls during inference! ๐ฎ
Commit: Added n_threads_batch=1 parameter
Status: Testing required to confirm improvement
Next: Monitor game responsiveness during inference