rts-commander / docs /LLM_THREAD_OPTIMIZATION.md
Luigi's picture
perf: Apply all thread optimizations + complete documentation
8e1770a
# ๐ŸŽฎ LLM Thread Management on 2 vCPU System
## ๐Ÿ› Problem Discovered
**Symptom:**
During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented.
**Root Cause:**
llama-cpp-python has **TWO thread parameters**:
1. `n_threads` - Threads for prompt processing
2. `n_threads_batch` - Threads for token generation (**defaults to n_threads if not set!**)
**Previous Config:**
```python
Llama(
n_threads=1, # โœ… Set to 1
# n_threads_batch=? # โŒ NOT SET โ†’ defaults to n_threads (1)
)
```
**BUT** - When `n_threads_batch` is not explicitly set, llama.cpp uses an internal default which may be higher!
## ๐Ÿ”ง Solution
**Explicitly set BOTH parameters to 1:**
```python
Llama(
n_threads=1, # Prompt processing: 1 thread
n_threads_batch=1, # Token generation: 1 thread (CRITICAL!)
n_batch=128, # Batch size
)
```
**CPU Allocation:**
- **vCPU 0**: LLM inference (1 thread total)
- **vCPU 1**: Game loop, websockets, async I/O
This ensures game always has 1 full vCPU available! ๐ŸŽฏ
## ๐Ÿ“Š HuggingFace Spaces Constraints
**Available Resources:**
- **2 vCPUs** (shared, not dedicated)
- **16GB RAM**
- **No GPU** (CPU-only inference)
**Challenges:**
1. **CPU-bound LLM**: Qwen2.5-Coder-1.5B takes 10-15s per inference
2. **Real-time game**: Needs consistent 20 FPS (50ms per frame)
3. **WebSocket server**: Needs to respond to user input instantly
4. **Shared system**: Other processes may use CPU
## ๐ŸŽ›๏ธ Additional Optimizations
### 1. Reduce Context Window
```python
n_ctx=4096, # Current - high memory, slower
n_ctx=2048, # Optimized - lower memory, faster โœ…
```
**Benefit:** Faster prompt processing, less memory
### 2. Increase Batch Size
```python
n_batch=128, # Current - more frequent updates
n_batch=256, # Optimized - fewer updates, faster overall โœ…
```
**Benefit:** Faster generation, less overhead
### 3. Set Thread Priority (OS Level)
```python
import os
import threading
# Lower LLM worker thread priority
def _process_requests(self):
# Set low priority (nice value 10-19)
try:
os.nice(10) # Lower priority
except:
pass
while not self._stop_worker:
# ... process requests
```
**Benefit:** OS scheduler favors game thread
### 4. CPU Affinity (Advanced)
```python
import os
# Pin LLM thread to CPU 0 only
try:
os.sched_setaffinity(0, {0}) # Use only CPU 0
except:
pass
```
**Benefit:** Game thread has exclusive access to CPU 1
### 5. Reduce Token Generation
```python
max_tokens=128, # Current for translations
max_tokens=64, # Optimized - shorter responses โœ…
max_tokens=200, # Current for AI analysis
max_tokens=150, # Optimized - more concise โœ…
```
**Benefit:** Faster inference, less CPU time
## ๐Ÿงช Testing Strategy
### Test 1: Idle Baseline
```bash
# No LLM inference
โ†’ Game FPS: 20 โœ…
โ†’ Mouse response: Instant โœ…
```
### Test 2: During Translation
```bash
# User types NL command during inference
โ†’ Game FPS: Should stay 20 โœ…
โ†’ Mouse clicks: Should respond immediately โœ…
โ†’ Unit movement: Should execute smoothly โœ…
```
### Test 3: During AI Analysis
```bash
# Game requests tactical analysis
โ†’ Game FPS: Should stay 20 โœ…
โ†’ User input: Should respond immediately โœ…
โ†’ Combat: Should continue smoothly โœ…
```
### Test 4: Concurrent
```bash
# Translation + Analysis at same time
โ†’ Game FPS: Should stay 18-20 (slight drop ok) โœ…
โ†’ Critical: Mouse/keyboard should work! โœ…
```
## ๐Ÿ“ˆ Expected Improvements
### Before Fix
```
During LLM Inference (n_threads_batch unset, potentially 2+):
โ”œโ”€ LLM uses both vCPUs
โ”œโ”€ Game thread starved
โ”œโ”€ Mouse clicks delayed/lost
โ””โ”€ Units don't respond to orders โŒ
```
### After Fix
```
During LLM Inference (n_threads=1, n_threads_batch=1):
โ”œโ”€ LLM uses only 1 vCPU
โ”œโ”€ Game has 1 dedicated vCPU
โ”œโ”€ Mouse clicks instant
โ””โ”€ Units respond immediately โœ…
```
## ๐Ÿ” Monitoring
**Add CPU usage logging:**
```python
import psutil
import time
def _process_requests(self):
while not self._stop_worker:
# Monitor CPU before inference
cpu_before = psutil.cpu_percent(interval=0.1)
# Process request
start = time.time()
response = self.model.create_chat_completion(...)
elapsed = time.time() - start
# Monitor CPU after
cpu_after = psutil.cpu_percent(interval=0.1)
print(f"โš™๏ธ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%โ†’{cpu_after:.0f}%")
```
## ๐ŸŽฏ Recommendations
### Immediate (Done โœ…)
- [x] Set `n_threads=1`
- [x] Set `n_threads_batch=1`
### High Priority
- [ ] Reduce `n_ctx` to 2048
- [ ] Increase `n_batch` to 256
- [ ] Reduce `max_tokens` (64 for translation, 150 for analysis)
### Medium Priority
- [ ] Add CPU monitoring logs
- [ ] Test on different command types
- [ ] Benchmark inference times
### Low Priority (Only if still laggy)
- [ ] Set thread priority with `os.nice()`
- [ ] CPU affinity with `sched_setaffinity()`
- [ ] Consider even smaller model (0.5B variant)
## ๐Ÿ“Š Performance Targets
| Metric | Target | Acceptable | Critical |
|--------|--------|------------|----------|
| **Game FPS** | 20 | 18-20 | < 15 โŒ |
| **Mouse latency** | < 50ms | < 100ms | > 200ms โŒ |
| **LLM inference** | 10-15s | < 20s | > 30s โŒ |
| **Translation time** | 5-10s | < 15s | > 20s โŒ |
| **Analysis time** | 10-15s | < 20s | > 30s โŒ |
## ๐Ÿšจ If Still Laggy
**Option 1: Smaller Model**
- Switch to Qwen2.5-0.5B (even faster)
- Trade quality for speed
**Option 2: Longer Batch**
```python
n_batch=512 # Process more at once
```
**Option 3: Limit Concurrent Requests**
```python
# Don't allow translation + analysis simultaneously
if self._current_request_id is not None:
return "Please wait for current inference to complete"
```
**Option 4: CPU Pinning**
```python
# Force LLM to CPU 0 only
os.sched_setaffinity(os.getpid(), {0})
```
**Option 5: Reduce Model Precision**
```python
# Use Q2_K instead of Q4_0
# Smaller, faster, slightly lower quality
model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf"
```
## ๐Ÿ“ Summary
**Problem:** LLM was potentially using 2 threads (`n_threads_batch` unset)
**Solution:** Explicitly set both `n_threads=1` and `n_threads_batch=1`
**Result:** LLM uses only 1 vCPU, game gets dedicated 1 vCPU
**Expected:** Smooth mouse/unit controls during inference! ๐ŸŽฎ
---
**Commit:** Added `n_threads_batch=1` parameter
**Status:** Testing required to confirm improvement
**Next:** Monitor game responsiveness during inference