# 🎮 LLM Thread Management on 2 vCPU System

## 🐛 Problem Discovered

**Symptom:**
During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented.

**Root Cause:**
llama-cpp-python has **TWO thread parameters**:
1. `n_threads` - Threads for prompt processing
2. `n_threads_batch` - Threads for token generation (**defaults to n_threads if not set!**)

**Previous Config:**
```python
Llama(
    n_threads=1,          # ✅ Set to 1
    # n_threads_batch=?    # ❌ NOT SET → defaults to n_threads (1)
)
```

**BUT** - When `n_threads_batch` is not explicitly set, llama.cpp uses an internal default which may be higher!

## 🔧 Solution

**Explicitly set BOTH parameters to 1:**
```python
Llama(
    n_threads=1,          # Prompt processing: 1 thread
    n_threads_batch=1,    # Token generation: 1 thread (CRITICAL!)
    n_batch=128,          # Batch size
)
```

**CPU Allocation:**
- **vCPU 0**: LLM inference (1 thread total)
- **vCPU 1**: Game loop, websockets, async I/O

This ensures game always has 1 full vCPU available! 🎯

## 📊 HuggingFace Spaces Constraints

**Available Resources:**
- **2 vCPUs** (shared, not dedicated)
- **16GB RAM** 
- **No GPU** (CPU-only inference)

**Challenges:**
1. **CPU-bound LLM**: Qwen2.5-Coder-1.5B takes 10-15s per inference
2. **Real-time game**: Needs consistent 20 FPS (50ms per frame)
3. **WebSocket server**: Needs to respond to user input instantly
4. **Shared system**: Other processes may use CPU

## 🎛️ Additional Optimizations

### 1. Reduce Context Window
```python
n_ctx=4096,  # Current - high memory, slower
n_ctx=2048,  # Optimized - lower memory, faster ✅
```
**Benefit:** Faster prompt processing, less memory

### 2. Increase Batch Size
```python
n_batch=128,   # Current - more frequent updates
n_batch=256,   # Optimized - fewer updates, faster overall ✅
```
**Benefit:** Faster generation, less overhead

### 3. Set Thread Priority (OS Level)
```python
import os
import threading

# Lower LLM worker thread priority
def _process_requests(self):
    # Set low priority (nice value 10-19)
    try:
        os.nice(10)  # Lower priority
    except:
        pass
    
    while not self._stop_worker:
        # ... process requests
```
**Benefit:** OS scheduler favors game thread

### 4. CPU Affinity (Advanced)
```python
import os

# Pin LLM thread to CPU 0 only
try:
    os.sched_setaffinity(0, {0})  # Use only CPU 0
except:
    pass
```
**Benefit:** Game thread has exclusive access to CPU 1

### 5. Reduce Token Generation
```python
max_tokens=128,  # Current for translations
max_tokens=64,   # Optimized - shorter responses ✅

max_tokens=200,  # Current for AI analysis
max_tokens=150,  # Optimized - more concise ✅
```
**Benefit:** Faster inference, less CPU time

## 🧪 Testing Strategy

### Test 1: Idle Baseline
```bash
# No LLM inference
→ Game FPS: 20 ✅
→ Mouse response: Instant ✅
```

### Test 2: During Translation
```bash
# User types NL command during inference
→ Game FPS: Should stay 20 ✅
→ Mouse clicks: Should respond immediately ✅
→ Unit movement: Should execute smoothly ✅
```

### Test 3: During AI Analysis
```bash
# Game requests tactical analysis
→ Game FPS: Should stay 20 ✅
→ User input: Should respond immediately ✅
→ Combat: Should continue smoothly ✅
```

### Test 4: Concurrent
```bash
# Translation + Analysis at same time
→ Game FPS: Should stay 18-20 (slight drop ok) ✅
→ Critical: Mouse/keyboard should work! ✅
```

## 📈 Expected Improvements

### Before Fix
```
During LLM Inference (n_threads_batch unset, potentially 2+):
├─ LLM uses both vCPUs
├─ Game thread starved
├─ Mouse clicks delayed/lost
└─ Units don't respond to orders ❌
```

### After Fix
```
During LLM Inference (n_threads=1, n_threads_batch=1):
├─ LLM uses only 1 vCPU
├─ Game has 1 dedicated vCPU
├─ Mouse clicks instant
└─ Units respond immediately ✅
```

## 🔍 Monitoring

**Add CPU usage logging:**
```python
import psutil
import time

def _process_requests(self):
    while not self._stop_worker:
        # Monitor CPU before inference
        cpu_before = psutil.cpu_percent(interval=0.1)
        
        # Process request
        start = time.time()
        response = self.model.create_chat_completion(...)
        elapsed = time.time() - start
        
        # Monitor CPU after
        cpu_after = psutil.cpu_percent(interval=0.1)
        
        print(f"⚙️ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%→{cpu_after:.0f}%")
```

## 🎯 Recommendations

### Immediate (Done ✅)
- [x] Set `n_threads=1`
- [x] Set `n_threads_batch=1`

### High Priority
- [ ] Reduce `n_ctx` to 2048
- [ ] Increase `n_batch` to 256
- [ ] Reduce `max_tokens` (64 for translation, 150 for analysis)

### Medium Priority
- [ ] Add CPU monitoring logs
- [ ] Test on different command types
- [ ] Benchmark inference times

### Low Priority (Only if still laggy)
- [ ] Set thread priority with `os.nice()`
- [ ] CPU affinity with `sched_setaffinity()`
- [ ] Consider even smaller model (0.5B variant)

## 📊 Performance Targets

| Metric | Target | Acceptable | Critical |
|--------|--------|------------|----------|
| **Game FPS** | 20 | 18-20 | < 15 ❌ |
| **Mouse latency** | < 50ms | < 100ms | > 200ms ❌ |
| **LLM inference** | 10-15s | < 20s | > 30s ❌ |
| **Translation time** | 5-10s | < 15s | > 20s ❌ |
| **Analysis time** | 10-15s | < 20s | > 30s ❌ |

## 🚨 If Still Laggy

**Option 1: Smaller Model**
- Switch to Qwen2.5-0.5B (even faster)
- Trade quality for speed

**Option 2: Longer Batch**
```python
n_batch=512  # Process more at once
```

**Option 3: Limit Concurrent Requests**
```python
# Don't allow translation + analysis simultaneously
if self._current_request_id is not None:
    return "Please wait for current inference to complete"
```

**Option 4: CPU Pinning**
```python
# Force LLM to CPU 0 only
os.sched_setaffinity(os.getpid(), {0})
```

**Option 5: Reduce Model Precision**
```python
# Use Q2_K instead of Q4_0
# Smaller, faster, slightly lower quality
model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf"
```

## 📝 Summary

**Problem:** LLM was potentially using 2 threads (`n_threads_batch` unset)
**Solution:** Explicitly set both `n_threads=1` and `n_threads_batch=1`
**Result:** LLM uses only 1 vCPU, game gets dedicated 1 vCPU
**Expected:** Smooth mouse/unit controls during inference! 🎮

---

**Commit:** Added `n_threads_batch=1` parameter
**Status:** Testing required to confirm improvement
**Next:** Monitor game responsiveness during inference