Spaces:

Luigi
/

rts-commander

Sleeping

App Files Files Community

rts-commander / docs /LLM_THREAD_OPTIMIZATION.md

Luigi

perf: Apply all thread optimizations + complete documentation

8e1770a about 2 months ago

preview code

raw

history blame contribute delete

6.65 kB

	# 🎮 LLM Thread Management on 2 vCPU System

	## 🐛 Problem Discovered

	Symptom:
	During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented.

	Root Cause:
	llama-cpp-python has TWO thread parameters:
	1. `n_threads` - Threads for prompt processing
	2. `n_threads_batch` - Threads for token generation (defaults to n_threads if not set!)

	Previous Config:
	```python
	Llama(
	n_threads=1, # ✅ Set to 1
	# n_threads_batch=? # ❌ NOT SET → defaults to n_threads (1)
	)
	```

	BUT - When `n_threads_batch` is not explicitly set, llama.cpp uses an internal default which may be higher!

	## 🔧 Solution

	Explicitly set BOTH parameters to 1:
	```python
	Llama(
	n_threads=1, # Prompt processing: 1 thread
	n_threads_batch=1, # Token generation: 1 thread (CRITICAL!)
	n_batch=128, # Batch size
	)
	```

	CPU Allocation:
	- vCPU 0: LLM inference (1 thread total)
	- vCPU 1: Game loop, websockets, async I/O

	This ensures game always has 1 full vCPU available! 🎯

	## 📊 HuggingFace Spaces Constraints

	Available Resources:
	- 2 vCPUs (shared, not dedicated)
	- 16GB RAM
	- No GPU (CPU-only inference)

	Challenges:
	1. CPU-bound LLM: Qwen2.5-Coder-1.5B takes 10-15s per inference
	2. Real-time game: Needs consistent 20 FPS (50ms per frame)
	3. WebSocket server: Needs to respond to user input instantly
	4. Shared system: Other processes may use CPU

	## 🎛️ Additional Optimizations

	### 1. Reduce Context Window
	```python
	n_ctx=4096, # Current - high memory, slower
	n_ctx=2048, # Optimized - lower memory, faster ✅
	```
	Benefit: Faster prompt processing, less memory

	### 2. Increase Batch Size
	```python
	n_batch=128, # Current - more frequent updates
	n_batch=256, # Optimized - fewer updates, faster overall ✅
	```
	Benefit: Faster generation, less overhead

	### 3. Set Thread Priority (OS Level)
	```python
	import os
	import threading

	# Lower LLM worker thread priority
	def _process_requests(self):
	# Set low priority (nice value 10-19)
	try:
	os.nice(10) # Lower priority
	except:
	pass

	while not self._stop_worker:
	# ... process requests
	```
	Benefit: OS scheduler favors game thread

	### 4. CPU Affinity (Advanced)
	```python
	import os

	# Pin LLM thread to CPU 0 only
	try:
	os.sched_setaffinity(0, {0}) # Use only CPU 0
	except:
	pass
	```
	Benefit: Game thread has exclusive access to CPU 1

	### 5. Reduce Token Generation
	```python
	max_tokens=128, # Current for translations
	max_tokens=64, # Optimized - shorter responses ✅

	max_tokens=200, # Current for AI analysis
	max_tokens=150, # Optimized - more concise ✅
	```
	Benefit: Faster inference, less CPU time

	## 🧪 Testing Strategy

	### Test 1: Idle Baseline
	```bash
	# No LLM inference
	→ Game FPS: 20 ✅
	→ Mouse response: Instant ✅
	```

	### Test 2: During Translation
	```bash
	# User types NL command during inference
	→ Game FPS: Should stay 20 ✅
	→ Mouse clicks: Should respond immediately ✅
	→ Unit movement: Should execute smoothly ✅
	```

	### Test 3: During AI Analysis
	```bash
	# Game requests tactical analysis
	→ Game FPS: Should stay 20 ✅
	→ User input: Should respond immediately ✅
	→ Combat: Should continue smoothly ✅
	```

	### Test 4: Concurrent
	```bash
	# Translation + Analysis at same time
	→ Game FPS: Should stay 18-20 (slight drop ok) ✅
	→ Critical: Mouse/keyboard should work! ✅
	```

	## 📈 Expected Improvements

	### Before Fix
	```
	During LLM Inference (n_threads_batch unset, potentially 2+):
	├─ LLM uses both vCPUs
	├─ Game thread starved
	├─ Mouse clicks delayed/lost
	└─ Units don't respond to orders ❌
	```

	### After Fix
	```
	During LLM Inference (n_threads=1, n_threads_batch=1):
	├─ LLM uses only 1 vCPU
	├─ Game has 1 dedicated vCPU
	├─ Mouse clicks instant
	└─ Units respond immediately ✅
	```

	## 🔍 Monitoring

	Add CPU usage logging:
	```python
	import psutil
	import time

	def _process_requests(self):
	while not self._stop_worker:
	# Monitor CPU before inference
	cpu_before = psutil.cpu_percent(interval=0.1)

	# Process request
	start = time.time()
	response = self.model.create_chat_completion(...)
	elapsed = time.time() - start

	# Monitor CPU after
	cpu_after = psutil.cpu_percent(interval=0.1)

	print(f"⚙️ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%→{cpu_after:.0f}%")
	```

	## 🎯 Recommendations

	### Immediate (Done ✅)
	- [x] Set `n_threads=1`
	- [x] Set `n_threads_batch=1`

	### High Priority
	- [ ] Reduce `n_ctx` to 2048
	- [ ] Increase `n_batch` to 256
	- [ ] Reduce `max_tokens` (64 for translation, 150 for analysis)

	### Medium Priority
	- [ ] Add CPU monitoring logs
	- [ ] Test on different command types
	- [ ] Benchmark inference times

	### Low Priority (Only if still laggy)
	- [ ] Set thread priority with `os.nice()`
	- [ ] CPU affinity with `sched_setaffinity()`
	- [ ] Consider even smaller model (0.5B variant)

	## 📊 Performance Targets

	\| Metric \| Target \| Acceptable \| Critical \|
	\|--------\|--------\|------------\|----------\|
	\| Game FPS \| 20 \| 18-20 \| < 15 ❌ \|
	\| Mouse latency \| < 50ms \| < 100ms \| > 200ms ❌ \|
	\| LLM inference \| 10-15s \| < 20s \| > 30s ❌ \|
	\| Translation time \| 5-10s \| < 15s \| > 20s ❌ \|
	\| Analysis time \| 10-15s \| < 20s \| > 30s ❌ \|

	## 🚨 If Still Laggy

	Option 1: Smaller Model
	- Switch to Qwen2.5-0.5B (even faster)
	- Trade quality for speed

	Option 2: Longer Batch
	```python
	n_batch=512 # Process more at once
	```

	Option 3: Limit Concurrent Requests
	```python
	# Don't allow translation + analysis simultaneously
	if self._current_request_id is not None:
	return "Please wait for current inference to complete"
	```

	Option 4: CPU Pinning
	```python
	# Force LLM to CPU 0 only
	os.sched_setaffinity(os.getpid(), {0})
	```

	Option 5: Reduce Model Precision
	```python
	# Use Q2_K instead of Q4_0
	# Smaller, faster, slightly lower quality
	model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf"
	```

	## 📝 Summary

	Problem: LLM was potentially using 2 threads (`n_threads_batch` unset)
	Solution: Explicitly set both `n_threads=1` and `n_threads_batch=1`
	Result: LLM uses only 1 vCPU, game gets dedicated 1 vCPU
	Expected: Smooth mouse/unit controls during inference! 🎮

	---

	Commit: Added `n_threads_batch=1` parameter
	Status: Testing required to confirm improvement
	Next: Monitor game responsiveness during inference