# ๐ŸŽฎ LLM Thread Management on 2 vCPU System ## ๐Ÿ› Problem Discovered **Symptom:** During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented. **Root Cause:** llama-cpp-python has **TWO thread parameters**: 1. `n_threads` - Threads for prompt processing 2. `n_threads_batch` - Threads for token generation (**defaults to n_threads if not set!**) **Previous Config:** ```python Llama( n_threads=1, # โœ… Set to 1 # n_threads_batch=? # โŒ NOT SET โ†’ defaults to n_threads (1) ) ``` **BUT** - When `n_threads_batch` is not explicitly set, llama.cpp uses an internal default which may be higher! ## ๐Ÿ”ง Solution **Explicitly set BOTH parameters to 1:** ```python Llama( n_threads=1, # Prompt processing: 1 thread n_threads_batch=1, # Token generation: 1 thread (CRITICAL!) n_batch=128, # Batch size ) ``` **CPU Allocation:** - **vCPU 0**: LLM inference (1 thread total) - **vCPU 1**: Game loop, websockets, async I/O This ensures game always has 1 full vCPU available! ๐ŸŽฏ ## ๐Ÿ“Š HuggingFace Spaces Constraints **Available Resources:** - **2 vCPUs** (shared, not dedicated) - **16GB RAM** - **No GPU** (CPU-only inference) **Challenges:** 1. **CPU-bound LLM**: Qwen2.5-Coder-1.5B takes 10-15s per inference 2. **Real-time game**: Needs consistent 20 FPS (50ms per frame) 3. **WebSocket server**: Needs to respond to user input instantly 4. **Shared system**: Other processes may use CPU ## ๐ŸŽ›๏ธ Additional Optimizations ### 1. Reduce Context Window ```python n_ctx=4096, # Current - high memory, slower n_ctx=2048, # Optimized - lower memory, faster โœ… ``` **Benefit:** Faster prompt processing, less memory ### 2. Increase Batch Size ```python n_batch=128, # Current - more frequent updates n_batch=256, # Optimized - fewer updates, faster overall โœ… ``` **Benefit:** Faster generation, less overhead ### 3. Set Thread Priority (OS Level) ```python import os import threading # Lower LLM worker thread priority def _process_requests(self): # Set low priority (nice value 10-19) try: os.nice(10) # Lower priority except: pass while not self._stop_worker: # ... process requests ``` **Benefit:** OS scheduler favors game thread ### 4. CPU Affinity (Advanced) ```python import os # Pin LLM thread to CPU 0 only try: os.sched_setaffinity(0, {0}) # Use only CPU 0 except: pass ``` **Benefit:** Game thread has exclusive access to CPU 1 ### 5. Reduce Token Generation ```python max_tokens=128, # Current for translations max_tokens=64, # Optimized - shorter responses โœ… max_tokens=200, # Current for AI analysis max_tokens=150, # Optimized - more concise โœ… ``` **Benefit:** Faster inference, less CPU time ## ๐Ÿงช Testing Strategy ### Test 1: Idle Baseline ```bash # No LLM inference โ†’ Game FPS: 20 โœ… โ†’ Mouse response: Instant โœ… ``` ### Test 2: During Translation ```bash # User types NL command during inference โ†’ Game FPS: Should stay 20 โœ… โ†’ Mouse clicks: Should respond immediately โœ… โ†’ Unit movement: Should execute smoothly โœ… ``` ### Test 3: During AI Analysis ```bash # Game requests tactical analysis โ†’ Game FPS: Should stay 20 โœ… โ†’ User input: Should respond immediately โœ… โ†’ Combat: Should continue smoothly โœ… ``` ### Test 4: Concurrent ```bash # Translation + Analysis at same time โ†’ Game FPS: Should stay 18-20 (slight drop ok) โœ… โ†’ Critical: Mouse/keyboard should work! โœ… ``` ## ๐Ÿ“ˆ Expected Improvements ### Before Fix ``` During LLM Inference (n_threads_batch unset, potentially 2+): โ”œโ”€ LLM uses both vCPUs โ”œโ”€ Game thread starved โ”œโ”€ Mouse clicks delayed/lost โ””โ”€ Units don't respond to orders โŒ ``` ### After Fix ``` During LLM Inference (n_threads=1, n_threads_batch=1): โ”œโ”€ LLM uses only 1 vCPU โ”œโ”€ Game has 1 dedicated vCPU โ”œโ”€ Mouse clicks instant โ””โ”€ Units respond immediately โœ… ``` ## ๐Ÿ” Monitoring **Add CPU usage logging:** ```python import psutil import time def _process_requests(self): while not self._stop_worker: # Monitor CPU before inference cpu_before = psutil.cpu_percent(interval=0.1) # Process request start = time.time() response = self.model.create_chat_completion(...) elapsed = time.time() - start # Monitor CPU after cpu_after = psutil.cpu_percent(interval=0.1) print(f"โš™๏ธ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%โ†’{cpu_after:.0f}%") ``` ## ๐ŸŽฏ Recommendations ### Immediate (Done โœ…) - [x] Set `n_threads=1` - [x] Set `n_threads_batch=1` ### High Priority - [ ] Reduce `n_ctx` to 2048 - [ ] Increase `n_batch` to 256 - [ ] Reduce `max_tokens` (64 for translation, 150 for analysis) ### Medium Priority - [ ] Add CPU monitoring logs - [ ] Test on different command types - [ ] Benchmark inference times ### Low Priority (Only if still laggy) - [ ] Set thread priority with `os.nice()` - [ ] CPU affinity with `sched_setaffinity()` - [ ] Consider even smaller model (0.5B variant) ## ๐Ÿ“Š Performance Targets | Metric | Target | Acceptable | Critical | |--------|--------|------------|----------| | **Game FPS** | 20 | 18-20 | < 15 โŒ | | **Mouse latency** | < 50ms | < 100ms | > 200ms โŒ | | **LLM inference** | 10-15s | < 20s | > 30s โŒ | | **Translation time** | 5-10s | < 15s | > 20s โŒ | | **Analysis time** | 10-15s | < 20s | > 30s โŒ | ## ๐Ÿšจ If Still Laggy **Option 1: Smaller Model** - Switch to Qwen2.5-0.5B (even faster) - Trade quality for speed **Option 2: Longer Batch** ```python n_batch=512 # Process more at once ``` **Option 3: Limit Concurrent Requests** ```python # Don't allow translation + analysis simultaneously if self._current_request_id is not None: return "Please wait for current inference to complete" ``` **Option 4: CPU Pinning** ```python # Force LLM to CPU 0 only os.sched_setaffinity(os.getpid(), {0}) ``` **Option 5: Reduce Model Precision** ```python # Use Q2_K instead of Q4_0 # Smaller, faster, slightly lower quality model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf" ``` ## ๐Ÿ“ Summary **Problem:** LLM was potentially using 2 threads (`n_threads_batch` unset) **Solution:** Explicitly set both `n_threads=1` and `n_threads_batch=1` **Result:** LLM uses only 1 vCPU, game gets dedicated 1 vCPU **Expected:** Smooth mouse/unit controls during inference! ๐ŸŽฎ --- **Commit:** Added `n_threads_batch=1` parameter **Status:** Testing required to confirm improvement **Next:** Monitor game responsiveness during inference