Spaces:
Sleeping
Sleeping
| # ๐ฎ LLM Thread Management on 2 vCPU System | |
| ## ๐ Problem Discovered | |
| **Symptom:** | |
| During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented. | |
| **Root Cause:** | |
| llama-cpp-python has **TWO thread parameters**: | |
| 1. `n_threads` - Threads for prompt processing | |
| 2. `n_threads_batch` - Threads for token generation (**defaults to n_threads if not set!**) | |
| **Previous Config:** | |
| ```python | |
| Llama( | |
| n_threads=1, # โ Set to 1 | |
| # n_threads_batch=? # โ NOT SET โ defaults to n_threads (1) | |
| ) | |
| ``` | |
| **BUT** - When `n_threads_batch` is not explicitly set, llama.cpp uses an internal default which may be higher! | |
| ## ๐ง Solution | |
| **Explicitly set BOTH parameters to 1:** | |
| ```python | |
| Llama( | |
| n_threads=1, # Prompt processing: 1 thread | |
| n_threads_batch=1, # Token generation: 1 thread (CRITICAL!) | |
| n_batch=128, # Batch size | |
| ) | |
| ``` | |
| **CPU Allocation:** | |
| - **vCPU 0**: LLM inference (1 thread total) | |
| - **vCPU 1**: Game loop, websockets, async I/O | |
| This ensures game always has 1 full vCPU available! ๐ฏ | |
| ## ๐ HuggingFace Spaces Constraints | |
| **Available Resources:** | |
| - **2 vCPUs** (shared, not dedicated) | |
| - **16GB RAM** | |
| - **No GPU** (CPU-only inference) | |
| **Challenges:** | |
| 1. **CPU-bound LLM**: Qwen2.5-Coder-1.5B takes 10-15s per inference | |
| 2. **Real-time game**: Needs consistent 20 FPS (50ms per frame) | |
| 3. **WebSocket server**: Needs to respond to user input instantly | |
| 4. **Shared system**: Other processes may use CPU | |
| ## ๐๏ธ Additional Optimizations | |
| ### 1. Reduce Context Window | |
| ```python | |
| n_ctx=4096, # Current - high memory, slower | |
| n_ctx=2048, # Optimized - lower memory, faster โ | |
| ``` | |
| **Benefit:** Faster prompt processing, less memory | |
| ### 2. Increase Batch Size | |
| ```python | |
| n_batch=128, # Current - more frequent updates | |
| n_batch=256, # Optimized - fewer updates, faster overall โ | |
| ``` | |
| **Benefit:** Faster generation, less overhead | |
| ### 3. Set Thread Priority (OS Level) | |
| ```python | |
| import os | |
| import threading | |
| # Lower LLM worker thread priority | |
| def _process_requests(self): | |
| # Set low priority (nice value 10-19) | |
| try: | |
| os.nice(10) # Lower priority | |
| except: | |
| pass | |
| while not self._stop_worker: | |
| # ... process requests | |
| ``` | |
| **Benefit:** OS scheduler favors game thread | |
| ### 4. CPU Affinity (Advanced) | |
| ```python | |
| import os | |
| # Pin LLM thread to CPU 0 only | |
| try: | |
| os.sched_setaffinity(0, {0}) # Use only CPU 0 | |
| except: | |
| pass | |
| ``` | |
| **Benefit:** Game thread has exclusive access to CPU 1 | |
| ### 5. Reduce Token Generation | |
| ```python | |
| max_tokens=128, # Current for translations | |
| max_tokens=64, # Optimized - shorter responses โ | |
| max_tokens=200, # Current for AI analysis | |
| max_tokens=150, # Optimized - more concise โ | |
| ``` | |
| **Benefit:** Faster inference, less CPU time | |
| ## ๐งช Testing Strategy | |
| ### Test 1: Idle Baseline | |
| ```bash | |
| # No LLM inference | |
| โ Game FPS: 20 โ | |
| โ Mouse response: Instant โ | |
| ``` | |
| ### Test 2: During Translation | |
| ```bash | |
| # User types NL command during inference | |
| โ Game FPS: Should stay 20 โ | |
| โ Mouse clicks: Should respond immediately โ | |
| โ Unit movement: Should execute smoothly โ | |
| ``` | |
| ### Test 3: During AI Analysis | |
| ```bash | |
| # Game requests tactical analysis | |
| โ Game FPS: Should stay 20 โ | |
| โ User input: Should respond immediately โ | |
| โ Combat: Should continue smoothly โ | |
| ``` | |
| ### Test 4: Concurrent | |
| ```bash | |
| # Translation + Analysis at same time | |
| โ Game FPS: Should stay 18-20 (slight drop ok) โ | |
| โ Critical: Mouse/keyboard should work! โ | |
| ``` | |
| ## ๐ Expected Improvements | |
| ### Before Fix | |
| ``` | |
| During LLM Inference (n_threads_batch unset, potentially 2+): | |
| โโ LLM uses both vCPUs | |
| โโ Game thread starved | |
| โโ Mouse clicks delayed/lost | |
| โโ Units don't respond to orders โ | |
| ``` | |
| ### After Fix | |
| ``` | |
| During LLM Inference (n_threads=1, n_threads_batch=1): | |
| โโ LLM uses only 1 vCPU | |
| โโ Game has 1 dedicated vCPU | |
| โโ Mouse clicks instant | |
| โโ Units respond immediately โ | |
| ``` | |
| ## ๐ Monitoring | |
| **Add CPU usage logging:** | |
| ```python | |
| import psutil | |
| import time | |
| def _process_requests(self): | |
| while not self._stop_worker: | |
| # Monitor CPU before inference | |
| cpu_before = psutil.cpu_percent(interval=0.1) | |
| # Process request | |
| start = time.time() | |
| response = self.model.create_chat_completion(...) | |
| elapsed = time.time() - start | |
| # Monitor CPU after | |
| cpu_after = psutil.cpu_percent(interval=0.1) | |
| print(f"โ๏ธ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%โ{cpu_after:.0f}%") | |
| ``` | |
| ## ๐ฏ Recommendations | |
| ### Immediate (Done โ ) | |
| - [x] Set `n_threads=1` | |
| - [x] Set `n_threads_batch=1` | |
| ### High Priority | |
| - [ ] Reduce `n_ctx` to 2048 | |
| - [ ] Increase `n_batch` to 256 | |
| - [ ] Reduce `max_tokens` (64 for translation, 150 for analysis) | |
| ### Medium Priority | |
| - [ ] Add CPU monitoring logs | |
| - [ ] Test on different command types | |
| - [ ] Benchmark inference times | |
| ### Low Priority (Only if still laggy) | |
| - [ ] Set thread priority with `os.nice()` | |
| - [ ] CPU affinity with `sched_setaffinity()` | |
| - [ ] Consider even smaller model (0.5B variant) | |
| ## ๐ Performance Targets | |
| | Metric | Target | Acceptable | Critical | | |
| |--------|--------|------------|----------| | |
| | **Game FPS** | 20 | 18-20 | < 15 โ | | |
| | **Mouse latency** | < 50ms | < 100ms | > 200ms โ | | |
| | **LLM inference** | 10-15s | < 20s | > 30s โ | | |
| | **Translation time** | 5-10s | < 15s | > 20s โ | | |
| | **Analysis time** | 10-15s | < 20s | > 30s โ | | |
| ## ๐จ If Still Laggy | |
| **Option 1: Smaller Model** | |
| - Switch to Qwen2.5-0.5B (even faster) | |
| - Trade quality for speed | |
| **Option 2: Longer Batch** | |
| ```python | |
| n_batch=512 # Process more at once | |
| ``` | |
| **Option 3: Limit Concurrent Requests** | |
| ```python | |
| # Don't allow translation + analysis simultaneously | |
| if self._current_request_id is not None: | |
| return "Please wait for current inference to complete" | |
| ``` | |
| **Option 4: CPU Pinning** | |
| ```python | |
| # Force LLM to CPU 0 only | |
| os.sched_setaffinity(os.getpid(), {0}) | |
| ``` | |
| **Option 5: Reduce Model Precision** | |
| ```python | |
| # Use Q2_K instead of Q4_0 | |
| # Smaller, faster, slightly lower quality | |
| model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf" | |
| ``` | |
| ## ๐ Summary | |
| **Problem:** LLM was potentially using 2 threads (`n_threads_batch` unset) | |
| **Solution:** Explicitly set both `n_threads=1` and `n_threads_batch=1` | |
| **Result:** LLM uses only 1 vCPU, game gets dedicated 1 vCPU | |
| **Expected:** Smooth mouse/unit controls during inference! ๐ฎ | |
| --- | |
| **Commit:** Added `n_threads_batch=1` parameter | |
| **Status:** Testing required to confirm improvement | |
| **Next:** Monitor game responsiveness during inference | |