File size: 6,645 Bytes
8e1770a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# ๐ŸŽฎ LLM Thread Management on 2 vCPU System

## ๐Ÿ› Problem Discovered

**Symptom:**
During LLM inference, game units have difficulty executing mouse orders - lag/unresponsive controls even though async system is implemented.

**Root Cause:**
llama-cpp-python has **TWO thread parameters**:
1. `n_threads` - Threads for prompt processing
2. `n_threads_batch` - Threads for token generation (**defaults to n_threads if not set!**)

**Previous Config:**
```python
Llama(
    n_threads=1,          # โœ… Set to 1
    # n_threads_batch=?    # โŒ NOT SET โ†’ defaults to n_threads (1)
)
```

**BUT** - When `n_threads_batch` is not explicitly set, llama.cpp uses an internal default which may be higher!

## ๐Ÿ”ง Solution

**Explicitly set BOTH parameters to 1:**
```python
Llama(
    n_threads=1,          # Prompt processing: 1 thread
    n_threads_batch=1,    # Token generation: 1 thread (CRITICAL!)
    n_batch=128,          # Batch size
)
```

**CPU Allocation:**
- **vCPU 0**: LLM inference (1 thread total)
- **vCPU 1**: Game loop, websockets, async I/O

This ensures game always has 1 full vCPU available! ๐ŸŽฏ

## ๐Ÿ“Š HuggingFace Spaces Constraints

**Available Resources:**
- **2 vCPUs** (shared, not dedicated)
- **16GB RAM** 
- **No GPU** (CPU-only inference)

**Challenges:**
1. **CPU-bound LLM**: Qwen2.5-Coder-1.5B takes 10-15s per inference
2. **Real-time game**: Needs consistent 20 FPS (50ms per frame)
3. **WebSocket server**: Needs to respond to user input instantly
4. **Shared system**: Other processes may use CPU

## ๐ŸŽ›๏ธ Additional Optimizations

### 1. Reduce Context Window
```python
n_ctx=4096,  # Current - high memory, slower
n_ctx=2048,  # Optimized - lower memory, faster โœ…
```
**Benefit:** Faster prompt processing, less memory

### 2. Increase Batch Size
```python
n_batch=128,   # Current - more frequent updates
n_batch=256,   # Optimized - fewer updates, faster overall โœ…
```
**Benefit:** Faster generation, less overhead

### 3. Set Thread Priority (OS Level)
```python
import os
import threading

# Lower LLM worker thread priority
def _process_requests(self):
    # Set low priority (nice value 10-19)
    try:
        os.nice(10)  # Lower priority
    except:
        pass
    
    while not self._stop_worker:
        # ... process requests
```
**Benefit:** OS scheduler favors game thread

### 4. CPU Affinity (Advanced)
```python
import os

# Pin LLM thread to CPU 0 only
try:
    os.sched_setaffinity(0, {0})  # Use only CPU 0
except:
    pass
```
**Benefit:** Game thread has exclusive access to CPU 1

### 5. Reduce Token Generation
```python
max_tokens=128,  # Current for translations
max_tokens=64,   # Optimized - shorter responses โœ…

max_tokens=200,  # Current for AI analysis
max_tokens=150,  # Optimized - more concise โœ…
```
**Benefit:** Faster inference, less CPU time

## ๐Ÿงช Testing Strategy

### Test 1: Idle Baseline
```bash
# No LLM inference
โ†’ Game FPS: 20 โœ…
โ†’ Mouse response: Instant โœ…
```

### Test 2: During Translation
```bash
# User types NL command during inference
โ†’ Game FPS: Should stay 20 โœ…
โ†’ Mouse clicks: Should respond immediately โœ…
โ†’ Unit movement: Should execute smoothly โœ…
```

### Test 3: During AI Analysis
```bash
# Game requests tactical analysis
โ†’ Game FPS: Should stay 20 โœ…
โ†’ User input: Should respond immediately โœ…
โ†’ Combat: Should continue smoothly โœ…
```

### Test 4: Concurrent
```bash
# Translation + Analysis at same time
โ†’ Game FPS: Should stay 18-20 (slight drop ok) โœ…
โ†’ Critical: Mouse/keyboard should work! โœ…
```

## ๐Ÿ“ˆ Expected Improvements

### Before Fix
```
During LLM Inference (n_threads_batch unset, potentially 2+):
โ”œโ”€ LLM uses both vCPUs
โ”œโ”€ Game thread starved
โ”œโ”€ Mouse clicks delayed/lost
โ””โ”€ Units don't respond to orders โŒ
```

### After Fix
```
During LLM Inference (n_threads=1, n_threads_batch=1):
โ”œโ”€ LLM uses only 1 vCPU
โ”œโ”€ Game has 1 dedicated vCPU
โ”œโ”€ Mouse clicks instant
โ””โ”€ Units respond immediately โœ…
```

## ๐Ÿ” Monitoring

**Add CPU usage logging:**
```python
import psutil
import time

def _process_requests(self):
    while not self._stop_worker:
        # Monitor CPU before inference
        cpu_before = psutil.cpu_percent(interval=0.1)
        
        # Process request
        start = time.time()
        response = self.model.create_chat_completion(...)
        elapsed = time.time() - start
        
        # Monitor CPU after
        cpu_after = psutil.cpu_percent(interval=0.1)
        
        print(f"โš™๏ธ LLM: {elapsed:.1f}s, CPU: {cpu_before:.0f}%โ†’{cpu_after:.0f}%")
```

## ๐ŸŽฏ Recommendations

### Immediate (Done โœ…)
- [x] Set `n_threads=1`
- [x] Set `n_threads_batch=1`

### High Priority
- [ ] Reduce `n_ctx` to 2048
- [ ] Increase `n_batch` to 256
- [ ] Reduce `max_tokens` (64 for translation, 150 for analysis)

### Medium Priority
- [ ] Add CPU monitoring logs
- [ ] Test on different command types
- [ ] Benchmark inference times

### Low Priority (Only if still laggy)
- [ ] Set thread priority with `os.nice()`
- [ ] CPU affinity with `sched_setaffinity()`
- [ ] Consider even smaller model (0.5B variant)

## ๐Ÿ“Š Performance Targets

| Metric | Target | Acceptable | Critical |
|--------|--------|------------|----------|
| **Game FPS** | 20 | 18-20 | < 15 โŒ |
| **Mouse latency** | < 50ms | < 100ms | > 200ms โŒ |
| **LLM inference** | 10-15s | < 20s | > 30s โŒ |
| **Translation time** | 5-10s | < 15s | > 20s โŒ |
| **Analysis time** | 10-15s | < 20s | > 30s โŒ |

## ๐Ÿšจ If Still Laggy

**Option 1: Smaller Model**
- Switch to Qwen2.5-0.5B (even faster)
- Trade quality for speed

**Option 2: Longer Batch**
```python
n_batch=512  # Process more at once
```

**Option 3: Limit Concurrent Requests**
```python
# Don't allow translation + analysis simultaneously
if self._current_request_id is not None:
    return "Please wait for current inference to complete"
```

**Option 4: CPU Pinning**
```python
# Force LLM to CPU 0 only
os.sched_setaffinity(os.getpid(), {0})
```

**Option 5: Reduce Model Precision**
```python
# Use Q2_K instead of Q4_0
# Smaller, faster, slightly lower quality
model = "qwen2.5-coder-1.5b-instruct-q2_k.gguf"
```

## ๐Ÿ“ Summary

**Problem:** LLM was potentially using 2 threads (`n_threads_batch` unset)
**Solution:** Explicitly set both `n_threads=1` and `n_threads_batch=1`
**Result:** LLM uses only 1 vCPU, game gets dedicated 1 vCPU
**Expected:** Smooth mouse/unit controls during inference! ๐ŸŽฎ

---

**Commit:** Added `n_threads_batch=1` parameter
**Status:** Testing required to confirm improvement
**Next:** Monitor game responsiveness during inference