Spaces:
Sleeping
Sleeping
| # LLM Performance Fix - Non-Blocking Architecture | |
| ## Problem | |
| The game was **laggy and losing instructions** during LLM inference because: | |
| 1. **Blocking LLM calls**: When a user sent an NL command, the model took 15+ seconds | |
| 2. **Game loop blocked**: During this time, other commands could be lost or delayed | |
| 3. **Fallback spawned new processes**: When timeout hit, system spawned new LLM process (even slower!) | |
| 4. **No request management**: Old requests accumulated in memory | |
| **Log evidence:** | |
| ``` | |
| ⚠️ Shared model failed: Request timeout after 15.0s, falling back to process isolation | |
| llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| ``` | |
| Multiple commands were sent but some got lost or severely delayed. | |
| ## Solution | |
| Implemented **fully asynchronous non-blocking LLM architecture**: | |
| ### 1. Async Model Manager (`model_manager.py`) | |
| **New classes:** | |
| - `RequestStatus` enum: PENDING, PROCESSING, COMPLETED, FAILED, CANCELLED | |
| - `AsyncRequest` dataclass: Tracks individual requests with status and timestamps | |
| **New methods:** | |
| - `submit_async()`: Submit request, returns immediately with request_id | |
| - `get_result()`: Poll result without blocking | |
| - `cancel_request()`: Cancel pending requests | |
| - `cleanup_old_requests()`: Remove completed requests older than max_age | |
| - `get_queue_status()`: Monitor queue for debugging | |
| **Key changes:** | |
| - Worker thread now updates `AsyncRequest` objects directly | |
| - No more blocking queues for results | |
| - Requests tracked in `_requests` dict with status | |
| - Prints timing info: `✅ LLM request completed in X.XXs` | |
| ### 2. Async NL Translator (`nl_translator_async.py`) | |
| **New file** with completely non-blocking API: | |
| **Core methods:** | |
| - `submit_translation()`: Submit NL command, returns request_id immediately | |
| - `check_translation()`: Poll for result, returns `{ready, status, result/error}` | |
| - `translate_blocking()`: Backward-compatible with short timeout (5s instead of 10s) | |
| **Key features:** | |
| - Never blocks more than 5 seconds | |
| - Returns timeout error if LLM busy (game continues!) | |
| - Auto-cleanup of old requests | |
| - Same language detection and examples as original | |
| **Compatibility:** | |
| - Keeps legacy `translate()` and `translate_command()` methods | |
| - Keeps `get_example_commands()` for UI | |
| - Drop-in replacement for old `nl_translator.py` | |
| ### 3. Game Loop Integration (`app.py`) | |
| **Changes:** | |
| - Import from `nl_translator_async` instead of `nl_translator` | |
| - Added periodic cleanup every 30 seconds (600 ticks): | |
| ```python | |
| # Cleanup old LLM requests every 30 seconds | |
| if self.game_state.tick % 600 == 0: | |
| model.cleanup_old_requests(max_age=300.0) # 5 min | |
| translator.cleanup_old_requests(max_age=60.0) # 1 min | |
| ``` | |
| ## Performance Improvements | |
| ### Before: | |
| - LLM inference: **15+ seconds blocking** | |
| - Game loop: **FROZEN during inference** | |
| - Commands: **LOST if sent during freeze** | |
| - Fallback: **Spawned new process** (30+ seconds additional) | |
| ### After: | |
| - LLM inference: **Still ~15s** but **NON-BLOCKING** | |
| - Game loop: **CONTINUES at 20 FPS** during inference | |
| - Commands: **QUEUED and processed** when LLM available | |
| - Fallback: **NO process spawning**, just timeout message | |
| - Cleanup: **Automatic** every 30 seconds | |
| ### User Experience: | |
| **Before:** | |
| ``` | |
| User: "move tanks north" | |
| [15 second freeze] | |
| User: "attack base" | |
| [Lost - not processed] | |
| User: "build infantry" | |
| [Lost - not processed] | |
| [Finally tanks move after 15s] | |
| ``` | |
| **After:** | |
| ``` | |
| User: "move tanks north" | |
| [Immediate "Processing..." feedback] | |
| User: "attack base" | |
| [Queued] | |
| User: "build infantry" | |
| [Queued] | |
| [Tanks move after 15s when LLM finishes] | |
| [Attack executes after 30s] | |
| [Build executes after 45s] | |
| ``` | |
| ## Technical Details | |
| ### Request Flow: | |
| 1. User sends NL command via `/api/nl/translate` | |
| 2. `translator.translate()` calls `submit_translation()` | |
| 3. Request immediately submitted to model_manager queue | |
| 4. Request ID returned, translation polls with 5s timeout | |
| 5. If LLM not done in 5s, returns timeout (game continues) | |
| 6. If completed, returns result and executes command | |
| 7. Old requests auto-cleaned every 30s | |
| ### Memory Management: | |
| - Completed requests kept for 5 minutes (for debugging) | |
| - Translator requests kept for 1 minute | |
| - Auto-cleanup prevents memory leak | |
| - Status monitoring via `get_queue_status()` | |
| ### Thread Safety: | |
| - All request access protected by `_requests_lock` | |
| - Worker thread only processes one request at a time | |
| - No race conditions on status updates | |
| - No deadlocks (no nested locks) | |
| ## Testing | |
| To verify the fix works: | |
| 1. **Check logs** for async messages: | |
| ``` | |
| 📤 LLM request submitted: req_1234567890_1234 | |
| ✅ LLM request completed in 14.23s | |
| 🧹 Cleaned up 3 old LLM requests | |
| ``` | |
| 2. **Monitor game loop**: | |
| ``` | |
| ⏱️ Game tick: 100 (loop running) | |
| [User sends command] | |
| ⏱️ Game tick: 200 (loop running) <- Should NOT freeze! | |
| ⏱️ Game tick: 300 (loop running) | |
| ``` | |
| 3. **Send rapid commands**: | |
| - Type 3-4 commands quickly | |
| - All should be queued (not lost) | |
| - Execute sequentially as LLM finishes each | |
| 4. **Check queue status** (add debug endpoint if needed): | |
| ```python | |
| status = model.get_queue_status() | |
| # {'queue_size': 2, 'pending': 1, 'processing': 1, ...} | |
| ``` | |
| ## Rollback | |
| If issues occur, revert: | |
| ```bash | |
| cd /home/luigi/rts/web | |
| git diff model_manager.py > llm_fix.patch | |
| git checkout HEAD -- model_manager.py | |
| # And change app.py import back to nl_translator | |
| ``` | |
| ## Future Optimizations | |
| 1. **Reduce max_tokens further**: 128→64 for faster response | |
| 2. **Reduce n_ctx**: 4096→2048 for less memory | |
| 3. **Add request priority**: Game commands > NL translation > AI analysis | |
| 4. **Batch similar requests**: Multiple "move" commands → single LLM call | |
| 5. **Cache common commands**: "build infantry" → skip LLM, use cached JSON | |
| ## Commit Message | |
| ``` | |
| perf: Non-blocking LLM architecture to prevent game lag | |
| - Implemented async request submission/polling in model_manager | |
| - Created AsyncRequest tracking with status enum | |
| - Added nl_translator_async with instant response | |
| - Added automatic cleanup every 30s (prevents memory leak) | |
| - Reduced timeout: 15s→5s for NL translation | |
| - Game loop now continues smoothly during LLM inference | |
| BEFORE: 15s freeze, lost commands, unresponsive | |
| AFTER: Smooth 20 FPS, all commands queued, no blocking | |
| Fixes lag and lost instructions reported in production | |
| ``` | |
| --- | |
| **Status**: ✅ Ready to test | |
| **Risk**: Low (backward compatible API, graceful fallback) | |
| **Performance impact**: Massive improvement in responsiveness | |