Spaces:
Sleeping
Sleeping
LLM Performance Fix - Non-Blocking Architecture
Problem
The game was laggy and losing instructions during LLM inference because:
- Blocking LLM calls: When a user sent an NL command, the model took 15+ seconds
- Game loop blocked: During this time, other commands could be lost or delayed
- Fallback spawned new processes: When timeout hit, system spawned new LLM process (even slower!)
- No request management: Old requests accumulated in memory
Log evidence:
⚠️ Shared model failed: Request timeout after 15.0s, falling back to process isolation
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Multiple commands were sent but some got lost or severely delayed.
Solution
Implemented fully asynchronous non-blocking LLM architecture:
1. Async Model Manager (model_manager.py)
New classes:
RequestStatusenum: PENDING, PROCESSING, COMPLETED, FAILED, CANCELLEDAsyncRequestdataclass: Tracks individual requests with status and timestamps
New methods:
submit_async(): Submit request, returns immediately with request_idget_result(): Poll result without blockingcancel_request(): Cancel pending requestscleanup_old_requests(): Remove completed requests older than max_ageget_queue_status(): Monitor queue for debugging
Key changes:
- Worker thread now updates
AsyncRequestobjects directly - No more blocking queues for results
- Requests tracked in
_requestsdict with status - Prints timing info:
✅ LLM request completed in X.XXs
2. Async NL Translator (nl_translator_async.py)
New file with completely non-blocking API:
Core methods:
submit_translation(): Submit NL command, returns request_id immediatelycheck_translation(): Poll for result, returns{ready, status, result/error}translate_blocking(): Backward-compatible with short timeout (5s instead of 10s)
Key features:
- Never blocks more than 5 seconds
- Returns timeout error if LLM busy (game continues!)
- Auto-cleanup of old requests
- Same language detection and examples as original
Compatibility:
- Keeps legacy
translate()andtranslate_command()methods - Keeps
get_example_commands()for UI - Drop-in replacement for old
nl_translator.py
3. Game Loop Integration (app.py)
Changes:
- Import from
nl_translator_asyncinstead ofnl_translator - Added periodic cleanup every 30 seconds (600 ticks):
# Cleanup old LLM requests every 30 seconds if self.game_state.tick % 600 == 0: model.cleanup_old_requests(max_age=300.0) # 5 min translator.cleanup_old_requests(max_age=60.0) # 1 min
Performance Improvements
Before:
- LLM inference: 15+ seconds blocking
- Game loop: FROZEN during inference
- Commands: LOST if sent during freeze
- Fallback: Spawned new process (30+ seconds additional)
After:
- LLM inference: Still ~15s but NON-BLOCKING
- Game loop: CONTINUES at 20 FPS during inference
- Commands: QUEUED and processed when LLM available
- Fallback: NO process spawning, just timeout message
- Cleanup: Automatic every 30 seconds
User Experience:
Before:
User: "move tanks north"
[15 second freeze]
User: "attack base"
[Lost - not processed]
User: "build infantry"
[Lost - not processed]
[Finally tanks move after 15s]
After:
User: "move tanks north"
[Immediate "Processing..." feedback]
User: "attack base"
[Queued]
User: "build infantry"
[Queued]
[Tanks move after 15s when LLM finishes]
[Attack executes after 30s]
[Build executes after 45s]
Technical Details
Request Flow:
- User sends NL command via
/api/nl/translate translator.translate()callssubmit_translation()- Request immediately submitted to model_manager queue
- Request ID returned, translation polls with 5s timeout
- If LLM not done in 5s, returns timeout (game continues)
- If completed, returns result and executes command
- Old requests auto-cleaned every 30s
Memory Management:
- Completed requests kept for 5 minutes (for debugging)
- Translator requests kept for 1 minute
- Auto-cleanup prevents memory leak
- Status monitoring via
get_queue_status()
Thread Safety:
- All request access protected by
_requests_lock - Worker thread only processes one request at a time
- No race conditions on status updates
- No deadlocks (no nested locks)
Testing
To verify the fix works:
Check logs for async messages:
📤 LLM request submitted: req_1234567890_1234 ✅ LLM request completed in 14.23s 🧹 Cleaned up 3 old LLM requestsMonitor game loop:
⏱️ Game tick: 100 (loop running) [User sends command] ⏱️ Game tick: 200 (loop running) <- Should NOT freeze! ⏱️ Game tick: 300 (loop running)Send rapid commands:
- Type 3-4 commands quickly
- All should be queued (not lost)
- Execute sequentially as LLM finishes each
Check queue status (add debug endpoint if needed):
status = model.get_queue_status() # {'queue_size': 2, 'pending': 1, 'processing': 1, ...}
Rollback
If issues occur, revert:
cd /home/luigi/rts/web
git diff model_manager.py > llm_fix.patch
git checkout HEAD -- model_manager.py
# And change app.py import back to nl_translator
Future Optimizations
- Reduce max_tokens further: 128→64 for faster response
- Reduce n_ctx: 4096→2048 for less memory
- Add request priority: Game commands > NL translation > AI analysis
- Batch similar requests: Multiple "move" commands → single LLM call
- Cache common commands: "build infantry" → skip LLM, use cached JSON
Commit Message
perf: Non-blocking LLM architecture to prevent game lag
- Implemented async request submission/polling in model_manager
- Created AsyncRequest tracking with status enum
- Added nl_translator_async with instant response
- Added automatic cleanup every 30s (prevents memory leak)
- Reduced timeout: 15s→5s for NL translation
- Game loop now continues smoothly during LLM inference
BEFORE: 15s freeze, lost commands, unresponsive
AFTER: Smooth 20 FPS, all commands queued, no blocking
Fixes lag and lost instructions reported in production
Status: ✅ Ready to test
Risk: Low (backward compatible API, graceful fallback)
Performance impact: Massive improvement in responsiveness