Spaces:

Luigi
/

rts-commander

Sleeping

App Files Files Community

rts-commander / docs /LLM_PERFORMANCE_FIX.md

Luigi

perf: Non-blocking LLM architecture to prevent game lag

7e8483f about 2 months ago

preview code

raw

history blame contribute delete

6.55 kB

LLM Performance Fix - Non-Blocking Architecture

Problem

The game was laggy and losing instructions during LLM inference because:

Blocking LLM calls: When a user sent an NL command, the model took 15+ seconds
Game loop blocked: During this time, other commands could be lost or delayed
Fallback spawned new processes: When timeout hit, system spawned new LLM process (even slower!)
No request management: Old requests accumulated in memory

Log evidence:

⚠️ Shared model failed: Request timeout after 15.0s, falling back to process isolation
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

Multiple commands were sent but some got lost or severely delayed.

Solution

Implemented fully asynchronous non-blocking LLM architecture:

1. Async Model Manager (`model_manager.py`)

New classes:

RequestStatus enum: PENDING, PROCESSING, COMPLETED, FAILED, CANCELLED
AsyncRequest dataclass: Tracks individual requests with status and timestamps

New methods:

submit_async(): Submit request, returns immediately with request_id
get_result(): Poll result without blocking
cancel_request(): Cancel pending requests
cleanup_old_requests(): Remove completed requests older than max_age
get_queue_status(): Monitor queue for debugging

Key changes:

Worker thread now updates AsyncRequest objects directly
No more blocking queues for results
Requests tracked in _requests dict with status
Prints timing info: ✅ LLM request completed in X.XXs

2. Async NL Translator (`nl_translator_async.py`)

New file with completely non-blocking API:

Core methods:

submit_translation(): Submit NL command, returns request_id immediately
check_translation(): Poll for result, returns {ready, status, result/error}
translate_blocking(): Backward-compatible with short timeout (5s instead of 10s)

Key features:

Never blocks more than 5 seconds
Returns timeout error if LLM busy (game continues!)
Auto-cleanup of old requests
Same language detection and examples as original

Compatibility:

Keeps legacy translate() and translate_command() methods
Keeps get_example_commands() for UI
Drop-in replacement for old nl_translator.py

3. Game Loop Integration (`app.py`)

Changes:

Import from nl_translator_async instead of nl_translator

Added periodic cleanup every 30 seconds (600 ticks):

# Cleanup old LLM requests every 30 seconds
if self.game_state.tick % 600 == 0:
    model.cleanup_old_requests(max_age=300.0)  # 5 min
    translator.cleanup_old_requests(max_age=60.0)  # 1 min

Performance Improvements

Before:

LLM inference: 15+ seconds blocking
Game loop: FROZEN during inference
Commands: LOST if sent during freeze
Fallback: Spawned new process (30+ seconds additional)

After:

LLM inference: Still ~15s but NON-BLOCKING
Game loop: CONTINUES at 20 FPS during inference
Commands: QUEUED and processed when LLM available
Fallback: NO process spawning, just timeout message
Cleanup: Automatic every 30 seconds

User Experience:

Before:

User: "move tanks north"
[15 second freeze]
User: "attack base"
[Lost - not processed]
User: "build infantry"
[Lost - not processed]
[Finally tanks move after 15s]

After:

User: "move tanks north"
[Immediate "Processing..." feedback]
User: "attack base"
[Queued]
User: "build infantry"
[Queued]
[Tanks move after 15s when LLM finishes]
[Attack executes after 30s]
[Build executes after 45s]

Technical Details

Request Flow:

User sends NL command via /api/nl/translate
translator.translate() calls submit_translation()
Request immediately submitted to model_manager queue
Request ID returned, translation polls with 5s timeout
If LLM not done in 5s, returns timeout (game continues)
If completed, returns result and executes command
Old requests auto-cleaned every 30s

Memory Management:

Completed requests kept for 5 minutes (for debugging)
Translator requests kept for 1 minute
Auto-cleanup prevents memory leak
Status monitoring via get_queue_status()

Thread Safety:

All request access protected by _requests_lock
Worker thread only processes one request at a time
No race conditions on status updates
No deadlocks (no nested locks)

Testing

To verify the fix works:

Check logs for async messages:

📤 LLM request submitted: req_1234567890_1234
✅ LLM request completed in 14.23s
🧹 Cleaned up 3 old LLM requests

Monitor game loop:

⏱️  Game tick: 100 (loop running)
[User sends command]
⏱️  Game tick: 200 (loop running)  <- Should NOT freeze!
⏱️  Game tick: 300 (loop running)

Send rapid commands:
- Type 3-4 commands quickly
- All should be queued (not lost)
- Execute sequentially as LLM finishes each

Check queue status (add debug endpoint if needed):

status = model.get_queue_status()
# {'queue_size': 2, 'pending': 1, 'processing': 1, ...}

Rollback

If issues occur, revert:

cd /home/luigi/rts/web
git diff model_manager.py > llm_fix.patch
git checkout HEAD -- model_manager.py
# And change app.py import back to nl_translator

Future Optimizations

Reduce max_tokens further: 128→64 for faster response
Reduce n_ctx: 4096→2048 for less memory
Add request priority: Game commands > NL translation > AI analysis
Batch similar requests: Multiple "move" commands → single LLM call
Cache common commands: "build infantry" → skip LLM, use cached JSON

Commit Message

perf: Non-blocking LLM architecture to prevent game lag

- Implemented async request submission/polling in model_manager
- Created AsyncRequest tracking with status enum
- Added nl_translator_async with instant response
- Added automatic cleanup every 30s (prevents memory leak)
- Reduced timeout: 15s→5s for NL translation
- Game loop now continues smoothly during LLM inference

BEFORE: 15s freeze, lost commands, unresponsive
AFTER: Smooth 20 FPS, all commands queued, no blocking

Fixes lag and lost instructions reported in production

Status: ✅ Ready to test
Risk: Low (backward compatible API, graceful fallback)
Performance impact: Massive improvement in responsiveness