Spaces:

KillerKing93
/

Transformers-InferenceServer-OpenAPI

Sleeping

App Files Files Community

KillerKing93 commited on 6 days ago

Commit

7cd14d8

verified ·

1 Parent(s): 8c57892

Sync from GitHub 8f6d598

Browse files

Files changed (13) hide show

.env.example +34 -0
.gitignore +31 -0
ARCHITECTURE.md +192 -0
CLAUDE.md +259 -0
Dockerfile +73 -0
HISTORY.md +3 -0
LICENSE +50 -0
README.md +358 -12
RULES.md +207 -0
TODO.md +14 -0
main.py +1108 -0
requirements.txt +30 -0
tests/test_api.py +274 -0

.env.example ADDED Viewed

	@@ -0,0 +1,34 @@

+# Server
+PORT=3000
+# Model from Hugging Face (Transformers)
+MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
+# HF token for gated/private models (optional)
+HF_TOKEN=
+# Inference parameters
+MAX_TOKENS=4096
+TEMPERATURE=0.7
+# Multimedia processing
+MAX_VIDEO_FRAMES=16
+# Transformers loading hints
+DEVICE_MAP=auto
+TORCH_DTYPE=auto
+# Persistent SSE session store (SQLite)
+# Enable to persist streaming chunks per session_id and allow resume after server restarts.
+# 1=true, 0=false
+PERSIST_SESSIONS=1
+SESSIONS_DB_PATH=sessions.db
+# TTL for sessions (seconds). Finished sessions older than TTL are garbage collected.
+SESSIONS_TTL_SECONDS=600
+# Auto compression and context reporting
+# Enable automatic prompt compression if context would overflow. Drops oldest non-system messages.
+ENABLE_AUTO_COMPRESSION=1
+# Force a max context window for budgeting; 0 = use model/tokenizer defaults
+CONTEXT_MAX_TOKENS_AUTO=0
+# Safety margin kept free for generation and special tokens
+CONTEXT_SAFETY_MARGIN=256
+# Compression strategy: truncate (default). summarize reserved for future use.
+COMPRESSION_STRATEGY=truncate

.gitignore ADDED Viewed

	@@ -0,0 +1,31 @@

+# Node (legacy)
+node_modules
+.env
+# Python
+.venv/
+venv/
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+# Tool caches
+.cache/
+.mypy_cache/
+.pyright/
+.pytest_cache/
+.ipynb_checkpoints/
+# Editors / OS
+.DS_Store
+Thumbs.db
+.idea/
+.vscode/
+# Data / models (large artifacts)
+models/
+data/
+uploads/
+tmp/

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,192 @@

+# Architecture (Python FastAPI + Transformers)
+This document describes the Python-based, OpenAI-compatible inference server for Qwen3-VL, replacing the previous Node.js/llama.cpp stack.
+Key source files
+- Server entry: [main.py](main.py)
+- Inference engine: [Python.class Engine](main.py:231)
+- Multimodal parsing: [Python.function build_mm_messages](main.py:251), [Python.function load_image_from_any](main.py:108), [Python.function load_video_frames_from_any](main.py:150)
+- Endpoints: Health [Python.app.get()](main.py:577), Chat Completions [Python.app.post()](main.py:591), Cancel [Python.app.post()](main.py:792)
+- Streaming + resume: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449), [Python.class _SQLiteStore](main.py:482), [Python.function chat_completions](main.py:591)
+- Local run (uvicorn): [Python.main()](main.py:807)
+- Configuration template: [.env.example](.env.example)
+- Dependencies: [requirements.txt](requirements.txt)
+Model target (default)
+- Hugging Face: Qwen/Qwen3-VL-2B-Thinking (Transformers, multimodal)
+- Overridable via environment variable: MODEL_REPO_ID
+## Overview
+The server exposes an OpenAI-compatible endpoint for chat completions that supports:
+- Text-only prompts
+- Images (URL or base64)
+- Videos (URL or base64; frames sampled)
+Two response modes are implemented:
+- Non-streaming JSON
+- Streaming via Server-Sent Events (SSE) with resumable delivery using Last-Event-ID. Resumability is achieved with an in‑memory ring buffer and optional SQLite persistence.
+## Components
+1) FastAPI application
+- Instantiated in [Python.main module](main.py:541) and endpoints mounted at:
+  - Health: [Python.app.get()](main.py:577)
+  - Chat Completions (non-stream + SSE): [Python.app.post()](main.py:591)
+  - Manual cancel (custom): [Python.app.post()](main.py:792)
+- CORS is enabled for simplicity.
+2) Inference Engine (Transformers)
+- Class: [Python.class Engine](main.py:231)
+- Loads:
+  - Processor: AutoProcessor(trust_remote_code=True)
+  - Model: AutoModelForCausalLM (device_map, dtype configurable via env)
+- Core methods:
+  - Input building: [Python.function build_mm_messages](main.py:251)
+  - Text-only generate: [Python.function infer](main.py:326)
+  - Streaming generate (iterator): [Python.function infer_stream](main.py:375)
+3) Multimodal preprocessing
+- Images:
+  - URL (http/https), data URL, base64, or local path
+  - Loader: [Python.function load_image_from_any](main.py:108)
+- Videos:
+  - URL (downloaded to temp), base64 to temp file, or local path
+  - Frame extraction via imageio.v3 (preferred) or OpenCV fallback
+  - Uniform sampling up to MAX_VIDEO_FRAMES
+  - Loader: [Python.function load_video_frames_from_any](main.py:150)
+4) SSE streaming with resume
+- Session objects:
+  - [Python.class _SSESession](main.py:435): ring buffer, condition variable, producer thread reference, cancellation event, listener count, and disconnect timer
+  - [Python.class _SessionStore](main.py:449): in-memory map with TTL + GC
+  - Optional persistence: [Python.class _SQLiteStore](main.py:482) for replaying chunks across restarts
+- SSE id format: "session_id:index"
+- Resume:
+  - Client sends Last-Event-ID header (or query ?last_event_id=...) and the same session_id in the body
+  - Server replays cached/persisted chunks after the provided index, then continues live streaming
+- Producer:
+  - Created on demand per session; runs generation in a daemon thread and pushes chunks into the ring buffer and SQLite (if enabled)
+  - See producer closure inside [Python.function chat_completions](main.py:591)
+- Auto-cancel on disconnect:
+  - If all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS (default 3600s), a timer signals cancellation via a stopping criteria in [Python.function infer_stream](main.py:375)
+## Request flow
+Non-streaming (POST /v1/chat/completions)
+1. Validate input, load engine singleton via [Python.function get_engine](main.py:558)
+2. Convert OpenAI-style messages to Qwen chat template via [Python.function build_mm_messages](main.py:251) and apply_chat_template
+3. Preprocess images/videos into processor inputs
+4. Generate with [Python.function infer](main.py:326)
+5. Return OpenAI-compatible response (choices[0].message.content)
+Streaming (POST /v1/chat/completions with "stream": true)
+1. Determine session_id:
+   - Use body.session_id if provided; otherwise generated server-side
+2. Parse Last-Event-ID (or query ?last_event_id) to get last delivered index
+3. Create/start or reuse producer thread for this session
+4. StreamingResponse generator:
+   - Replays persisted events (SQLite, if enabled) and in-memory buffer after last index
+   - Waits on condition variable for new tokens
+   - Emits "[DONE]" at the end or upon buffer completion
+5. Clients can reconnect and resume by sending Last-Event-ID: "session_id:index"
+6. If all clients disconnect, an auto-cancel timer can stop generation (configurable via env)
+Manual cancel (POST /v1/cancel/{session_id})
+- Custom operational shortcut to cancel an in-flight generation for a session id.
+- This is not part of the legacy OpenAI Chat Completions spec (OpenAI’s newer Responses API defines cancel); it is provided for practical control.
+## Message and content mapping
+Input format (OpenAI-like):
+- "messages" list of role/content entries
+- content can be:
+  - string (text)
+  - array of parts with "type":
+    - "text": { text: "..."}
+    - "image_url": { image_url: { url: "..." } } or { image_url: "..." }
+    - "input_image": { b64_json: "..." } or { image: "..." }
+    - "video_url": { video_url: { url: "..." } } or { video_url: "..." }
+    - "input_video": { b64_json: "..." } or { video: "..." }
+Conversion:
+- [Python.function build_mm_messages](main.py:251) constructs a multimodal content list per message:
+  - { type: "text", text: ... }
+  - { type: "image", image: PIL.Image }
+  - { type: "video", video: [PIL.Image frames] }
+Template:
+- Qwen apply_chat_template:
+  - See usage in [Python.function infer](main.py:326) and [Python.function infer_stream](main.py:375)
+## Configuration (.env)
+See [.env.example](.env.example)
+- PORT (default 3000)
+- MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
+- HF_TOKEN (optional)
+- MAX_TOKENS (default 256)
+- TEMPERATURE (default 0.7)
+- MAX_VIDEO_FRAMES (default 16)
+- DEVICE_MAP (default "auto")
+- TORCH_DTYPE (default "auto")
+- PERSIST_SESSIONS (default 0; set 1 to enable SQLite persistence)
+- SESSIONS_DB_PATH (default sessions.db)
+- SESSIONS_TTL_SECONDS (default 600)
+- CANCEL_AFTER_DISCONNECT_SECONDS (default 3600; set 0 to disable)
+## Error handling and readiness
+- Health endpoint: [Python.app.get()](main.py:577)
+  - Returns { ok, modelReady, modelId, error }
+- Chat endpoint:
+  - 400 for invalid messages or multimodal parsing errors
+  - 503 when model failed to load
+  - 500 for unexpected generation errors
+- During first request, the model is lazily loaded; subsequent requests reuse the singleton
+## Performance and scaling
+- GPU recommended:
+  - Set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16/float16 if supported
+- Reduce MAX_VIDEO_FRAMES to speed up video processing
+- For concurrency:
+  - FastAPI/Uvicorn workers and model sharing: typically 1 model per process
+  - For high throughput, prefer multiple processes or a queueing layer
+## Data and directories
+- models/ contains downloaded model artifacts (implicitly created by Transformers cache); ignored by git
+- tmp/ used transiently for video decoding (temporary files)
+Ignored artifacts (see [.gitignore](.gitignore))
+- Python: .venv/, __pycache__/, .cache/, etc.
+- Large artifacts: models/, data/, uploads/, tmp/
+## Streaming resume details
+- Session store:
+  - In-memory ring buffer for fast replay
+  - Optional SQLite persistence for robust replay across process restarts
+  - See GC in [Python.class _SessionStore](main.py:449) and [Python.method _SQLiteStore.gc](main.py:526)
+- Limits:
+  - Ring buffer stores ~2048 SSE events per session by default
+  - If the buffer overflows before a client resumes and persistence is disabled, the earliest chunks may be unavailable
+- End-of-stream:
+  - Final chunk contains finish_reason: "stop"
+  - "[DONE]" sentinel is emitted afterwards
+## Future enhancements
+- Redis persistence:
+  - Add a Redis-backed store as a drop-in alongside SQLite
+- Token accounting:
+  - Populate usage prompt/completion/total tokens when model exposes tokenization costs
+- Logging/observability:
+  - Structured logs, request IDs, and metrics
+## Migration notes (from Node.js)
+- All Node.js server files and scripts were removed (index.js, package*.json, scripts/)
+- The server now targets Transformers models directly and supports multimodal inputs out of the box
+- The API remains OpenAI-compatible on /v1/chat/completions with resumable SSE and optional SQLite persistence

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,259 @@

+# CLAUDE Technical Log and Decisions (Python FastAPI + Transformers)
+## Progress Log — 2025-10-23 (Asia/Jakarta)
+- Migrated stack from Node.js/llama.cpp to Python + FastAPI + Transformers
+  - New server: [main.py](main.py)
+  - Default model: Qwen/Qwen3-VL-2B-Thinking via Transformers with trust_remote_code
+- Implemented endpoints
+  - Health: [Python.app.get()](main.py:577)
+  - OpenAI-compatible Chat Completions (non-stream + SSE): [Python.app.post()](main.py:591)
+  - Manual cancel (custom extension): [Python.app.post()](main.py:792)
+- Multimodal support
+  - OpenAI-style messages mapped in [Python.function build_mm_messages](main.py:251)
+  - Image loader: [Python.function load_image_from_any](main.py:108)
+  - Video loader (frame sampling): [Python.function load_video_frames_from_any](main.py:150)
+- Streaming + resume + persistence
+  - SSE with session_id + Last-Event-ID
+  - In-memory session ring buffer: [Python.class _SSESession](main.py:435), manager [Python.class _SessionStore](main.py:449)
+  - Optional SQLite persistence: [Python.class _SQLiteStore](main.py:482) with replay across restarts
+- Cancellation
+  - Auto-cancel after all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS, timer wiring in [Python.function chat_completions](main.py:733), cooperative stop in [Python.function infer_stream](main.py:375)
+  - Manual cancel API: [Python.function cancel_session](main.py:792)
+- Configuration and dependencies
+  - Env template updated: [.env.example](.env.example) with MODEL_REPO_ID, PERSIST_SESSIONS, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS, CANCEL_AFTER_DISCONNECT_SECONDS, etc.
+  - Python deps: [requirements.txt](requirements.txt)
+  - Git ignores for Python + artifacts: [.gitignore](.gitignore)
+- Documentation refreshed
+  - Operator docs: [README.md](README.md) including SSE resume, SQLite, cancel API
+  - Architecture: [ARCHITECTURE.md](ARCHITECTURE.md) aligned to Python flows
+  - Rules: [RULES.md](RULES.md) updated — Git usage is mandatory
+- Legacy removal
+  - Deleted Node files and scripts (index.js, package*.json, scripts/) as requested
+Suggested Git commit series (run in order)
+- git add .
+- git commit -m "feat(server): add FastAPI OpenAI-compatible /v1/chat/completions with Qwen3-VL [Python.main()](main.py:1)"
+- git commit -m "feat(stream): SSE streaming with session_id resume and in-memory sessions [Python.function chat_completions()](main.py:591)"
+- git commit -m "feat(persist): SQLite-backed replay for SSE sessions [Python.class _SQLiteStore](main.py:482)"
+- git commit -m "feat(cancel): auto-cancel after disconnect and POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:792)"
+- git commit -m "docs: update README/ARCHITECTURE/RULES for Python stack and streaming resume"
+- git push
+Verification snapshot
+- Non-stream text works via [Python.function infer](main.py:326)
+- Streaming emits chunks and ends with [DONE]
+- Resume works with Last-Event-ID; persists across restart when PERSIST_SESSIONS=1
+- Manual cancel stops generation; auto-cancel triggers after disconnect threshold
+This is the developer-facing changelog and design rationale for the Python migration. Operator docs live in [README.md](README.md); architecture details in [ARCHITECTURE.md](ARCHITECTURE.md); rules in [RULES.md](RULES.md); task tracking in [TODO.md](TODO.md).
+Key source file references
+- Server entry: [Python.main()](main.py:807)
+- Health endpoint: [Python.app.get()](main.py:577)
+- Chat Completions endpoint (non-stream + SSE): [Python.app.post()](main.py:591)
+- Manual cancel endpoint (custom): [Python.app.post()](main.py:792)
+- Engine (Transformers): [Python.class Engine](main.py:231)
+- Multimodal mapping: [Python.function build_mm_messages](main.py:251)
+- Image loader: [Python.function load_image_from_any](main.py:108)
+- Video loader: [Python.function load_video_frames_from_any](main.py:150)
+- Non-stream inference: [Python.function infer](main.py:326)
+- Streaming inference + stopping criteria: [Python.function infer_stream](main.py:375)
+- In-memory sessions: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449)
+- SQLite persistence: [Python.class _SQLiteStore](main.py:482)
+Summary of the migration
+- Replaced the Node.js/llama.cpp stack with a Python FastAPI server that uses Hugging Face Transformers for Qwen3-VL multimodal inference.
+- Exposes an OpenAI-compatible /v1/chat/completions endpoint (non-stream and streaming via SSE).
+- Supports text, images, and videos:
+  - Messages can include array parts such as "text", "image_url" / "input_image" (base64), "video_url" / "input_video" (base64).
+  - Images are decoded to PIL in [Python.function load_image_from_any](main.py:108).
+  - Videos are read via imageio.v3 (preferred) or OpenCV, sampled to up to MAX_VIDEO_FRAMES in [Python.function load_video_frames_from_any](main.py:150).
+- Streaming includes resumability with session_id + Last-Event-ID:
+  - In-memory ring buffer: [Python.class _SSESession](main.py:435)
+  - Optional SQLite persistence: [Python.class _SQLiteStore](main.py:482)
+- Added a manual cancel endpoint (custom) and implemented auto-cancel after disconnect.
+Why Python + Transformers?
+- Qwen3-VL-2B-Thinking is published for Transformers and includes multimodal processors (preprocessor_config.json, video_preprocessor_config.json, chat_template.json). Python + Transformers is the first-class path.
+- trust_remote_code=True allows the model repo to provide custom processing logic and templates, used in [Python.class Engine](main.py:231) via AutoProcessor/AutoModelForCausalLM.
+Core design choices
+1) OpenAI compatibility
+- Non-stream path returns choices[0].message.content from [Python.function infer](main.py:326).
+- Streaming path (SSE) produces OpenAI-style "chat.completion.chunk" deltas, with id lines "session_id:index" for resume.
+- We retained Chat Completions (legacy) rather than the newer Responses API for compatibility with existing SDKs. A custom cancel endpoint is provided to fill the gap.
+2) Multimodal input handling
+- The API accepts "messages" with content either as a string or an array of parts typed as "text" / "image_url" / "input_image" / "video_url" / "input_video".
+- Images: URLs (http/https or data URL), base64, or local path are supported by [Python.function load_image_from_any](main.py:108).
+- Videos: URLs and base64 are materialized to a temp file; frames extracted and uniformly sampled by [Python.function load_video_frames_from_any](main.py:150).
+3) Engine and generation
+- Qwen chat template applied via processor.apply_chat_template in both [Python.function infer](main.py:326) and [Python.function infer_stream](main.py:375).
+- Generation sampling uses temperature; do_sample toggled when temperature > 0.
+- Streams are produced using TextIteratorStreamer.
+- Optional cooperative cancellation is implemented with a StoppingCriteria bound to a session cancel event in [Python.function infer_stream](main.py:375).
+4) Streaming, resume, and persistence
+- In-memory buffer per session for immediate replay: [Python.class _SSESession](main.py:435).
+- Optional SQLite persistence to survive restarts and handle long gaps: [Python.class _SQLiteStore](main.py:482).
+- Resume protocol:
+  - Client provides session_id in the request body and Last-Event-ID header "session_id:index", or pass ?last_event_id=...
+  - Server replays events after index from SQLite (if enabled) and the in-memory buffer.
+  - Producer appends events to both the ring buffer and SQLite (when enabled).
+5) Cancellation and disconnects
+- Manual cancel endpoint [Python.app.post()](main.py:792) sets the session cancel event and marks finished in SQLite.
+- Auto-cancel after disconnect:
+  - If all clients disconnect, a timer fires after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600) that sets the cancel event.
+  - The StoppingCriteria checks this event cooperatively and halts generation.
+6) Environment configuration
+- See [.env.example](.env.example).
+- Important variables:
+  - MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
+  - HF_TOKEN (optional)
+  - MAX_TOKENS, TEMPERATURE
+  - MAX_VIDEO_FRAMES (video frame sampling)
+  - DEVICE_MAP, TORCH_DTYPE (Transformers loading hints)
+  - PERSIST_SESSIONS, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS (SQLite)
+  - CANCEL_AFTER_DISCONNECT_SECONDS (auto-cancel threshold)
+Security and privacy notes
+- trust_remote_code=True executes code from the model repository when loading AutoProcessor/AutoModel. This is standard for many HF multimodal models but should be understood in terms of supply-chain risk.
+- Do not log sensitive data. Avoid dumping raw request bodies or tokens.
+Operational guidance
+Running locally
+- Install Python dependencies from [requirements.txt](requirements.txt) and install a suitable PyTorch wheel for your platform/CUDA.
+- copy .env.example .env and adjust as needed.
+- Start: python [Python.main()](main.py:807)
+Testing endpoints
+- Health: GET /health
+- Chat (non-stream): POST /v1/chat/completions with messages array.
+- Chat (stream): add "stream": true; optionally pass "session_id".
+- Resume: send Last-Event-ID with "session_id:index".
+- Cancel: POST /v1/cancel/{session_id}.
+Scaling notes
+- Typically deploy one model per process. For throughput, run multiple workers behind a load balancer; sessions are process-local unless persistence is used.
+- SQLite persistence supports replay but does not synchronize cancel/producer state across processes. A Redis-based store (future work) can coordinate multi-process session state more robustly.
+Known limitations and follow-ups
+- Token accounting (usage prompt/completion/total) is stubbed at zeros. Populate if/when needed.
+- Redis store not yet implemented (design leaves a clear seam via _SQLiteStore analog).
+- No structured logging/tracing yet; follow-up for observability.
+- Cancellation is best-effort cooperative; it relies on the stopping criteria hook in generation.
+Changelog (2025-10-23)
+- feat(server): Python FastAPI server with Qwen3-VL (Transformers), OpenAI-compatible /v1/chat/completions.
+- feat(stream): SSE streaming with session_id + Last-Event-ID resumability.
+- feat(persist): Optional SQLite-backed session persistence for replay across restarts.
+- feat(cancel): Manual cancel endpoint /v1/cancel/{session_id}; auto-cancel after disconnect threshold.
+- docs: Updated [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md), [RULES.md](RULES.md). Rewrote [TODO.md](TODO.md) pending/complete items (see repo TODO).
+- chore: Removed Node.js and scripts from the prior stack.
+Verification checklist
+- Non-stream text-only request returns a valid completion.
+- Image and video prompts pass through preprocessing and generate coherent output.
+- Streaming emits OpenAI-style deltas and ends with [DONE].
+- Resume works with Last-Event-ID and session_id across reconnects; works after server restart when PERSIST_SESSIONS=1.
+- Manual cancel halts generation and marks session finished; subsequent resumes return a finished stream.
+- Auto-cancel fires after all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS and cooperatively stops generation.
+End of entry.
+## Progress Log Template (Mandatory per RULES)
+Use this template for every change or progress step. Add a new entry before/with each commit, then append the final commit hash after push. See enforcement in [RULES.md](RULES.md:33) and the progress policy in [RULES.md](RULES.md:49).
+Entry template
+- Date/Time (Asia/Jakarta): YYYY-MM-DD HH:mm
+- Commit: &lt;hash&gt; - &lt;conventional message&gt;
+- Scope/Files (clickable anchors required):
+  - [Python.function chat_completions()](main.py:591)
+  - [Python.function infer_stream()](main.py:375)
+  - [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1), [RULES.md](RULES.md:1), [TODO.md](TODO.md:1)
+- Summary:
+  - What changed and why (problem/requirement)
+- Changes:
+  - Short bullet list of code edits with anchors
+- Verification:
+  - Commands:
+    - curl examples (non-stream, stream with session_id, resume with Last-Event-ID)
+    - cancel API test: curl -X POST http://localhost:3000/v1/cancel/mysession123
+  - Expected vs Actual:
+    - …
+- Follow-ups/Limitations:
+  - …
+- Notes:
+  - If commit hash unknown at authoring time, update the entry after git push.
+Git sequence (run every time)
+- git add .
+- git commit -m "type(scope): short description"
+- git push
+- Update this entry with the final commit hash.
+Example (filled)
+- Date/Time: 2025-10-23 14:30 (Asia/Jakarta)
+- Commit: f724450 - feat(stream): add SQLite persistence for SSE resume
+- Scope/Files:
+  - [Python.class _SQLiteStore](main.py:482)
+  - [Python.function chat_completions()](main.py:591)
+  - [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1)
+- Summary:
+  - Persist SSE chunks to SQLite for replay across restarts; enable via PERSIST_SESSIONS.
+- Changes:
+  - Add _SQLiteStore with schema and CRUD
+  - Wire producer to append events to DB
+  - Replay DB events on resume before in-memory buffer
+- Verification:
+  - curl -N -H "Content-Type: application/json" ^
+    -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
+    http://localhost:3000/v1/chat/completions
+  - Restart server; resume:
+    curl -N -H "Content-Type: application/json" ^
+    -H "Last-Event-ID: mysession123:42" ^
+    -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
+    http://localhost:3000/v1/chat/completions
+  - Expected vs Actual: replayed chunks after index 42, continued live, ended with [DONE].
+- Follow-ups:
+  - Consider Redis store for multi-process coordination
+## Progress Log — 2025-10-23 14:31 (Asia/Jakarta)
+- Commit: f724450 - docs: sync README/ARCHITECTURE/RULES with main.py; add progress log in CLAUDE.md; enforce mandatory Git
+- Scope/Files (anchors):
+  - [Python.function chat_completions()](main.py:591)
+  - [Python.function infer_stream()](main.py:375)
+  - [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449), [Python.class _SQLiteStore](main.py:482)
+  - [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1), [RULES.md](RULES.md:1), [CLAUDE.md](CLAUDE.md:1), [.env.example](.env.example:1)
+- Summary:
+  - Completed Python migration and synchronized documentation. Implemented SSE streaming with resume, optional SQLite persistence, auto-cancel on disconnect, and manual cancel API. RULES now mandate Git usage and progress logging.
+- Changes:
+  - Document streaming/resume/persistence/cancel in [README.md](README.md:1) and [ARCHITECTURE.md](ARCHITECTURE.md:1)
+  - Enforce Git workflow and progress logging in [RULES.md](RULES.md:33)
+  - Add Progress Log template and entries in [CLAUDE.md](CLAUDE.md:1)
+- Verification:
+  - Non-stream:
+    curl -X POST http://localhost:3000/v1/chat/completions ^
+      -H "Content-Type: application/json" ^
+      -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}]}"
+  - Stream:
+    curl -N -H "Content-Type: application/json" ^
+      -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
+      http://localhost:3000/v1/chat/completions
+  - Resume:
+    curl -N -H "Content-Type: application/json" ^
+      -H "Last-Event-ID: mysession123:42" ^
+      -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
+      http://localhost:3000/v1/chat/completions
+  - Cancel:
+    curl -X POST http://localhost:3000/v1/cancel/mysession123
+  - Results:
+    - Streaming emits chunks, ends with [DONE]; resume replays after index; cancel terminates generation; auto-cancel after disconnect threshold works via timer + stopping criteria.
+- Follow-ups:
+  - Optional Redis store for multi-process coordination.

Dockerfile ADDED Viewed

	@@ -0,0 +1,73 @@

+# Use Python 3.12 slim for smaller image
+FROM python:3.12-slim
+# Install system deps for image/video processing and HF
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    curl \
+    libglib2.0-0 \
+    libgomp1 \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+# Set working directory
+WORKDIR /app
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Backend selector: cpu | nvidia | amd
+ARG BACKEND=cpu
+# Pin torch versions per backend index
+# - CPU index publishes newer (2.9.0 ok)
+# - CUDA cu124 index publishes up to 2.6.0 (auto-resolves to +cu124)
+# - ROCm 6.2 index publishes up to 2.5.1+rocm6.2 (must include local tag)
+ARG TORCH_VER_CPU=2.9.0
+ARG TORCH_VER_NVIDIA=2.6.0
+ARG TORCH_VER_AMD=2.5.1+rocm6.2
+# Control whether to bake the model into the image (1) or skip and download at runtime (0)
+ARG BAKE_MODEL=0
+ENV BACKEND=${BACKEND}
+ENV BAKE_MODEL=${BAKE_MODEL}
+ENV PIP_NO_CACHE_DIR=1
+# Install appropriate PyTorch for the selected backend, then the rest
+RUN if [ "$BACKEND" = "cpu" ]; then \
+      pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cpu torch==${TORCH_VER_CPU}; \
+    elif [ "$BACKEND" = "nvidia" ]; then \
+      pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu124 torch==${TORCH_VER_NVIDIA}; \
+    elif [ "$BACKEND" = "amd" ]; then \
+      pip install --no-cache-dir --index-url https://download.pytorch.org/whl/rocm6.2 "torch==${TORCH_VER_AMD}"; \
+    else \
+      echo "Unsupported BACKEND: $BACKEND" && exit 1; \
+    fi && \
+    pip install --no-cache-dir -r requirements.txt
+# Copy source code
+COPY main.py .
+COPY tests/ tests/
+# Copy env template (users can override with volume or env)
+COPY .env.example .env
+# HF cache and optional model bake-in (skippable for huge GPU builds to avoid runner disk exhaustion)
+ENV HF_HOME=/app/hf-cache
+ENV TRANSFORMERS_CACHE=/app/hf-cache
+RUN mkdir -p /app/hf-cache && \
+    if [ "$BAKE_MODEL" = "1" ]; then \
+      python -c "import os; from huggingface_hub import snapshot_download; repo_id='Qwen/Qwen3-VL-2B-Thinking'; token=os.getenv('HF_TOKEN'); print(f'Downloading {repo_id}...'); snapshot_download(repo_id, token=token, local_dir='/app/hf-cache/Qwen_Qwen3-VL-2B-Thinking', local_dir_use_symlinks=False); print('Model downloaded.');"; \
+    else \
+      echo 'Skipping model bake-in (BAKE_MODEL=0). The server will prefetch to /app/hf-cache at startup.'; \
+    fi
+# Expose port
+EXPOSE 3000
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
+    CMD curl -f http://localhost:3000/health || exit 1
+# Run the server
+CMD ["python", "main.py"]

HISTORY.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # History
2	+
3	+ This file tracks the chat history.

LICENSE ADDED Viewed

	@@ -0,0 +1,50 @@

+Modified Apache License, Version 2.0 (Royalty-Linked)
+Copyright (c) 2025 Alif Nurhidayat (GitHub: KillerKing93) <alifnurhidayatwork@gmail.com>
+1. Incorporation of Apache License, Version 2.0
+This Software is provided under the terms of the Apache License, Version 2.0, with the additional terms set forth below. Except as modified herein, the full text of the Apache License, Version 2.0 applies and is incorporated by reference:
+https://www.apache.org/licenses/LICENSE-2.0
+2. Additional Terms — Commercial Royalty
+2.1. Commercial Use of this Software is subject to a royalty obligation to the Licensor (the copyright holder), unless explicitly exempted below.
+2.2. “Commercial Use” means any use, distribution, SaaS offering, internal deployment tied to revenue generation or cost reduction, embedding in a paid product or service, or use by for-profit entities for business operations.
+2.3. “Licensor” means the copyright holder named above.
+3. Royalty Schedule (Guidance)
+The payable royalty ranges from 5% up to 25% of net revenue attributable to the Software, determined as follows:
+- 5%: Minimal or ancillary use (non-core functionality, prototypes, or internal tools with limited scope).
+- 10%: Moderate use (Software forms a notable but non-primary component of a commercial product or service).
+- 15%: Significant use (Software materially contributes to value delivery or operational savings).
+- 20%: Core use (Software is a primary component enabling the product/service).
+- 25%: White-label or redistribution scenarios where the Software is central and repackaged for resale.
+Notes:
+- “Net revenue attributable” should reasonably apportion revenue or savings connected to the Software’s role.
+- When in doubt, contact the Licensor to agree on a fair rate and basis. Written waivers or adjustments override this schedule.
+4. Exemptions
+4.1. PT. ASCON INOVASI DATA is granted a perpetual, worldwide, royalty-free license to use, reproduce, distribute, modify, and create derivative works of the Software for any purpose.
+4.2. Non-commercial academic research and open-source contributions not tied to revenue generation are generally exempt from royalties. However, redistribution or commercial hosting still requires compliance with Section 2.
+5. Reporting and Payment
+5.1. For Commercial Use, Licensee shall make a good-faith effort to notify the Licensor within 30 days of first commercial deployment and, if requested, provide a brief description of the use case to determine the appropriate royalty tier.
+5.2. Royalties shall be settled quarterly unless otherwise agreed in writing.
+6. No Warranty; Limitation of Liability
+As per Apache License 2.0 Sections 7 and 8, the Software is provided “AS IS,” without warranties or conditions of any kind, and Licensor shall not be liable for any damages arising from the use of the Software.
+7. Patent Grant
+As per Apache License 2.0 Section 3. No additional patent licenses are granted or implied beyond Apache 2.0.
+8. Attribution
+Attribution notices required by Apache 2.0 must be preserved in source distributions and, where practical, in documentation and About screens of products/services using the Software.
+9. Severability
+If any provision of these Additional Terms is held unenforceable, the remaining provisions shall remain in full force and effect, and the unenforceable provision shall be enforced to the maximum extent permissible.
+10. Contact
+For royalty discussions, waivers, or clarifications:
+- Licensor: Alif Nurhidayat (KillerKing93)
+- Email: alifnurhidayatwork@gmail.com
+By using this Software in a Commercial Use, you acknowledge the applicability of these Additional Terms alongside the Apache License, Version 2.0.

README.md CHANGED Viewed

@@ -1,12 +1,358 @@
----
-title: Transformers InferenceServer OpenAPI Compatible
-emoji: 🌖
-colorFrom: green
-colorTo: green
-sdk: docker
-pinned: false
-license: other
-short_description: Transformers-InferenceServer-OpenAPI-Compatible
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Python FastAPI Inference Server (OpenAI-Compatible) for Qwen3-VL-2B-Thinking
+This repository has been migrated from a Node.js/llama.cpp stack to a Python/Transformers stack to fully support multimodal inference (text, images, videos) with the Hugging Face Qwen3 models.
+Key files:
+- Server entry: [main.py](main.py)
+- Environment template: [.env.example](.env.example)
+- Python dependencies: [requirements.txt](requirements.txt)
+- Architecture: [ARCHITECTURE.md](ARCHITECTURE.md) (will be updated to reflect the Python stack)
+Model:
+- Default: Qwen/Qwen3-VL-2B-Thinking (Transformers; supports multimodal)
+- You can change the model via environment variable MODEL_REPO_ID.
+Node.js artifacts and scripts from the previous project have been removed.
+## Quick Start
+### Option 1: Run with Docker (with-model images: CPU / NVIDIA / AMD)
+Tags built by CI:
+- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
+- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
+- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
+Pull:
+```bash
+# CPU
+docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
+# NVIDIA (CUDA 12.4 wheel)
+docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
+# AMD (ROCm 6.2 wheel)
+docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
+```
+Run:
+```bash
+# CPU
+docker run -p 3000:3000 \
+  -e HF_TOKEN=your_hf_token_here \
+  ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
+# NVIDIA GPU (requires NVIDIA drivers + nvidia-container-toolkit on the host)
+docker run --gpus all -p 3000:3000 \
+  -e HF_TOKEN=your_hf_token_here \
+  ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
+# AMD GPU ROCm (requires ROCm 6.2+ drivers on the host; Linux only)
+# Map ROCm devices and video group (may vary by distro)
+docker run --device=/dev/kfd --device=/dev/dri --group-add video \
+  -p 3000:3000 \
+  -e HF_TOKEN=your_hf_token_here \
+  ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
+```
+Health check:
+```bash
+curl http://localhost:3000/health
+```
+Notes:
+- These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
+- Host requirements:
+  - NVIDIA: recent driver + nvidia-container-toolkit.
+  - AMD: ROCm 6.2+ driver stack, supported GPU, and mapped /dev/kfd and /dev/dri devices.
+### Option 2: Run Locally
+Requirements
+- Python 3.10+
+- pip
+- PyTorch (install a wheel matching your platform/CUDA)
+- Optionally a GPU with enough VRAM for the chosen model
+Install
+1. Create and activate a virtual environment (Windows CMD):
+   python -m venv .venv
+   .venv\Scripts\activate
+2. Install dependencies:
+   pip install -r requirements.txt
+3. Install PyTorch appropriate for your platform (examples):
+   CPU-only:
+   pip install torch --index-url https://download.pytorch.org/whl/cpu
+   CUDA 12.4 example:
+   pip install torch --index-url https://download.pytorch.org/whl/cu124
+4. Create a .env from the template and adjust if needed:
+   copy .env.example .env
+   - Set HF_TOKEN if the model is gated
+   - Adjust MAX_TOKENS, TEMPERATURE, DEVICE_MAP, TORCH_DTYPE, MAX_VIDEO_FRAMES as desired
+Configuration via .env
+See [.env.example](.env.example). Important variables:
+- PORT=3000
+- MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
+- HF_TOKEN= # optional if gated
+- MAX_TOKENS=4096
+- TEMPERATURE=0.7
+- MAX_VIDEO_FRAMES=16
+- DEVICE_MAP=auto
+- TORCH_DTYPE=auto
+Additional streaming/persistence configuration
+- PERSIST_SESSIONS=1 # enable SQLite-backed resumable SSE
+- SESSIONS_DB_PATH=sessions.db # SQLite db path
+- SESSIONS_TTL_SECONDS=600 # TTL for finished sessions before GC
+- CANCEL_AFTER_DISCONNECT_SECONDS=3600 # auto-cancel generation if all clients disconnect for this many seconds (0=disable)
+Cancel session API (custom extension)
+- Endpoint: POST /v1/cancel/{session_id}
+- Purpose: Manually cancel an in-flight streaming generation for the given session_id. Not part of OpenAI Chat Completions spec (the newer OpenAI Responses API has cancel), so this is provided as a practical extension.
+- Example (Windows CMD):
+  curl -X POST http://localhost:3000/v1/cancel/mysession123
+  Run
+- Direct:
+  python main.py
+- Using uvicorn:
+  uvicorn main:app --host 0.0.0.0 --port 3000
+Endpoints (OpenAI-compatible)
+- Health
+  GET /health
+  Example:
+  curl http://localhost:3000/health
+  Response:
+  {
+  "ok": true,
+  "modelReady": true,
+  "modelId": "Qwen/Qwen3-VL-2B-Thinking",
+  "error": null
+  }
+- Chat Completions (non-streaming)
+  POST /v1/chat/completions
+  Example (Windows CMD):
+  curl -X POST http://localhost:3000/v1/chat/completions ^
+  -H "Content-Type: application/json" ^
+  -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":128}"
+  Example (PowerShell):
+  $body = @{
+  model = "qwen-local"
+  messages = @(@{ role = "user"; content = "Hello Qwen3!" })
+  max_tokens = 128
+  } | ConvertTo-Json -Depth 5
+  curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
+- Chat Completions (streaming via Server-Sent Events)
+  Set "stream": true to receive partial deltas as they are generated.
+  Example (Windows CMD):
+  curl -N -H "Content-Type: application/json" ^
+  -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: what is 17 * 23?\"}],\"stream\":true}" ^
+  http://localhost:3000/v1/chat/completions
+  The stream format follows OpenAI-style SSE:
+  data: { "id": "...", "object": "chat.completion.chunk", "choices":[{ "delta": {"role": "assistant"} }]}
+  data: { "choices":[{ "delta": {"content": "To"} }]}
+  data: { "choices":[{ "delta": {"content": " think..."} }]}
+  ...
+  data: { "choices":[{ "delta": {}, "finish_reason": "stop"}]}
+  data: [DONE]
+Multimodal Usage
+- Text only:
+  { "role": "user", "content": "Summarize: The quick brown fox ..." }
+- Image by URL:
+  {
+  "role": "user",
+  "content": [
+  { "type": "text", "text": "What is in this image?" },
+  { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } }
+  ]
+  }
+- Image by base64:
+  {
+  "role": "user",
+  "content": [
+  { "type": "text", "text": "OCR this." },
+  { "type": "input_image", "b64_json": "<base64 of image bytes>" }
+  ]
+  }
+- Video by URL (frames are sampled up to MAX_VIDEO_FRAMES):
+  {
+  "role": "user",
+  "content": [
+  { "type": "text", "text": "Describe this clip." },
+  { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } }
+  ]
+  }
+- Video by base64:
+  {
+  "role": "user",
+  "content": [
+  { "type": "text", "text": "Count the number of cars." },
+  { "type": "input_video", "b64_json": "<base64 of full video file>" }
+  ]
+  }
+Implementation Notes
+- Server code: [main.py](main.py)
+  - FastAPI with CORS enabled
+  - Non-streaming and streaming endpoints
+  - Uses AutoProcessor and AutoModelForCausalLM with trust_remote_code=True
+  - Converts OpenAI-style messages into the Qwen multimodal format
+  - Images loaded via PIL; videos loaded via imageio.v3 (preferred) or OpenCV as fallback; frames sampled
+Performance Tips
+- On GPUs: set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16 or float16 if supported
+- Reduce MAX_VIDEO_FRAMES to speed up video processing
+- Tune MAX_TOKENS and TEMPERATURE according to your needs
+Troubleshooting
+- ImportError or no CUDA found:
+  - Ensure PyTorch is installed with the correct wheel for your environment.
+- OOM / CUDA out of memory:
+  - Use a smaller model, lower MAX_VIDEO_FRAMES, lower MAX_TOKENS, or run on CPU.
+- 503 Model not ready:
+  - The first request triggers model load; check /health for errors and HF_TOKEN if gated.
+License
+- See LICENSE for terms.
+Changelog and Architecture
+- We will update [ARCHITECTURE.md](ARCHITECTURE.md) to reflect the Python server flow.
+## Streaming behavior, resume, and reconnections
+The server streams responses using Server‑Sent Events (SSE) from [Python.function chat_completions()](main.py:457), driven by token iteration in [Python.function infer_stream](main.py:361). It now supports resumable streaming using an in‑memory ring buffer and SSE Last-Event-ID, with optional SQLite persistence (enable PERSIST_SESSIONS=1).
+What’s implemented
+- Per-session in-memory ring buffer keyed by session_id (no external storage).
+- Each SSE event carries an SSE id line in the format "session_id:index" so clients can resume with Last-Event-ID.
+- On reconnect:
+  - Provide the same session_id in the request body, and
+  - Provide "Last-Event-ID: session_id:index" header (or query ?last_event_id=session_id:index).
+  - The server replays cached events after index and continues streaming new tokens.
+- Session TTL: ~10 minutes, buffer capacity: ~2048 events. Old or finished sessions are garbage-collected in-memory.
+How to start a streaming session
+- Minimal (server generates a session_id internally for SSE id lines):
+  Windows CMD:
+  curl -N -H "Content-Type: application/json" ^
+  -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
+  http://localhost:3000/v1/chat/completions
+- With explicit session_id (recommended if you want to resume):
+  Windows CMD:
+  curl -N -H "Content-Type: application/json" ^
+  -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
+  http://localhost:3000/v1/chat/completions
+How to resume after disconnect
+- Use the same session_id and the SSE Last-Event-ID header (or ?last_event_id=...):
+  Windows CMD (resume from index 42):
+  curl -N -H "Content-Type: application/json" ^
+  -H "Last-Event-ID: mysession123:42" ^
+  -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
+  http://localhost:3000/v1/chat/completions
+  Alternatively with query string:
+  http://localhost:3000/v1/chat/completions?last_event_id=mysession123:42
+Event format
+- Chunks follow the OpenAI-style "chat.completion.chunk" shape in data payloads, plus an SSE id:
+  id: mysession123:5
+  data: {"id":"mysession123","object":"chat.completion.chunk","created":..., "model":"...", "choices":[{"index":0,"delta":{"content":" token"},"finish_reason":null}]}
+- The stream ends with:
+  data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
+  data: [DONE]
+Notes and limits
+- This implementation keeps session state only in memory; restarts will drop buffers.
+- If the buffer overflows before you resume, the earliest chunks may be unavailable.
+- Cancellation on client disconnect is not automatic; generation runs to completion in the background. A cancellable stopping-criteria can be added if required.
+## Hugging Face repository files support
+This server loads the Qwen3-VL model via Transformers with `trust_remote_code=True`, so the standard files from the repo are supported and consumed automatically. Summary for https://huggingface.co/Qwen/Qwen3-VL-2B-Thinking/tree/main:
+- Used by model weights and architecture
+  - model.safetensors — main weights loaded by AutoModelForCausalLM
+  - config.json — architecture/config
+  - generation_config.json — default gen params (we may override via request or env)
+- Used by tokenizer
+  - tokenizer.json — primary tokenizer specification
+  - tokenizer_config.json — tokenizer settings
+  - merges.txt and vocab.json — fallback/compat files; if tokenizer.json exists, HF generally prefers it
+- Used by processors (multimodal)
+  - preprocessor_config.json — image/text processor config
+  - video_preprocessor_config.json — video processor config (frame sampling, etc.)
+  - chat_template.json — chat formatting used by [Python.function infer](main.py:312) and [Python.function infer_stream](main.py:361) via `processor.apply_chat_template(...)`
+- Not required for runtime
+  - README.md, .gitattributes — ignored by runtime
+Notes:
+- We rely on Transformers’ AutoModelForCausalLM and AutoProcessor to resolve and use the above files; no manual parsing is required in our code.
+- With `trust_remote_code=True`, model-specific code from the repo may load additional assets transparently.
+- If the repo updates configs (e.g., new chat template), the server will pick them up on next load.
+## Cancellation and session persistence
+- Auto-cancel on disconnect:
+  - Generation is automatically cancelled if all clients disconnect for more than CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 seconds = 1 hour). Configure in [.env.example](.env.example) via `CANCEL_AFTER_DISCONNECT_SECONDS`.
+  - Implemented by a timer in [Python.function chat_completions](main.py:732) that triggers a cooperative stop through a stopping criteria in [Python.function infer_stream](main.py:375).
+- Manual cancel API (custom extension):
+  - Endpoint: `POST /v1/cancel/{session_id}`
+  - Cancels an ongoing streaming session and marks it finished in the store. Example (Windows CMD):
+    curl -X POST http://localhost:3000/v1/cancel/mysession123
+  - This is not part of OpenAI’s legacy Chat Completions spec. OpenAI’s newer Responses API has a cancel endpoint, but Chat Completions does not. We provide this custom endpoint for operational control.
+- Persistence:
+  - Optional SQLite-backed persistence for resumable SSE (enable `PERSIST_SESSIONS=1` in [.env.example](.env.example)).
+  - Database path: `SESSIONS_DB_PATH` (default: sessions.db)
+  - Session TTL for GC: `SESSIONS_TTL_SECONDS` (default: 600)
+  - See implementation in [Python.class \_SQLiteStore](main.py:481) and integration in [Python.function chat_completions](main.py:591).
+  - Redis is not implemented yet; the design isolates persistence so a Redis-backed store can be added as a drop-in.

RULES.md ADDED Viewed

	@@ -0,0 +1,207 @@

+# Project Rules and Workflow (Python FastAPI + Transformers)
+These rules are binding for every change. Keep code, docs, and behavior synchronized at all times.
+Files referenced below:
+- [README.md](README.md)
+- [ARCHITECTURE.md](ARCHITECTURE.md)
+- [TODO.md](TODO.md)
+- [CLAUDE.md](CLAUDE.md)
+- [.env.example](.env.example)
+- [.gitignore](.gitignore)
+- [requirements.txt](requirements.txt)
+- [Python.main()](main.py:1)
+## 1) Documentation rules (must-do on every change)
+Always update documentation when code or behavior changes.
+Minimum documentation checklist:
+- What changed and where (filenames, sections, or callable links like [Python.function chat_completions()](main.py:591)).
+- Why the change was made (problem or requirement).
+- How to operate or verify (commands, endpoints, examples).
+- Follow-ups or known limitations.
+Where to update:
+- Operator-facing: [README.md](README.md)
+- Developer-facing: [CLAUDE.md](CLAUDE.md) (rationale, alternatives, caveats)
+- Architecture or flows: [ARCHITECTURE.md](ARCHITECTURE.md)
+- Tasks and statuses: [TODO.md](TODO.md)
+Never skip documentation. If a change is reverted, document the revert.
+## 2) Git discipline (mandatory)
+- Always use Git. Every change or progress step MUST be committed and pushed.
+  - Windows CMD example:
+    - git add .
+    - git commit -m "type(scope): short description"
+    - git push
+- No exceptions. If no remote exists, commit locally and configure a remote as soon as possible. Record any temporary push limitations in [README.md](README.md) and [CLAUDE.md](CLAUDE.md), but commits are still required locally.
+- Commit style:
+  - Conventional types: chore, docs, feat, fix, refactor, perf, test, build, ci
+  - Keep commits small and atomic (one concern per commit).
+  - Reference important files in the commit body, for example: updated [Python.function chat_completions()](main.py:591), [README.md](README.md).
+- After updating code or docs, commit immediately. Do not batch unrelated changes.
+## 2.1) Progress log (mandatory)
+- Every commit MUST include a corresponding entry in [CLAUDE.md](CLAUDE.md) under a “Progress Log” section.
+- Each entry must include:
+  - Date/time (Asia/Jakarta)
+  - Scope and short summary of the change
+  - The final Git commit hash and commit message
+  - Files and exact callable anchors touched (use clickable anchors), e.g. [Python.function chat_completions()](main.py:591), [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1)
+  - Verification steps and results (curl examples, expected vs actual, notes)
+- Required sequence:
+  1) Make code changes
+  2) Update docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md), and add a new progress log entry in [CLAUDE.md](CLAUDE.md)
+  3) Run Git commands:
+     - git add .
+     - git commit -m "type(scope): short description"
+     - git push
+  4) Append the final commit hash to the [CLAUDE.md](CLAUDE.md) entry if it was not known at authoring time
+- No code change may land without a synchronized progress log entry.
+## 3) Large artifacts policy (.gitignore)
+Never commit large/generated artifacts. Keep the repository lean and reproducible.
+Must be ignored:
+- models/ (downloaded by HF/Transformers cache or tools at runtime)
+- .venv/, venv/
+- __pycache__/
+- .cache/
+- uploads/, data/, tmp/
+See [.gitignore](.gitignore) and extend as needed for new generated outputs. If you add ignores, document the rationale in [CLAUDE.md](CLAUDE.md).
+## 4) Model policy (Hugging Face / Transformers)
+Target default model:
+- Qwen/Qwen3-VL-2B-Thinking (Transformers; multimodal).
+Rules:
+- Use Hugging Face Transformers (AutoModelForCausalLM + AutoProcessor) with trust_remote_code=True.
+- Do not commit model weights or caches. Let from_pretrained() download to local caches.
+- Handle authentication for gated models via HF_TOKEN in [.env.example](.env.example).
+- The server must remain OpenAI-compatible at /v1/chat/completions and support multimodal inputs (text, images, videos).
+- Keep configuration via environment variables (see [Python.os.getenv()](main.py:67)).
+## 5) API contract
+Provide an OpenAI-compatible endpoint:
+- POST /v1/chat/completions
+Minimum behavior:
+- Accept model and messages per OpenAI schema (we honor messages; model is informational since server is pinned via env).
+- Non-streaming JSON response.
+- Streaming SSE response when body.stream=true:
+  - Emit OpenAI-style chat.completion.chunk deltas.
+  - Include SSE id lines "session_id:index" to support resume via Last-Event-ID.
+Resume semantics:
+- Client provides a session_id (or server generates one).
+- Client may reconnect and send Last-Event-ID: session_id:index to replay missed chunks.
+- Session data can be persisted (SQLite) if enabled.
+Manual cancel (custom extension):
+- POST /v1/cancel/{session_id} cancels a streaming generation.
+- Note: Not part of legacy OpenAI Chat Completions spec. It mirrors the spirit of the newer OpenAI Responses API cancel endpoint.
+All endpoints must validate inputs, handle timeouts/failures, and return structured JSON errors.
+## 6) Streaming, persistence, and cancellation
+- Streaming is implemented via SSE in [Python.function chat_completions()](main.py:591) with token iteration in [Python.function infer_stream](main.py:375).
+- In-memory ring buffer per session and optional SQLite persistence for replay across restarts:
+  - In-memory: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449)
+  - SQLite: [Python.class _SQLiteStore](main.py:482) (enabled with PERSIST_SESSIONS=1)
+- Resume:
+  - Uses SSE id "session_id:index" and Last-Event-ID header (or ?last_event_id=...).
+- Auto-cancel on disconnect:
+  - If all clients disconnect, generation is cancelled after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 sec). Configurable via env.
+  - Cooperative stop via StoppingCriteria in [Python.function infer_stream](main.py:375).
+- Manual cancel:
+  - [Python.function cancel_session](main.py:792) to stop a session on demand.
+## 7) Logging and error handling
+- Log key lifecycle stages (startup, model load, stream start/stop, resume).
+- Redact sensitive fields (e.g., tokens, credentials).
+- User errors → 400; model-not-ready → 503; unexpected failures → 500.
+- Optionally add structured logging and request IDs in a follow-up.
+## 8) Architecture documentation
+Keep [ARCHITECTURE.md](ARCHITECTURE.md) authoritative for:
+- Startup flow and lazy model load
+- Multimodal preprocessing (images/videos)
+- Streaming, resume, persistence, and cancellation flows
+- Error/timeout handling
+- Extensibility (persistence strategies, cancellation hooks, scaling patterns)
+Update when code paths or data flows change.
+## 9) TODO hygiene
+Track all planned work in [TODO.md](TODO.md):
+- Update statuses immediately when tasks start/complete.
+- Add newly discovered tasks as soon as they are identified.
+- Keep TODO focused, scoped, and prioritized.
+## 10) Operational requirements and environment
+Required:
+- Python: >= 3.10
+- pip
+- PyTorch: install a wheel matching platform/CUDA (see [requirements.txt](requirements.txt) notes)
+Recommended:
+- GPU with sufficient VRAM for the chosen model
+- Windows 11 supported; Linux/macOS should also work
+Environment variables (see [.env.example](.env.example)):
+- PORT=3000
+- MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
+- HF_TOKEN=
+- MAX_TOKENS=256
+- TEMPERATURE=0.7
+- MAX_VIDEO_FRAMES=16
+- DEVICE_MAP=auto
+- TORCH_DTYPE=auto
+- PERSIST_SESSIONS=1|0, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS
+- CANCEL_AFTER_DISCONNECT_SECONDS=3600 (0 to disable)
+## 11) File responsibilities overview
+- Server: [Python.main()](main.py:1)
+  - API routing, model singleton, inference, streaming, resume, cancel
+- Docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md)
+- Dev log: [CLAUDE.md](CLAUDE.md)
+- Tasks: [TODO.md](TODO.md)
+- Config template: [.env.example](.env.example)
+- Dependencies: [requirements.txt](requirements.txt)
+- Ignores: [.gitignore](.gitignore)
+## 12) Workflow example (single iteration)
+1) Make a small, isolated change (e.g., enable SQLite persistence).
+2) Update docs:
+   - [CLAUDE.md](CLAUDE.md): what/why/how
+   - [README.md](README.md): operator usage changes
+   - [ARCHITECTURE.md](ARCHITECTURE.md): persistence/resume flow
+   - [TODO.md](TODO.md): status changes
+3) Commit and push:
+   - git add .
+   - git commit -m "feat(stream): add SQLite persistence for SSE resume"
+   - git push
+4) Verify locally; record any issues or follow-ups in [CLAUDE.md](CLAUDE.md).
+## 13) Compliance checklist (pre-merge / pre-push)
+- Code runs locally (uvicorn main:app …).
+- Docs updated ([README.md](README.md), [CLAUDE.md](CLAUDE.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md)).
+- No large artifacts added to git.
+- Commit message follows conventional style.
+- Endpoint contract honored (including streaming/resume semantics and cancel extension).

TODO.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# TODO
+- [ ] Initialize Git repository and perform initial commit.
+- [x] Install necessary dependencies: `express`, `node-llama-cpp`.
+- [x] Create the basic server structure in `index.js`.
+- [ ] Implement the `/v1/chat/completions` endpoint.
+- [ ] Load the Qwen3 model.
+- [ ] Implement the inference logic using `node-llama-cpp`.
+- [ ] Add error handling.
+- [ ] Add logging.
+- [ ] Write tests for the API endpoint.
+- [ ] Update `CLAUDE.md` with detailed documentation.
+- [ ] Update `ARCHITECTURE.md` with the project architecture.
+- [ ] Push the initial project to the GitHub repository.

main.py ADDED Viewed

	@@ -0,0 +1,1108 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""
+FastAPI Inference Server (OpenAI-compatible) for Qwen3-VL multimodal model.
+- Default model: Qwen/Qwen3-VL-2B-Thinking
+- Endpoints:
+  * GET /openapi.yaml     (OpenAPI schema in YAML)
+  * GET /health           (readiness + context report)
+  * POST /v1/chat/completions (non-stream and streaming SSE)
+  * POST /v1/cancel/{session_id} (custom cancel endpoint)
+Notes:
+- Uses Hugging Face Transformers with trust_remote_code=True.
+- Supports OpenAI-style chat messages with text, image_url/input_image, video_url/input_video.
+- Streaming SSE supports resume (session_id + Last-Event-ID) with optional SQLite persistence.
+- Auto prompt compression prevents context overflow with a simple truncate strategy.
+"""
+import os
+import io
+import re
+import base64
+import tempfile
+import contextlib
+from typing import Any, Dict, List, Optional, Tuple, Deque
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from starlette.responses import JSONResponse
+from fastapi.responses import StreamingResponse, Response
+import json
+import yaml
+import threading
+import time
+import uuid
+import sqlite3
+from collections import deque
+import subprocess
+import sys
+import shutil
+# Load env
+try:
+    from dotenv import load_dotenv
+    load_dotenv()
+except Exception:
+    pass
+# Ensure HF cache dirs are relative to this project by default
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+DEFAULT_HF_CACHE = os.path.join(ROOT_DIR, "hf-cache")
+if not os.getenv("HF_HOME"):
+    os.environ["HF_HOME"] = DEFAULT_HF_CACHE
+if not os.getenv("TRANSFORMERS_CACHE"):
+    os.environ["TRANSFORMERS_CACHE"] = DEFAULT_HF_CACHE
+# Create directory eagerly to avoid later mkdir races
+try:
+    os.makedirs(os.environ["HF_HOME"], exist_ok=True)
+except Exception:
+    pass
+# Optional heavy deps are imported lazily inside Engine to improve startup UX
+import requests
+from PIL import Image
+import numpy as np
+from huggingface_hub import snapshot_download, list_repo_files, hf_hub_download, get_hf_file_metadata
+# Server config
+PORT = int(os.getenv("PORT", "3000"))
+DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
+HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
+DEFAULT_MAX_TOKENS = int(os.getenv("MAX_TOKENS", "256"))
+DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
+MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
+DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")
+TORCH_DTYPE = os.getenv("TORCH_DTYPE", "auto")
+# Persistent session store (SQLite)
+PERSIST_SESSIONS = str(os.getenv("PERSIST_SESSIONS", "0")).lower() in ("1", "true", "yes", "y")
+SESSIONS_DB_PATH = os.getenv("SESSIONS_DB_PATH", "sessions.db")
+SESSIONS_TTL_SECONDS = int(os.getenv("SESSIONS_TTL_SECONDS", "600"))
+# Auto-cancel if all clients disconnect for duration (seconds). 0 disables it.
+CANCEL_AFTER_DISCONNECT_SECONDS = int(os.getenv("CANCEL_AFTER_DISCONNECT_SECONDS", "3600"))
+# Auto compression settings
+ENABLE_AUTO_COMPRESSION = str(os.getenv("ENABLE_AUTO_COMPRESSION", "1")).lower() in ("1", "true", "yes", "y")
+CONTEXT_MAX_TOKENS_AUTO = int(os.getenv("CONTEXT_MAX_TOKENS_AUTO", "0"))  # 0 -> infer from model/tokenizer
+CONTEXT_SAFETY_MARGIN = int(os.getenv("CONTEXT_SAFETY_MARGIN", "256"))
+COMPRESSION_STRATEGY = os.getenv("COMPRESSION_STRATEGY", "truncate")  # truncate | summarize (future)
+# Eager model loading (download/check at startup before serving traffic)
+EAGER_LOAD_MODEL = str(os.getenv("EAGER_LOAD_MODEL", "1")).lower() in ("1", "true", "yes", "y")
+def _log(msg: str):
+    # Consistent, flush-immediate startup logs
+    print(f"[startup] {msg}", flush=True)
+def prefetch_model_assets(repo_id: str, token: Optional[str]) -> Optional[str]:
+    """
+    Reproducible prefetch driven by huggingface-cli:
+    - Downloads the ENTIRE repo using CLI (visible progress bar).
+    - Returns the local directory path where the repo is mirrored.
+    - If CLI is unavailable, falls back to verbose API prefetch.
+    """
+    try:
+        # Enable accelerated transfer + xet if available
+        os.environ.setdefault("HF_HUB_ENABLE_HF_TRANSFER", "1")
+        os.environ.setdefault("HF_HUB_ENABLE_XET", "1")
+        cache_dir = os.getenv("HF_HOME") or os.getenv("TRANSFORMERS_CACHE") or ""
+        if cache_dir:
+            os.makedirs(cache_dir, exist_ok=True)
+        # Resolve huggingface-cli path (Windows-friendly)
+        cli_path = shutil.which("huggingface-cli")
+        if not cli_path:
+            candidates = []
+            appdata = os.getenv("APPDATA")
+            if appdata:
+                candidates.append(os.path.join(appdata, "Python", "Python312", "Scripts", "huggingface-cli.exe"))
+            candidates.append(os.path.join(os.path.dirname(sys.executable), "Scripts", "huggingface-cli.exe"))
+            cli_path = next((p for p in candidates if os.path.exists(p)), None)
+        # Preferred: one-shot CLI download for the whole repo (shows live progress)
+        if cli_path:
+            local_root = os.path.join(cache_dir if cache_dir else ".", repo_id.replace("/", "_"))
+            os.makedirs(local_root, exist_ok=True)
+            _log(f"Using huggingface-cli to download entire repo -> '{local_root}'")
+            cmd = [
+                cli_path,
+                "download",
+                repo_id,
+                "--repo-type",
+                "model",
+                "--local-dir",
+                local_root,
+                "--local-dir-use-symlinks",
+                "False",
+                "--resume",
+            ]
+            if token:
+                cmd += ["--token", token]
+            # Inherit stdio; users will see a proper progress bar
+            subprocess.run(cmd, check=False)
+            # Verify we have the essential files
+            if os.path.exists(os.path.join(local_root, "config.json")) or os.path.exists(os.path.join(local_root, "model.safetensors")):
+                _log("CLI prefetch completed")
+                return local_root
+            else:
+                _log("CLI prefetch finished but essential files not found; will fallback to API mirroring")
+        # Fallback: verbose API-driven prefetch with per-file logging
+        _log(f"Prefetching (API) repo={repo_id} to cache='{cache_dir}'")
+        try:
+            files = list_repo_files(repo_id, repo_type="model", token=token)
+        except Exception as e:
+            _log(f"list_repo_files failed ({type(e).__name__}: {e}); falling back to snapshot_download")
+            snapshot_download(repo_id, token=token, local_files_only=False)
+            _log("Prefetch completed (snapshot)")
+            return None
+        total = len(files)
+        _log(f"Found {total} files to ensure cached (API)")
+        for i, fn in enumerate(files, start=1):
+            try:
+                meta = get_hf_file_metadata(repo_id, fn, repo_type="model", token=token)
+                size_bytes = meta.size or 0
+            except Exception:
+                size_bytes = 0
+            size_mb = size_bytes / (1024 * 1024) if size_bytes else 0.0
+            _log(f"[{i}/{total}] fetching '{fn}' (~{size_mb:.2f} MB)")
+            _ = hf_hub_download(
+                repo_id=repo_id,
+                filename=fn,
+                repo_type="model",
+                token=token,
+                local_files_only=False,
+                resume_download=True,
+            )
+            _log(f"[{i}/{total}] done '{fn}'")
+        _log("Prefetch completed (API)")
+        return None
+    except Exception as e:
+        _log(f"Prefetch skipped: {type(e).__name__}: {e}")
+        return None
+def is_data_url(url: str) -> bool:
+    return url.startswith("data:") and ";base64," in url
+def is_http_url(url: str) -> bool:
+    return url.startswith("http://") or url.startswith("https://")
+def decode_base64_to_bytes(b64: str) -> bytes:
+    # strip possible "data:*;base64," prefix
+    if "base64," in b64:
+        b64 = b64.split("base64,", 1)[1]
+    return base64.b64decode(b64, validate=False)
+def fetch_bytes(url: str, headers: Optional[Dict[str, str]] = None, timeout: int = 60) -> bytes:
+    if not is_http_url(url):
+        raise ValueError(f"Only http(s) URLs supported for fetch, got: {url}")
+    resp = requests.get(url, headers=headers or {}, timeout=timeout, stream=True)
+    resp.raise_for_status()
+    return resp.content
+def load_image_from_any(src: Dict[str, Any]) -> Image.Image:
+    """
+    src can be:
+      - { "url": "http(s)://..." } (also supports data URL)
+      - { "b64_json": "<base64>" }
+      - { "path": "local_path" } (optional)
+    """
+    if "b64_json" in src and src["b64_json"]:
+        data = decode_base64_to_bytes(str(src["b64_json"]))
+        return Image.open(io.BytesIO(data)).convert("RGB")
+    if "url" in src and src["url"]:
+        url = str(src["url"])
+        if is_data_url(url):
+            data = decode_base64_to_bytes(url)
+            return Image.open(io.BytesIO(data)).convert("RGB")
+        if is_http_url(url):
+            data = fetch_bytes(url)
+            return Image.open(io.BytesIO(data)).convert("RGB")
+        # treat as local path
+        if os.path.exists(url):
+            with open(url, "rb") as f:
+                return Image.open(io.BytesIO(f.read())).convert("RGB")
+        raise ValueError(f"Invalid image url/path: {url}")
+    if "path" in src and src["path"]:
+        p = str(src["path"])
+        if os.path.exists(p):
+            with open(p, "rb") as f:
+                return Image.open(io.BytesIO(f.read())).convert("RGB")
+        raise ValueError(f"Image path not found: {p}")
+    raise ValueError("Unsupported image source payload")
+def write_bytes_tempfile(data: bytes, suffix: str) -> str:
+    tmp = tempfile.NamedTemporaryFile(delete=False, suffix=suffix)
+    with tmp as f:
+        f.write(data)
+    return tmp.name
+def load_video_frames_from_any(src: Dict[str, Any], max_frames: int = MAX_VIDEO_FRAMES) -> List[Image.Image]:
+    """
+    Returns a list of PIL.Image frames (RGB) sampled up to max_frames.
+    src can be:
+      - { "url": "http(s)://..." } (mp4/mov/webm/etc.)
+      - { "b64_json": "<base64 of a video file>" }
+      - { "path": "local_path" }
+    """
+    # Prefer imageio.v3 if present, fallback to OpenCV
+    # We load all frames then uniform sample if too many.
+    def _load_all_frames(path: str) -> List[Image.Image]:
+        frames: List[Image.Image] = []
+        with contextlib.suppress(ImportError):
+            import imageio.v3 as iio
+            arr_iter = iio.imiter(path)  # yields numpy arrays HxWxC
+            for arr in arr_iter:
+                if arr is None:
+                    continue
+                if arr.ndim == 2:
+                    arr = np.stack([arr, arr, arr], axis=-1)
+                if arr.shape[-1] == 4:
+                    arr = arr[..., :3]
+                frames.append(Image.fromarray(arr).convert("RGB"))
+            return frames
+        # Fallback to OpenCV
+        import cv2  # type: ignore
+        cap = cv2.VideoCapture(path)
+        ok, frame = cap.read()
+        while ok:
+            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            frames.append(Image.fromarray(frame))
+            ok, frame = cap.read()
+        cap.release()
+        return frames
+    # Resolve to a local path
+    local_path = None
+    if "b64_json" in src and src["b64_json"]:
+        data = decode_base64_to_bytes(str(src["b64_json"]))
+        local_path = write_bytes_tempfile(data, suffix=".mp4")
+    elif "url" in src and src["url"]:
+        url = str(src["url"])
+        if is_data_url(url):
+            data = decode_base64_to_bytes(url)
+            local_path = write_bytes_tempfile(data, suffix=".mp4")
+        elif is_http_url(url):
+            data = fetch_bytes(url)
+            local_path = write_bytes_tempfile(data, suffix=".mp4")
+        elif os.path.exists(url):
+            local_path = url
+        else:
+            raise ValueError(f"Invalid video url/path: {url}")
+    elif "path" in src and src["path"]:
+        p = str(src["path"])
+        if os.path.exists(p):
+            local_path = p
+        else:
+            raise ValueError(f"Video path not found: {p}")
+    else:
+        raise ValueError("Unsupported video source payload")
+    frames = _load_all_frames(local_path)
+    # Uniform sample if too many frames
+    if len(frames) > max_frames and max_frames > 0:
+        idxs = np.linspace(0, len(frames) - 1, max_frames).astype(int).tolist()
+        frames = [frames[i] for i in idxs]
+    return frames
+class ChatRequest(BaseModel):
+    model: Optional[str] = None
+    messages: List[Dict[str, Any]]
+    max_tokens: Optional[int] = None
+    temperature: Optional[float] = None
+    stream: Optional[bool] = None
+    session_id: Optional[str] = None
+class Engine:
+    def __init__(self, model_id: str, hf_token: Optional[str] = None):
+        # Lazy import heavy deps
+        from transformers import AutoProcessor, AutoModelForCausalLM, AutoModelForVision2Seq, AutoModel
+        # AutoModelForImageTextToText is the v5+ replacement for Vision2Seq in Transformers
+        try:
+            from transformers import AutoModelForImageTextToText  # type: ignore
+        except Exception:
+            AutoModelForImageTextToText = None  # type: ignore
+        model_kwargs: Dict[str, Any] = {
+            "trust_remote_code": True,
+        }
+        if hf_token:
+            # Only pass 'token' (use_auth_token is deprecated and causes conflicts)
+            model_kwargs["token"] = hf_token
+        # Device and dtype
+        model_kwargs["device_map"] = DEVICE_MAP
+        model_kwargs["torch_dtype"] = TORCH_DTYPE if TORCH_DTYPE != "auto" else "auto"
+        # Processor (handles text + images/videos)
+        proc_kwargs: Dict[str, Any] = {"trust_remote_code": True}
+        if hf_token:
+            proc_kwargs["token"] = hf_token
+        self.processor = AutoProcessor.from_pretrained(
+            model_id,
+            **proc_kwargs,
+        )  # pragma: no cover
+        # Prefer ImageTextToText (Transformers v5 path), then Vision2Seq, then CausalLM as a last resort
+        model = None
+        if 'AutoModelForImageTextToText' in globals() and AutoModelForImageTextToText is not None:
+            try:
+                model = AutoModelForImageTextToText.from_pretrained(model_id, **model_kwargs)  # pragma: no cover
+            except Exception:
+                model = None
+        if model is None:
+            try:
+                model = AutoModelForVision2Seq.from_pretrained(model_id, **model_kwargs)  # pragma: no cover
+            except Exception:
+                model = None
+        if model is None:
+            try:
+                model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)  # pragma: no cover
+            except Exception:
+                model = None
+        if model is None:
+            # Generic AutoModel as last-resort with trust_remote_code to load custom architectures
+            model = AutoModel.from_pretrained(model_id, **model_kwargs)  # pragma: no cover
+        self.model = model.eval()  # pragma: no cover
+        self.model_id = model_id
+        self.tokenizer = getattr(self.processor, "tokenizer", None)
+        self.last_context_info: Dict[str, Any] = {}
+    def _model_max_context(self) -> int:
+        try:
+            cfg = getattr(self.model, "config", None)
+            if cfg is not None:
+                v = getattr(cfg, "max_position_embeddings", None)
+                if isinstance(v, int) and v > 0 and v < 10_000_000:
+                    return v
+        except Exception:
+            pass
+        try:
+            mx = int(getattr(self.tokenizer, "model_max_length", 0) or 0)
+            if mx > 0 and mx < 10_000_000_000:
+                return mx
+        except Exception:
+            pass
+        return 32768
+    def _count_prompt_tokens(self, text: str) -> int:
+        try:
+            if self.tokenizer is not None:
+                enc = self.tokenizer([text], add_special_tokens=False, return_attention_mask=False)
+                ids = enc["input_ids"][0]
+                return len(ids)
+        except Exception:
+            pass
+        return max(1, int(len(text.split()) * 1.3))
+    def _auto_compress_if_needed(
+        self, mm_messages: List[Dict[str, Any]], max_new_tokens: int
+    ) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
+        info: Dict[str, Any] = {}
+        # Build once to measure
+        text0 = self.processor.apply_chat_template(mm_messages, tokenize=False, add_generation_prompt=True)
+        prompt_tokens = self._count_prompt_tokens(text0)
+        max_ctx = CONTEXT_MAX_TOKENS_AUTO if CONTEXT_MAX_TOKENS_AUTO > 0 else self._model_max_context()
+        budget = max(1024, max_ctx - CONTEXT_SAFETY_MARGIN - int(max_new_tokens))
+        if not ENABLE_AUTO_COMPRESSION or prompt_tokens <= budget:
+            info = {
+                "compressed": False,
+                "prompt_tokens": int(prompt_tokens),
+                "max_context": int(max_ctx),
+                "budget": int(budget),
+                "strategy": COMPRESSION_STRATEGY,
+                "dropped_messages": 0,
+            }
+            return mm_messages, info
+        # Truncate earliest non-system messages until within budget
+        msgs = list(mm_messages)
+        dropped = 0
+        guard = 0
+        while True:
+            text = self.processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
+            prompt_tokens = self._count_prompt_tokens(text)
+            if prompt_tokens <= budget or len(msgs) <= 1:
+                break
+            # drop earliest non-system
+            drop_idx = None
+            for j, m in enumerate(msgs):
+                if (m.get("role") or "user") != "system":
+                    drop_idx = j
+                    break
+            if drop_idx is None:
+                break
+            msgs.pop(drop_idx)
+            dropped += 1
+            guard += 1
+            if guard > 10000:
+                break
+        info = {
+            "compressed": True,
+            "prompt_tokens": int(prompt_tokens),
+            "max_context": int(max_ctx),
+            "budget": int(budget),
+            "strategy": "truncate",
+            "dropped_messages": int(dropped),
+        }
+        return msgs, info
+    def get_context_report(self) -> Dict[str, Any]:
+        try:
+            tk_max = int(getattr(self.tokenizer, "model_max_length", 0) or 0)
+        except Exception:
+            tk_max = 0
+        return {
+            "compressionEnabled": ENABLE_AUTO_COMPRESSION,
+            "strategy": COMPRESSION_STRATEGY,
+            "safetyMargin": CONTEXT_SAFETY_MARGIN,
+            "modelMaxContext": self._model_max_context(),
+            "tokenizerModelMaxLength": tk_max,
+            "last": self.last_context_info or {},
+        }
+    def build_mm_messages(
+        self, openai_messages: List[Dict[str, Any]]
+    ) -> Tuple[List[Dict[str, Any]], List[Image.Image], List[List[Image.Image]]]:
+        """
+        Convert OpenAI-style messages to Qwen multimodal messages.
+        Returns:
+          - messages for apply_chat_template
+          - flat list of images in encounter order
+          - list of videos (each is list of PIL frames)
+        """
+        mm_msgs: List[Dict[str, Any]] = []
+        images: List[Image.Image] = []
+        videos: List[List[Image.Image]] = []
+        for msg in openai_messages:
+            role = msg.get("role", "user")
+            content = msg.get("content", "")
+            parts: List[Dict[str, Any]] = []
+            if isinstance(content, str):
+                if content:
+                    parts.append({"type": "text", "text": content})
+            elif isinstance(content, list):
+                for p in content:
+                    ptype = p.get("type")
+                    if ptype == "text":
+                        txt = p.get("text", "")
+                        if txt:
+                            parts.append({"type": "text", "text": txt})
+                    elif ptype in ("image_url", "input_image"):
+                        src: Dict[str, Any] = {}
+                        if ptype == "image_url":
+                            u = (p.get("image_url") or {}).get("url") if isinstance(p.get("image_url"), dict) else p.get("image_url")
+                            src["url"] = u
+                        else:
+                            b64 = p.get("image") or p.get("b64_json") or p.get("data") or (p.get("image_url") or {}).get("url")
+                            if b64:
+                                src["b64_json"] = b64
+                        try:
+                            img = load_image_from_any(src)
+                            images.append(img)
+                            parts.append({"type": "image", "image": img})
+                        except Exception as e:
+                            raise ValueError(f"Failed to parse image part: {e}") from e
+                    elif ptype in ("video_url", "input_video"):
+                        src = {}
+                        if ptype == "video_url":
+                            u = (p.get("video_url") or {}).get("url") if isinstance(p.get("video_url"), dict) else p.get("video_url")
+                            src["url"] = u
+                        else:
+                            b64 = p.get("video") or p.get("b64_json") or p.get("data")
+                            if b64:
+                                src["b64_json"] = b64
+                        try:
+                            frames = load_video_frames_from_any(src, max_frames=MAX_VIDEO_FRAMES)
+                            videos.append(frames)
+                            parts.append({"type": "video", "video": frames})
+                        except Exception as e:
+                            raise ValueError(f"Failed to parse video part: {e}") from e
+                    else:
+                        if isinstance(p, dict):
+                            txt = p.get("text")
+                            if isinstance(txt, str) and txt:
+                                parts.append({"type": "text", "text": txt})
+            else:
+                if content:
+                    parts.append({"type": "text", "text": str(content)})
+            mm_msgs.append({"role": role, "content": parts})
+        return mm_msgs, images, videos
+    def infer(self, messages: List[Dict[str, Any]], max_tokens: int, temperature: float) -> str:
+        mm_messages, images, videos = self.build_mm_messages(messages)
+        # Auto-compress if needed based on context budget
+        mm_messages, ctx_info = self._auto_compress_if_needed(mm_messages, max_tokens)
+        self.last_context_info = ctx_info
+        # Build chat template
+        text = self.processor.apply_chat_template(
+            mm_messages,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        proc_kwargs: Dict[str, Any] = {"text": [text], "return_tensors": "pt"}
+        if images:
+            proc_kwargs["images"] = images
+        if videos:
+            proc_kwargs["videos"] = videos
+        inputs = self.processor(**proc_kwargs)
+        # Move tensors to model device if present
+        try:
+            device = getattr(self.model, "device", None) or next(self.model.parameters()).device
+            inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in inputs.items()}
+        except Exception:
+            pass
+        do_sample = temperature is not None and float(temperature) > 0.0
+        gen_ids = self.model.generate(
+            **inputs,
+            max_new_tokens=int(max_tokens),
+            temperature=float(temperature),
+            do_sample=do_sample,
+            use_cache=True,
+        )
+        # Decode
+        output = self.processor.batch_decode(
+            gen_ids,
+            skip_special_tokens=True,
+            clean_up_tokenization_spaces=False,
+        )[0]
+        # Best-effort: return only the assistant reply after the last template marker if present
+        parts = re.split(r"\n?assistant:\s*", output, flags=re.IGNORECASE)
+        if len(parts) >= 2:
+            return parts[-1].strip()
+        return output.strip()
+    def infer_stream(
+        self,
+        messages: List[Dict[str, Any]],
+        max_tokens: int,
+        temperature: float,
+        cancel_event: Optional[threading.Event] = None,
+    ):
+        from transformers import TextIteratorStreamer, StoppingCriteria, StoppingCriteriaList
+        mm_messages, images, videos = self.build_mm_messages(messages)
+        # Auto-compress if needed based on context budget
+        mm_messages, ctx_info = self._auto_compress_if_needed(mm_messages, max_tokens)
+        self.last_context_info = ctx_info
+        text = self.processor.apply_chat_template(
+            mm_messages,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        proc_kwargs: Dict[str, Any] = {"text": [text], "return_tensors": "pt"}
+        if images:
+            proc_kwargs["images"] = images
+        if videos:
+            proc_kwargs["videos"] = videos
+        inputs = self.processor(**proc_kwargs)
+        try:
+            device = getattr(self.model, "device", None) or next(self.model.parameters()).device
+            inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in inputs.items()}
+        except Exception:
+            pass
+        do_sample = temperature is not None and float(temperature) > 0.0
+        streamer = TextIteratorStreamer(
+            getattr(self.processor, "tokenizer", None),
+            skip_prompt=True,
+            skip_special_tokens=True,
+        )
+        gen_kwargs = dict(
+            **inputs,
+            max_new_tokens=int(max_tokens),
+            temperature=float(temperature),
+            do_sample=do_sample,
+            use_cache=True,
+            streamer=streamer,
+        )
+        # Optional cooperative cancellation via StoppingCriteria
+        if cancel_event is not None:
+            class _CancelCrit(StoppingCriteria):
+                def __init__(self, ev: threading.Event):
+                    self.ev = ev
+                def __call__(self, input_ids, scores, **kwargs):
+                    return bool(self.ev.is_set())
+            gen_kwargs["stopping_criteria"] = StoppingCriteriaList([_CancelCrit(cancel_event)])
+        th = threading.Thread(target=self.model.generate, kwargs=gen_kwargs)
+        th.start()
+        for piece in streamer:
+            if piece:
+                yield piece
+# Simple in-memory resumable SSE session store + optional SQLite persistence
+class _SSESession:
+    def __init__(self, maxlen: int = 2048, ttl_seconds: int = 600):
+        self.buffer: Deque[Tuple[int, str]] = deque(maxlen=maxlen)  # (idx, sse_line_block)
+        self.last_idx: int = -1
+        self.created: float = time.time()
+        self.finished: bool = False
+        self.cond = threading.Condition()
+        self.thread: Optional[threading.Thread] = None
+        self.ttl_seconds = ttl_seconds
+        # Cancellation + client tracking
+        self.cancel_event = threading.Event()
+        self.listeners: int = 0
+        self.cancel_timer = None  # type: ignore
+class _SessionStore:
+    def __init__(self, ttl_seconds: int = 600, max_sessions: int = 256):
+        self._sessions: Dict[str, _SSESession] = {}
+        self._lock = threading.Lock()
+        self._ttl = ttl_seconds
+        self._max_sessions = max_sessions
+    def get_or_create(self, sid: str) -> _SSESession:
+        with self._lock:
+            sess = self._sessions.get(sid)
+            if sess is None:
+                sess = _SSESession(ttl_seconds=self._ttl)
+                self._sessions[sid] = sess
+            return sess
+    def get(self, sid: str) -> Optional[_SSESession]:
+        with self._lock:
+            return self._sessions.get(sid)
+    def gc(self):
+        now = time.time()
+        with self._lock:
+            # remove expired
+            expired = [k for k, v in self._sessions.items() if (now - v.created) > self._ttl or (v.finished and (now - v.created) > self._ttl / 4)]
+            for k in expired:
+                self._sessions.pop(k, None)
+            # bound session count
+            if len(self._sessions) > self._max_sessions:
+                for k, _ in sorted(self._sessions.items(), key=lambda kv: kv[1].created)[: max(0, len(self._sessions) - self._max_sessions)]:
+                    self._sessions.pop(k, None)
+class _SQLiteStore:
+    def __init__(self, db_path: str):
+        self.db_path = db_path
+        self._lock = threading.Lock()
+        self._conn = sqlite3.connect(self.db_path, check_same_thread=False)
+        self._conn.execute("PRAGMA journal_mode=WAL;")
+        self._conn.execute("PRAGMA synchronous=NORMAL;")
+        self._ensure_schema()
+    def _ensure_schema(self):
+        cur = self._conn.cursor()
+        cur.execute(
+            "CREATE TABLE IF NOT EXISTS sessions (session_id TEXT PRIMARY KEY, created REAL, finished INTEGER DEFAULT 0)"
+        )
+        cur.execute(
+            "CREATE TABLE IF NOT EXISTS events (session_id TEXT, idx INTEGER, data TEXT, created REAL, PRIMARY KEY(session_id, idx))"
+        )
+        cur.execute("CREATE INDEX IF NOT EXISTS idx_events_session ON events(session_id, idx)")
+        self._conn.commit()
+    def ensure_session(self, session_id: str, created: int):
+        with self._lock:
+            self._conn.execute(
+                "INSERT OR IGNORE INTO sessions(session_id, created, finished) VALUES (?, ?, 0)",
+                (session_id, float(created)),
+            )
+            self._conn.commit()
+    def append_event(self, session_id: str, idx: int, payload: Dict[str, Any]):
+        data = json.dumps(payload, ensure_ascii=False)
+        with self._lock:
+            self._conn.execute(
+                "INSERT OR REPLACE INTO events(session_id, idx, data, created) VALUES (?, ?, ?, ?)",
+                (session_id, idx, data, time.time()),
+            )
+            self._conn.commit()
+    def get_events_after(self, session_id: str, last_idx: int) -> List[Tuple[int, str]]:
+        with self._lock:
+            cur = self._conn.execute(
+                "SELECT idx, data FROM events WHERE session_id=? AND idx>? ORDER BY idx ASC", (session_id, last_idx)
+            )
+            return [(int(r[0]), str(r[1])) for r in cur.fetchall()]
+    def mark_finished(self, session_id: str):
+        with self._lock:
+            self._conn.execute("UPDATE sessions SET finished=1 WHERE session_id=?", (session_id,))
+            self._conn.commit()
+    def session_meta(self, session_id: str) -> Tuple[bool, int]:
+        with self._lock:
+            row = self._conn.execute("SELECT finished FROM sessions WHERE session_id=?", (session_id,)).fetchone()
+            finished = bool(row[0]) if row else False
+            row2 = self._conn.execute("SELECT MAX(idx) FROM events WHERE session_id=?", (session_id,)).fetchone()
+            last_idx = int(row2[0]) if row2 and row2[0] is not None else -1
+            return finished, last_idx
+    def gc(self, ttl_seconds: int):
+        cutoff = time.time() - float(ttl_seconds)
+        with self._lock:
+            cur = self._conn.execute("SELECT session_id FROM sessions WHERE finished=1 AND created<?", (cutoff,))
+            ids = [r[0] for r in cur.fetchall()]
+            for sid in ids:
+                self._conn.execute("DELETE FROM events WHERE session_id=?", (sid,))
+                self._conn.execute("DELETE FROM sessions WHERE session_id=?", (sid,))
+            self._conn.commit()
+def _sse_event(session_id: str, idx: int, payload: Dict[str, Any]) -> str:
+    # Include SSE id line so clients can send Last-Event-ID to resume.
+    return f"id: {session_id}:{idx}\n" + f"data: {json.dumps(payload, ensure_ascii=False)}\n\n"
+_STORE = _SessionStore()
+_DB_STORE = _SQLiteStore(SESSIONS_DB_PATH) if PERSIST_SESSIONS else None
+# FastAPI app and OpenAPI tags
+tags_metadata = [
+    {"name": "meta", "description": "Service metadata and OpenAPI schema"},
+    {"name": "health", "description": "Readiness and runtime info including context window report"},
+    {"name": "chat", "description": "OpenAI-compatible chat completions (non-stream and streaming SSE)"},
+]
+app = FastAPI(
+    title="Qwen3-VL Inference Server",
+    version="1.0.0",
+    description="OpenAI-compatible inference server for Qwen3-VL with multimodal support, streaming SSE with resume, context auto-compression, and optional SQLite persistence.",
+    openapi_tags=tags_metadata,
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Startup hook is defined after get_engine() so globals are initialized first.
+# Engine singletons
+_engine: Optional[Engine] = None
+_engine_error: Optional[str] = None
+def get_engine() -> Engine:
+    global _engine, _engine_error
+    if _engine is not None:
+        return _engine
+    try:
+        model_id = DEFAULT_MODEL_ID
+        _log(f"Preparing model '{model_id}' (HF_HOME={os.getenv('HF_HOME')}, cache={os.getenv('TRANSFORMERS_CACHE')})")
+        local_repo_dir = prefetch_model_assets(model_id, HF_TOKEN)
+        load_id = local_repo_dir if (local_repo_dir and os.path.exists(os.path.join(local_repo_dir, 'config.json'))) else model_id
+        _log(f"Loading processor and model from: {load_id}")
+        _engine = Engine(model_id=load_id, hf_token=HF_TOKEN)
+        _engine_error = None
+        _log(f"Model ready: {_engine.model_id}")
+        return _engine
+    except Exception as e:
+        _engine_error = f"{type(e).__name__}: {e}"
+        _log(f"Engine init failed: {_engine_error}")
+        raise
+# Eager-load model at startup after definitions so it downloads/checks before serving traffic.
+@app.on_event("startup")
+def _startup_load_model():
+    if EAGER_LOAD_MODEL:
+        print("[startup] EAGER_LOAD_MODEL=1: initializing model...")
+        try:
+            _ = get_engine()
+            print("[startup] Model loaded:", _engine.model_id if _engine else "unknown")
+        except Exception as e:
+            # Fail fast if model cannot be initialized
+            print("[startup] Model load failed:", e)
+            raise
+@app.get("/", tags=["meta"])
+def root():
+    """Liveness check."""
+    return JSONResponse({"ok": True})
+@app.get("/openapi.yaml", tags=["meta"])
+def openapi_yaml():
+    """Serve OpenAPI schema as YAML for tooling compatibility."""
+    schema = app.openapi()
+    yml = yaml.safe_dump(schema, sort_keys=False)
+    return Response(yml, media_type="application/yaml")
+@app.get("/health", tags=["health"])
+def health():
+    ready = False
+    err = None
+    model_id = DEFAULT_MODEL_ID
+    global _engine, _engine_error
+    if _engine is not None:
+        ready = True
+        model_id = _engine.model_id
+    elif _engine_error:
+        err = _engine_error
+    ctx = None
+    try:
+        if _engine is not None:
+            ctx = _engine.get_context_report()
+    except Exception:
+        ctx = None
+    return JSONResponse({"ok": True, "modelReady": ready, "modelId": model_id, "error": err, "context": ctx})
+@app.post("/v1/chat/completions", tags=["chat"])
+def chat_completions(request: Request, body: ChatRequest):
+    # Ensure engine is loaded
+    try:
+        engine = get_engine()
+    except Exception as e:
+        raise HTTPException(status_code=503, detail=f"Model not ready: {e}")
+    if not body or not isinstance(body.messages, list) or len(body.messages) == 0:
+        raise HTTPException(status_code=400, detail="messages must be a non-empty array")
+    max_tokens = int(body.max_tokens) if isinstance(body.max_tokens, int) else DEFAULT_MAX_TOKENS
+    temperature = float(body.temperature) if body.temperature is not None else DEFAULT_TEMPERATURE
+    do_stream = bool(body.stream)
+    # Parse Last-Event-ID for resuming and derive/align session_id
+    last_event_id_header = request.headers.get("last-event-id")
+    sid_from_header: Optional[str] = None
+    last_idx_from_header: int = -1
+    if last_event_id_header:
+        try:
+            sid_from_header, idx_str = last_event_id_header.split(":", 1)
+            last_idx_from_header = int(idx_str)
+        except Exception:
+            sid_from_header = None
+            last_idx_from_header = -1
+    session_id = body.session_id or sid_from_header or f"sess-{uuid.uuid4().hex[:12]}"
+    sess = _STORE.get_or_create(session_id)
+    created_ts = int(sess.created)
+    if _DB_STORE is not None:
+        _DB_STORE.ensure_session(session_id, created_ts)
+    if not do_stream:
+        # Non-streaming path
+        try:
+            content = engine.infer(body.messages, max_tokens=max_tokens, temperature=temperature)
+        except ValueError as e:
+            # Parsing/user payload errors from engine -> HTTP 400
+            raise HTTPException(status_code=400, detail=str(e))
+        except Exception as e:
+            raise HTTPException(status_code=500, detail=f"Inference error: {e}")
+        now = int(time.time())
+        prompt_tokens = int((engine.last_context_info or {}).get("prompt_tokens") or 0)
+        completion_tokens = max(1, len((content or "").split()))
+        total_tokens = prompt_tokens + completion_tokens
+        resp: Dict[str, Any] = {
+            "id": f"chatcmpl-{uuid.uuid4().hex[:12]}",
+            "object": "chat.completion",
+            "created": now,
+            "model": engine.model_id,
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": content},
+                    "finish_reason": "stop",
+                }
+            ],
+            "usage": {
+                "prompt_tokens": prompt_tokens,
+                "completion_tokens": completion_tokens,
+                "total_tokens": total_tokens,
+            },
+            "context": engine.last_context_info or {},
+        }
+        return JSONResponse(resp)
+    # Streaming SSE with resumable support
+    def sse_generator():
+        # Manage listener count and cancel timer
+        sess.listeners += 1
+        try:
+            # Cancel any pending cancel timer when a listener attaches
+            if getattr(sess, "cancel_timer", None):
+                try:
+                    sess.cancel_timer.cancel()
+                except Exception:
+                    pass
+                sess.cancel_timer = None
+            # Replay if Last-Event-ID was provided
+            replay_from = last_idx_from_header if sid_from_header == session_id else -1
+            if replay_from >= -1:
+                # First try in-memory buffer
+                for idx, block in list(sess.buffer):
+                    if idx > replay_from:
+                        yield block.encode("utf-8")
+                # Optionally pull from SQLite persistence
+                if _DB_STORE is not None:
+                    try:
+                        for idx, data in _DB_STORE.get_events_after(session_id, replay_from):
+                            block = f"id: {session_id}:{idx}\n" + f"data: {data}\n\n"
+                            yield block.encode("utf-8")
+                    except Exception:
+                        pass
+                if sess.finished:
+                    # Already finished; send terminal and exit
+                    yield b"data: [DONE]\n\n"
+                    return
+            # Fresh generation path
+            # Helper to append to buffers and yield to client
+            def push(payload: Dict[str, Any]):
+                sess.last_idx += 1
+                idx = sess.last_idx
+                block = _sse_event(session_id, idx, payload)
+                sess.buffer.append((idx, block))
+                if _DB_STORE is not None:
+                    try:
+                        _DB_STORE.append_event(session_id, idx, payload)
+                    except Exception:
+                        pass
+                return block
+            # Initial assistant role delta
+            head = {
+                "id": session_id,
+                "object": "chat.completion.chunk",
+                "created": int(time.time()),
+                "model": engine.model_id,
+                "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}],
+                "system_fingerprint": "fastapi",
+            }
+            yield push(head).encode("utf-8")
+            # Stream model pieces
+            try:
+                for piece in engine.infer_stream(
+                    body.messages, max_tokens=max_tokens, temperature=temperature, cancel_event=sess.cancel_event
+                ):
+                    if not piece:
+                        continue
+                    payload = {
+                        "id": session_id,
+                        "object": "chat.completion.chunk",
+                        "created": int(time.time()),
+                        "model": engine.model_id,
+                        "choices": [{"index": 0, "delta": {"content": piece}, "finish_reason": None}],
+                    }
+                    yield push(payload).encode("utf-8")
+                    # Cooperative early-exit if cancel requested
+                    if sess.cancel_event.is_set():
+                        break
+            except Exception:
+                # On engine error, terminate gracefully
+                pass
+            # Finish chunk
+            finish = {
+                "id": session_id,
+                "object": "chat.completion.chunk",
+                "created": int(time.time()),
+                "model": engine.model_id,
+                "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}],
+            }
+            yield push(finish).encode("utf-8")
+        finally:
+            # Mark finished and persist
+            sess.finished = True
+            if _DB_STORE is not None:
+                try:
+                    _DB_STORE.mark_finished(session_id)
+                    # Optionally GC older finished sessions
+                    _DB_STORE.gc(SESSIONS_TTL_SECONDS)
+                except Exception:
+                    pass
+            # Always send terminal [DONE]
+            yield b"data: [DONE]\n\n"
+            # Listener bookkeeping and optional auto-cancel if all disconnect
+            try:
+                sess.listeners = max(0, sess.listeners - 1)
+                if sess.listeners == 0 and CANCEL_AFTER_DISCONNECT_SECONDS > 0 and not sess.cancel_event.is_set():
+                    def _later_cancel():
+                        # If still no listeners, cancel
+                        if sess.listeners == 0 and not sess.cancel_event.is_set():
+                            sess.cancel_event.set()
+                    sess.cancel_timer = threading.Timer(CANCEL_AFTER_DISCONNECT_SECONDS, _later_cancel)
+                    sess.cancel_timer.daemon = True
+                    sess.cancel_timer.start()
+            except Exception:
+                pass
+            # In-memory store GC
+            try:
+                _STORE.gc()
+            except Exception:
+                pass
+    headers = {
+        "Cache-Control": "no-cache",
+        "Connection": "keep-alive",
+        "X-Accel-Buffering": "no",
+    }
+    return StreamingResponse(sse_generator(), media_type="text/event-stream", headers=headers)
+@app.post("/v1/cancel/{session_id}", tags=["chat"])
+def cancel_session(session_id: str):
+    sess = _STORE.get(session_id)
+    if sess is not None:
+        try:
+            sess.cancel_event.set()
+            sess.finished = True
+            if _DB_STORE is not None:
+                _DB_STORE.mark_finished(session_id)
+        except Exception:
+            pass
+    return JSONResponse({"ok": True, "session_id": session_id})
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("main:app", host="0.0.0.0", port=PORT, reload=False)

requirements.txt ADDED Viewed

	@@ -0,0 +1,30 @@

+# Core server
+fastapi>=0.115.0
+uvicorn[standard]>=0.30.0
+# HF ecosystem
+transformers>=4.44.0
+accelerate>=0.33.0
+# Multimedia + utils
+pillow>=10.0.0
+numpy>=1.24.0
+requests>=2.31.0
+imageio[ffmpeg]>=2.34.0
+# Config
+python-dotenv>=1.0.1
+# FastAPI runtime model layer
+pydantic>=2.0.0
+# IMPORTANT:
+# - Install PyTorch separately to match your platform/CUDA:
+#     CPU (Windows/Linux/macOS):
+#       pip install torch --index-url https://download.pytorch.org/whl/cpu
+#     NVIDIA CUDA (example for cu124, adjust as needed):
+#       pip install torch --index-url https://download.pytorch.org/whl/cu124
+# Testing and tooling
+pytest>=8.0.0
+pytest-cov>=4.1.0
+pyyaml>=6.0.0

tests/test_api.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import json
+import time
+from contextlib import contextmanager
+import pytest
+from fastapi.testclient import TestClient
+import main
+class FakeEngine:
+    def __init__(self, model_id="fake-model"):
+        self.model_id = model_id
+        self.last_context_info = {
+            "compressed": False,
+            "prompt_tokens": 5,
+            "max_context": 8192,
+            "budget": 7900,
+            "strategy": "truncate",
+            "dropped_messages": 0,
+        }
+    def infer(self, messages, max_tokens, temperature):
+        # Simulate parse error pathway when special trigger is present
+        if messages and isinstance(messages[0].get("content"), str) and "PARSE_ERR" in messages[0]["content"]:
+            raise ValueError("Simulated parse error")
+        # Return echo content for deterministic test
+        parts = []
+        for m in messages:
+            c = m.get("content", "")
+            if isinstance(c, list):
+                for p in c:
+                    if isinstance(p, dict) and p.get("type") == "text":
+                        parts.append(p.get("text", ""))
+            elif isinstance(c, str):
+                parts.append(c)
+        txt = " ".join(parts) or "OK"
+        # Simulate context accounting changing with request
+        self.last_context_info = {
+            "compressed": False,
+            "prompt_tokens": max(1, len(txt.split())),
+            "max_context": 8192,
+            "budget": 7900,
+            "strategy": "truncate",
+            "dropped_messages": 0,
+        }
+        return f"OK: {txt}"
+    def infer_stream(self, messages, max_tokens, temperature, cancel_event=None):
+        # simple two-piece stream; respects cancel_event if set during streaming
+        outputs = ["hello", " world"]
+        for piece in outputs:
+            if cancel_event is not None and cancel_event.is_set():
+                break
+            yield piece
+            # tiny delay to allow cancel test to interleave
+            time.sleep(0.01)
+    def get_context_report(self):
+        return {
+            "compressionEnabled": True,
+            "strategy": "truncate",
+            "safetyMargin": 256,
+            "modelMaxContext": 8192,
+            "tokenizerModelMaxLength": 8192,
+            "last": self.last_context_info,
+        }
+@contextmanager
+def patched_engine():
+    # Patch global engine so server does not load real model
+    prev_engine = main._engine
+    prev_err = main._engine_error
+    fake = FakeEngine()
+    main._engine = fake
+    main._engine_error = None
+    try:
+        yield fake
+    finally:
+        main._engine = prev_engine
+        main._engine_error = prev_err
+def get_client():
+    return TestClient(main.app)
+def test_health_ready_and_context():
+    with patched_engine():
+        client = get_client()
+        r = client.get("/health")
+        assert r.status_code == 200
+        body = r.json()
+        assert body["ok"] is True
+        assert body["modelReady"] is True
+        assert body["modelId"] == "fake-model"
+        # context block exists with required fields
+        ctx = body["context"]
+        assert ctx["compressionEnabled"] is True
+        assert "last" in ctx
+        assert isinstance(ctx["last"].get("prompt_tokens"), int)
+def test_health_with_engine_error():
+    # simulate model load error path
+    prev_engine = main._engine
+    prev_err = main._engine_error
+    try:
+        main._engine = None
+        main._engine_error = "boom"
+        client = get_client()
+        r = client.get("/health")
+        assert r.status_code == 200
+        body = r.json()
+        assert body["modelReady"] is False
+        assert body["error"] == "boom"
+    finally:
+        main._engine = prev_engine
+        main._engine_error = prev_err
+def test_chat_non_stream_validation():
+    with patched_engine():
+        client = get_client()
+        # missing messages should 400
+        r = client.post("/v1/chat/completions", json={"messages": []})
+        assert r.status_code == 400
+def test_chat_non_stream_success_and_usage_context():
+    with patched_engine():
+        client = get_client()
+        payload = {
+            "messages": [{"role": "user", "content": "Hello Qwen"}],
+            "max_tokens": 8,
+            "temperature": 0.0,
+        }
+        r = client.post("/v1/chat/completions", json=payload)
+        assert r.status_code == 200
+        body = r.json()
+        assert body["object"] == "chat.completion"
+        assert body["choices"][0]["message"]["content"].startswith("OK:")
+        # usage prompt_tokens filled from engine.last_context_info
+        assert body["usage"]["prompt_tokens"] >= 1
+        # response includes context echo
+        assert "context" in body
+        assert "prompt_tokens" in body["context"]
+def test_chat_non_stream_parse_error_to_400():
+    with patched_engine():
+        client = get_client()
+        payload = {
+            "messages": [{"role": "user", "content": "PARSE_ERR trigger"}],
+            "max_tokens": 4,
+        }
+        r = client.post("/v1/chat/completions", json=payload)
+        # ValueError in engine -> 400 per API contract
+        assert r.status_code == 400
+def read_sse_lines(resp):
+    # Utility to parse event-stream into list of data payloads (including [DONE])
+    lines = []
+    buf = b""
+    # Starlette TestClient (httpx) responses expose iter_bytes()/iter_raw(), not requests.iter_content().
+    # Fall back to available iterator or to full content if streaming isn't supported.
+    iterator = None
+    for name in ("iter_bytes", "iter_raw", "iter_content"):
+        it = getattr(resp, name, None)
+        if callable(it):
+            iterator = it
+            break
+    if iterator is None:
+        data = getattr(resp, "content", b"")
+        if isinstance(data, str):
+            data = data.encode("utf-8", "ignore")
+        buf = data
+    else:
+        for chunk in iterator():
+            if not chunk:
+                continue
+            if isinstance(chunk, str):
+                chunk = chunk.encode("utf-8", "ignore")
+            buf += chunk
+            while b"\n\n" in buf:
+                frame, buf = buf.split(b"\n\n", 1)
+                # keep original frame text for asserts
+                lines.append(frame.decode("utf-8", errors="ignore"))
+    # Drain any leftover
+    if buf:
+        lines.append(buf.decode("utf-8", errors="ignore"))
+    return lines
+def test_chat_stream_sse_flow_and_resume():
+    with patched_engine():
+        client = get_client()
+        payload = {
+            "session_id": "s1",
+            "stream": True,
+            "messages": [{"role": "user", "content": "stream please"}],
+            "max_tokens": 8,
+            "temperature": 0.2,
+        }
+        with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
+            assert resp.status_code == 200
+            lines = read_sse_lines(resp)
+        # Must contain role delta, content pieces, finish chunk, and [DONE]
+        joined = "\n".join(lines)
+        assert "delta" in joined
+        assert "[DONE]" in joined
+        # Resume from event index 0 should receive at least one subsequent event
+        headers = {"Last-Event-ID": "s1:0"}
+        with client.stream("POST", "/v1/chat/completions", headers=headers, json=payload) as resp2:
+            assert resp2.status_code == 200
+            lines2 = read_sse_lines(resp2)
+        assert any("data:" in l for l in lines2)
+        assert "[DONE]" in "\n".join(lines2)
+        # Invalid Last-Event-ID format should not crash (covered by try/except)
+        headers_bad = {"Last-Event-ID": "not-an-index"}
+        with client.stream("POST", "/v1/chat/completions", headers=headers_bad, json=payload) as resp3:
+            assert resp3.status_code == 200
+            _ = read_sse_lines(resp3)  # just ensure no crash
+def test_cancel_endpoint_stops_generation():
+    with patched_engine():
+        client = get_client()
+        payload = {
+            "session_id": "to-cancel",
+            "stream": True,
+            "messages": [{"role": "user", "content": "cancel me"}],
+        }
+        # Start streaming in background (client.stream keeps the connection open)
+        with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
+            # Immediately cancel
+            rc = client.post("/v1/cancel/to-cancel")
+            assert rc.status_code == 200
+            # Stream should end with [DONE] without hanging
+            lines = read_sse_lines(resp)
+            assert "[DONE]" in "\n".join(lines)
+def test_cancel_unknown_session_is_ok():
+    with patched_engine():
+        client = get_client()
+        rc = client.post("/v1/cancel/does-not-exist")
+        # Endpoint returns ok regardless (idempotent, operationally safe)
+        assert rc.status_code == 200
+def test_edge_large_last_event_id_after_finish_yields_done():
+    with patched_engine():
+        client = get_client()
+        payload = {
+            "session_id": "done-session",
+            "stream": True,
+            "messages": [{"role": "user", "content": "edge"}],
+        }
+        # Complete a run
+        with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
+            _ = read_sse_lines(resp)
+        # Resume with huge index; should return DONE quickly
+        headers = {"Last-Event-ID": "done-session:99999"}
+        with client.stream("POST", "/v1/chat/completions", headers=headers, json=payload) as resp2:
+            lines2 = read_sse_lines(resp2)
+        assert "[DONE]" in "\n".join(lines2)