KillerKing93 commited on
Commit
7cd14d8
·
verified ·
1 Parent(s): 8c57892

Sync from GitHub 8f6d598

Browse files
Files changed (13) hide show
  1. .env.example +34 -0
  2. .gitignore +31 -0
  3. ARCHITECTURE.md +192 -0
  4. CLAUDE.md +259 -0
  5. Dockerfile +73 -0
  6. HISTORY.md +3 -0
  7. LICENSE +50 -0
  8. README.md +358 -12
  9. RULES.md +207 -0
  10. TODO.md +14 -0
  11. main.py +1108 -0
  12. requirements.txt +30 -0
  13. tests/test_api.py +274 -0
.env.example ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Server
2
+ PORT=3000
3
+
4
+ # Model from Hugging Face (Transformers)
5
+ MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
6
+ # HF token for gated/private models (optional)
7
+ HF_TOKEN=
8
+
9
+ # Inference parameters
10
+ MAX_TOKENS=4096
11
+ TEMPERATURE=0.7
12
+
13
+ # Multimedia processing
14
+ MAX_VIDEO_FRAMES=16
15
+
16
+ # Transformers loading hints
17
+ DEVICE_MAP=auto
18
+ TORCH_DTYPE=auto
19
+ # Persistent SSE session store (SQLite)
20
+ # Enable to persist streaming chunks per session_id and allow resume after server restarts.
21
+ # 1=true, 0=false
22
+ PERSIST_SESSIONS=1
23
+ SESSIONS_DB_PATH=sessions.db
24
+ # TTL for sessions (seconds). Finished sessions older than TTL are garbage collected.
25
+ SESSIONS_TTL_SECONDS=600
26
+ # Auto compression and context reporting
27
+ # Enable automatic prompt compression if context would overflow. Drops oldest non-system messages.
28
+ ENABLE_AUTO_COMPRESSION=1
29
+ # Force a max context window for budgeting; 0 = use model/tokenizer defaults
30
+ CONTEXT_MAX_TOKENS_AUTO=0
31
+ # Safety margin kept free for generation and special tokens
32
+ CONTEXT_SAFETY_MARGIN=256
33
+ # Compression strategy: truncate (default). summarize reserved for future use.
34
+ COMPRESSION_STRATEGY=truncate
.gitignore ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Node (legacy)
2
+ node_modules
3
+
4
+ .env
5
+
6
+ # Python
7
+ .venv/
8
+ venv/
9
+ __pycache__/
10
+ *.py[cod]
11
+ *.pyo
12
+ *.pyd
13
+
14
+ # Tool caches
15
+ .cache/
16
+ .mypy_cache/
17
+ .pyright/
18
+ .pytest_cache/
19
+ .ipynb_checkpoints/
20
+
21
+ # Editors / OS
22
+ .DS_Store
23
+ Thumbs.db
24
+ .idea/
25
+ .vscode/
26
+
27
+ # Data / models (large artifacts)
28
+ models/
29
+ data/
30
+ uploads/
31
+ tmp/
ARCHITECTURE.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture (Python FastAPI + Transformers)
2
+
3
+ This document describes the Python-based, OpenAI-compatible inference server for Qwen3-VL, replacing the previous Node.js/llama.cpp stack.
4
+
5
+ Key source files
6
+ - Server entry: [main.py](main.py)
7
+ - Inference engine: [Python.class Engine](main.py:231)
8
+ - Multimodal parsing: [Python.function build_mm_messages](main.py:251), [Python.function load_image_from_any](main.py:108), [Python.function load_video_frames_from_any](main.py:150)
9
+ - Endpoints: Health [Python.app.get()](main.py:577), Chat Completions [Python.app.post()](main.py:591), Cancel [Python.app.post()](main.py:792)
10
+ - Streaming + resume: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449), [Python.class _SQLiteStore](main.py:482), [Python.function chat_completions](main.py:591)
11
+ - Local run (uvicorn): [Python.main()](main.py:807)
12
+ - Configuration template: [.env.example](.env.example)
13
+ - Dependencies: [requirements.txt](requirements.txt)
14
+
15
+ Model target (default)
16
+ - Hugging Face: Qwen/Qwen3-VL-2B-Thinking (Transformers, multimodal)
17
+ - Overridable via environment variable: MODEL_REPO_ID
18
+
19
+ ## Overview
20
+
21
+ The server exposes an OpenAI-compatible endpoint for chat completions that supports:
22
+ - Text-only prompts
23
+ - Images (URL or base64)
24
+ - Videos (URL or base64; frames sampled)
25
+
26
+ Two response modes are implemented:
27
+ - Non-streaming JSON
28
+ - Streaming via Server-Sent Events (SSE) with resumable delivery using Last-Event-ID. Resumability is achieved with an in‑memory ring buffer and optional SQLite persistence.
29
+
30
+ ## Components
31
+
32
+ 1) FastAPI application
33
+ - Instantiated in [Python.main module](main.py:541) and endpoints mounted at:
34
+ - Health: [Python.app.get()](main.py:577)
35
+ - Chat Completions (non-stream + SSE): [Python.app.post()](main.py:591)
36
+ - Manual cancel (custom): [Python.app.post()](main.py:792)
37
+ - CORS is enabled for simplicity.
38
+
39
+ 2) Inference Engine (Transformers)
40
+ - Class: [Python.class Engine](main.py:231)
41
+ - Loads:
42
+ - Processor: AutoProcessor(trust_remote_code=True)
43
+ - Model: AutoModelForCausalLM (device_map, dtype configurable via env)
44
+ - Core methods:
45
+ - Input building: [Python.function build_mm_messages](main.py:251)
46
+ - Text-only generate: [Python.function infer](main.py:326)
47
+ - Streaming generate (iterator): [Python.function infer_stream](main.py:375)
48
+
49
+ 3) Multimodal preprocessing
50
+ - Images:
51
+ - URL (http/https), data URL, base64, or local path
52
+ - Loader: [Python.function load_image_from_any](main.py:108)
53
+ - Videos:
54
+ - URL (downloaded to temp), base64 to temp file, or local path
55
+ - Frame extraction via imageio.v3 (preferred) or OpenCV fallback
56
+ - Uniform sampling up to MAX_VIDEO_FRAMES
57
+ - Loader: [Python.function load_video_frames_from_any](main.py:150)
58
+
59
+ 4) SSE streaming with resume
60
+ - Session objects:
61
+ - [Python.class _SSESession](main.py:435): ring buffer, condition variable, producer thread reference, cancellation event, listener count, and disconnect timer
62
+ - [Python.class _SessionStore](main.py:449): in-memory map with TTL + GC
63
+ - Optional persistence: [Python.class _SQLiteStore](main.py:482) for replaying chunks across restarts
64
+ - SSE id format: "session_id:index"
65
+ - Resume:
66
+ - Client sends Last-Event-ID header (or query ?last_event_id=...) and the same session_id in the body
67
+ - Server replays cached/persisted chunks after the provided index, then continues live streaming
68
+ - Producer:
69
+ - Created on demand per session; runs generation in a daemon thread and pushes chunks into the ring buffer and SQLite (if enabled)
70
+ - See producer closure inside [Python.function chat_completions](main.py:591)
71
+ - Auto-cancel on disconnect:
72
+ - If all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS (default 3600s), a timer signals cancellation via a stopping criteria in [Python.function infer_stream](main.py:375)
73
+
74
+ ## Request flow
75
+
76
+ Non-streaming (POST /v1/chat/completions)
77
+ 1. Validate input, load engine singleton via [Python.function get_engine](main.py:558)
78
+ 2. Convert OpenAI-style messages to Qwen chat template via [Python.function build_mm_messages](main.py:251) and apply_chat_template
79
+ 3. Preprocess images/videos into processor inputs
80
+ 4. Generate with [Python.function infer](main.py:326)
81
+ 5. Return OpenAI-compatible response (choices[0].message.content)
82
+
83
+ Streaming (POST /v1/chat/completions with "stream": true)
84
+ 1. Determine session_id:
85
+ - Use body.session_id if provided; otherwise generated server-side
86
+ 2. Parse Last-Event-ID (or query ?last_event_id) to get last delivered index
87
+ 3. Create/start or reuse producer thread for this session
88
+ 4. StreamingResponse generator:
89
+ - Replays persisted events (SQLite, if enabled) and in-memory buffer after last index
90
+ - Waits on condition variable for new tokens
91
+ - Emits "[DONE]" at the end or upon buffer completion
92
+ 5. Clients can reconnect and resume by sending Last-Event-ID: "session_id:index"
93
+ 6. If all clients disconnect, an auto-cancel timer can stop generation (configurable via env)
94
+
95
+ Manual cancel (POST /v1/cancel/{session_id})
96
+ - Custom operational shortcut to cancel an in-flight generation for a session id.
97
+ - This is not part of the legacy OpenAI Chat Completions spec (OpenAI’s newer Responses API defines cancel); it is provided for practical control.
98
+
99
+ ## Message and content mapping
100
+
101
+ Input format (OpenAI-like):
102
+ - "messages" list of role/content entries
103
+ - content can be:
104
+ - string (text)
105
+ - array of parts with "type":
106
+ - "text": { text: "..."}
107
+ - "image_url": { image_url: { url: "..." } } or { image_url: "..." }
108
+ - "input_image": { b64_json: "..." } or { image: "..." }
109
+ - "video_url": { video_url: { url: "..." } } or { video_url: "..." }
110
+ - "input_video": { b64_json: "..." } or { video: "..." }
111
+
112
+ Conversion:
113
+ - [Python.function build_mm_messages](main.py:251) constructs a multimodal content list per message:
114
+ - { type: "text", text: ... }
115
+ - { type: "image", image: PIL.Image }
116
+ - { type: "video", video: [PIL.Image frames] }
117
+
118
+ Template:
119
+ - Qwen apply_chat_template:
120
+ - See usage in [Python.function infer](main.py:326) and [Python.function infer_stream](main.py:375)
121
+
122
+ ## Configuration (.env)
123
+
124
+ See [.env.example](.env.example)
125
+ - PORT (default 3000)
126
+ - MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
127
+ - HF_TOKEN (optional)
128
+ - MAX_TOKENS (default 256)
129
+ - TEMPERATURE (default 0.7)
130
+ - MAX_VIDEO_FRAMES (default 16)
131
+ - DEVICE_MAP (default "auto")
132
+ - TORCH_DTYPE (default "auto")
133
+ - PERSIST_SESSIONS (default 0; set 1 to enable SQLite persistence)
134
+ - SESSIONS_DB_PATH (default sessions.db)
135
+ - SESSIONS_TTL_SECONDS (default 600)
136
+ - CANCEL_AFTER_DISCONNECT_SECONDS (default 3600; set 0 to disable)
137
+
138
+ ## Error handling and readiness
139
+
140
+ - Health endpoint: [Python.app.get()](main.py:577)
141
+ - Returns { ok, modelReady, modelId, error }
142
+ - Chat endpoint:
143
+ - 400 for invalid messages or multimodal parsing errors
144
+ - 503 when model failed to load
145
+ - 500 for unexpected generation errors
146
+ - During first request, the model is lazily loaded; subsequent requests reuse the singleton
147
+
148
+ ## Performance and scaling
149
+
150
+ - GPU recommended:
151
+ - Set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16/float16 if supported
152
+ - Reduce MAX_VIDEO_FRAMES to speed up video processing
153
+ - For concurrency:
154
+ - FastAPI/Uvicorn workers and model sharing: typically 1 model per process
155
+ - For high throughput, prefer multiple processes or a queueing layer
156
+
157
+ ## Data and directories
158
+
159
+ - models/ contains downloaded model artifacts (implicitly created by Transformers cache); ignored by git
160
+ - tmp/ used transiently for video decoding (temporary files)
161
+
162
+ Ignored artifacts (see [.gitignore](.gitignore))
163
+ - Python: .venv/, __pycache__/, .cache/, etc.
164
+ - Large artifacts: models/, data/, uploads/, tmp/
165
+
166
+ ## Streaming resume details
167
+
168
+ - Session store:
169
+ - In-memory ring buffer for fast replay
170
+ - Optional SQLite persistence for robust replay across process restarts
171
+ - See GC in [Python.class _SessionStore](main.py:449) and [Python.method _SQLiteStore.gc](main.py:526)
172
+ - Limits:
173
+ - Ring buffer stores ~2048 SSE events per session by default
174
+ - If the buffer overflows before a client resumes and persistence is disabled, the earliest chunks may be unavailable
175
+ - End-of-stream:
176
+ - Final chunk contains finish_reason: "stop"
177
+ - "[DONE]" sentinel is emitted afterwards
178
+
179
+ ## Future enhancements
180
+
181
+ - Redis persistence:
182
+ - Add a Redis-backed store as a drop-in alongside SQLite
183
+ - Token accounting:
184
+ - Populate usage prompt/completion/total tokens when model exposes tokenization costs
185
+ - Logging/observability:
186
+ - Structured logs, request IDs, and metrics
187
+
188
+ ## Migration notes (from Node.js)
189
+
190
+ - All Node.js server files and scripts were removed (index.js, package*.json, scripts/)
191
+ - The server now targets Transformers models directly and supports multimodal inputs out of the box
192
+ - The API remains OpenAI-compatible on /v1/chat/completions with resumable SSE and optional SQLite persistence
CLAUDE.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE Technical Log and Decisions (Python FastAPI + Transformers)
2
+ ## Progress Log — 2025-10-23 (Asia/Jakarta)
3
+
4
+ - Migrated stack from Node.js/llama.cpp to Python + FastAPI + Transformers
5
+ - New server: [main.py](main.py)
6
+ - Default model: Qwen/Qwen3-VL-2B-Thinking via Transformers with trust_remote_code
7
+ - Implemented endpoints
8
+ - Health: [Python.app.get()](main.py:577)
9
+ - OpenAI-compatible Chat Completions (non-stream + SSE): [Python.app.post()](main.py:591)
10
+ - Manual cancel (custom extension): [Python.app.post()](main.py:792)
11
+ - Multimodal support
12
+ - OpenAI-style messages mapped in [Python.function build_mm_messages](main.py:251)
13
+ - Image loader: [Python.function load_image_from_any](main.py:108)
14
+ - Video loader (frame sampling): [Python.function load_video_frames_from_any](main.py:150)
15
+ - Streaming + resume + persistence
16
+ - SSE with session_id + Last-Event-ID
17
+ - In-memory session ring buffer: [Python.class _SSESession](main.py:435), manager [Python.class _SessionStore](main.py:449)
18
+ - Optional SQLite persistence: [Python.class _SQLiteStore](main.py:482) with replay across restarts
19
+ - Cancellation
20
+ - Auto-cancel after all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS, timer wiring in [Python.function chat_completions](main.py:733), cooperative stop in [Python.function infer_stream](main.py:375)
21
+ - Manual cancel API: [Python.function cancel_session](main.py:792)
22
+ - Configuration and dependencies
23
+ - Env template updated: [.env.example](.env.example) with MODEL_REPO_ID, PERSIST_SESSIONS, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS, CANCEL_AFTER_DISCONNECT_SECONDS, etc.
24
+ - Python deps: [requirements.txt](requirements.txt)
25
+ - Git ignores for Python + artifacts: [.gitignore](.gitignore)
26
+ - Documentation refreshed
27
+ - Operator docs: [README.md](README.md) including SSE resume, SQLite, cancel API
28
+ - Architecture: [ARCHITECTURE.md](ARCHITECTURE.md) aligned to Python flows
29
+ - Rules: [RULES.md](RULES.md) updated — Git usage is mandatory
30
+ - Legacy removal
31
+ - Deleted Node files and scripts (index.js, package*.json, scripts/) as requested
32
+
33
+ Suggested Git commit series (run in order)
34
+ - git add .
35
+ - git commit -m "feat(server): add FastAPI OpenAI-compatible /v1/chat/completions with Qwen3-VL [Python.main()](main.py:1)"
36
+ - git commit -m "feat(stream): SSE streaming with session_id resume and in-memory sessions [Python.function chat_completions()](main.py:591)"
37
+ - git commit -m "feat(persist): SQLite-backed replay for SSE sessions [Python.class _SQLiteStore](main.py:482)"
38
+ - git commit -m "feat(cancel): auto-cancel after disconnect and POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:792)"
39
+ - git commit -m "docs: update README/ARCHITECTURE/RULES for Python stack and streaming resume"
40
+ - git push
41
+
42
+ Verification snapshot
43
+ - Non-stream text works via [Python.function infer](main.py:326)
44
+ - Streaming emits chunks and ends with [DONE]
45
+ - Resume works with Last-Event-ID; persists across restart when PERSIST_SESSIONS=1
46
+ - Manual cancel stops generation; auto-cancel triggers after disconnect threshold
47
+
48
+
49
+ This is the developer-facing changelog and design rationale for the Python migration. Operator docs live in [README.md](README.md); architecture details in [ARCHITECTURE.md](ARCHITECTURE.md); rules in [RULES.md](RULES.md); task tracking in [TODO.md](TODO.md).
50
+
51
+ Key source file references
52
+ - Server entry: [Python.main()](main.py:807)
53
+ - Health endpoint: [Python.app.get()](main.py:577)
54
+ - Chat Completions endpoint (non-stream + SSE): [Python.app.post()](main.py:591)
55
+ - Manual cancel endpoint (custom): [Python.app.post()](main.py:792)
56
+ - Engine (Transformers): [Python.class Engine](main.py:231)
57
+ - Multimodal mapping: [Python.function build_mm_messages](main.py:251)
58
+ - Image loader: [Python.function load_image_from_any](main.py:108)
59
+ - Video loader: [Python.function load_video_frames_from_any](main.py:150)
60
+ - Non-stream inference: [Python.function infer](main.py:326)
61
+ - Streaming inference + stopping criteria: [Python.function infer_stream](main.py:375)
62
+ - In-memory sessions: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449)
63
+ - SQLite persistence: [Python.class _SQLiteStore](main.py:482)
64
+
65
+ Summary of the migration
66
+ - Replaced the Node.js/llama.cpp stack with a Python FastAPI server that uses Hugging Face Transformers for Qwen3-VL multimodal inference.
67
+ - Exposes an OpenAI-compatible /v1/chat/completions endpoint (non-stream and streaming via SSE).
68
+ - Supports text, images, and videos:
69
+ - Messages can include array parts such as "text", "image_url" / "input_image" (base64), "video_url" / "input_video" (base64).
70
+ - Images are decoded to PIL in [Python.function load_image_from_any](main.py:108).
71
+ - Videos are read via imageio.v3 (preferred) or OpenCV, sampled to up to MAX_VIDEO_FRAMES in [Python.function load_video_frames_from_any](main.py:150).
72
+ - Streaming includes resumability with session_id + Last-Event-ID:
73
+ - In-memory ring buffer: [Python.class _SSESession](main.py:435)
74
+ - Optional SQLite persistence: [Python.class _SQLiteStore](main.py:482)
75
+ - Added a manual cancel endpoint (custom) and implemented auto-cancel after disconnect.
76
+
77
+ Why Python + Transformers?
78
+ - Qwen3-VL-2B-Thinking is published for Transformers and includes multimodal processors (preprocessor_config.json, video_preprocessor_config.json, chat_template.json). Python + Transformers is the first-class path.
79
+ - trust_remote_code=True allows the model repo to provide custom processing logic and templates, used in [Python.class Engine](main.py:231) via AutoProcessor/AutoModelForCausalLM.
80
+
81
+ Core design choices
82
+
83
+ 1) OpenAI compatibility
84
+ - Non-stream path returns choices[0].message.content from [Python.function infer](main.py:326).
85
+ - Streaming path (SSE) produces OpenAI-style "chat.completion.chunk" deltas, with id lines "session_id:index" for resume.
86
+ - We retained Chat Completions (legacy) rather than the newer Responses API for compatibility with existing SDKs. A custom cancel endpoint is provided to fill the gap.
87
+
88
+ 2) Multimodal input handling
89
+ - The API accepts "messages" with content either as a string or an array of parts typed as "text" / "image_url" / "input_image" / "video_url" / "input_video".
90
+ - Images: URLs (http/https or data URL), base64, or local path are supported by [Python.function load_image_from_any](main.py:108).
91
+ - Videos: URLs and base64 are materialized to a temp file; frames extracted and uniformly sampled by [Python.function load_video_frames_from_any](main.py:150).
92
+
93
+ 3) Engine and generation
94
+ - Qwen chat template applied via processor.apply_chat_template in both [Python.function infer](main.py:326) and [Python.function infer_stream](main.py:375).
95
+ - Generation sampling uses temperature; do_sample toggled when temperature > 0.
96
+ - Streams are produced using TextIteratorStreamer.
97
+ - Optional cooperative cancellation is implemented with a StoppingCriteria bound to a session cancel event in [Python.function infer_stream](main.py:375).
98
+
99
+ 4) Streaming, resume, and persistence
100
+ - In-memory buffer per session for immediate replay: [Python.class _SSESession](main.py:435).
101
+ - Optional SQLite persistence to survive restarts and handle long gaps: [Python.class _SQLiteStore](main.py:482).
102
+ - Resume protocol:
103
+ - Client provides session_id in the request body and Last-Event-ID header "session_id:index", or pass ?last_event_id=...
104
+ - Server replays events after index from SQLite (if enabled) and the in-memory buffer.
105
+ - Producer appends events to both the ring buffer and SQLite (when enabled).
106
+
107
+ 5) Cancellation and disconnects
108
+ - Manual cancel endpoint [Python.app.post()](main.py:792) sets the session cancel event and marks finished in SQLite.
109
+ - Auto-cancel after disconnect:
110
+ - If all clients disconnect, a timer fires after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600) that sets the cancel event.
111
+ - The StoppingCriteria checks this event cooperatively and halts generation.
112
+
113
+ 6) Environment configuration
114
+ - See [.env.example](.env.example).
115
+ - Important variables:
116
+ - MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
117
+ - HF_TOKEN (optional)
118
+ - MAX_TOKENS, TEMPERATURE
119
+ - MAX_VIDEO_FRAMES (video frame sampling)
120
+ - DEVICE_MAP, TORCH_DTYPE (Transformers loading hints)
121
+ - PERSIST_SESSIONS, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS (SQLite)
122
+ - CANCEL_AFTER_DISCONNECT_SECONDS (auto-cancel threshold)
123
+
124
+ Security and privacy notes
125
+ - trust_remote_code=True executes code from the model repository when loading AutoProcessor/AutoModel. This is standard for many HF multimodal models but should be understood in terms of supply-chain risk.
126
+ - Do not log sensitive data. Avoid dumping raw request bodies or tokens.
127
+
128
+ Operational guidance
129
+
130
+ Running locally
131
+ - Install Python dependencies from [requirements.txt](requirements.txt) and install a suitable PyTorch wheel for your platform/CUDA.
132
+ - copy .env.example .env and adjust as needed.
133
+ - Start: python [Python.main()](main.py:807)
134
+
135
+ Testing endpoints
136
+ - Health: GET /health
137
+ - Chat (non-stream): POST /v1/chat/completions with messages array.
138
+ - Chat (stream): add "stream": true; optionally pass "session_id".
139
+ - Resume: send Last-Event-ID with "session_id:index".
140
+ - Cancel: POST /v1/cancel/{session_id}.
141
+
142
+ Scaling notes
143
+ - Typically deploy one model per process. For throughput, run multiple workers behind a load balancer; sessions are process-local unless persistence is used.
144
+ - SQLite persistence supports replay but does not synchronize cancel/producer state across processes. A Redis-based store (future work) can coordinate multi-process session state more robustly.
145
+
146
+ Known limitations and follow-ups
147
+ - Token accounting (usage prompt/completion/total) is stubbed at zeros. Populate if/when needed.
148
+ - Redis store not yet implemented (design leaves a clear seam via _SQLiteStore analog).
149
+ - No structured logging/tracing yet; follow-up for observability.
150
+ - Cancellation is best-effort cooperative; it relies on the stopping criteria hook in generation.
151
+
152
+ Changelog (2025-10-23)
153
+ - feat(server): Python FastAPI server with Qwen3-VL (Transformers), OpenAI-compatible /v1/chat/completions.
154
+ - feat(stream): SSE streaming with session_id + Last-Event-ID resumability.
155
+ - feat(persist): Optional SQLite-backed session persistence for replay across restarts.
156
+ - feat(cancel): Manual cancel endpoint /v1/cancel/{session_id}; auto-cancel after disconnect threshold.
157
+ - docs: Updated [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md), [RULES.md](RULES.md). Rewrote [TODO.md](TODO.md) pending/complete items (see repo TODO).
158
+ - chore: Removed Node.js and scripts from the prior stack.
159
+
160
+ Verification checklist
161
+ - Non-stream text-only request returns a valid completion.
162
+ - Image and video prompts pass through preprocessing and generate coherent output.
163
+ - Streaming emits OpenAI-style deltas and ends with [DONE].
164
+ - Resume works with Last-Event-ID and session_id across reconnects; works after server restart when PERSIST_SESSIONS=1.
165
+ - Manual cancel halts generation and marks session finished; subsequent resumes return a finished stream.
166
+ - Auto-cancel fires after all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS and cooperatively stops generation.
167
+
168
+ End of entry.
169
+ ## Progress Log Template (Mandatory per RULES)
170
+
171
+ Use this template for every change or progress step. Add a new entry before/with each commit, then append the final commit hash after push. See enforcement in [RULES.md](RULES.md:33) and the progress policy in [RULES.md](RULES.md:49).
172
+
173
+ Entry template
174
+ - Date/Time (Asia/Jakarta): YYYY-MM-DD HH:mm
175
+ - Commit: <hash> - <conventional message>
176
+ - Scope/Files (clickable anchors required):
177
+ - [Python.function chat_completions()](main.py:591)
178
+ - [Python.function infer_stream()](main.py:375)
179
+ - [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1), [RULES.md](RULES.md:1), [TODO.md](TODO.md:1)
180
+ - Summary:
181
+ - What changed and why (problem/requirement)
182
+ - Changes:
183
+ - Short bullet list of code edits with anchors
184
+ - Verification:
185
+ - Commands:
186
+ - curl examples (non-stream, stream with session_id, resume with Last-Event-ID)
187
+ - cancel API test: curl -X POST http://localhost:3000/v1/cancel/mysession123
188
+ - Expected vs Actual:
189
+ - …
190
+ - Follow-ups/Limitations:
191
+ - …
192
+ - Notes:
193
+ - If commit hash unknown at authoring time, update the entry after git push.
194
+
195
+ Git sequence (run every time)
196
+ - git add .
197
+ - git commit -m "type(scope): short description"
198
+ - git push
199
+ - Update this entry with the final commit hash.
200
+
201
+ Example (filled)
202
+ - Date/Time: 2025-10-23 14:30 (Asia/Jakarta)
203
+ - Commit: f724450 - feat(stream): add SQLite persistence for SSE resume
204
+ - Scope/Files:
205
+ - [Python.class _SQLiteStore](main.py:482)
206
+ - [Python.function chat_completions()](main.py:591)
207
+ - [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1)
208
+ - Summary:
209
+ - Persist SSE chunks to SQLite for replay across restarts; enable via PERSIST_SESSIONS.
210
+ - Changes:
211
+ - Add _SQLiteStore with schema and CRUD
212
+ - Wire producer to append events to DB
213
+ - Replay DB events on resume before in-memory buffer
214
+ - Verification:
215
+ - curl -N -H "Content-Type: application/json" ^
216
+ -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
217
+ http://localhost:3000/v1/chat/completions
218
+ - Restart server; resume:
219
+ curl -N -H "Content-Type: application/json" ^
220
+ -H "Last-Event-ID: mysession123:42" ^
221
+ -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
222
+ http://localhost:3000/v1/chat/completions
223
+ - Expected vs Actual: replayed chunks after index 42, continued live, ended with [DONE].
224
+ - Follow-ups:
225
+ - Consider Redis store for multi-process coordination
226
+ ## Progress Log — 2025-10-23 14:31 (Asia/Jakarta)
227
+
228
+ - Commit: f724450 - docs: sync README/ARCHITECTURE/RULES with main.py; add progress log in CLAUDE.md; enforce mandatory Git
229
+ - Scope/Files (anchors):
230
+ - [Python.function chat_completions()](main.py:591)
231
+ - [Python.function infer_stream()](main.py:375)
232
+ - [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449), [Python.class _SQLiteStore](main.py:482)
233
+ - [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1), [RULES.md](RULES.md:1), [CLAUDE.md](CLAUDE.md:1), [.env.example](.env.example:1)
234
+ - Summary:
235
+ - Completed Python migration and synchronized documentation. Implemented SSE streaming with resume, optional SQLite persistence, auto-cancel on disconnect, and manual cancel API. RULES now mandate Git usage and progress logging.
236
+ - Changes:
237
+ - Document streaming/resume/persistence/cancel in [README.md](README.md:1) and [ARCHITECTURE.md](ARCHITECTURE.md:1)
238
+ - Enforce Git workflow and progress logging in [RULES.md](RULES.md:33)
239
+ - Add Progress Log template and entries in [CLAUDE.md](CLAUDE.md:1)
240
+ - Verification:
241
+ - Non-stream:
242
+ curl -X POST http://localhost:3000/v1/chat/completions ^
243
+ -H "Content-Type: application/json" ^
244
+ -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}]}"
245
+ - Stream:
246
+ curl -N -H "Content-Type: application/json" ^
247
+ -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
248
+ http://localhost:3000/v1/chat/completions
249
+ - Resume:
250
+ curl -N -H "Content-Type: application/json" ^
251
+ -H "Last-Event-ID: mysession123:42" ^
252
+ -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
253
+ http://localhost:3000/v1/chat/completions
254
+ - Cancel:
255
+ curl -X POST http://localhost:3000/v1/cancel/mysession123
256
+ - Results:
257
+ - Streaming emits chunks, ends with [DONE]; resume replays after index; cancel terminates generation; auto-cancel after disconnect threshold works via timer + stopping criteria.
258
+ - Follow-ups:
259
+ - Optional Redis store for multi-process coordination.
Dockerfile ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use Python 3.12 slim for smaller image
2
+ FROM python:3.12-slim
3
+
4
+ # Install system deps for image/video processing and HF
5
+ RUN apt-get update && apt-get install -y --no-install-recommends \
6
+ git \
7
+ curl \
8
+ libglib2.0-0 \
9
+ libgomp1 \
10
+ && apt-get clean \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
+ # Set working directory
14
+ WORKDIR /app
15
+
16
+ # Copy requirements first for better caching
17
+ COPY requirements.txt .
18
+
19
+ # Backend selector: cpu | nvidia | amd
20
+ ARG BACKEND=cpu
21
+ # Pin torch versions per backend index
22
+ # - CPU index publishes newer (2.9.0 ok)
23
+ # - CUDA cu124 index publishes up to 2.6.0 (auto-resolves to +cu124)
24
+ # - ROCm 6.2 index publishes up to 2.5.1+rocm6.2 (must include local tag)
25
+ ARG TORCH_VER_CPU=2.9.0
26
+ ARG TORCH_VER_NVIDIA=2.6.0
27
+ ARG TORCH_VER_AMD=2.5.1+rocm6.2
28
+
29
+ # Control whether to bake the model into the image (1) or skip and download at runtime (0)
30
+ ARG BAKE_MODEL=0
31
+
32
+ ENV BACKEND=${BACKEND}
33
+ ENV BAKE_MODEL=${BAKE_MODEL}
34
+ ENV PIP_NO_CACHE_DIR=1
35
+
36
+ # Install appropriate PyTorch for the selected backend, then the rest
37
+ RUN if [ "$BACKEND" = "cpu" ]; then \
38
+ pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cpu torch==${TORCH_VER_CPU}; \
39
+ elif [ "$BACKEND" = "nvidia" ]; then \
40
+ pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu124 torch==${TORCH_VER_NVIDIA}; \
41
+ elif [ "$BACKEND" = "amd" ]; then \
42
+ pip install --no-cache-dir --index-url https://download.pytorch.org/whl/rocm6.2 "torch==${TORCH_VER_AMD}"; \
43
+ else \
44
+ echo "Unsupported BACKEND: $BACKEND" && exit 1; \
45
+ fi && \
46
+ pip install --no-cache-dir -r requirements.txt
47
+
48
+ # Copy source code
49
+ COPY main.py .
50
+ COPY tests/ tests/
51
+
52
+ # Copy env template (users can override with volume or env)
53
+ COPY .env.example .env
54
+
55
+ # HF cache and optional model bake-in (skippable for huge GPU builds to avoid runner disk exhaustion)
56
+ ENV HF_HOME=/app/hf-cache
57
+ ENV TRANSFORMERS_CACHE=/app/hf-cache
58
+ RUN mkdir -p /app/hf-cache && \
59
+ if [ "$BAKE_MODEL" = "1" ]; then \
60
+ python -c "import os; from huggingface_hub import snapshot_download; repo_id='Qwen/Qwen3-VL-2B-Thinking'; token=os.getenv('HF_TOKEN'); print(f'Downloading {repo_id}...'); snapshot_download(repo_id, token=token, local_dir='/app/hf-cache/Qwen_Qwen3-VL-2B-Thinking', local_dir_use_symlinks=False); print('Model downloaded.');"; \
61
+ else \
62
+ echo 'Skipping model bake-in (BAKE_MODEL=0). The server will prefetch to /app/hf-cache at startup.'; \
63
+ fi
64
+
65
+ # Expose port
66
+ EXPOSE 3000
67
+
68
+ # Health check
69
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
70
+ CMD curl -f http://localhost:3000/health || exit 1
71
+
72
+ # Run the server
73
+ CMD ["python", "main.py"]
HISTORY.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # History
2
+
3
+ This file tracks the chat history.
LICENSE ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Modified Apache License, Version 2.0 (Royalty-Linked)
2
+
3
+ Copyright (c) 2025 Alif Nurhidayat (GitHub: KillerKing93) <alifnurhidayatwork@gmail.com>
4
+
5
+ 1. Incorporation of Apache License, Version 2.0
6
+ This Software is provided under the terms of the Apache License, Version 2.0, with the additional terms set forth below. Except as modified herein, the full text of the Apache License, Version 2.0 applies and is incorporated by reference:
7
+ https://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ 2. Additional Terms — Commercial Royalty
10
+ 2.1. Commercial Use of this Software is subject to a royalty obligation to the Licensor (the copyright holder), unless explicitly exempted below.
11
+ 2.2. “Commercial Use” means any use, distribution, SaaS offering, internal deployment tied to revenue generation or cost reduction, embedding in a paid product or service, or use by for-profit entities for business operations.
12
+ 2.3. “Licensor” means the copyright holder named above.
13
+
14
+ 3. Royalty Schedule (Guidance)
15
+ The payable royalty ranges from 5% up to 25% of net revenue attributable to the Software, determined as follows:
16
+ - 5%: Minimal or ancillary use (non-core functionality, prototypes, or internal tools with limited scope).
17
+ - 10%: Moderate use (Software forms a notable but non-primary component of a commercial product or service).
18
+ - 15%: Significant use (Software materially contributes to value delivery or operational savings).
19
+ - 20%: Core use (Software is a primary component enabling the product/service).
20
+ - 25%: White-label or redistribution scenarios where the Software is central and repackaged for resale.
21
+ Notes:
22
+ - “Net revenue attributable” should reasonably apportion revenue or savings connected to the Software’s role.
23
+ - When in doubt, contact the Licensor to agree on a fair rate and basis. Written waivers or adjustments override this schedule.
24
+
25
+ 4. Exemptions
26
+ 4.1. PT. ASCON INOVASI DATA is granted a perpetual, worldwide, royalty-free license to use, reproduce, distribute, modify, and create derivative works of the Software for any purpose.
27
+ 4.2. Non-commercial academic research and open-source contributions not tied to revenue generation are generally exempt from royalties. However, redistribution or commercial hosting still requires compliance with Section 2.
28
+
29
+ 5. Reporting and Payment
30
+ 5.1. For Commercial Use, Licensee shall make a good-faith effort to notify the Licensor within 30 days of first commercial deployment and, if requested, provide a brief description of the use case to determine the appropriate royalty tier.
31
+ 5.2. Royalties shall be settled quarterly unless otherwise agreed in writing.
32
+
33
+ 6. No Warranty; Limitation of Liability
34
+ As per Apache License 2.0 Sections 7 and 8, the Software is provided “AS IS,” without warranties or conditions of any kind, and Licensor shall not be liable for any damages arising from the use of the Software.
35
+
36
+ 7. Patent Grant
37
+ As per Apache License 2.0 Section 3. No additional patent licenses are granted or implied beyond Apache 2.0.
38
+
39
+ 8. Attribution
40
+ Attribution notices required by Apache 2.0 must be preserved in source distributions and, where practical, in documentation and About screens of products/services using the Software.
41
+
42
+ 9. Severability
43
+ If any provision of these Additional Terms is held unenforceable, the remaining provisions shall remain in full force and effect, and the unenforceable provision shall be enforced to the maximum extent permissible.
44
+
45
+ 10. Contact
46
+ For royalty discussions, waivers, or clarifications:
47
+ - Licensor: Alif Nurhidayat (KillerKing93)
48
+ - Email: alifnurhidayatwork@gmail.com
49
+
50
+ By using this Software in a Commercial Use, you acknowledge the applicability of these Additional Terms alongside the Apache License, Version 2.0.
README.md CHANGED
@@ -1,12 +1,358 @@
1
- ---
2
- title: Transformers InferenceServer OpenAPI Compatible
3
- emoji: 🌖
4
- colorFrom: green
5
- colorTo: green
6
- sdk: docker
7
- pinned: false
8
- license: other
9
- short_description: Transformers-InferenceServer-OpenAPI-Compatible
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python FastAPI Inference Server (OpenAI-Compatible) for Qwen3-VL-2B-Thinking
2
+
3
+ This repository has been migrated from a Node.js/llama.cpp stack to a Python/Transformers stack to fully support multimodal inference (text, images, videos) with the Hugging Face Qwen3 models.
4
+
5
+ Key files:
6
+
7
+ - Server entry: [main.py](main.py)
8
+ - Environment template: [.env.example](.env.example)
9
+ - Python dependencies: [requirements.txt](requirements.txt)
10
+ - Architecture: [ARCHITECTURE.md](ARCHITECTURE.md) (will be updated to reflect the Python stack)
11
+
12
+ Model:
13
+
14
+ - Default: Qwen/Qwen3-VL-2B-Thinking (Transformers; supports multimodal)
15
+ - You can change the model via environment variable MODEL_REPO_ID.
16
+
17
+ Node.js artifacts and scripts from the previous project have been removed.
18
+
19
+ ## Quick Start
20
+
21
+ ### Option 1: Run with Docker (with-model images: CPU / NVIDIA / AMD)
22
+
23
+ Tags built by CI:
24
+ - ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
25
+ - ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
26
+ - ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
27
+
28
+ Pull:
29
+
30
+ ```bash
31
+ # CPU
32
+ docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
33
+
34
+ # NVIDIA (CUDA 12.4 wheel)
35
+ docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
36
+
37
+ # AMD (ROCm 6.2 wheel)
38
+ docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
39
+ ```
40
+
41
+ Run:
42
+
43
+ ```bash
44
+ # CPU
45
+ docker run -p 3000:3000 \
46
+ -e HF_TOKEN=your_hf_token_here \
47
+ ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
48
+
49
+ # NVIDIA GPU (requires NVIDIA drivers + nvidia-container-toolkit on the host)
50
+ docker run --gpus all -p 3000:3000 \
51
+ -e HF_TOKEN=your_hf_token_here \
52
+ ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
53
+
54
+ # AMD GPU ROCm (requires ROCm 6.2+ drivers on the host; Linux only)
55
+ # Map ROCm devices and video group (may vary by distro)
56
+ docker run --device=/dev/kfd --device=/dev/dri --group-add video \
57
+ -p 3000:3000 \
58
+ -e HF_TOKEN=your_hf_token_here \
59
+ ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
60
+ ```
61
+
62
+ Health check:
63
+ ```bash
64
+ curl http://localhost:3000/health
65
+ ```
66
+
67
+ Notes:
68
+ - These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
69
+ - Host requirements:
70
+ - NVIDIA: recent driver + nvidia-container-toolkit.
71
+ - AMD: ROCm 6.2+ driver stack, supported GPU, and mapped /dev/kfd and /dev/dri devices.
72
+
73
+ ### Option 2: Run Locally
74
+
75
+ Requirements
76
+
77
+ - Python 3.10+
78
+ - pip
79
+ - PyTorch (install a wheel matching your platform/CUDA)
80
+ - Optionally a GPU with enough VRAM for the chosen model
81
+
82
+ Install
83
+
84
+ 1. Create and activate a virtual environment (Windows CMD):
85
+ python -m venv .venv
86
+ .venv\Scripts\activate
87
+
88
+ 2. Install dependencies:
89
+ pip install -r requirements.txt
90
+
91
+ 3. Install PyTorch appropriate for your platform (examples):
92
+ CPU-only:
93
+ pip install torch --index-url https://download.pytorch.org/whl/cpu
94
+ CUDA 12.4 example:
95
+ pip install torch --index-url https://download.pytorch.org/whl/cu124
96
+
97
+ 4. Create a .env from the template and adjust if needed:
98
+ copy .env.example .env
99
+ - Set HF_TOKEN if the model is gated
100
+ - Adjust MAX_TOKENS, TEMPERATURE, DEVICE_MAP, TORCH_DTYPE, MAX_VIDEO_FRAMES as desired
101
+
102
+ Configuration via .env
103
+ See [.env.example](.env.example). Important variables:
104
+
105
+ - PORT=3000
106
+ - MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
107
+ - HF_TOKEN= # optional if gated
108
+ - MAX_TOKENS=4096
109
+ - TEMPERATURE=0.7
110
+ - MAX_VIDEO_FRAMES=16
111
+ - DEVICE_MAP=auto
112
+ - TORCH_DTYPE=auto
113
+
114
+ Additional streaming/persistence configuration
115
+
116
+ - PERSIST_SESSIONS=1 # enable SQLite-backed resumable SSE
117
+ - SESSIONS_DB_PATH=sessions.db # SQLite db path
118
+ - SESSIONS_TTL_SECONDS=600 # TTL for finished sessions before GC
119
+ - CANCEL_AFTER_DISCONNECT_SECONDS=3600 # auto-cancel generation if all clients disconnect for this many seconds (0=disable)
120
+
121
+ Cancel session API (custom extension)
122
+
123
+ - Endpoint: POST /v1/cancel/{session_id}
124
+ - Purpose: Manually cancel an in-flight streaming generation for the given session_id. Not part of OpenAI Chat Completions spec (the newer OpenAI Responses API has cancel), so this is provided as a practical extension.
125
+ - Example (Windows CMD):
126
+ curl -X POST http://localhost:3000/v1/cancel/mysession123
127
+ Run
128
+
129
+ - Direct:
130
+ python main.py
131
+
132
+ - Using uvicorn:
133
+ uvicorn main:app --host 0.0.0.0 --port 3000
134
+
135
+ Endpoints (OpenAI-compatible)
136
+
137
+ - Health
138
+ GET /health
139
+ Example:
140
+ curl http://localhost:3000/health
141
+ Response:
142
+ {
143
+ "ok": true,
144
+ "modelReady": true,
145
+ "modelId": "Qwen/Qwen3-VL-2B-Thinking",
146
+ "error": null
147
+ }
148
+
149
+ - Chat Completions (non-streaming)
150
+ POST /v1/chat/completions
151
+ Example (Windows CMD):
152
+ curl -X POST http://localhost:3000/v1/chat/completions ^
153
+ -H "Content-Type: application/json" ^
154
+ -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":128}"
155
+
156
+ Example (PowerShell):
157
+ $body = @{
158
+ model = "qwen-local"
159
+ messages = @(@{ role = "user"; content = "Hello Qwen3!" })
160
+ max_tokens = 128
161
+ } | ConvertTo-Json -Depth 5
162
+ curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
163
+
164
+ - Chat Completions (streaming via Server-Sent Events)
165
+ Set "stream": true to receive partial deltas as they are generated.
166
+ Example (Windows CMD):
167
+ curl -N -H "Content-Type: application/json" ^
168
+ -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: what is 17 * 23?\"}],\"stream\":true}" ^
169
+ http://localhost:3000/v1/chat/completions
170
+
171
+ The stream format follows OpenAI-style SSE:
172
+ data: { "id": "...", "object": "chat.completion.chunk", "choices":[{ "delta": {"role": "assistant"} }]}
173
+ data: { "choices":[{ "delta": {"content": "To"} }]}
174
+ data: { "choices":[{ "delta": {"content": " think..."} }]}
175
+ ...
176
+ data: { "choices":[{ "delta": {}, "finish_reason": "stop"}]}
177
+ data: [DONE]
178
+
179
+ Multimodal Usage
180
+
181
+ - Text only:
182
+ { "role": "user", "content": "Summarize: The quick brown fox ..." }
183
+
184
+ - Image by URL:
185
+ {
186
+ "role": "user",
187
+ "content": [
188
+ { "type": "text", "text": "What is in this image?" },
189
+ { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } }
190
+ ]
191
+ }
192
+
193
+ - Image by base64:
194
+ {
195
+ "role": "user",
196
+ "content": [
197
+ { "type": "text", "text": "OCR this." },
198
+ { "type": "input_image", "b64_json": "<base64 of image bytes>" }
199
+ ]
200
+ }
201
+
202
+ - Video by URL (frames are sampled up to MAX_VIDEO_FRAMES):
203
+ {
204
+ "role": "user",
205
+ "content": [
206
+ { "type": "text", "text": "Describe this clip." },
207
+ { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } }
208
+ ]
209
+ }
210
+
211
+ - Video by base64:
212
+ {
213
+ "role": "user",
214
+ "content": [
215
+ { "type": "text", "text": "Count the number of cars." },
216
+ { "type": "input_video", "b64_json": "<base64 of full video file>" }
217
+ ]
218
+ }
219
+
220
+ Implementation Notes
221
+
222
+ - Server code: [main.py](main.py)
223
+ - FastAPI with CORS enabled
224
+ - Non-streaming and streaming endpoints
225
+ - Uses AutoProcessor and AutoModelForCausalLM with trust_remote_code=True
226
+ - Converts OpenAI-style messages into the Qwen multimodal format
227
+ - Images loaded via PIL; videos loaded via imageio.v3 (preferred) or OpenCV as fallback; frames sampled
228
+
229
+ Performance Tips
230
+
231
+ - On GPUs: set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16 or float16 if supported
232
+ - Reduce MAX_VIDEO_FRAMES to speed up video processing
233
+ - Tune MAX_TOKENS and TEMPERATURE according to your needs
234
+
235
+ Troubleshooting
236
+
237
+ - ImportError or no CUDA found:
238
+ - Ensure PyTorch is installed with the correct wheel for your environment.
239
+ - OOM / CUDA out of memory:
240
+ - Use a smaller model, lower MAX_VIDEO_FRAMES, lower MAX_TOKENS, or run on CPU.
241
+ - 503 Model not ready:
242
+ - The first request triggers model load; check /health for errors and HF_TOKEN if gated.
243
+
244
+ License
245
+
246
+ - See LICENSE for terms.
247
+
248
+ Changelog and Architecture
249
+
250
+ - We will update [ARCHITECTURE.md](ARCHITECTURE.md) to reflect the Python server flow.
251
+
252
+ ## Streaming behavior, resume, and reconnections
253
+
254
+ The server streams responses using Server‑Sent Events (SSE) from [Python.function chat_completions()](main.py:457), driven by token iteration in [Python.function infer_stream](main.py:361). It now supports resumable streaming using an in‑memory ring buffer and SSE Last-Event-ID, with optional SQLite persistence (enable PERSIST_SESSIONS=1).
255
+
256
+ What’s implemented
257
+
258
+ - Per-session in-memory ring buffer keyed by session_id (no external storage).
259
+ - Each SSE event carries an SSE id line in the format "session_id:index" so clients can resume with Last-Event-ID.
260
+ - On reconnect:
261
+ - Provide the same session_id in the request body, and
262
+ - Provide "Last-Event-ID: session_id:index" header (or query ?last_event_id=session_id:index).
263
+ - The server replays cached events after index and continues streaming new tokens.
264
+ - Session TTL: ~10 minutes, buffer capacity: ~2048 events. Old or finished sessions are garbage-collected in-memory.
265
+
266
+ How to start a streaming session
267
+
268
+ - Minimal (server generates a session_id internally for SSE id lines):
269
+ Windows CMD:
270
+ curl -N -H "Content-Type: application/json" ^
271
+ -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
272
+ http://localhost:3000/v1/chat/completions
273
+
274
+ - With explicit session_id (recommended if you want to resume):
275
+ Windows CMD:
276
+ curl -N -H "Content-Type: application/json" ^
277
+ -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
278
+ http://localhost:3000/v1/chat/completions
279
+
280
+ How to resume after disconnect
281
+
282
+ - Use the same session_id and the SSE Last-Event-ID header (or ?last_event_id=...):
283
+ Windows CMD (resume from index 42):
284
+ curl -N -H "Content-Type: application/json" ^
285
+ -H "Last-Event-ID: mysession123:42" ^
286
+ -d "{\"session_id\":\"mysession123\",\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" ^
287
+ http://localhost:3000/v1/chat/completions
288
+
289
+ Alternatively with query string:
290
+ http://localhost:3000/v1/chat/completions?last_event_id=mysession123:42
291
+
292
+ Event format
293
+
294
+ - Chunks follow the OpenAI-style "chat.completion.chunk" shape in data payloads, plus an SSE id:
295
+ id: mysession123:5
296
+ data: {"id":"mysession123","object":"chat.completion.chunk","created":..., "model":"...", "choices":[{"index":0,"delta":{"content":" token"},"finish_reason":null}]}
297
+
298
+ - The stream ends with:
299
+ data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
300
+ data: [DONE]
301
+
302
+ Notes and limits
303
+
304
+ - This implementation keeps session state only in memory; restarts will drop buffers.
305
+ - If the buffer overflows before you resume, the earliest chunks may be unavailable.
306
+ - Cancellation on client disconnect is not automatic; generation runs to completion in the background. A cancellable stopping-criteria can be added if required.
307
+
308
+ ## Hugging Face repository files support
309
+
310
+ This server loads the Qwen3-VL model via Transformers with `trust_remote_code=True`, so the standard files from the repo are supported and consumed automatically. Summary for https://huggingface.co/Qwen/Qwen3-VL-2B-Thinking/tree/main:
311
+
312
+ - Used by model weights and architecture
313
+
314
+ - model.safetensors — main weights loaded by AutoModelForCausalLM
315
+ - config.json — architecture/config
316
+ - generation_config.json — default gen params (we may override via request or env)
317
+
318
+ - Used by tokenizer
319
+
320
+ - tokenizer.json — primary tokenizer specification
321
+ - tokenizer_config.json — tokenizer settings
322
+ - merges.txt and vocab.json — fallback/compat files; if tokenizer.json exists, HF generally prefers it
323
+
324
+ - Used by processors (multimodal)
325
+
326
+ - preprocessor_config.json — image/text processor config
327
+ - video_preprocessor_config.json — video processor config (frame sampling, etc.)
328
+ - chat_template.json — chat formatting used by [Python.function infer](main.py:312) and [Python.function infer_stream](main.py:361) via `processor.apply_chat_template(...)`
329
+
330
+ - Not required for runtime
331
+ - README.md, .gitattributes — ignored by runtime
332
+
333
+ Notes:
334
+
335
+ - We rely on Transformers’ AutoModelForCausalLM and AutoProcessor to resolve and use the above files; no manual parsing is required in our code.
336
+ - With `trust_remote_code=True`, model-specific code from the repo may load additional assets transparently.
337
+ - If the repo updates configs (e.g., new chat template), the server will pick them up on next load.
338
+
339
+ ## Cancellation and session persistence
340
+
341
+ - Auto-cancel on disconnect:
342
+
343
+ - Generation is automatically cancelled if all clients disconnect for more than CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 seconds = 1 hour). Configure in [.env.example](.env.example) via `CANCEL_AFTER_DISCONNECT_SECONDS`.
344
+ - Implemented by a timer in [Python.function chat_completions](main.py:732) that triggers a cooperative stop through a stopping criteria in [Python.function infer_stream](main.py:375).
345
+
346
+ - Manual cancel API (custom extension):
347
+
348
+ - Endpoint: `POST /v1/cancel/{session_id}`
349
+ - Cancels an ongoing streaming session and marks it finished in the store. Example (Windows CMD):
350
+ curl -X POST http://localhost:3000/v1/cancel/mysession123
351
+ - This is not part of OpenAI’s legacy Chat Completions spec. OpenAI’s newer Responses API has a cancel endpoint, but Chat Completions does not. We provide this custom endpoint for operational control.
352
+
353
+ - Persistence:
354
+ - Optional SQLite-backed persistence for resumable SSE (enable `PERSIST_SESSIONS=1` in [.env.example](.env.example)).
355
+ - Database path: `SESSIONS_DB_PATH` (default: sessions.db)
356
+ - Session TTL for GC: `SESSIONS_TTL_SECONDS` (default: 600)
357
+ - See implementation in [Python.class \_SQLiteStore](main.py:481) and integration in [Python.function chat_completions](main.py:591).
358
+ - Redis is not implemented yet; the design isolates persistence so a Redis-backed store can be added as a drop-in.
RULES.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Rules and Workflow (Python FastAPI + Transformers)
2
+
3
+ These rules are binding for every change. Keep code, docs, and behavior synchronized at all times.
4
+
5
+ Files referenced below:
6
+ - [README.md](README.md)
7
+ - [ARCHITECTURE.md](ARCHITECTURE.md)
8
+ - [TODO.md](TODO.md)
9
+ - [CLAUDE.md](CLAUDE.md)
10
+ - [.env.example](.env.example)
11
+ - [.gitignore](.gitignore)
12
+ - [requirements.txt](requirements.txt)
13
+ - [Python.main()](main.py:1)
14
+
15
+ ## 1) Documentation rules (must-do on every change)
16
+
17
+ Always update documentation when code or behavior changes.
18
+
19
+ Minimum documentation checklist:
20
+ - What changed and where (filenames, sections, or callable links like [Python.function chat_completions()](main.py:591)).
21
+ - Why the change was made (problem or requirement).
22
+ - How to operate or verify (commands, endpoints, examples).
23
+ - Follow-ups or known limitations.
24
+
25
+ Where to update:
26
+ - Operator-facing: [README.md](README.md)
27
+ - Developer-facing: [CLAUDE.md](CLAUDE.md) (rationale, alternatives, caveats)
28
+ - Architecture or flows: [ARCHITECTURE.md](ARCHITECTURE.md)
29
+ - Tasks and statuses: [TODO.md](TODO.md)
30
+
31
+ Never skip documentation. If a change is reverted, document the revert.
32
+
33
+ ## 2) Git discipline (mandatory)
34
+
35
+ - Always use Git. Every change or progress step MUST be committed and pushed.
36
+ - Windows CMD example:
37
+ - git add .
38
+ - git commit -m "type(scope): short description"
39
+ - git push
40
+ - No exceptions. If no remote exists, commit locally and configure a remote as soon as possible. Record any temporary push limitations in [README.md](README.md) and [CLAUDE.md](CLAUDE.md), but commits are still required locally.
41
+ - Commit style:
42
+ - Conventional types: chore, docs, feat, fix, refactor, perf, test, build, ci
43
+ - Keep commits small and atomic (one concern per commit).
44
+ - Reference important files in the commit body, for example: updated [Python.function chat_completions()](main.py:591), [README.md](README.md).
45
+ - After updating code or docs, commit immediately. Do not batch unrelated changes.
46
+
47
+ ## 2.1) Progress log (mandatory)
48
+
49
+ - Every commit MUST include a corresponding entry in [CLAUDE.md](CLAUDE.md) under a “Progress Log” section.
50
+ - Each entry must include:
51
+ - Date/time (Asia/Jakarta)
52
+ - Scope and short summary of the change
53
+ - The final Git commit hash and commit message
54
+ - Files and exact callable anchors touched (use clickable anchors), e.g. [Python.function chat_completions()](main.py:591), [README.md](README.md:1), [ARCHITECTURE.md](ARCHITECTURE.md:1)
55
+ - Verification steps and results (curl examples, expected vs actual, notes)
56
+ - Required sequence:
57
+ 1) Make code changes
58
+ 2) Update docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md), and add a new progress log entry in [CLAUDE.md](CLAUDE.md)
59
+ 3) Run Git commands:
60
+ - git add .
61
+ - git commit -m "type(scope): short description"
62
+ - git push
63
+ 4) Append the final commit hash to the [CLAUDE.md](CLAUDE.md) entry if it was not known at authoring time
64
+ - No code change may land without a synchronized progress log entry.
65
+
66
+ ## 3) Large artifacts policy (.gitignore)
67
+
68
+ Never commit large/generated artifacts. Keep the repository lean and reproducible.
69
+
70
+ Must be ignored:
71
+ - models/ (downloaded by HF/Transformers cache or tools at runtime)
72
+ - .venv/, venv/
73
+ - __pycache__/
74
+ - .cache/
75
+ - uploads/, data/, tmp/
76
+
77
+ See [.gitignore](.gitignore) and extend as needed for new generated outputs. If you add ignores, document the rationale in [CLAUDE.md](CLAUDE.md).
78
+
79
+ ## 4) Model policy (Hugging Face / Transformers)
80
+
81
+ Target default model:
82
+ - Qwen/Qwen3-VL-2B-Thinking (Transformers; multimodal).
83
+
84
+ Rules:
85
+ - Use Hugging Face Transformers (AutoModelForCausalLM + AutoProcessor) with trust_remote_code=True.
86
+ - Do not commit model weights or caches. Let from_pretrained() download to local caches.
87
+ - Handle authentication for gated models via HF_TOKEN in [.env.example](.env.example).
88
+ - The server must remain OpenAI-compatible at /v1/chat/completions and support multimodal inputs (text, images, videos).
89
+ - Keep configuration via environment variables (see [Python.os.getenv()](main.py:67)).
90
+
91
+ ## 5) API contract
92
+
93
+ Provide an OpenAI-compatible endpoint:
94
+ - POST /v1/chat/completions
95
+
96
+ Minimum behavior:
97
+ - Accept model and messages per OpenAI schema (we honor messages; model is informational since server is pinned via env).
98
+ - Non-streaming JSON response.
99
+ - Streaming SSE response when body.stream=true:
100
+ - Emit OpenAI-style chat.completion.chunk deltas.
101
+ - Include SSE id lines "session_id:index" to support resume via Last-Event-ID.
102
+
103
+ Resume semantics:
104
+ - Client provides a session_id (or server generates one).
105
+ - Client may reconnect and send Last-Event-ID: session_id:index to replay missed chunks.
106
+ - Session data can be persisted (SQLite) if enabled.
107
+
108
+ Manual cancel (custom extension):
109
+ - POST /v1/cancel/{session_id} cancels a streaming generation.
110
+ - Note: Not part of legacy OpenAI Chat Completions spec. It mirrors the spirit of the newer OpenAI Responses API cancel endpoint.
111
+
112
+ All endpoints must validate inputs, handle timeouts/failures, and return structured JSON errors.
113
+
114
+ ## 6) Streaming, persistence, and cancellation
115
+
116
+ - Streaming is implemented via SSE in [Python.function chat_completions()](main.py:591) with token iteration in [Python.function infer_stream](main.py:375).
117
+ - In-memory ring buffer per session and optional SQLite persistence for replay across restarts:
118
+ - In-memory: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449)
119
+ - SQLite: [Python.class _SQLiteStore](main.py:482) (enabled with PERSIST_SESSIONS=1)
120
+ - Resume:
121
+ - Uses SSE id "session_id:index" and Last-Event-ID header (or ?last_event_id=...).
122
+ - Auto-cancel on disconnect:
123
+ - If all clients disconnect, generation is cancelled after CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 sec). Configurable via env.
124
+ - Cooperative stop via StoppingCriteria in [Python.function infer_stream](main.py:375).
125
+ - Manual cancel:
126
+ - [Python.function cancel_session](main.py:792) to stop a session on demand.
127
+
128
+ ## 7) Logging and error handling
129
+
130
+ - Log key lifecycle stages (startup, model load, stream start/stop, resume).
131
+ - Redact sensitive fields (e.g., tokens, credentials).
132
+ - User errors → 400; model-not-ready → 503; unexpected failures → 500.
133
+ - Optionally add structured logging and request IDs in a follow-up.
134
+
135
+ ## 8) Architecture documentation
136
+
137
+ Keep [ARCHITECTURE.md](ARCHITECTURE.md) authoritative for:
138
+ - Startup flow and lazy model load
139
+ - Multimodal preprocessing (images/videos)
140
+ - Streaming, resume, persistence, and cancellation flows
141
+ - Error/timeout handling
142
+ - Extensibility (persistence strategies, cancellation hooks, scaling patterns)
143
+
144
+ Update when code paths or data flows change.
145
+
146
+ ## 9) TODO hygiene
147
+
148
+ Track all planned work in [TODO.md](TODO.md):
149
+ - Update statuses immediately when tasks start/complete.
150
+ - Add newly discovered tasks as soon as they are identified.
151
+ - Keep TODO focused, scoped, and prioritized.
152
+
153
+ ## 10) Operational requirements and environment
154
+
155
+ Required:
156
+ - Python: >= 3.10
157
+ - pip
158
+ - PyTorch: install a wheel matching platform/CUDA (see [requirements.txt](requirements.txt) notes)
159
+
160
+ Recommended:
161
+ - GPU with sufficient VRAM for the chosen model
162
+ - Windows 11 supported; Linux/macOS should also work
163
+
164
+ Environment variables (see [.env.example](.env.example)):
165
+ - PORT=3000
166
+ - MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
167
+ - HF_TOKEN=
168
+ - MAX_TOKENS=256
169
+ - TEMPERATURE=0.7
170
+ - MAX_VIDEO_FRAMES=16
171
+ - DEVICE_MAP=auto
172
+ - TORCH_DTYPE=auto
173
+ - PERSIST_SESSIONS=1|0, SESSIONS_DB_PATH, SESSIONS_TTL_SECONDS
174
+ - CANCEL_AFTER_DISCONNECT_SECONDS=3600 (0 to disable)
175
+
176
+ ## 11) File responsibilities overview
177
+
178
+ - Server: [Python.main()](main.py:1)
179
+ - API routing, model singleton, inference, streaming, resume, cancel
180
+ - Docs: [README.md](README.md), [ARCHITECTURE.md](ARCHITECTURE.md)
181
+ - Dev log: [CLAUDE.md](CLAUDE.md)
182
+ - Tasks: [TODO.md](TODO.md)
183
+ - Config template: [.env.example](.env.example)
184
+ - Dependencies: [requirements.txt](requirements.txt)
185
+ - Ignores: [.gitignore](.gitignore)
186
+
187
+ ## 12) Workflow example (single iteration)
188
+
189
+ 1) Make a small, isolated change (e.g., enable SQLite persistence).
190
+ 2) Update docs:
191
+ - [CLAUDE.md](CLAUDE.md): what/why/how
192
+ - [README.md](README.md): operator usage changes
193
+ - [ARCHITECTURE.md](ARCHITECTURE.md): persistence/resume flow
194
+ - [TODO.md](TODO.md): status changes
195
+ 3) Commit and push:
196
+ - git add .
197
+ - git commit -m "feat(stream): add SQLite persistence for SSE resume"
198
+ - git push
199
+ 4) Verify locally; record any issues or follow-ups in [CLAUDE.md](CLAUDE.md).
200
+
201
+ ## 13) Compliance checklist (pre-merge / pre-push)
202
+
203
+ - Code runs locally (uvicorn main:app …).
204
+ - Docs updated ([README.md](README.md), [CLAUDE.md](CLAUDE.md), [ARCHITECTURE.md](ARCHITECTURE.md), [TODO.md](TODO.md)).
205
+ - No large artifacts added to git.
206
+ - Commit message follows conventional style.
207
+ - Endpoint contract honored (including streaming/resume semantics and cancel extension).
TODO.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TODO
2
+
3
+ - [ ] Initialize Git repository and perform initial commit.
4
+ - [x] Install necessary dependencies: `express`, `node-llama-cpp`.
5
+ - [x] Create the basic server structure in `index.js`.
6
+ - [ ] Implement the `/v1/chat/completions` endpoint.
7
+ - [ ] Load the Qwen3 model.
8
+ - [ ] Implement the inference logic using `node-llama-cpp`.
9
+ - [ ] Add error handling.
10
+ - [ ] Add logging.
11
+ - [ ] Write tests for the API endpoint.
12
+ - [ ] Update `CLAUDE.md` with detailed documentation.
13
+ - [ ] Update `ARCHITECTURE.md` with the project architecture.
14
+ - [ ] Push the initial project to the GitHub repository.
main.py ADDED
@@ -0,0 +1,1108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ FastAPI Inference Server (OpenAI-compatible) for Qwen3-VL multimodal model.
5
+
6
+ - Default model: Qwen/Qwen3-VL-2B-Thinking
7
+ - Endpoints:
8
+ * GET /openapi.yaml (OpenAPI schema in YAML)
9
+ * GET /health (readiness + context report)
10
+ * POST /v1/chat/completions (non-stream and streaming SSE)
11
+ * POST /v1/cancel/{session_id} (custom cancel endpoint)
12
+
13
+ Notes:
14
+ - Uses Hugging Face Transformers with trust_remote_code=True.
15
+ - Supports OpenAI-style chat messages with text, image_url/input_image, video_url/input_video.
16
+ - Streaming SSE supports resume (session_id + Last-Event-ID) with optional SQLite persistence.
17
+ - Auto prompt compression prevents context overflow with a simple truncate strategy.
18
+ """
19
+
20
+ import os
21
+ import io
22
+ import re
23
+ import base64
24
+ import tempfile
25
+ import contextlib
26
+ from typing import Any, Dict, List, Optional, Tuple, Deque
27
+
28
+ from fastapi import FastAPI, HTTPException, Request
29
+ from fastapi.middleware.cors import CORSMiddleware
30
+ from pydantic import BaseModel
31
+ from starlette.responses import JSONResponse
32
+ from fastapi.responses import StreamingResponse, Response
33
+ import json
34
+ import yaml
35
+ import threading
36
+ import time
37
+ import uuid
38
+ import sqlite3
39
+ from collections import deque
40
+ import subprocess
41
+ import sys
42
+ import shutil
43
+
44
+ # Load env
45
+ try:
46
+ from dotenv import load_dotenv
47
+ load_dotenv()
48
+ except Exception:
49
+ pass
50
+
51
+ # Ensure HF cache dirs are relative to this project by default
52
+ ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
53
+ DEFAULT_HF_CACHE = os.path.join(ROOT_DIR, "hf-cache")
54
+ if not os.getenv("HF_HOME"):
55
+ os.environ["HF_HOME"] = DEFAULT_HF_CACHE
56
+ if not os.getenv("TRANSFORMERS_CACHE"):
57
+ os.environ["TRANSFORMERS_CACHE"] = DEFAULT_HF_CACHE
58
+ # Create directory eagerly to avoid later mkdir races
59
+ try:
60
+ os.makedirs(os.environ["HF_HOME"], exist_ok=True)
61
+ except Exception:
62
+ pass
63
+
64
+ # Optional heavy deps are imported lazily inside Engine to improve startup UX
65
+ import requests
66
+ from PIL import Image
67
+ import numpy as np
68
+ from huggingface_hub import snapshot_download, list_repo_files, hf_hub_download, get_hf_file_metadata
69
+
70
+ # Server config
71
+ PORT = int(os.getenv("PORT", "3000"))
72
+ DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
73
+ HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
74
+ DEFAULT_MAX_TOKENS = int(os.getenv("MAX_TOKENS", "256"))
75
+ DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
76
+ MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
77
+ DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")
78
+ TORCH_DTYPE = os.getenv("TORCH_DTYPE", "auto")
79
+
80
+ # Persistent session store (SQLite)
81
+ PERSIST_SESSIONS = str(os.getenv("PERSIST_SESSIONS", "0")).lower() in ("1", "true", "yes", "y")
82
+ SESSIONS_DB_PATH = os.getenv("SESSIONS_DB_PATH", "sessions.db")
83
+ SESSIONS_TTL_SECONDS = int(os.getenv("SESSIONS_TTL_SECONDS", "600"))
84
+ # Auto-cancel if all clients disconnect for duration (seconds). 0 disables it.
85
+ CANCEL_AFTER_DISCONNECT_SECONDS = int(os.getenv("CANCEL_AFTER_DISCONNECT_SECONDS", "3600"))
86
+
87
+ # Auto compression settings
88
+ ENABLE_AUTO_COMPRESSION = str(os.getenv("ENABLE_AUTO_COMPRESSION", "1")).lower() in ("1", "true", "yes", "y")
89
+ CONTEXT_MAX_TOKENS_AUTO = int(os.getenv("CONTEXT_MAX_TOKENS_AUTO", "0")) # 0 -> infer from model/tokenizer
90
+ CONTEXT_SAFETY_MARGIN = int(os.getenv("CONTEXT_SAFETY_MARGIN", "256"))
91
+ COMPRESSION_STRATEGY = os.getenv("COMPRESSION_STRATEGY", "truncate") # truncate | summarize (future)
92
+
93
+ # Eager model loading (download/check at startup before serving traffic)
94
+ EAGER_LOAD_MODEL = str(os.getenv("EAGER_LOAD_MODEL", "1")).lower() in ("1", "true", "yes", "y")
95
+
96
+ def _log(msg: str):
97
+ # Consistent, flush-immediate startup logs
98
+ print(f"[startup] {msg}", flush=True)
99
+
100
+ def prefetch_model_assets(repo_id: str, token: Optional[str]) -> Optional[str]:
101
+ """
102
+ Reproducible prefetch driven by huggingface-cli:
103
+ - Downloads the ENTIRE repo using CLI (visible progress bar).
104
+ - Returns the local directory path where the repo is mirrored.
105
+ - If CLI is unavailable, falls back to verbose API prefetch.
106
+ """
107
+ try:
108
+ # Enable accelerated transfer + xet if available
109
+ os.environ.setdefault("HF_HUB_ENABLE_HF_TRANSFER", "1")
110
+ os.environ.setdefault("HF_HUB_ENABLE_XET", "1")
111
+
112
+ cache_dir = os.getenv("HF_HOME") or os.getenv("TRANSFORMERS_CACHE") or ""
113
+ if cache_dir:
114
+ os.makedirs(cache_dir, exist_ok=True)
115
+
116
+ # Resolve huggingface-cli path (Windows-friendly)
117
+ cli_path = shutil.which("huggingface-cli")
118
+ if not cli_path:
119
+ candidates = []
120
+ appdata = os.getenv("APPDATA")
121
+ if appdata:
122
+ candidates.append(os.path.join(appdata, "Python", "Python312", "Scripts", "huggingface-cli.exe"))
123
+ candidates.append(os.path.join(os.path.dirname(sys.executable), "Scripts", "huggingface-cli.exe"))
124
+ cli_path = next((p for p in candidates if os.path.exists(p)), None)
125
+
126
+ # Preferred: one-shot CLI download for the whole repo (shows live progress)
127
+ if cli_path:
128
+ local_root = os.path.join(cache_dir if cache_dir else ".", repo_id.replace("/", "_"))
129
+ os.makedirs(local_root, exist_ok=True)
130
+ _log(f"Using huggingface-cli to download entire repo -> '{local_root}'")
131
+ cmd = [
132
+ cli_path,
133
+ "download",
134
+ repo_id,
135
+ "--repo-type",
136
+ "model",
137
+ "--local-dir",
138
+ local_root,
139
+ "--local-dir-use-symlinks",
140
+ "False",
141
+ "--resume",
142
+ ]
143
+ if token:
144
+ cmd += ["--token", token]
145
+ # Inherit stdio; users will see a proper progress bar
146
+ subprocess.run(cmd, check=False)
147
+ # Verify we have the essential files
148
+ if os.path.exists(os.path.join(local_root, "config.json")) or os.path.exists(os.path.join(local_root, "model.safetensors")):
149
+ _log("CLI prefetch completed")
150
+ return local_root
151
+ else:
152
+ _log("CLI prefetch finished but essential files not found; will fallback to API mirroring")
153
+
154
+ # Fallback: verbose API-driven prefetch with per-file logging
155
+ _log(f"Prefetching (API) repo={repo_id} to cache='{cache_dir}'")
156
+ try:
157
+ files = list_repo_files(repo_id, repo_type="model", token=token)
158
+ except Exception as e:
159
+ _log(f"list_repo_files failed ({type(e).__name__}: {e}); falling back to snapshot_download")
160
+ snapshot_download(repo_id, token=token, local_files_only=False)
161
+ _log("Prefetch completed (snapshot)")
162
+ return None
163
+
164
+ total = len(files)
165
+ _log(f"Found {total} files to ensure cached (API)")
166
+ for i, fn in enumerate(files, start=1):
167
+ try:
168
+ meta = get_hf_file_metadata(repo_id, fn, repo_type="model", token=token)
169
+ size_bytes = meta.size or 0
170
+ except Exception:
171
+ size_bytes = 0
172
+ size_mb = size_bytes / (1024 * 1024) if size_bytes else 0.0
173
+ _log(f"[{i}/{total}] fetching '{fn}' (~{size_mb:.2f} MB)")
174
+ _ = hf_hub_download(
175
+ repo_id=repo_id,
176
+ filename=fn,
177
+ repo_type="model",
178
+ token=token,
179
+ local_files_only=False,
180
+ resume_download=True,
181
+ )
182
+ _log(f"[{i}/{total}] done '{fn}'")
183
+ _log("Prefetch completed (API)")
184
+ return None
185
+ except Exception as e:
186
+ _log(f"Prefetch skipped: {type(e).__name__}: {e}")
187
+ return None
188
+
189
+ def is_data_url(url: str) -> bool:
190
+ return url.startswith("data:") and ";base64," in url
191
+
192
+
193
+ def is_http_url(url: str) -> bool:
194
+ return url.startswith("http://") or url.startswith("https://")
195
+
196
+
197
+ def decode_base64_to_bytes(b64: str) -> bytes:
198
+ # strip possible "data:*;base64," prefix
199
+ if "base64," in b64:
200
+ b64 = b64.split("base64,", 1)[1]
201
+ return base64.b64decode(b64, validate=False)
202
+
203
+
204
+ def fetch_bytes(url: str, headers: Optional[Dict[str, str]] = None, timeout: int = 60) -> bytes:
205
+ if not is_http_url(url):
206
+ raise ValueError(f"Only http(s) URLs supported for fetch, got: {url}")
207
+ resp = requests.get(url, headers=headers or {}, timeout=timeout, stream=True)
208
+ resp.raise_for_status()
209
+ return resp.content
210
+
211
+
212
+ def load_image_from_any(src: Dict[str, Any]) -> Image.Image:
213
+ """
214
+ src can be:
215
+ - { "url": "http(s)://..." } (also supports data URL)
216
+ - { "b64_json": "<base64>" }
217
+ - { "path": "local_path" } (optional)
218
+ """
219
+ if "b64_json" in src and src["b64_json"]:
220
+ data = decode_base64_to_bytes(str(src["b64_json"]))
221
+ return Image.open(io.BytesIO(data)).convert("RGB")
222
+
223
+ if "url" in src and src["url"]:
224
+ url = str(src["url"])
225
+ if is_data_url(url):
226
+ data = decode_base64_to_bytes(url)
227
+ return Image.open(io.BytesIO(data)).convert("RGB")
228
+ if is_http_url(url):
229
+ data = fetch_bytes(url)
230
+ return Image.open(io.BytesIO(data)).convert("RGB")
231
+ # treat as local path
232
+ if os.path.exists(url):
233
+ with open(url, "rb") as f:
234
+ return Image.open(io.BytesIO(f.read())).convert("RGB")
235
+ raise ValueError(f"Invalid image url/path: {url}")
236
+
237
+ if "path" in src and src["path"]:
238
+ p = str(src["path"])
239
+ if os.path.exists(p):
240
+ with open(p, "rb") as f:
241
+ return Image.open(io.BytesIO(f.read())).convert("RGB")
242
+ raise ValueError(f"Image path not found: {p}")
243
+
244
+ raise ValueError("Unsupported image source payload")
245
+
246
+
247
+ def write_bytes_tempfile(data: bytes, suffix: str) -> str:
248
+ tmp = tempfile.NamedTemporaryFile(delete=False, suffix=suffix)
249
+ with tmp as f:
250
+ f.write(data)
251
+ return tmp.name
252
+
253
+
254
+ def load_video_frames_from_any(src: Dict[str, Any], max_frames: int = MAX_VIDEO_FRAMES) -> List[Image.Image]:
255
+ """
256
+ Returns a list of PIL.Image frames (RGB) sampled up to max_frames.
257
+ src can be:
258
+ - { "url": "http(s)://..." } (mp4/mov/webm/etc.)
259
+ - { "b64_json": "<base64 of a video file>" }
260
+ - { "path": "local_path" }
261
+ """
262
+ # Prefer imageio.v3 if present, fallback to OpenCV
263
+ # We load all frames then uniform sample if too many.
264
+ def _load_all_frames(path: str) -> List[Image.Image]:
265
+ frames: List[Image.Image] = []
266
+ with contextlib.suppress(ImportError):
267
+ import imageio.v3 as iio
268
+ arr_iter = iio.imiter(path) # yields numpy arrays HxWxC
269
+ for arr in arr_iter:
270
+ if arr is None:
271
+ continue
272
+ if arr.ndim == 2:
273
+ arr = np.stack([arr, arr, arr], axis=-1)
274
+ if arr.shape[-1] == 4:
275
+ arr = arr[..., :3]
276
+ frames.append(Image.fromarray(arr).convert("RGB"))
277
+ return frames
278
+
279
+ # Fallback to OpenCV
280
+ import cv2 # type: ignore
281
+ cap = cv2.VideoCapture(path)
282
+ ok, frame = cap.read()
283
+ while ok:
284
+ frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
285
+ frames.append(Image.fromarray(frame))
286
+ ok, frame = cap.read()
287
+ cap.release()
288
+ return frames
289
+
290
+ # Resolve to a local path
291
+ local_path = None
292
+ if "b64_json" in src and src["b64_json"]:
293
+ data = decode_base64_to_bytes(str(src["b64_json"]))
294
+ local_path = write_bytes_tempfile(data, suffix=".mp4")
295
+ elif "url" in src and src["url"]:
296
+ url = str(src["url"])
297
+ if is_data_url(url):
298
+ data = decode_base64_to_bytes(url)
299
+ local_path = write_bytes_tempfile(data, suffix=".mp4")
300
+ elif is_http_url(url):
301
+ data = fetch_bytes(url)
302
+ local_path = write_bytes_tempfile(data, suffix=".mp4")
303
+ elif os.path.exists(url):
304
+ local_path = url
305
+ else:
306
+ raise ValueError(f"Invalid video url/path: {url}")
307
+ elif "path" in src and src["path"]:
308
+ p = str(src["path"])
309
+ if os.path.exists(p):
310
+ local_path = p
311
+ else:
312
+ raise ValueError(f"Video path not found: {p}")
313
+ else:
314
+ raise ValueError("Unsupported video source payload")
315
+
316
+ frames = _load_all_frames(local_path)
317
+ # Uniform sample if too many frames
318
+ if len(frames) > max_frames and max_frames > 0:
319
+ idxs = np.linspace(0, len(frames) - 1, max_frames).astype(int).tolist()
320
+ frames = [frames[i] for i in idxs]
321
+ return frames
322
+
323
+
324
+ class ChatRequest(BaseModel):
325
+ model: Optional[str] = None
326
+ messages: List[Dict[str, Any]]
327
+ max_tokens: Optional[int] = None
328
+ temperature: Optional[float] = None
329
+ stream: Optional[bool] = None
330
+ session_id: Optional[str] = None
331
+
332
+
333
+ class Engine:
334
+ def __init__(self, model_id: str, hf_token: Optional[str] = None):
335
+ # Lazy import heavy deps
336
+ from transformers import AutoProcessor, AutoModelForCausalLM, AutoModelForVision2Seq, AutoModel
337
+ # AutoModelForImageTextToText is the v5+ replacement for Vision2Seq in Transformers
338
+ try:
339
+ from transformers import AutoModelForImageTextToText # type: ignore
340
+ except Exception:
341
+ AutoModelForImageTextToText = None # type: ignore
342
+
343
+ model_kwargs: Dict[str, Any] = {
344
+ "trust_remote_code": True,
345
+ }
346
+ if hf_token:
347
+ # Only pass 'token' (use_auth_token is deprecated and causes conflicts)
348
+ model_kwargs["token"] = hf_token
349
+ # Device and dtype
350
+ model_kwargs["device_map"] = DEVICE_MAP
351
+ model_kwargs["torch_dtype"] = TORCH_DTYPE if TORCH_DTYPE != "auto" else "auto"
352
+
353
+ # Processor (handles text + images/videos)
354
+ proc_kwargs: Dict[str, Any] = {"trust_remote_code": True}
355
+ if hf_token:
356
+ proc_kwargs["token"] = hf_token
357
+ self.processor = AutoProcessor.from_pretrained(
358
+ model_id,
359
+ **proc_kwargs,
360
+ ) # pragma: no cover
361
+
362
+ # Prefer ImageTextToText (Transformers v5 path), then Vision2Seq, then CausalLM as a last resort
363
+ model = None
364
+ if 'AutoModelForImageTextToText' in globals() and AutoModelForImageTextToText is not None:
365
+ try:
366
+ model = AutoModelForImageTextToText.from_pretrained(model_id, **model_kwargs) # pragma: no cover
367
+ except Exception:
368
+ model = None
369
+ if model is None:
370
+ try:
371
+ model = AutoModelForVision2Seq.from_pretrained(model_id, **model_kwargs) # pragma: no cover
372
+ except Exception:
373
+ model = None
374
+ if model is None:
375
+ try:
376
+ model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs) # pragma: no cover
377
+ except Exception:
378
+ model = None
379
+ if model is None:
380
+ # Generic AutoModel as last-resort with trust_remote_code to load custom architectures
381
+ model = AutoModel.from_pretrained(model_id, **model_kwargs) # pragma: no cover
382
+ self.model = model.eval() # pragma: no cover
383
+
384
+ self.model_id = model_id
385
+ self.tokenizer = getattr(self.processor, "tokenizer", None)
386
+ self.last_context_info: Dict[str, Any] = {}
387
+
388
+ def _model_max_context(self) -> int:
389
+ try:
390
+ cfg = getattr(self.model, "config", None)
391
+ if cfg is not None:
392
+ v = getattr(cfg, "max_position_embeddings", None)
393
+ if isinstance(v, int) and v > 0 and v < 10_000_000:
394
+ return v
395
+ except Exception:
396
+ pass
397
+ try:
398
+ mx = int(getattr(self.tokenizer, "model_max_length", 0) or 0)
399
+ if mx > 0 and mx < 10_000_000_000:
400
+ return mx
401
+ except Exception:
402
+ pass
403
+ return 32768
404
+
405
+ def _count_prompt_tokens(self, text: str) -> int:
406
+ try:
407
+ if self.tokenizer is not None:
408
+ enc = self.tokenizer([text], add_special_tokens=False, return_attention_mask=False)
409
+ ids = enc["input_ids"][0]
410
+ return len(ids)
411
+ except Exception:
412
+ pass
413
+ return max(1, int(len(text.split()) * 1.3))
414
+
415
+ def _auto_compress_if_needed(
416
+ self, mm_messages: List[Dict[str, Any]], max_new_tokens: int
417
+ ) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
418
+ info: Dict[str, Any] = {}
419
+ # Build once to measure
420
+ text0 = self.processor.apply_chat_template(mm_messages, tokenize=False, add_generation_prompt=True)
421
+ prompt_tokens = self._count_prompt_tokens(text0)
422
+ max_ctx = CONTEXT_MAX_TOKENS_AUTO if CONTEXT_MAX_TOKENS_AUTO > 0 else self._model_max_context()
423
+ budget = max(1024, max_ctx - CONTEXT_SAFETY_MARGIN - int(max_new_tokens))
424
+ if not ENABLE_AUTO_COMPRESSION or prompt_tokens <= budget:
425
+ info = {
426
+ "compressed": False,
427
+ "prompt_tokens": int(prompt_tokens),
428
+ "max_context": int(max_ctx),
429
+ "budget": int(budget),
430
+ "strategy": COMPRESSION_STRATEGY,
431
+ "dropped_messages": 0,
432
+ }
433
+ return mm_messages, info
434
+
435
+ # Truncate earliest non-system messages until within budget
436
+ msgs = list(mm_messages)
437
+ dropped = 0
438
+ guard = 0
439
+ while True:
440
+ text = self.processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
441
+ prompt_tokens = self._count_prompt_tokens(text)
442
+ if prompt_tokens <= budget or len(msgs) <= 1:
443
+ break
444
+ # drop earliest non-system
445
+ drop_idx = None
446
+ for j, m in enumerate(msgs):
447
+ if (m.get("role") or "user") != "system":
448
+ drop_idx = j
449
+ break
450
+ if drop_idx is None:
451
+ break
452
+ msgs.pop(drop_idx)
453
+ dropped += 1
454
+ guard += 1
455
+ if guard > 10000:
456
+ break
457
+
458
+ info = {
459
+ "compressed": True,
460
+ "prompt_tokens": int(prompt_tokens),
461
+ "max_context": int(max_ctx),
462
+ "budget": int(budget),
463
+ "strategy": "truncate",
464
+ "dropped_messages": int(dropped),
465
+ }
466
+ return msgs, info
467
+
468
+ def get_context_report(self) -> Dict[str, Any]:
469
+ try:
470
+ tk_max = int(getattr(self.tokenizer, "model_max_length", 0) or 0)
471
+ except Exception:
472
+ tk_max = 0
473
+ return {
474
+ "compressionEnabled": ENABLE_AUTO_COMPRESSION,
475
+ "strategy": COMPRESSION_STRATEGY,
476
+ "safetyMargin": CONTEXT_SAFETY_MARGIN,
477
+ "modelMaxContext": self._model_max_context(),
478
+ "tokenizerModelMaxLength": tk_max,
479
+ "last": self.last_context_info or {},
480
+ }
481
+
482
+ def build_mm_messages(
483
+ self, openai_messages: List[Dict[str, Any]]
484
+ ) -> Tuple[List[Dict[str, Any]], List[Image.Image], List[List[Image.Image]]]:
485
+ """
486
+ Convert OpenAI-style messages to Qwen multimodal messages.
487
+ Returns:
488
+ - messages for apply_chat_template
489
+ - flat list of images in encounter order
490
+ - list of videos (each is list of PIL frames)
491
+ """
492
+ mm_msgs: List[Dict[str, Any]] = []
493
+ images: List[Image.Image] = []
494
+ videos: List[List[Image.Image]] = []
495
+
496
+ for msg in openai_messages:
497
+ role = msg.get("role", "user")
498
+ content = msg.get("content", "")
499
+
500
+ parts: List[Dict[str, Any]] = []
501
+
502
+ if isinstance(content, str):
503
+ if content:
504
+ parts.append({"type": "text", "text": content})
505
+ elif isinstance(content, list):
506
+ for p in content:
507
+ ptype = p.get("type")
508
+ if ptype == "text":
509
+ txt = p.get("text", "")
510
+ if txt:
511
+ parts.append({"type": "text", "text": txt})
512
+ elif ptype in ("image_url", "input_image"):
513
+ src: Dict[str, Any] = {}
514
+ if ptype == "image_url":
515
+ u = (p.get("image_url") or {}).get("url") if isinstance(p.get("image_url"), dict) else p.get("image_url")
516
+ src["url"] = u
517
+ else:
518
+ b64 = p.get("image") or p.get("b64_json") or p.get("data") or (p.get("image_url") or {}).get("url")
519
+ if b64:
520
+ src["b64_json"] = b64
521
+ try:
522
+ img = load_image_from_any(src)
523
+ images.append(img)
524
+ parts.append({"type": "image", "image": img})
525
+ except Exception as e:
526
+ raise ValueError(f"Failed to parse image part: {e}") from e
527
+ elif ptype in ("video_url", "input_video"):
528
+ src = {}
529
+ if ptype == "video_url":
530
+ u = (p.get("video_url") or {}).get("url") if isinstance(p.get("video_url"), dict) else p.get("video_url")
531
+ src["url"] = u
532
+ else:
533
+ b64 = p.get("video") or p.get("b64_json") or p.get("data")
534
+ if b64:
535
+ src["b64_json"] = b64
536
+ try:
537
+ frames = load_video_frames_from_any(src, max_frames=MAX_VIDEO_FRAMES)
538
+ videos.append(frames)
539
+ parts.append({"type": "video", "video": frames})
540
+ except Exception as e:
541
+ raise ValueError(f"Failed to parse video part: {e}") from e
542
+ else:
543
+ if isinstance(p, dict):
544
+ txt = p.get("text")
545
+ if isinstance(txt, str) and txt:
546
+ parts.append({"type": "text", "text": txt})
547
+ else:
548
+ if content:
549
+ parts.append({"type": "text", "text": str(content)})
550
+
551
+ mm_msgs.append({"role": role, "content": parts})
552
+
553
+ return mm_msgs, images, videos
554
+
555
+ def infer(self, messages: List[Dict[str, Any]], max_tokens: int, temperature: float) -> str:
556
+ mm_messages, images, videos = self.build_mm_messages(messages)
557
+ # Auto-compress if needed based on context budget
558
+ mm_messages, ctx_info = self._auto_compress_if_needed(mm_messages, max_tokens)
559
+ self.last_context_info = ctx_info
560
+
561
+ # Build chat template
562
+ text = self.processor.apply_chat_template(
563
+ mm_messages,
564
+ tokenize=False,
565
+ add_generation_prompt=True,
566
+ )
567
+
568
+ proc_kwargs: Dict[str, Any] = {"text": [text], "return_tensors": "pt"}
569
+ if images:
570
+ proc_kwargs["images"] = images
571
+ if videos:
572
+ proc_kwargs["videos"] = videos
573
+
574
+ inputs = self.processor(**proc_kwargs)
575
+ # Move tensors to model device if present
576
+ try:
577
+ device = getattr(self.model, "device", None) or next(self.model.parameters()).device
578
+ inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in inputs.items()}
579
+ except Exception:
580
+ pass
581
+
582
+ do_sample = temperature is not None and float(temperature) > 0.0
583
+
584
+ gen_ids = self.model.generate(
585
+ **inputs,
586
+ max_new_tokens=int(max_tokens),
587
+ temperature=float(temperature),
588
+ do_sample=do_sample,
589
+ use_cache=True,
590
+ )
591
+ # Decode
592
+ output = self.processor.batch_decode(
593
+ gen_ids,
594
+ skip_special_tokens=True,
595
+ clean_up_tokenization_spaces=False,
596
+ )[0]
597
+
598
+ # Best-effort: return only the assistant reply after the last template marker if present
599
+ parts = re.split(r"\n?assistant:\s*", output, flags=re.IGNORECASE)
600
+ if len(parts) >= 2:
601
+ return parts[-1].strip()
602
+ return output.strip()
603
+
604
+ def infer_stream(
605
+ self,
606
+ messages: List[Dict[str, Any]],
607
+ max_tokens: int,
608
+ temperature: float,
609
+ cancel_event: Optional[threading.Event] = None,
610
+ ):
611
+ from transformers import TextIteratorStreamer, StoppingCriteria, StoppingCriteriaList
612
+
613
+ mm_messages, images, videos = self.build_mm_messages(messages)
614
+ # Auto-compress if needed based on context budget
615
+ mm_messages, ctx_info = self._auto_compress_if_needed(mm_messages, max_tokens)
616
+ self.last_context_info = ctx_info
617
+
618
+ text = self.processor.apply_chat_template(
619
+ mm_messages,
620
+ tokenize=False,
621
+ add_generation_prompt=True,
622
+ )
623
+
624
+ proc_kwargs: Dict[str, Any] = {"text": [text], "return_tensors": "pt"}
625
+ if images:
626
+ proc_kwargs["images"] = images
627
+ if videos:
628
+ proc_kwargs["videos"] = videos
629
+
630
+ inputs = self.processor(**proc_kwargs)
631
+ try:
632
+ device = getattr(self.model, "device", None) or next(self.model.parameters()).device
633
+ inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in inputs.items()}
634
+ except Exception:
635
+ pass
636
+
637
+ do_sample = temperature is not None and float(temperature) > 0.0
638
+
639
+ streamer = TextIteratorStreamer(
640
+ getattr(self.processor, "tokenizer", None),
641
+ skip_prompt=True,
642
+ skip_special_tokens=True,
643
+ )
644
+
645
+ gen_kwargs = dict(
646
+ **inputs,
647
+ max_new_tokens=int(max_tokens),
648
+ temperature=float(temperature),
649
+ do_sample=do_sample,
650
+ use_cache=True,
651
+ streamer=streamer,
652
+ )
653
+
654
+ # Optional cooperative cancellation via StoppingCriteria
655
+ if cancel_event is not None:
656
+ class _CancelCrit(StoppingCriteria):
657
+ def __init__(self, ev: threading.Event):
658
+ self.ev = ev
659
+
660
+ def __call__(self, input_ids, scores, **kwargs):
661
+ return bool(self.ev.is_set())
662
+
663
+ gen_kwargs["stopping_criteria"] = StoppingCriteriaList([_CancelCrit(cancel_event)])
664
+
665
+ th = threading.Thread(target=self.model.generate, kwargs=gen_kwargs)
666
+ th.start()
667
+
668
+ for piece in streamer:
669
+ if piece:
670
+ yield piece
671
+
672
+
673
+ # Simple in-memory resumable SSE session store + optional SQLite persistence
674
+ class _SSESession:
675
+ def __init__(self, maxlen: int = 2048, ttl_seconds: int = 600):
676
+ self.buffer: Deque[Tuple[int, str]] = deque(maxlen=maxlen) # (idx, sse_line_block)
677
+ self.last_idx: int = -1
678
+ self.created: float = time.time()
679
+ self.finished: bool = False
680
+ self.cond = threading.Condition()
681
+ self.thread: Optional[threading.Thread] = None
682
+ self.ttl_seconds = ttl_seconds
683
+ # Cancellation + client tracking
684
+ self.cancel_event = threading.Event()
685
+ self.listeners: int = 0
686
+ self.cancel_timer = None # type: ignore
687
+
688
+
689
+ class _SessionStore:
690
+ def __init__(self, ttl_seconds: int = 600, max_sessions: int = 256):
691
+ self._sessions: Dict[str, _SSESession] = {}
692
+ self._lock = threading.Lock()
693
+ self._ttl = ttl_seconds
694
+ self._max_sessions = max_sessions
695
+
696
+ def get_or_create(self, sid: str) -> _SSESession:
697
+ with self._lock:
698
+ sess = self._sessions.get(sid)
699
+ if sess is None:
700
+ sess = _SSESession(ttl_seconds=self._ttl)
701
+ self._sessions[sid] = sess
702
+ return sess
703
+
704
+ def get(self, sid: str) -> Optional[_SSESession]:
705
+ with self._lock:
706
+ return self._sessions.get(sid)
707
+
708
+ def gc(self):
709
+ now = time.time()
710
+ with self._lock:
711
+ # remove expired
712
+ expired = [k for k, v in self._sessions.items() if (now - v.created) > self._ttl or (v.finished and (now - v.created) > self._ttl / 4)]
713
+ for k in expired:
714
+ self._sessions.pop(k, None)
715
+ # bound session count
716
+ if len(self._sessions) > self._max_sessions:
717
+ for k, _ in sorted(self._sessions.items(), key=lambda kv: kv[1].created)[: max(0, len(self._sessions) - self._max_sessions)]:
718
+ self._sessions.pop(k, None)
719
+
720
+
721
+ class _SQLiteStore:
722
+ def __init__(self, db_path: str):
723
+ self.db_path = db_path
724
+ self._lock = threading.Lock()
725
+ self._conn = sqlite3.connect(self.db_path, check_same_thread=False)
726
+ self._conn.execute("PRAGMA journal_mode=WAL;")
727
+ self._conn.execute("PRAGMA synchronous=NORMAL;")
728
+ self._ensure_schema()
729
+
730
+ def _ensure_schema(self):
731
+ cur = self._conn.cursor()
732
+ cur.execute(
733
+ "CREATE TABLE IF NOT EXISTS sessions (session_id TEXT PRIMARY KEY, created REAL, finished INTEGER DEFAULT 0)"
734
+ )
735
+ cur.execute(
736
+ "CREATE TABLE IF NOT EXISTS events (session_id TEXT, idx INTEGER, data TEXT, created REAL, PRIMARY KEY(session_id, idx))"
737
+ )
738
+ cur.execute("CREATE INDEX IF NOT EXISTS idx_events_session ON events(session_id, idx)")
739
+ self._conn.commit()
740
+
741
+ def ensure_session(self, session_id: str, created: int):
742
+ with self._lock:
743
+ self._conn.execute(
744
+ "INSERT OR IGNORE INTO sessions(session_id, created, finished) VALUES (?, ?, 0)",
745
+ (session_id, float(created)),
746
+ )
747
+ self._conn.commit()
748
+
749
+ def append_event(self, session_id: str, idx: int, payload: Dict[str, Any]):
750
+ data = json.dumps(payload, ensure_ascii=False)
751
+ with self._lock:
752
+ self._conn.execute(
753
+ "INSERT OR REPLACE INTO events(session_id, idx, data, created) VALUES (?, ?, ?, ?)",
754
+ (session_id, idx, data, time.time()),
755
+ )
756
+ self._conn.commit()
757
+
758
+ def get_events_after(self, session_id: str, last_idx: int) -> List[Tuple[int, str]]:
759
+ with self._lock:
760
+ cur = self._conn.execute(
761
+ "SELECT idx, data FROM events WHERE session_id=? AND idx>? ORDER BY idx ASC", (session_id, last_idx)
762
+ )
763
+ return [(int(r[0]), str(r[1])) for r in cur.fetchall()]
764
+
765
+ def mark_finished(self, session_id: str):
766
+ with self._lock:
767
+ self._conn.execute("UPDATE sessions SET finished=1 WHERE session_id=?", (session_id,))
768
+ self._conn.commit()
769
+
770
+ def session_meta(self, session_id: str) -> Tuple[bool, int]:
771
+ with self._lock:
772
+ row = self._conn.execute("SELECT finished FROM sessions WHERE session_id=?", (session_id,)).fetchone()
773
+ finished = bool(row[0]) if row else False
774
+ row2 = self._conn.execute("SELECT MAX(idx) FROM events WHERE session_id=?", (session_id,)).fetchone()
775
+ last_idx = int(row2[0]) if row2 and row2[0] is not None else -1
776
+ return finished, last_idx
777
+
778
+ def gc(self, ttl_seconds: int):
779
+ cutoff = time.time() - float(ttl_seconds)
780
+ with self._lock:
781
+ cur = self._conn.execute("SELECT session_id FROM sessions WHERE finished=1 AND created<?", (cutoff,))
782
+ ids = [r[0] for r in cur.fetchall()]
783
+ for sid in ids:
784
+ self._conn.execute("DELETE FROM events WHERE session_id=?", (sid,))
785
+ self._conn.execute("DELETE FROM sessions WHERE session_id=?", (sid,))
786
+ self._conn.commit()
787
+
788
+
789
+ def _sse_event(session_id: str, idx: int, payload: Dict[str, Any]) -> str:
790
+ # Include SSE id line so clients can send Last-Event-ID to resume.
791
+ return f"id: {session_id}:{idx}\n" + f"data: {json.dumps(payload, ensure_ascii=False)}\n\n"
792
+
793
+
794
+ _STORE = _SessionStore()
795
+ _DB_STORE = _SQLiteStore(SESSIONS_DB_PATH) if PERSIST_SESSIONS else None
796
+
797
+ # FastAPI app and OpenAPI tags
798
+ tags_metadata = [
799
+ {"name": "meta", "description": "Service metadata and OpenAPI schema"},
800
+ {"name": "health", "description": "Readiness and runtime info including context window report"},
801
+ {"name": "chat", "description": "OpenAI-compatible chat completions (non-stream and streaming SSE)"},
802
+ ]
803
+
804
+ app = FastAPI(
805
+ title="Qwen3-VL Inference Server",
806
+ version="1.0.0",
807
+ description="OpenAI-compatible inference server for Qwen3-VL with multimodal support, streaming SSE with resume, context auto-compression, and optional SQLite persistence.",
808
+ openapi_tags=tags_metadata,
809
+ )
810
+ app.add_middleware(
811
+ CORSMiddleware,
812
+ allow_origins=["*"],
813
+ allow_methods=["*"],
814
+ allow_headers=["*"],
815
+ )
816
+
817
+ # Startup hook is defined after get_engine() so globals are initialized first.
818
+
819
+ # Engine singletons
820
+ _engine: Optional[Engine] = None
821
+ _engine_error: Optional[str] = None
822
+
823
+
824
+ def get_engine() -> Engine:
825
+ global _engine, _engine_error
826
+ if _engine is not None:
827
+ return _engine
828
+ try:
829
+ model_id = DEFAULT_MODEL_ID
830
+ _log(f"Preparing model '{model_id}' (HF_HOME={os.getenv('HF_HOME')}, cache={os.getenv('TRANSFORMERS_CACHE')})")
831
+ local_repo_dir = prefetch_model_assets(model_id, HF_TOKEN)
832
+ load_id = local_repo_dir if (local_repo_dir and os.path.exists(os.path.join(local_repo_dir, 'config.json'))) else model_id
833
+ _log(f"Loading processor and model from: {load_id}")
834
+ _engine = Engine(model_id=load_id, hf_token=HF_TOKEN)
835
+ _engine_error = None
836
+ _log(f"Model ready: {_engine.model_id}")
837
+ return _engine
838
+ except Exception as e:
839
+ _engine_error = f"{type(e).__name__}: {e}"
840
+ _log(f"Engine init failed: {_engine_error}")
841
+ raise
842
+
843
+ # Eager-load model at startup after definitions so it downloads/checks before serving traffic.
844
+ @app.on_event("startup")
845
+ def _startup_load_model():
846
+ if EAGER_LOAD_MODEL:
847
+ print("[startup] EAGER_LOAD_MODEL=1: initializing model...")
848
+ try:
849
+ _ = get_engine()
850
+ print("[startup] Model loaded:", _engine.model_id if _engine else "unknown")
851
+ except Exception as e:
852
+ # Fail fast if model cannot be initialized
853
+ print("[startup] Model load failed:", e)
854
+ raise
855
+
856
+
857
+ @app.get("/", tags=["meta"])
858
+ def root():
859
+ """Liveness check."""
860
+ return JSONResponse({"ok": True})
861
+
862
+
863
+ @app.get("/openapi.yaml", tags=["meta"])
864
+ def openapi_yaml():
865
+ """Serve OpenAPI schema as YAML for tooling compatibility."""
866
+ schema = app.openapi()
867
+ yml = yaml.safe_dump(schema, sort_keys=False)
868
+ return Response(yml, media_type="application/yaml")
869
+
870
+
871
+ @app.get("/health", tags=["health"])
872
+ def health():
873
+ ready = False
874
+ err = None
875
+ model_id = DEFAULT_MODEL_ID
876
+ global _engine, _engine_error
877
+ if _engine is not None:
878
+ ready = True
879
+ model_id = _engine.model_id
880
+ elif _engine_error:
881
+ err = _engine_error
882
+ ctx = None
883
+ try:
884
+ if _engine is not None:
885
+ ctx = _engine.get_context_report()
886
+ except Exception:
887
+ ctx = None
888
+ return JSONResponse({"ok": True, "modelReady": ready, "modelId": model_id, "error": err, "context": ctx})
889
+
890
+
891
+ @app.post("/v1/chat/completions", tags=["chat"])
892
+ def chat_completions(request: Request, body: ChatRequest):
893
+ # Ensure engine is loaded
894
+ try:
895
+ engine = get_engine()
896
+ except Exception as e:
897
+ raise HTTPException(status_code=503, detail=f"Model not ready: {e}")
898
+
899
+ if not body or not isinstance(body.messages, list) or len(body.messages) == 0:
900
+ raise HTTPException(status_code=400, detail="messages must be a non-empty array")
901
+
902
+ max_tokens = int(body.max_tokens) if isinstance(body.max_tokens, int) else DEFAULT_MAX_TOKENS
903
+ temperature = float(body.temperature) if body.temperature is not None else DEFAULT_TEMPERATURE
904
+ do_stream = bool(body.stream)
905
+
906
+ # Parse Last-Event-ID for resuming and derive/align session_id
907
+ last_event_id_header = request.headers.get("last-event-id")
908
+ sid_from_header: Optional[str] = None
909
+ last_idx_from_header: int = -1
910
+ if last_event_id_header:
911
+ try:
912
+ sid_from_header, idx_str = last_event_id_header.split(":", 1)
913
+ last_idx_from_header = int(idx_str)
914
+ except Exception:
915
+ sid_from_header = None
916
+ last_idx_from_header = -1
917
+
918
+ session_id = body.session_id or sid_from_header or f"sess-{uuid.uuid4().hex[:12]}"
919
+ sess = _STORE.get_or_create(session_id)
920
+ created_ts = int(sess.created)
921
+ if _DB_STORE is not None:
922
+ _DB_STORE.ensure_session(session_id, created_ts)
923
+
924
+ if not do_stream:
925
+ # Non-streaming path
926
+ try:
927
+ content = engine.infer(body.messages, max_tokens=max_tokens, temperature=temperature)
928
+ except ValueError as e:
929
+ # Parsing/user payload errors from engine -> HTTP 400
930
+ raise HTTPException(status_code=400, detail=str(e))
931
+ except Exception as e:
932
+ raise HTTPException(status_code=500, detail=f"Inference error: {e}")
933
+
934
+ now = int(time.time())
935
+ prompt_tokens = int((engine.last_context_info or {}).get("prompt_tokens") or 0)
936
+ completion_tokens = max(1, len((content or "").split()))
937
+ total_tokens = prompt_tokens + completion_tokens
938
+ resp: Dict[str, Any] = {
939
+ "id": f"chatcmpl-{uuid.uuid4().hex[:12]}",
940
+ "object": "chat.completion",
941
+ "created": now,
942
+ "model": engine.model_id,
943
+ "choices": [
944
+ {
945
+ "index": 0,
946
+ "message": {"role": "assistant", "content": content},
947
+ "finish_reason": "stop",
948
+ }
949
+ ],
950
+ "usage": {
951
+ "prompt_tokens": prompt_tokens,
952
+ "completion_tokens": completion_tokens,
953
+ "total_tokens": total_tokens,
954
+ },
955
+ "context": engine.last_context_info or {},
956
+ }
957
+ return JSONResponse(resp)
958
+
959
+ # Streaming SSE with resumable support
960
+ def sse_generator():
961
+ # Manage listener count and cancel timer
962
+ sess.listeners += 1
963
+ try:
964
+ # Cancel any pending cancel timer when a listener attaches
965
+ if getattr(sess, "cancel_timer", None):
966
+ try:
967
+ sess.cancel_timer.cancel()
968
+ except Exception:
969
+ pass
970
+ sess.cancel_timer = None
971
+
972
+ # Replay if Last-Event-ID was provided
973
+ replay_from = last_idx_from_header if sid_from_header == session_id else -1
974
+ if replay_from >= -1:
975
+ # First try in-memory buffer
976
+ for idx, block in list(sess.buffer):
977
+ if idx > replay_from:
978
+ yield block.encode("utf-8")
979
+ # Optionally pull from SQLite persistence
980
+ if _DB_STORE is not None:
981
+ try:
982
+ for idx, data in _DB_STORE.get_events_after(session_id, replay_from):
983
+ block = f"id: {session_id}:{idx}\n" + f"data: {data}\n\n"
984
+ yield block.encode("utf-8")
985
+ except Exception:
986
+ pass
987
+ if sess.finished:
988
+ # Already finished; send terminal and exit
989
+ yield b"data: [DONE]\n\n"
990
+ return
991
+
992
+ # Fresh generation path
993
+ # Helper to append to buffers and yield to client
994
+ def push(payload: Dict[str, Any]):
995
+ sess.last_idx += 1
996
+ idx = sess.last_idx
997
+ block = _sse_event(session_id, idx, payload)
998
+ sess.buffer.append((idx, block))
999
+ if _DB_STORE is not None:
1000
+ try:
1001
+ _DB_STORE.append_event(session_id, idx, payload)
1002
+ except Exception:
1003
+ pass
1004
+ return block
1005
+
1006
+ # Initial assistant role delta
1007
+ head = {
1008
+ "id": session_id,
1009
+ "object": "chat.completion.chunk",
1010
+ "created": int(time.time()),
1011
+ "model": engine.model_id,
1012
+ "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}],
1013
+ "system_fingerprint": "fastapi",
1014
+ }
1015
+ yield push(head).encode("utf-8")
1016
+
1017
+ # Stream model pieces
1018
+ try:
1019
+ for piece in engine.infer_stream(
1020
+ body.messages, max_tokens=max_tokens, temperature=temperature, cancel_event=sess.cancel_event
1021
+ ):
1022
+ if not piece:
1023
+ continue
1024
+ payload = {
1025
+ "id": session_id,
1026
+ "object": "chat.completion.chunk",
1027
+ "created": int(time.time()),
1028
+ "model": engine.model_id,
1029
+ "choices": [{"index": 0, "delta": {"content": piece}, "finish_reason": None}],
1030
+ }
1031
+ yield push(payload).encode("utf-8")
1032
+ # Cooperative early-exit if cancel requested
1033
+ if sess.cancel_event.is_set():
1034
+ break
1035
+ except Exception:
1036
+ # On engine error, terminate gracefully
1037
+ pass
1038
+
1039
+ # Finish chunk
1040
+ finish = {
1041
+ "id": session_id,
1042
+ "object": "chat.completion.chunk",
1043
+ "created": int(time.time()),
1044
+ "model": engine.model_id,
1045
+ "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}],
1046
+ }
1047
+ yield push(finish).encode("utf-8")
1048
+
1049
+ finally:
1050
+ # Mark finished and persist
1051
+ sess.finished = True
1052
+ if _DB_STORE is not None:
1053
+ try:
1054
+ _DB_STORE.mark_finished(session_id)
1055
+ # Optionally GC older finished sessions
1056
+ _DB_STORE.gc(SESSIONS_TTL_SECONDS)
1057
+ except Exception:
1058
+ pass
1059
+
1060
+ # Always send terminal [DONE]
1061
+ yield b"data: [DONE]\n\n"
1062
+
1063
+ # Listener bookkeeping and optional auto-cancel if all disconnect
1064
+ try:
1065
+ sess.listeners = max(0, sess.listeners - 1)
1066
+ if sess.listeners == 0 and CANCEL_AFTER_DISCONNECT_SECONDS > 0 and not sess.cancel_event.is_set():
1067
+ def _later_cancel():
1068
+ # If still no listeners, cancel
1069
+ if sess.listeners == 0 and not sess.cancel_event.is_set():
1070
+ sess.cancel_event.set()
1071
+ sess.cancel_timer = threading.Timer(CANCEL_AFTER_DISCONNECT_SECONDS, _later_cancel)
1072
+ sess.cancel_timer.daemon = True
1073
+ sess.cancel_timer.start()
1074
+ except Exception:
1075
+ pass
1076
+
1077
+ # In-memory store GC
1078
+ try:
1079
+ _STORE.gc()
1080
+ except Exception:
1081
+ pass
1082
+
1083
+ headers = {
1084
+ "Cache-Control": "no-cache",
1085
+ "Connection": "keep-alive",
1086
+ "X-Accel-Buffering": "no",
1087
+ }
1088
+ return StreamingResponse(sse_generator(), media_type="text/event-stream", headers=headers)
1089
+
1090
+
1091
+ @app.post("/v1/cancel/{session_id}", tags=["chat"])
1092
+ def cancel_session(session_id: str):
1093
+ sess = _STORE.get(session_id)
1094
+ if sess is not None:
1095
+ try:
1096
+ sess.cancel_event.set()
1097
+ sess.finished = True
1098
+ if _DB_STORE is not None:
1099
+ _DB_STORE.mark_finished(session_id)
1100
+ except Exception:
1101
+ pass
1102
+ return JSONResponse({"ok": True, "session_id": session_id})
1103
+
1104
+
1105
+ if __name__ == "__main__":
1106
+ import uvicorn
1107
+
1108
+ uvicorn.run("main:app", host="0.0.0.0", port=PORT, reload=False)
requirements.txt ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core server
2
+ fastapi>=0.115.0
3
+ uvicorn[standard]>=0.30.0
4
+
5
+ # HF ecosystem
6
+ transformers>=4.44.0
7
+ accelerate>=0.33.0
8
+
9
+ # Multimedia + utils
10
+ pillow>=10.0.0
11
+ numpy>=1.24.0
12
+ requests>=2.31.0
13
+ imageio[ffmpeg]>=2.34.0
14
+
15
+ # Config
16
+ python-dotenv>=1.0.1
17
+
18
+ # FastAPI runtime model layer
19
+ pydantic>=2.0.0
20
+
21
+ # IMPORTANT:
22
+ # - Install PyTorch separately to match your platform/CUDA:
23
+ # CPU (Windows/Linux/macOS):
24
+ # pip install torch --index-url https://download.pytorch.org/whl/cpu
25
+ # NVIDIA CUDA (example for cu124, adjust as needed):
26
+ # pip install torch --index-url https://download.pytorch.org/whl/cu124
27
+ # Testing and tooling
28
+ pytest>=8.0.0
29
+ pytest-cov>=4.1.0
30
+ pyyaml>=6.0.0
tests/test_api.py ADDED
@@ -0,0 +1,274 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import time
3
+ from contextlib import contextmanager
4
+
5
+ import pytest
6
+ from fastapi.testclient import TestClient
7
+
8
+ import main
9
+
10
+
11
+ class FakeEngine:
12
+ def __init__(self, model_id="fake-model"):
13
+ self.model_id = model_id
14
+ self.last_context_info = {
15
+ "compressed": False,
16
+ "prompt_tokens": 5,
17
+ "max_context": 8192,
18
+ "budget": 7900,
19
+ "strategy": "truncate",
20
+ "dropped_messages": 0,
21
+ }
22
+
23
+ def infer(self, messages, max_tokens, temperature):
24
+ # Simulate parse error pathway when special trigger is present
25
+ if messages and isinstance(messages[0].get("content"), str) and "PARSE_ERR" in messages[0]["content"]:
26
+ raise ValueError("Simulated parse error")
27
+ # Return echo content for deterministic test
28
+ parts = []
29
+ for m in messages:
30
+ c = m.get("content", "")
31
+ if isinstance(c, list):
32
+ for p in c:
33
+ if isinstance(p, dict) and p.get("type") == "text":
34
+ parts.append(p.get("text", ""))
35
+ elif isinstance(c, str):
36
+ parts.append(c)
37
+ txt = " ".join(parts) or "OK"
38
+ # Simulate context accounting changing with request
39
+ self.last_context_info = {
40
+ "compressed": False,
41
+ "prompt_tokens": max(1, len(txt.split())),
42
+ "max_context": 8192,
43
+ "budget": 7900,
44
+ "strategy": "truncate",
45
+ "dropped_messages": 0,
46
+ }
47
+ return f"OK: {txt}"
48
+
49
+ def infer_stream(self, messages, max_tokens, temperature, cancel_event=None):
50
+ # simple two-piece stream; respects cancel_event if set during streaming
51
+ outputs = ["hello", " world"]
52
+ for piece in outputs:
53
+ if cancel_event is not None and cancel_event.is_set():
54
+ break
55
+ yield piece
56
+ # tiny delay to allow cancel test to interleave
57
+ time.sleep(0.01)
58
+
59
+ def get_context_report(self):
60
+ return {
61
+ "compressionEnabled": True,
62
+ "strategy": "truncate",
63
+ "safetyMargin": 256,
64
+ "modelMaxContext": 8192,
65
+ "tokenizerModelMaxLength": 8192,
66
+ "last": self.last_context_info,
67
+ }
68
+
69
+
70
+ @contextmanager
71
+ def patched_engine():
72
+ # Patch global engine so server does not load real model
73
+ prev_engine = main._engine
74
+ prev_err = main._engine_error
75
+ fake = FakeEngine()
76
+ main._engine = fake
77
+ main._engine_error = None
78
+ try:
79
+ yield fake
80
+ finally:
81
+ main._engine = prev_engine
82
+ main._engine_error = prev_err
83
+
84
+
85
+ def get_client():
86
+ return TestClient(main.app)
87
+
88
+
89
+ def test_health_ready_and_context():
90
+ with patched_engine():
91
+ client = get_client()
92
+ r = client.get("/health")
93
+ assert r.status_code == 200
94
+ body = r.json()
95
+ assert body["ok"] is True
96
+ assert body["modelReady"] is True
97
+ assert body["modelId"] == "fake-model"
98
+ # context block exists with required fields
99
+ ctx = body["context"]
100
+ assert ctx["compressionEnabled"] is True
101
+ assert "last" in ctx
102
+ assert isinstance(ctx["last"].get("prompt_tokens"), int)
103
+
104
+
105
+ def test_health_with_engine_error():
106
+ # simulate model load error path
107
+ prev_engine = main._engine
108
+ prev_err = main._engine_error
109
+ try:
110
+ main._engine = None
111
+ main._engine_error = "boom"
112
+ client = get_client()
113
+ r = client.get("/health")
114
+ assert r.status_code == 200
115
+ body = r.json()
116
+ assert body["modelReady"] is False
117
+ assert body["error"] == "boom"
118
+ finally:
119
+ main._engine = prev_engine
120
+ main._engine_error = prev_err
121
+
122
+
123
+ def test_chat_non_stream_validation():
124
+ with patched_engine():
125
+ client = get_client()
126
+ # missing messages should 400
127
+ r = client.post("/v1/chat/completions", json={"messages": []})
128
+ assert r.status_code == 400
129
+
130
+
131
+ def test_chat_non_stream_success_and_usage_context():
132
+ with patched_engine():
133
+ client = get_client()
134
+ payload = {
135
+ "messages": [{"role": "user", "content": "Hello Qwen"}],
136
+ "max_tokens": 8,
137
+ "temperature": 0.0,
138
+ }
139
+ r = client.post("/v1/chat/completions", json=payload)
140
+ assert r.status_code == 200
141
+ body = r.json()
142
+ assert body["object"] == "chat.completion"
143
+ assert body["choices"][0]["message"]["content"].startswith("OK:")
144
+ # usage prompt_tokens filled from engine.last_context_info
145
+ assert body["usage"]["prompt_tokens"] >= 1
146
+ # response includes context echo
147
+ assert "context" in body
148
+ assert "prompt_tokens" in body["context"]
149
+
150
+
151
+ def test_chat_non_stream_parse_error_to_400():
152
+ with patched_engine():
153
+ client = get_client()
154
+ payload = {
155
+ "messages": [{"role": "user", "content": "PARSE_ERR trigger"}],
156
+ "max_tokens": 4,
157
+ }
158
+ r = client.post("/v1/chat/completions", json=payload)
159
+ # ValueError in engine -> 400 per API contract
160
+ assert r.status_code == 400
161
+
162
+
163
+ def read_sse_lines(resp):
164
+ # Utility to parse event-stream into list of data payloads (including [DONE])
165
+ lines = []
166
+ buf = b""
167
+
168
+ # Starlette TestClient (httpx) responses expose iter_bytes()/iter_raw(), not requests.iter_content().
169
+ # Fall back to available iterator or to full content if streaming isn't supported.
170
+ iterator = None
171
+ for name in ("iter_bytes", "iter_raw", "iter_content"):
172
+ it = getattr(resp, name, None)
173
+ if callable(it):
174
+ iterator = it
175
+ break
176
+
177
+ if iterator is None:
178
+ data = getattr(resp, "content", b"")
179
+ if isinstance(data, str):
180
+ data = data.encode("utf-8", "ignore")
181
+ buf = data
182
+ else:
183
+ for chunk in iterator():
184
+ if not chunk:
185
+ continue
186
+ if isinstance(chunk, str):
187
+ chunk = chunk.encode("utf-8", "ignore")
188
+ buf += chunk
189
+ while b"\n\n" in buf:
190
+ frame, buf = buf.split(b"\n\n", 1)
191
+ # keep original frame text for asserts
192
+ lines.append(frame.decode("utf-8", errors="ignore"))
193
+
194
+ # Drain any leftover
195
+ if buf:
196
+ lines.append(buf.decode("utf-8", errors="ignore"))
197
+ return lines
198
+
199
+
200
+ def test_chat_stream_sse_flow_and_resume():
201
+ with patched_engine():
202
+ client = get_client()
203
+ payload = {
204
+ "session_id": "s1",
205
+ "stream": True,
206
+ "messages": [{"role": "user", "content": "stream please"}],
207
+ "max_tokens": 8,
208
+ "temperature": 0.2,
209
+ }
210
+ with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
211
+ assert resp.status_code == 200
212
+ lines = read_sse_lines(resp)
213
+ # Must contain role delta, content pieces, finish chunk, and [DONE]
214
+ joined = "\n".join(lines)
215
+ assert "delta" in joined
216
+ assert "[DONE]" in joined
217
+
218
+ # Resume from event index 0 should receive at least one subsequent event
219
+ headers = {"Last-Event-ID": "s1:0"}
220
+ with client.stream("POST", "/v1/chat/completions", headers=headers, json=payload) as resp2:
221
+ assert resp2.status_code == 200
222
+ lines2 = read_sse_lines(resp2)
223
+ assert any("data:" in l for l in lines2)
224
+ assert "[DONE]" in "\n".join(lines2)
225
+
226
+ # Invalid Last-Event-ID format should not crash (covered by try/except)
227
+ headers_bad = {"Last-Event-ID": "not-an-index"}
228
+ with client.stream("POST", "/v1/chat/completions", headers=headers_bad, json=payload) as resp3:
229
+ assert resp3.status_code == 200
230
+ _ = read_sse_lines(resp3) # just ensure no crash
231
+
232
+
233
+ def test_cancel_endpoint_stops_generation():
234
+ with patched_engine():
235
+ client = get_client()
236
+ payload = {
237
+ "session_id": "to-cancel",
238
+ "stream": True,
239
+ "messages": [{"role": "user", "content": "cancel me"}],
240
+ }
241
+ # Start streaming in background (client.stream keeps the connection open)
242
+ with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
243
+ # Immediately cancel
244
+ rc = client.post("/v1/cancel/to-cancel")
245
+ assert rc.status_code == 200
246
+ # Stream should end with [DONE] without hanging
247
+ lines = read_sse_lines(resp)
248
+ assert "[DONE]" in "\n".join(lines)
249
+
250
+
251
+ def test_cancel_unknown_session_is_ok():
252
+ with patched_engine():
253
+ client = get_client()
254
+ rc = client.post("/v1/cancel/does-not-exist")
255
+ # Endpoint returns ok regardless (idempotent, operationally safe)
256
+ assert rc.status_code == 200
257
+
258
+
259
+ def test_edge_large_last_event_id_after_finish_yields_done():
260
+ with patched_engine():
261
+ client = get_client()
262
+ payload = {
263
+ "session_id": "done-session",
264
+ "stream": True,
265
+ "messages": [{"role": "user", "content": "edge"}],
266
+ }
267
+ # Complete a run
268
+ with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
269
+ _ = read_sse_lines(resp)
270
+ # Resume with huge index; should return DONE quickly
271
+ headers = {"Last-Event-ID": "done-session:99999"}
272
+ with client.stream("POST", "/v1/chat/completions", headers=headers, json=payload) as resp2:
273
+ lines2 = read_sse_lines(resp2)
274
+ assert "[DONE]" in "\n".join(lines2)