File size: 9,480 Bytes
7cd14d8 2a02935 7cd14d8 2a02935 7cd14d8 2a02935 7cd14d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
# Architecture (Python FastAPI + Transformers)
This document describes the Python-based, OpenAI-compatible inference server for Qwen3-VL, replacing the previous Node.js/llama.cpp stack.
Key source files
- Server entry: [main.py](main.py)
- Inference engine: [Python.class Engine](main.py:231)
- Multimodal parsing: [Python.function build_mm_messages](main.py:251), [Python.function load_image_from_any](main.py:108), [Python.function load_video_frames_from_any](main.py:150)
- Endpoints: Health [Python.app.get()](main.py:577), Chat Completions [Python.app.post()](main.py:591), Cancel [Python.app.post()](main.py:792), KTP OCR [Python.app.post()](main.py:1310)
- Streaming + resume: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449), [Python.class _SQLiteStore](main.py:482), [Python.function chat_completions](main.py:591)
- Local run (uvicorn): [Python.main()](main.py:807)
- Configuration template: [.env.example](.env.example)
- Dependencies: [requirements.txt](requirements.txt)
Model target (default)
- Hugging Face: Qwen/Qwen3-VL-2B-Thinking (Transformers, multimodal)
- Overridable via environment variable: MODEL_REPO_ID
## Overview
The server exposes an OpenAI-compatible endpoint for chat completions that supports:
- Text-only prompts
- Images (URL or base64)
- Videos (URL or base64; frames sampled)
Two response modes are implemented:
- Non-streaming JSON
- Streaming via Server-Sent Events (SSE) with resumable delivery using Last-Event-ID. Resumability is achieved with an in‑memory ring buffer and optional SQLite persistence.
## Components
1) FastAPI application
- Instantiated in [Python.main module](main.py:541) and endpoints mounted at:
- Health: [Python.app.get()](main.py:577)
- Chat Completions (non-stream + SSE): [Python.app.post()](main.py:591)
- Manual cancel (custom): [Python.app.post()](main.py:792)
- KTP OCR: [Python.app.post()](main.py:1310)
- CORS is enabled for simplicity.
2) Inference Engine (Transformers)
- Class: [Python.class Engine](main.py:231)
- Loads:
- Processor: AutoProcessor(trust_remote_code=True)
- Model: AutoModelForCausalLM (device_map, dtype configurable via env)
- Core methods:
- Input building: [Python.function build_mm_messages](main.py:251)
- Text-only generate: [Python.function infer](main.py:326)
- Streaming generate (iterator): [Python.function infer_stream](main.py:375)
3) Multimodal preprocessing
- Images:
- URL (http/https), data URL, base64, or local path
- Loader: [Python.function load_image_from_any](main.py:108)
- Videos:
- URL (downloaded to temp), base64 to temp file, or local path
- Frame extraction via imageio.v3 (preferred) or OpenCV fallback
- Uniform sampling up to MAX_VIDEO_FRAMES
- Loader: [Python.function load_video_frames_from_any](main.py:150)
4) SSE streaming with resume
- Session objects:
- [Python.class _SSESession](main.py:435): ring buffer, condition variable, producer thread reference, cancellation event, listener count, and disconnect timer
- [Python.class _SessionStore](main.py:449): in-memory map with TTL + GC
- Optional persistence: [Python.class _SQLiteStore](main.py:482) for replaying chunks across restarts
- SSE id format: "session_id:index"
- Resume:
- Client sends Last-Event-ID header (or query ?last_event_id=...) and the same session_id in the body
- Server replays cached/persisted chunks after the provided index, then continues live streaming
- Producer:
- Created on demand per session; runs generation in a daemon thread and pushes chunks into the ring buffer and SQLite (if enabled)
- See producer closure inside [Python.function chat_completions](main.py:591)
- Auto-cancel on disconnect:
- If all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS (default 3600s), a timer signals cancellation via a stopping criteria in [Python.function infer_stream](main.py:375)
## Request flow
Non-streaming (POST /v1/chat/completions)
1. Validate input, load engine singleton via [Python.function get_engine](main.py:558)
2. Convert OpenAI-style messages to Qwen chat template via [Python.function build_mm_messages](main.py:251) and apply_chat_template
3. Preprocess images/videos into processor inputs
4. Generate with [Python.function infer](main.py:326)
5. Return OpenAI-compatible response (choices[0].message.content)
Streaming (POST /v1/chat/completions with "stream": true)
1. Determine session_id:
- Use body.session_id if provided; otherwise generated server-side
2. Parse Last-Event-ID (or query ?last_event_id) to get last delivered index
3. Create/start or reuse producer thread for this session
4. StreamingResponse generator:
- Replays persisted events (SQLite, if enabled) and in-memory buffer after last index
- Waits on condition variable for new tokens
- Emits "[DONE]" at the end or upon buffer completion
5. Clients can reconnect and resume by sending Last-Event-ID: "session_id:index"
6. If all clients disconnect, an auto-cancel timer can stop generation (configurable via env)
Manual cancel (POST /v1/cancel/{session_id})
- Custom operational shortcut to cancel an in-flight generation for a session id.
- This is not part of the legacy OpenAI Chat Completions spec (OpenAI’s newer Responses API defines cancel); it is provided for practical control.
KTP OCR (POST /ktp-ocr/)
- Specialized endpoint for Indonesian ID card (KTP) optical character recognition.
- Accepts multipart form-data with image file, extracts structured JSON data using multimodal inference.
- Returns standardized fields: nik, nama, tempat_lahir, tgl_lahir, jenis_kelamin, alamat (with nested fields), agama, status_perkawinan, pekerjaan, kewarganegaraan, berlaku_hingga.
- Uses custom prompt engineering for accurate structured extraction from Qwen3-VL model.
- Inspired by raflyryhnsyh/Gemini-OCR-KTP but adapted for local, self-hosted inference.
## Message and content mapping
Input format (OpenAI-like):
- "messages" list of role/content entries
- content can be:
- string (text)
- array of parts with "type":
- "text": { text: "..."}
- "image_url": { image_url: { url: "..." } } or { image_url: "..." }
- "input_image": { b64_json: "..." } or { image: "..." }
- "video_url": { video_url: { url: "..." } } or { video_url: "..." }
- "input_video": { b64_json: "..." } or { video: "..." }
Conversion:
- [Python.function build_mm_messages](main.py:251) constructs a multimodal content list per message:
- { type: "text", text: ... }
- { type: "image", image: PIL.Image }
- { type: "video", video: [PIL.Image frames] }
Template:
- Qwen apply_chat_template:
- See usage in [Python.function infer](main.py:326) and [Python.function infer_stream](main.py:375)
## Configuration (.env)
See [.env.example](.env.example)
- PORT (default 3000)
- MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
- HF_TOKEN (optional)
- MAX_TOKENS (default 256)
- TEMPERATURE (default 0.7)
- MAX_VIDEO_FRAMES (default 16)
- DEVICE_MAP (default "auto")
- TORCH_DTYPE (default "auto")
- PERSIST_SESSIONS (default 0; set 1 to enable SQLite persistence)
- SESSIONS_DB_PATH (default sessions.db)
- SESSIONS_TTL_SECONDS (default 600)
- CANCEL_AFTER_DISCONNECT_SECONDS (default 3600; set 0 to disable)
## Error handling and readiness
- Health endpoint: [Python.app.get()](main.py:577)
- Returns { ok, modelReady, modelId, error }
- Chat endpoint:
- 400 for invalid messages or multimodal parsing errors
- 503 when model failed to load
- 500 for unexpected generation errors
- During first request, the model is lazily loaded; subsequent requests reuse the singleton
## Performance and scaling
- GPU recommended:
- Set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16/float16 if supported
- Reduce MAX_VIDEO_FRAMES to speed up video processing
- For concurrency:
- FastAPI/Uvicorn workers and model sharing: typically 1 model per process
- For high throughput, prefer multiple processes or a queueing layer
## Data and directories
- models/ contains downloaded model artifacts (implicitly created by Transformers cache); ignored by git
- tmp/ used transiently for video decoding (temporary files)
Ignored artifacts (see [.gitignore](.gitignore))
- Python: .venv/, __pycache__/, .cache/, etc.
- Large artifacts: models/, data/, uploads/, tmp/
## Streaming resume details
- Session store:
- In-memory ring buffer for fast replay
- Optional SQLite persistence for robust replay across process restarts
- See GC in [Python.class _SessionStore](main.py:449) and [Python.method _SQLiteStore.gc](main.py:526)
- Limits:
- Ring buffer stores ~2048 SSE events per session by default
- If the buffer overflows before a client resumes and persistence is disabled, the earliest chunks may be unavailable
- End-of-stream:
- Final chunk contains finish_reason: "stop"
- "[DONE]" sentinel is emitted afterwards
## Future enhancements
- Redis persistence:
- Add a Redis-backed store as a drop-in alongside SQLite
- Token accounting:
- Populate usage prompt/completion/total tokens when model exposes tokenization costs
- Logging/observability:
- Structured logs, request IDs, and metrics
## Migration notes (from Node.js)
- All Node.js server files and scripts were removed (index.js, package*.json, scripts/)
- The server now targets Transformers models directly and supports multimodal inputs out of the box
- The API remains OpenAI-compatible on /v1/chat/completions with resumable SSE and optional SQLite persistence
|