Architecture (Python FastAPI + Transformers)
This document describes the Python-based, OpenAI-compatible inference server for Qwen3-VL, replacing the previous Node.js/llama.cpp stack.
Key source files
- Server entry: main.py
- Inference engine: Python.class Engine
- Multimodal parsing: Python.function build_mm_messages, Python.function load_image_from_any, Python.function load_video_frames_from_any
- Endpoints: Health Python.app.get(), Chat Completions Python.app.post(), Cancel Python.app.post(), KTP OCR Python.app.post()
- Streaming + resume: Python.class _SSESession, Python.class _SessionStore, Python.class _SQLiteStore, Python.function chat_completions
- Local run (uvicorn): Python.main()
- Configuration template: .env.example
- Dependencies: requirements.txt
Model target (default)
- Hugging Face: Qwen/Qwen3-VL-2B-Thinking (Transformers, multimodal)
- Overridable via environment variable: MODEL_REPO_ID
Overview
The server exposes an OpenAI-compatible endpoint for chat completions that supports:
- Text-only prompts
- Images (URL or base64)
- Videos (URL or base64; frames sampled)
Two response modes are implemented:
- Non-streaming JSON
- Streaming via Server-Sent Events (SSE) with resumable delivery using Last-Event-ID. Resumability is achieved with an in‑memory ring buffer and optional SQLite persistence.
Components
- FastAPI application
- Instantiated in Python.main module and endpoints mounted at:
- Health: Python.app.get()
- Chat Completions (non-stream + SSE): Python.app.post()
- Manual cancel (custom): Python.app.post()
- KTP OCR: Python.app.post()
- CORS is enabled for simplicity.
- Inference Engine (Transformers)
- Class: Python.class Engine
- Loads:
- Processor: AutoProcessor(trust_remote_code=True)
- Model: AutoModelForCausalLM (device_map, dtype configurable via env)
- Core methods:
- Input building: Python.function build_mm_messages
- Text-only generate: Python.function infer
- Streaming generate (iterator): Python.function infer_stream
- Multimodal preprocessing
- Images:
- URL (http/https), data URL, base64, or local path
- Loader: Python.function load_image_from_any
- Videos:
- URL (downloaded to temp), base64 to temp file, or local path
- Frame extraction via imageio.v3 (preferred) or OpenCV fallback
- Uniform sampling up to MAX_VIDEO_FRAMES
- Loader: Python.function load_video_frames_from_any
- SSE streaming with resume
- Session objects:
- Python.class _SSESession: ring buffer, condition variable, producer thread reference, cancellation event, listener count, and disconnect timer
- Python.class _SessionStore: in-memory map with TTL + GC
- Optional persistence: Python.class _SQLiteStore for replaying chunks across restarts
- SSE id format: "session_id:index"
- Resume:
- Client sends Last-Event-ID header (or query ?last_event_id=...) and the same session_id in the body
- Server replays cached/persisted chunks after the provided index, then continues live streaming
- Producer:
- Created on demand per session; runs generation in a daemon thread and pushes chunks into the ring buffer and SQLite (if enabled)
- See producer closure inside Python.function chat_completions
- Auto-cancel on disconnect:
- If all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS (default 3600s), a timer signals cancellation via a stopping criteria in Python.function infer_stream
Request flow
Non-streaming (POST /v1/chat/completions)
- Validate input, load engine singleton via Python.function get_engine
- Convert OpenAI-style messages to Qwen chat template via Python.function build_mm_messages and apply_chat_template
- Preprocess images/videos into processor inputs
- Generate with Python.function infer
- Return OpenAI-compatible response (choices[0].message.content)
Streaming (POST /v1/chat/completions with "stream": true)
- Determine session_id:
- Use body.session_id if provided; otherwise generated server-side
- Parse Last-Event-ID (or query ?last_event_id) to get last delivered index
- Create/start or reuse producer thread for this session
- StreamingResponse generator:
- Replays persisted events (SQLite, if enabled) and in-memory buffer after last index
- Waits on condition variable for new tokens
- Emits "[DONE]" at the end or upon buffer completion
- Clients can reconnect and resume by sending Last-Event-ID: "session_id:index"
- If all clients disconnect, an auto-cancel timer can stop generation (configurable via env)
Manual cancel (POST /v1/cancel/{session_id})
- Custom operational shortcut to cancel an in-flight generation for a session id.
- This is not part of the legacy OpenAI Chat Completions spec (OpenAI’s newer Responses API defines cancel); it is provided for practical control.
KTP OCR (POST /ktp-ocr/)
- Specialized endpoint for Indonesian ID card (KTP) optical character recognition.
- Accepts multipart form-data with image file, extracts structured JSON data using multimodal inference.
- Returns standardized fields: nik, nama, tempat_lahir, tgl_lahir, jenis_kelamin, alamat (with nested fields), agama, status_perkawinan, pekerjaan, kewarganegaraan, berlaku_hingga.
- Uses custom prompt engineering for accurate structured extraction from Qwen3-VL model.
- Inspired by raflyryhnsyh/Gemini-OCR-KTP but adapted for local, self-hosted inference.
Message and content mapping
Input format (OpenAI-like):
- "messages" list of role/content entries
- content can be:
- string (text)
- array of parts with "type":
- "text": { text: "..."}
- "image_url": { image_url: { url: "..." } } or { image_url: "..." }
- "input_image": { b64_json: "..." } or { image: "..." }
- "video_url": { video_url: { url: "..." } } or { video_url: "..." }
- "input_video": { b64_json: "..." } or { video: "..." }
Conversion:
- Python.function build_mm_messages constructs a multimodal content list per message:
- { type: "text", text: ... }
- { type: "image", image: PIL.Image }
- { type: "video", video: [PIL.Image frames] }
Template:
- Qwen apply_chat_template:
- See usage in Python.function infer and Python.function infer_stream
Configuration (.env)
See .env.example
- PORT (default 3000)
- MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
- HF_TOKEN (optional)
- MAX_TOKENS (default 256)
- TEMPERATURE (default 0.7)
- MAX_VIDEO_FRAMES (default 16)
- DEVICE_MAP (default "auto")
- TORCH_DTYPE (default "auto")
- PERSIST_SESSIONS (default 0; set 1 to enable SQLite persistence)
- SESSIONS_DB_PATH (default sessions.db)
- SESSIONS_TTL_SECONDS (default 600)
- CANCEL_AFTER_DISCONNECT_SECONDS (default 3600; set 0 to disable)
Error handling and readiness
- Health endpoint: Python.app.get()
- Returns { ok, modelReady, modelId, error }
- Chat endpoint:
- 400 for invalid messages or multimodal parsing errors
- 503 when model failed to load
- 500 for unexpected generation errors
- During first request, the model is lazily loaded; subsequent requests reuse the singleton
Performance and scaling
- GPU recommended:
- Set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16/float16 if supported
- Reduce MAX_VIDEO_FRAMES to speed up video processing
- For concurrency:
- FastAPI/Uvicorn workers and model sharing: typically 1 model per process
- For high throughput, prefer multiple processes or a queueing layer
Data and directories
- models/ contains downloaded model artifacts (implicitly created by Transformers cache); ignored by git
- tmp/ used transiently for video decoding (temporary files)
Ignored artifacts (see .gitignore)
- Python: .venv/, pycache/, .cache/, etc.
- Large artifacts: models/, data/, uploads/, tmp/
Streaming resume details
- Session store:
- In-memory ring buffer for fast replay
- Optional SQLite persistence for robust replay across process restarts
- See GC in Python.class _SessionStore and Python.method _SQLiteStore.gc
- Limits:
- Ring buffer stores ~2048 SSE events per session by default
- If the buffer overflows before a client resumes and persistence is disabled, the earliest chunks may be unavailable
- End-of-stream:
- Final chunk contains finish_reason: "stop"
- "[DONE]" sentinel is emitted afterwards
Future enhancements
- Redis persistence:
- Add a Redis-backed store as a drop-in alongside SQLite
- Token accounting:
- Populate usage prompt/completion/total tokens when model exposes tokenization costs
- Logging/observability:
- Structured logs, request IDs, and metrics
Migration notes (from Node.js)
- All Node.js server files and scripts were removed (index.js, package*.json, scripts/)
- The server now targets Transformers models directly and supports multimodal inputs out of the box
- The API remains OpenAI-compatible on /v1/chat/completions with resumable SSE and optional SQLite persistence