File size: 9,480 Bytes
7cd14d8
 
 
 
 
 
 
 
2a02935
7cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a02935
7cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a02935
 
 
 
 
 
 
7cd14d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# Architecture (Python FastAPI + Transformers)

This document describes the Python-based, OpenAI-compatible inference server for Qwen3-VL, replacing the previous Node.js/llama.cpp stack.

Key source files
- Server entry: [main.py](main.py)
- Inference engine: [Python.class Engine](main.py:231)
- Multimodal parsing: [Python.function build_mm_messages](main.py:251), [Python.function load_image_from_any](main.py:108), [Python.function load_video_frames_from_any](main.py:150)
- Endpoints: Health [Python.app.get()](main.py:577), Chat Completions [Python.app.post()](main.py:591), Cancel [Python.app.post()](main.py:792), KTP OCR [Python.app.post()](main.py:1310)
- Streaming + resume: [Python.class _SSESession](main.py:435), [Python.class _SessionStore](main.py:449), [Python.class _SQLiteStore](main.py:482), [Python.function chat_completions](main.py:591)
- Local run (uvicorn): [Python.main()](main.py:807)
- Configuration template: [.env.example](.env.example)
- Dependencies: [requirements.txt](requirements.txt)

Model target (default)
- Hugging Face: Qwen/Qwen3-VL-2B-Thinking (Transformers, multimodal)
- Overridable via environment variable: MODEL_REPO_ID

## Overview

The server exposes an OpenAI-compatible endpoint for chat completions that supports:
- Text-only prompts
- Images (URL or base64)
- Videos (URL or base64; frames sampled)

Two response modes are implemented:
- Non-streaming JSON
- Streaming via Server-Sent Events (SSE) with resumable delivery using Last-Event-ID. Resumability is achieved with an in‑memory ring buffer and optional SQLite persistence.

## Components

1) FastAPI application
- Instantiated in [Python.main module](main.py:541) and endpoints mounted at:
  - Health: [Python.app.get()](main.py:577)
  - Chat Completions (non-stream + SSE): [Python.app.post()](main.py:591)
  - Manual cancel (custom): [Python.app.post()](main.py:792)
  - KTP OCR: [Python.app.post()](main.py:1310)
- CORS is enabled for simplicity.

2) Inference Engine (Transformers)
- Class: [Python.class Engine](main.py:231)
- Loads:
  - Processor: AutoProcessor(trust_remote_code=True)
  - Model: AutoModelForCausalLM (device_map, dtype configurable via env)
- Core methods:
  - Input building: [Python.function build_mm_messages](main.py:251)
  - Text-only generate: [Python.function infer](main.py:326)
  - Streaming generate (iterator): [Python.function infer_stream](main.py:375)

3) Multimodal preprocessing
- Images:
  - URL (http/https), data URL, base64, or local path
  - Loader: [Python.function load_image_from_any](main.py:108)
- Videos:
  - URL (downloaded to temp), base64 to temp file, or local path
  - Frame extraction via imageio.v3 (preferred) or OpenCV fallback
  - Uniform sampling up to MAX_VIDEO_FRAMES
  - Loader: [Python.function load_video_frames_from_any](main.py:150)

4) SSE streaming with resume
- Session objects:
  - [Python.class _SSESession](main.py:435): ring buffer, condition variable, producer thread reference, cancellation event, listener count, and disconnect timer
  - [Python.class _SessionStore](main.py:449): in-memory map with TTL + GC
  - Optional persistence: [Python.class _SQLiteStore](main.py:482) for replaying chunks across restarts
- SSE id format: "session_id:index"
- Resume:
  - Client sends Last-Event-ID header (or query ?last_event_id=...) and the same session_id in the body
  - Server replays cached/persisted chunks after the provided index, then continues live streaming
- Producer:
  - Created on demand per session; runs generation in a daemon thread and pushes chunks into the ring buffer and SQLite (if enabled)
  - See producer closure inside [Python.function chat_completions](main.py:591)
- Auto-cancel on disconnect:
  - If all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS (default 3600s), a timer signals cancellation via a stopping criteria in [Python.function infer_stream](main.py:375)

## Request flow

Non-streaming (POST /v1/chat/completions)
1. Validate input, load engine singleton via [Python.function get_engine](main.py:558)
2. Convert OpenAI-style messages to Qwen chat template via [Python.function build_mm_messages](main.py:251) and apply_chat_template
3. Preprocess images/videos into processor inputs
4. Generate with [Python.function infer](main.py:326)
5. Return OpenAI-compatible response (choices[0].message.content)

Streaming (POST /v1/chat/completions with "stream": true)
1. Determine session_id:
   - Use body.session_id if provided; otherwise generated server-side
2. Parse Last-Event-ID (or query ?last_event_id) to get last delivered index
3. Create/start or reuse producer thread for this session
4. StreamingResponse generator:
   - Replays persisted events (SQLite, if enabled) and in-memory buffer after last index
   - Waits on condition variable for new tokens
   - Emits "[DONE]" at the end or upon buffer completion
5. Clients can reconnect and resume by sending Last-Event-ID: "session_id:index"
6. If all clients disconnect, an auto-cancel timer can stop generation (configurable via env)

Manual cancel (POST /v1/cancel/{session_id})
- Custom operational shortcut to cancel an in-flight generation for a session id.
- This is not part of the legacy OpenAI Chat Completions spec (OpenAI’s newer Responses API defines cancel); it is provided for practical control.

KTP OCR (POST /ktp-ocr/)
- Specialized endpoint for Indonesian ID card (KTP) optical character recognition.
- Accepts multipart form-data with image file, extracts structured JSON data using multimodal inference.
- Returns standardized fields: nik, nama, tempat_lahir, tgl_lahir, jenis_kelamin, alamat (with nested fields), agama, status_perkawinan, pekerjaan, kewarganegaraan, berlaku_hingga.
- Uses custom prompt engineering for accurate structured extraction from Qwen3-VL model.
- Inspired by raflyryhnsyh/Gemini-OCR-KTP but adapted for local, self-hosted inference.

## Message and content mapping

Input format (OpenAI-like):
- "messages" list of role/content entries
- content can be:
  - string (text)
  - array of parts with "type":
    - "text": { text: "..."}
    - "image_url": { image_url: { url: "..." } } or { image_url: "..." }
    - "input_image": { b64_json: "..." } or { image: "..." }
    - "video_url": { video_url: { url: "..." } } or { video_url: "..." }
    - "input_video": { b64_json: "..." } or { video: "..." }

Conversion:
- [Python.function build_mm_messages](main.py:251) constructs a multimodal content list per message:
  - { type: "text", text: ... }
  - { type: "image", image: PIL.Image }
  - { type: "video", video: [PIL.Image frames] }

Template:
- Qwen apply_chat_template:
  - See usage in [Python.function infer](main.py:326) and [Python.function infer_stream](main.py:375)

## Configuration (.env)

See [.env.example](.env.example)
- PORT (default 3000)
- MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
- HF_TOKEN (optional)
- MAX_TOKENS (default 256)
- TEMPERATURE (default 0.7)
- MAX_VIDEO_FRAMES (default 16)
- DEVICE_MAP (default "auto")
- TORCH_DTYPE (default "auto")
- PERSIST_SESSIONS (default 0; set 1 to enable SQLite persistence)
- SESSIONS_DB_PATH (default sessions.db)
- SESSIONS_TTL_SECONDS (default 600)
- CANCEL_AFTER_DISCONNECT_SECONDS (default 3600; set 0 to disable)

## Error handling and readiness

- Health endpoint: [Python.app.get()](main.py:577)
  - Returns { ok, modelReady, modelId, error }
- Chat endpoint:
  - 400 for invalid messages or multimodal parsing errors
  - 503 when model failed to load
  - 500 for unexpected generation errors
- During first request, the model is lazily loaded; subsequent requests reuse the singleton

## Performance and scaling

- GPU recommended:
  - Set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16/float16 if supported
- Reduce MAX_VIDEO_FRAMES to speed up video processing
- For concurrency:
  - FastAPI/Uvicorn workers and model sharing: typically 1 model per process
  - For high throughput, prefer multiple processes or a queueing layer

## Data and directories

- models/ contains downloaded model artifacts (implicitly created by Transformers cache); ignored by git
- tmp/ used transiently for video decoding (temporary files)

Ignored artifacts (see [.gitignore](.gitignore))
- Python: .venv/, __pycache__/, .cache/, etc.
- Large artifacts: models/, data/, uploads/, tmp/

## Streaming resume details

- Session store:
  - In-memory ring buffer for fast replay
  - Optional SQLite persistence for robust replay across process restarts
  - See GC in [Python.class _SessionStore](main.py:449) and [Python.method _SQLiteStore.gc](main.py:526)
- Limits:
  - Ring buffer stores ~2048 SSE events per session by default
  - If the buffer overflows before a client resumes and persistence is disabled, the earliest chunks may be unavailable
- End-of-stream:
  - Final chunk contains finish_reason: "stop"
  - "[DONE]" sentinel is emitted afterwards

## Future enhancements

- Redis persistence:
  - Add a Redis-backed store as a drop-in alongside SQLite
- Token accounting:
  - Populate usage prompt/completion/total tokens when model exposes tokenization costs
- Logging/observability:
  - Structured logs, request IDs, and metrics

## Migration notes (from Node.js)

- All Node.js server files and scripts were removed (index.js, package*.json, scripts/)
- The server now targets Transformers models directly and supports multimodal inputs out of the box
- The API remains OpenAI-compatible on /v1/chat/completions with resumable SSE and optional SQLite persistence