KillerKing93's picture
Sync from GitHub db05d07
2a02935 verified

Architecture (Python FastAPI + Transformers)

This document describes the Python-based, OpenAI-compatible inference server for Qwen3-VL, replacing the previous Node.js/llama.cpp stack.

Key source files

Model target (default)

  • Hugging Face: Qwen/Qwen3-VL-2B-Thinking (Transformers, multimodal)
  • Overridable via environment variable: MODEL_REPO_ID

Overview

The server exposes an OpenAI-compatible endpoint for chat completions that supports:

  • Text-only prompts
  • Images (URL or base64)
  • Videos (URL or base64; frames sampled)

Two response modes are implemented:

  • Non-streaming JSON
  • Streaming via Server-Sent Events (SSE) with resumable delivery using Last-Event-ID. Resumability is achieved with an in‑memory ring buffer and optional SQLite persistence.

Components

  1. FastAPI application
  1. Inference Engine (Transformers)
  1. Multimodal preprocessing
  1. SSE streaming with resume
  • Session objects:
  • SSE id format: "session_id:index"
  • Resume:
    • Client sends Last-Event-ID header (or query ?last_event_id=...) and the same session_id in the body
    • Server replays cached/persisted chunks after the provided index, then continues live streaming
  • Producer:
    • Created on demand per session; runs generation in a daemon thread and pushes chunks into the ring buffer and SQLite (if enabled)
    • See producer closure inside Python.function chat_completions
  • Auto-cancel on disconnect:
    • If all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS (default 3600s), a timer signals cancellation via a stopping criteria in Python.function infer_stream

Request flow

Non-streaming (POST /v1/chat/completions)

  1. Validate input, load engine singleton via Python.function get_engine
  2. Convert OpenAI-style messages to Qwen chat template via Python.function build_mm_messages and apply_chat_template
  3. Preprocess images/videos into processor inputs
  4. Generate with Python.function infer
  5. Return OpenAI-compatible response (choices[0].message.content)

Streaming (POST /v1/chat/completions with "stream": true)

  1. Determine session_id:
    • Use body.session_id if provided; otherwise generated server-side
  2. Parse Last-Event-ID (or query ?last_event_id) to get last delivered index
  3. Create/start or reuse producer thread for this session
  4. StreamingResponse generator:
    • Replays persisted events (SQLite, if enabled) and in-memory buffer after last index
    • Waits on condition variable for new tokens
    • Emits "[DONE]" at the end or upon buffer completion
  5. Clients can reconnect and resume by sending Last-Event-ID: "session_id:index"
  6. If all clients disconnect, an auto-cancel timer can stop generation (configurable via env)

Manual cancel (POST /v1/cancel/{session_id})

  • Custom operational shortcut to cancel an in-flight generation for a session id.
  • This is not part of the legacy OpenAI Chat Completions spec (OpenAI’s newer Responses API defines cancel); it is provided for practical control.

KTP OCR (POST /ktp-ocr/)

  • Specialized endpoint for Indonesian ID card (KTP) optical character recognition.
  • Accepts multipart form-data with image file, extracts structured JSON data using multimodal inference.
  • Returns standardized fields: nik, nama, tempat_lahir, tgl_lahir, jenis_kelamin, alamat (with nested fields), agama, status_perkawinan, pekerjaan, kewarganegaraan, berlaku_hingga.
  • Uses custom prompt engineering for accurate structured extraction from Qwen3-VL model.
  • Inspired by raflyryhnsyh/Gemini-OCR-KTP but adapted for local, self-hosted inference.

Message and content mapping

Input format (OpenAI-like):

  • "messages" list of role/content entries
  • content can be:
    • string (text)
    • array of parts with "type":
      • "text": { text: "..."}
      • "image_url": { image_url: { url: "..." } } or { image_url: "..." }
      • "input_image": { b64_json: "..." } or { image: "..." }
      • "video_url": { video_url: { url: "..." } } or { video_url: "..." }
      • "input_video": { b64_json: "..." } or { video: "..." }

Conversion:

  • Python.function build_mm_messages constructs a multimodal content list per message:
    • { type: "text", text: ... }
    • { type: "image", image: PIL.Image }
    • { type: "video", video: [PIL.Image frames] }

Template:

Configuration (.env)

See .env.example

  • PORT (default 3000)
  • MODEL_REPO_ID (default "Qwen/Qwen3-VL-2B-Thinking")
  • HF_TOKEN (optional)
  • MAX_TOKENS (default 256)
  • TEMPERATURE (default 0.7)
  • MAX_VIDEO_FRAMES (default 16)
  • DEVICE_MAP (default "auto")
  • TORCH_DTYPE (default "auto")
  • PERSIST_SESSIONS (default 0; set 1 to enable SQLite persistence)
  • SESSIONS_DB_PATH (default sessions.db)
  • SESSIONS_TTL_SECONDS (default 600)
  • CANCEL_AFTER_DISCONNECT_SECONDS (default 3600; set 0 to disable)

Error handling and readiness

  • Health endpoint: Python.app.get()
    • Returns { ok, modelReady, modelId, error }
  • Chat endpoint:
    • 400 for invalid messages or multimodal parsing errors
    • 503 when model failed to load
    • 500 for unexpected generation errors
  • During first request, the model is lazily loaded; subsequent requests reuse the singleton

Performance and scaling

  • GPU recommended:
    • Set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16/float16 if supported
  • Reduce MAX_VIDEO_FRAMES to speed up video processing
  • For concurrency:
    • FastAPI/Uvicorn workers and model sharing: typically 1 model per process
    • For high throughput, prefer multiple processes or a queueing layer

Data and directories

  • models/ contains downloaded model artifacts (implicitly created by Transformers cache); ignored by git
  • tmp/ used transiently for video decoding (temporary files)

Ignored artifacts (see .gitignore)

  • Python: .venv/, pycache/, .cache/, etc.
  • Large artifacts: models/, data/, uploads/, tmp/

Streaming resume details

  • Session store:
  • Limits:
    • Ring buffer stores ~2048 SSE events per session by default
    • If the buffer overflows before a client resumes and persistence is disabled, the earliest chunks may be unavailable
  • End-of-stream:
    • Final chunk contains finish_reason: "stop"
    • "[DONE]" sentinel is emitted afterwards

Future enhancements

  • Redis persistence:
    • Add a Redis-backed store as a drop-in alongside SQLite
  • Token accounting:
    • Populate usage prompt/completion/total tokens when model exposes tokenization costs
  • Logging/observability:
    • Structured logs, request IDs, and metrics

Migration notes (from Node.js)

  • All Node.js server files and scripts were removed (index.js, package*.json, scripts/)
  • The server now targets Transformers models directly and supports multimodal inputs out of the box
  • The API remains OpenAI-compatible on /v1/chat/completions with resumable SSE and optional SQLite persistence