Spaces:

KillerKing93
/

Transformers-InferenceServer-OpenAPI

Sleeping

Non-streaming JSON
Streaming via Server-Sent Events (SSE) with resumable delivery using Last-Event-ID. Resumability is achieved with an in‑memory ring buffer and optional SQLite persistence.

Components

Instantiated in Python.main module and endpoints mounted at:
- Health: Python.app.get()
- Chat Completions (non-stream + SSE): Python.app.post()
- Manual cancel (custom): Python.app.post()
- KTP OCR: Python.app.post()
CORS is enabled for simplicity.

Class: Python.class Engine
Loads:
- Processor: AutoProcessor(trust_remote_code=True)
- Model: AutoModelForCausalLM (device_map, dtype configurable via env)
Core methods:
- Input building: Python.function build_mm_messages
- Text-only generate: Python.function infer
- Streaming generate (iterator): Python.function infer_stream

Images:
- URL (http/https), data URL, base64, or local path
- Loader: Python.function load_image_from_any
Videos:
- URL (downloaded to temp), base64 to temp file, or local path
- Frame extraction via imageio.v3 (preferred) or OpenCV fallback
- Uniform sampling up to MAX_VIDEO_FRAMES
- Loader: Python.function load_video_frames_from_any

Session objects:
- Python.class _SSESession: ring buffer, condition variable, producer thread reference, cancellation event, listener count, and disconnect timer
- Python.class _SessionStore: in-memory map with TTL + GC
- Optional persistence: Python.class _SQLiteStore for replaying chunks across restarts
SSE id format: "session_id:index"
Resume:
- Client sends Last-Event-ID header (or query ?last_event_id=...) and the same session_id in the body
- Server replays cached/persisted chunks after the provided index, then continues live streaming
Producer:
- Created on demand per session; runs generation in a daemon thread and pushes chunks into the ring buffer and SQLite (if enabled)
- See producer closure inside Python.function chat_completions
Auto-cancel on disconnect:
- If all clients disconnect for CANCEL_AFTER_DISCONNECT_SECONDS (default 3600s), a timer signals cancellation via a stopping criteria in Python.function infer_stream

Non-streaming (POST /v1/chat/completions)

Validate input, load engine singleton via Python.function get_engine
Convert OpenAI-style messages to Qwen chat template via Python.function build_mm_messages and apply_chat_template
Preprocess images/videos into processor inputs
Generate with Python.function infer
Return OpenAI-compatible response (choices[0].message.content)

Streaming (POST /v1/chat/completions with "stream": true)

Determine session_id:
- Use body.session_id if provided; otherwise generated server-side
Parse Last-Event-ID (or query ?last_event_id) to get last delivered index
Create/start or reuse producer thread for this session
StreamingResponse generator:
- Replays persisted events (SQLite, if enabled) and in-memory buffer after last index
- Waits on condition variable for new tokens
- Emits "[DONE]" at the end or upon buffer completion
Clients can reconnect and resume by sending Last-Event-ID: "session_id:index"
If all clients disconnect, an auto-cancel timer can stop generation (configurable via env)

Manual cancel (POST /v1/cancel/{session_id})

Custom operational shortcut to cancel an in-flight generation for a session id.
This is not part of the legacy OpenAI Chat Completions spec (OpenAI’s newer Responses API defines cancel); it is provided for practical control.

KTP OCR (POST /ktp-ocr/)

Specialized endpoint for Indonesian ID card (KTP) optical character recognition.
Accepts multipart form-data with image file, extracts structured JSON data using multimodal inference.
Returns standardized fields: nik, nama, tempat_lahir, tgl_lahir, jenis_kelamin, alamat (with nested fields), agama, status_perkawinan, pekerjaan, kewarganegaraan, berlaku_hingga.
Uses custom prompt engineering for accurate structured extraction from Qwen3-VL model.
Inspired by raflyryhnsyh/Gemini-OCR-KTP but adapted for local, self-hosted inference.

Input format (OpenAI-like):

Conversion:

Python.function build_mm_messages constructs a multimodal content list per message:
- { type: "text", text: ... }
- { type: "image", image: PIL.Image }
- { type: "video", video: [PIL.Image frames] }

Template:

Health endpoint: Python.app.get()
- Returns { ok, modelReady, modelId, error }
Chat endpoint:
- 400 for invalid messages or multimodal parsing errors
- 503 when model failed to load
- 500 for unexpected generation errors
During first request, the model is lazily loaded; subsequent requests reuse the singleton

GPU recommended:
- Set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16/float16 if supported
Reduce MAX_VIDEO_FRAMES to speed up video processing
For concurrency:
- FastAPI/Uvicorn workers and model sharing: typically 1 model per process
- For high throughput, prefer multiple processes or a queueing layer

models/ contains downloaded model artifacts (implicitly created by Transformers cache); ignored by git
tmp/ used transiently for video decoding (temporary files)

Ignored artifacts (see .gitignore)

Session store:
- In-memory ring buffer for fast replay
- Optional SQLite persistence for robust replay across process restarts
- See GC in Python.class _SessionStore and Python.method _SQLiteStore.gc
Limits:
- Ring buffer stores ~2048 SSE events per session by default
- If the buffer overflows before a client resumes and persistence is disabled, the earliest chunks may be unavailable
End-of-stream:
- Final chunk contains finish_reason: "stop"
- "[DONE]" sentinel is emitted afterwards

Redis persistence:
- Add a Redis-backed store as a drop-in alongside SQLite
Token accounting:
- Populate usage prompt/completion/total tokens when model exposes tokenization costs
Logging/observability:
- Structured logs, request IDs, and metrics

All Node.js server files and scripts were removed (index.js, package*.json, scripts/)
The server now targets Transformers models directly and supports multimodal inputs out of the box
The API remains OpenAI-compatible on /v1/chat/completions with resumable SSE and optional SQLite persistence