Spaces:
Runtime error
title: Transformers Inference Server (Qwen3‑VL)
emoji: 🐍
colorFrom: purple
colorTo: green
sdk: docker
app_port: 3000
pinned: false
Python FastAPI Inference Server (OpenAI-Compatible) for Qwen3-VL-2B-Thinking
This repository has been migrated from a Node.js/llama.cpp stack to a Python/Transformers stack to fully support multimodal inference (text, images, videos) with the Hugging Face Qwen3 models.
Key files:
- Server entry: main.py
- Environment template: .env.example
- Python dependencies: requirements.txt
- Architecture: ARCHITECTURE.md (will be updated to reflect the Python stack)
Model:
- Default: Qwen/Qwen3-VL-2B-Thinking (Transformers; supports multimodal)
- You can change the model via environment variable MODEL_REPO_ID.
Node.js artifacts and scripts from the previous project have been removed.
Quick Start
Option 1: Run with Docker (with-model images: CPU / NVIDIA / AMD)
Tags built by CI:
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
Pull:
# CPU
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
# NVIDIA (CUDA 12.4 wheel)
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
# AMD (ROCm 6.2 wheel)
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
Run:
# CPU
docker run -p 3000:3000 \
-e HF_TOKEN=your_hf_token_here \
ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
# NVIDIA GPU (requires NVIDIA drivers + nvidia-container-toolkit on the host)
docker run --gpus all -p 3000:3000 \
-e HF_TOKEN=your_hf_token_here \
ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
# AMD GPU ROCm (requires ROCm 6.2+ drivers on the host; Linux only)
# Map ROCm devices and video group (may vary by distro)
docker run --device=/dev/kfd --device=/dev/dri --group-add video \
-p 3000:3000 \
-e HF_TOKEN=your_hf_token_here \
ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
Health check:
curl http://localhost:3000/health
Notes:
- These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
- Host requirements:
- NVIDIA: recent driver + nvidia-container-toolkit.
- AMD: ROCm 6.2+ driver stack, supported GPU, and mapped /dev/kfd and /dev/dri devices.
Option 2: Run Locally
Requirements
- Python 3.10+
- pip
- PyTorch (install a wheel matching your platform/CUDA)
- Optionally a GPU with enough VRAM for the chosen model
Install
Create and activate a virtual environment (Windows CMD): python -m venv .venv .venv\Scripts\activate
Install dependencies: pip install -r requirements.txt
Install PyTorch appropriate for your platform (examples): CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu CUDA 12.4 example: pip install torch --index-url https://download.pytorch.org/whl/cu124
Create a .env from the template and adjust if needed: copy .env.example .env
- Set HF_TOKEN if the model is gated
- Adjust MAX_TOKENS, TEMPERATURE, DEVICE_MAP, TORCH_DTYPE, MAX_VIDEO_FRAMES as desired
Configuration via .env See .env.example. Important variables:
- PORT=3000
- MODEL_REPO_ID=Qwen/Qwen3-VL-2B-Thinking
- HF_TOKEN= # optional if gated
- MAX_TOKENS=4096
- TEMPERATURE=0.7
- MAX_VIDEO_FRAMES=16
- DEVICE_MAP=auto
- TORCH_DTYPE=auto
Additional streaming/persistence configuration
- PERSIST_SESSIONS=1 # enable SQLite-backed resumable SSE
- SESSIONS_DB_PATH=sessions.db # SQLite db path
- SESSIONS_TTL_SECONDS=600 # TTL for finished sessions before GC
- CANCEL_AFTER_DISCONNECT_SECONDS=3600 # auto-cancel generation if all clients disconnect for this many seconds (0=disable)
Cancel session API (custom extension)
Endpoint: POST /v1/cancel/{session_id}
Purpose: Manually cancel an in-flight streaming generation for the given session_id. Not part of OpenAI Chat Completions spec (the newer OpenAI Responses API has cancel), so this is provided as a practical extension.
Example (Windows CMD): curl -X POST http://localhost:3000/v1/cancel/mysession123 Run
Direct: python main.py
Using uvicorn: uvicorn main:app --host 0.0.0.0 --port 3000
Endpoints (OpenAI-compatible)
Health GET /health Example: curl http://localhost:3000/health Response: { "ok": true, "modelReady": true, "modelId": "Qwen/Qwen3-VL-2B-Thinking", "error": null }
Chat Completions (non-streaming) POST /v1/chat/completions Example (Windows CMD): curl -X POST http://localhost:3000/v1/chat/completions ^ -H "Content-Type: application/json" ^ -d "{"model":"qwen-local","messages":[{"role":"user","content":"Describe this image briefly"}],"max_tokens":128}"
Example (PowerShell): $body = @{ model = "qwen-local" messages = @(@{ role = "user"; content = "Hello Qwen3!" }) max_tokens = 128 } | ConvertTo-Json -Depth 5 curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
Chat Completions (streaming via Server-Sent Events) Set "stream": true to receive partial deltas as they are generated. Example (Windows CMD): curl -N -H "Content-Type: application/json" ^ -d "{"model":"qwen-local","messages":[{"role":"user","content":"Think step by step: what is 17 * 23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
The stream format follows OpenAI-style SSE: data: { "id": "...", "object": "chat.completion.chunk", "choices":[{ "delta": {"role": "assistant"} }]} data: { "choices":[{ "delta": {"content": "To"} }]} data: { "choices":[{ "delta": {"content": " think..."} }]} ... data: { "choices":[{ "delta": {}, "finish_reason": "stop"}]} data: [DONE]
Multimodal Usage
Text only: { "role": "user", "content": "Summarize: The quick brown fox ..." }
Image by URL: { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } } ] }
Image by base64: { "role": "user", "content": [ { "type": "text", "text": "OCR this." }, { "type": "input_image", "b64_json": "" } ] }
Video by URL (frames are sampled up to MAX_VIDEO_FRAMES): { "role": "user", "content": [ { "type": "text", "text": "Describe this clip." }, { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } } ] }
Video by base64: { "role": "user", "content": [ { "type": "text", "text": "Count the number of cars." }, { "type": "input_video", "b64_json": "" } ] }
Implementation Notes
- Server code: main.py
- FastAPI with CORS enabled
- Non-streaming and streaming endpoints
- Uses AutoProcessor and AutoModelForCausalLM with trust_remote_code=True
- Converts OpenAI-style messages into the Qwen multimodal format
- Images loaded via PIL; videos loaded via imageio.v3 (preferred) or OpenCV as fallback; frames sampled
Performance Tips
- On GPUs: set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16 or float16 if supported
- Reduce MAX_VIDEO_FRAMES to speed up video processing
- Tune MAX_TOKENS and TEMPERATURE according to your needs
Troubleshooting
- ImportError or no CUDA found:
- Ensure PyTorch is installed with the correct wheel for your environment.
- OOM / CUDA out of memory:
- Use a smaller model, lower MAX_VIDEO_FRAMES, lower MAX_TOKENS, or run on CPU.
- 503 Model not ready:
- The first request triggers model load; check /health for errors and HF_TOKEN if gated.
License
- See LICENSE for terms.
Changelog and Architecture
- We will update ARCHITECTURE.md to reflect the Python server flow.
Streaming behavior, resume, and reconnections
The server streams responses using Server‑Sent Events (SSE) from Python.function chat_completions(), driven by token iteration in Python.function infer_stream. It now supports resumable streaming using an in‑memory ring buffer and SSE Last-Event-ID, with optional SQLite persistence (enable PERSIST_SESSIONS=1).
What’s implemented
- Per-session in-memory ring buffer keyed by session_id (no external storage).
- Each SSE event carries an SSE id line in the format "session_id:index" so clients can resume with Last-Event-ID.
- On reconnect:
- Provide the same session_id in the request body, and
- Provide "Last-Event-ID: session_id:index" header (or query ?last_event_id=session_id:index).
- The server replays cached events after index and continues streaming new tokens.
- Session TTL: ~10 minutes, buffer capacity: ~2048 events. Old or finished sessions are garbage-collected in-memory.
How to start a streaming session
Minimal (server generates a session_id internally for SSE id lines): Windows CMD: curl -N -H "Content-Type: application/json" ^ -d "{"messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
With explicit session_id (recommended if you want to resume): Windows CMD: curl -N -H "Content-Type: application/json" ^ -d "{"session_id":"mysession123","messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
How to resume after disconnect
Use the same session_id and the SSE Last-Event-ID header (or ?last_event_id=...): Windows CMD (resume from index 42): curl -N -H "Content-Type: application/json" ^ -H "Last-Event-ID: mysession123:42" ^ -d "{"session_id":"mysession123","messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
Alternatively with query string: http://localhost:3000/v1/chat/completions?last_event_id=mysession123:42
Event format
Chunks follow the OpenAI-style "chat.completion.chunk" shape in data payloads, plus an SSE id: id: mysession123:5 data: {"id":"mysession123","object":"chat.completion.chunk","created":..., "model":"...", "choices":[{"index":0,"delta":{"content":" token"},"finish_reason":null}]}
The stream ends with: data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]
Notes and limits
- This implementation keeps session state only in memory; restarts will drop buffers.
- If the buffer overflows before you resume, the earliest chunks may be unavailable.
- Cancellation on client disconnect is not automatic; generation runs to completion in the background. A cancellable stopping-criteria can be added if required.
Hugging Face repository files support
This server loads the Qwen3-VL model via Transformers with trust_remote_code=True, so the standard files from the repo are supported and consumed automatically. Summary for https://huggingface.co/Qwen/Qwen3-VL-2B-Thinking/tree/main:
Used by model weights and architecture
- model.safetensors — main weights loaded by AutoModelForCausalLM
- config.json — architecture/config
- generation_config.json — default gen params (we may override via request or env)
Used by tokenizer
- tokenizer.json — primary tokenizer specification
- tokenizer_config.json — tokenizer settings
- merges.txt and vocab.json — fallback/compat files; if tokenizer.json exists, HF generally prefers it
Used by processors (multimodal)
- preprocessor_config.json — image/text processor config
- video_preprocessor_config.json — video processor config (frame sampling, etc.)
- chat_template.json — chat formatting used by Python.function infer and Python.function infer_stream via
processor.apply_chat_template(...)
Not required for runtime
- README.md, .gitattributes — ignored by runtime
Notes:
- We rely on Transformers’ AutoModelForCausalLM and AutoProcessor to resolve and use the above files; no manual parsing is required in our code.
- With
trust_remote_code=True, model-specific code from the repo may load additional assets transparently. - If the repo updates configs (e.g., new chat template), the server will pick them up on next load.
Cancellation and session persistence
Auto-cancel on disconnect:
- Generation is automatically cancelled if all clients disconnect for more than CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 seconds = 1 hour). Configure in .env.example via
CANCEL_AFTER_DISCONNECT_SECONDS. - Implemented by a timer in Python.function chat_completions that triggers a cooperative stop through a stopping criteria in Python.function infer_stream.
- Generation is automatically cancelled if all clients disconnect for more than CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 seconds = 1 hour). Configure in .env.example via
Manual cancel API (custom extension):
- Endpoint:
POST /v1/cancel/{session_id} - Cancels an ongoing streaming session and marks it finished in the store. Example (Windows CMD): curl -X POST http://localhost:3000/v1/cancel/mysession123
- This is not part of OpenAI’s legacy Chat Completions spec. OpenAI’s newer Responses API has a cancel endpoint, but Chat Completions does not. We provide this custom endpoint for operational control.
- Endpoint:
Persistence:
- Optional SQLite-backed persistence for resumable SSE (enable
PERSIST_SESSIONS=1in .env.example). - Database path:
SESSIONS_DB_PATH(default: sessions.db) - Session TTL for GC:
SESSIONS_TTL_SECONDS(default: 600) - See implementation in Python.class _SQLiteStore and integration in Python.function chat_completions.
- Redis is not implemented yet; the design isolates persistence so a Redis-backed store can be added as a drop-in.
- Optional SQLite-backed persistence for resumable SSE (enable
Deploy on Render
Render has two easy options. Since our image already bakes the model, the fastest path is to deploy the public Docker image (CPU). Render currently doesn’t provide NVIDIA/AMD GPUs for standard Web Services, so use the CPU image.
Option A — Deploy public Docker image (recommended)
- In Render Dashboard: New → Web Service
- Environment: Docker → Public Docker image
- Image
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
- Instance and region
- Region: closest to your users
- Instance type: pick a plan with at least 16 GB RAM (more if you see OOM)
- Port/health
- Render auto-injects PORT; the server binds to it via Python.os.getenv()
- Health Check Path: /health (served by Python.function health)
- Start command
- Leave blank; the image uses CMD ["python","main.py"] as defined in Dockerfile. The app entry is Python.main().
- Environment variables
- EAGER_LOAD_MODEL=1
- MAX_TOKENS=4096
- HF_TOKEN=your_hf_token_here (only if the model is gated)
- Optional persistence:
- PERSIST_SESSIONS=1
- SESSIONS_DB_PATH=/data/sessions.db (requires a disk)
- Persistent Disk (optional)
- Add a Disk (e.g., 1–5 GB) and mount it at /data if you enable SQLite persistence
- Create Web Service and wait for it to start
- Verify
- curl https://YOUR-SERVICE.onrender.com/health
- OpenAPI YAML: https://YOUR-SERVICE.onrender.com/openapi.yaml (served by Python.function openapi_yaml)
- Chat endpoint: POST https://YOUR-SERVICE.onrender.com/v1/chat/completions (implemented in Python.function chat_completions)
Option B — Build directly from this GitHub repo (Dockerfile)
- In Render Dashboard: New → Web Service → Build from a Git repo (connect this repo)
- Render will detect the Dockerfile automatically (no Build Command needed)
- Advanced → Docker Build Args
- BACKEND=cpu (ensures CPU-only torch wheel)
- Health and env vars
- Health Check Path: /health
- Set EAGER_LOAD_MODEL, MAX_TOKENS, HF_TOKEN as needed (same as Option A)
- (Optional) Add a Disk and mount at /data, then set SESSIONS_DB_PATH=/data/sessions.db if you want resumable SSE across restarts
- Deploy (first build can take a while due to the multi-GB model layer)
Notes and limits on Render
- GPU acceleration (NVIDIA/AMD) isn’t available for standard Web Services on Render; use the CPU image.
- The image already contains the Qwen3-VL model under /app/hf-cache, so there’s no model download at runtime.
- SSE is supported; streaming is produced by Python.function chat_completions. Keep the connection open to avoid idle timeouts.
- If you enable SQLite persistence, remember to attach a Disk; otherwise, the DB is ephemeral.
Example render.yaml (optional IaC) If you prefer infrastructure-as-code, you can use a render.yaml like:
services:
- type: web
name: qwen-vl-cpu
env: docker
image:
url: ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
plan: standard
region: oregon
healthCheckPath: /health
autoDeploy: true
envVars:
- key: EAGER_LOAD_MODEL value: "1"
- key: MAX_TOKENS value: "4096"
- key: HF_TOKEN
sync: false # set in dashboard or use Render secrets
- key: PERSIST_SESSIONS
value: "1"
- key: SESSIONS_DB_PATH
value: "/data/sessions.db"
disks:
Uncomment if using persistence
- name: data
mountPath: /data
sizeGB: 5
After deploy:
- Health: GET /health
- OpenAPI: GET /openapi.yaml
- Inference:
curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions
-H "Content-Type: application/json"
-d "{"messages":[{"role":"user","content":"Hello"}],"max_tokens":128}"
Deploy on Hugging Face Spaces
Recommended: Docker Space (works with our FastAPI app and preserves multimodal behavior). You can run CPU or GPU hardware. To persist the HF cache across restarts, enable Persistent Storage and point HF cache to /data.
A) Create the Space (Docker)
Install CLI and login: pip install -U "huggingface_hub[cli]" huggingface-cli login
Create a Docker Space (public or private): huggingface-cli repo create my-qwen3-vl-server --type space --sdk docker
Add the Space as a remote and push this repo: git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server git push hf main
This pushes Dockerfile, main.py, requirements.txt. The Space will auto-build your container.
B) Configure Space settings
Hardware:
- CPU: works out-of-the-box (fast to build, slower inference).
- GPU: choose a GPU tier (e.g., T4/A10G/L4) for faster inference.
Persistent Storage (recommended):
- Enable Persistent storage (e.g., 10–30 GB).
- This lets you cache models and sessions across restarts.
Variables and Secrets:
- Variables:
- EAGER_LOAD_MODEL=1
- MAX_TOKENS=4096
- HF_HOME=/data/hf-cache
- TRANSFORMERS_CACHE=/data/hf-cache
- Secrets:
- HF_TOKEN=your_hf_token_if_model_is_gated
- Variables:
C) CPU vs GPU on Spaces
- CPU: No change needed. Our Dockerfile defaults to CPU PyTorch and bakes the model during build. It will run on CPU Spaces.
- GPU: Edit the Space’s Dockerfile to switch the backend before the next build:
- In the file editor of the Space UI, change: ARG BACKEND=cpu to: ARG BACKEND=nvidia
- Save/commit; the Space rebuilds with a CUDA-enabled torch. Choose a GPU hardware tier in the Space settings. Note: Building the GPU image pulls CUDA torch wheels and increases build time.
- AMD ROCm is not available on Spaces; use NVIDIA GPUs on Spaces.
D) Speed up cold starts and caching
- With Persistent Storage enabled and HF_HOME/TRANSFORMERS_CACHE pointed to /data/hf-cache, the model cache persists across restarts (subsequent spins are much faster).
- Keep the Space “Always on” if available on your plan to avoid cold starts.
E) Space endpoints
- Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
- Health: GET /health (implemented by Python.function health)
- OpenAPI YAML: GET /openapi.yaml (implemented by Python.openapi_yaml)
- Chat Completions: POST /v1/chat/completions (non-stream + SSE) Python.function chat_completions
- Cancel: POST /v1/cancel/{session_id} Python.function cancel_session
F) Quick test after the Space is “Running”
- Health: curl -s https://YOUR-SPACE-Subdomain.hf.space/health
- Non-stream:
curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions
-H "Content-Type: application/json"
-d "{"messages":[{"role":"user","content":"Hello from HF Spaces!"}],"max_tokens":128}" - Streaming:
curl -N -H "Content-Type: application/json"
-d "{"messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}"
https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions
Notes
- The Space build step can appear “idle” after “Model downloaded.” while Docker commits a multi‑GB layer; this is expected.
- If you hit OOM, increase the Space hardware memory or switch to a GPU tier. Reduce MAX_VIDEO_FRAMES and MAX_TOKENS if needed.