KillerKing93 commited on
Commit
c996e04
·
verified ·
1 Parent(s): 7cd14d8

Sync from GitHub bfbbe77

Browse files
Files changed (1) hide show
  1. README.md +170 -0
README.md CHANGED
@@ -1,3 +1,13 @@
 
 
 
 
 
 
 
 
 
 
1
  # Python FastAPI Inference Server (OpenAI-Compatible) for Qwen3-VL-2B-Thinking
2
 
3
  This repository has been migrated from a Node.js/llama.cpp stack to a Python/Transformers stack to fully support multimodal inference (text, images, videos) with the Hugging Face Qwen3 models.
@@ -356,3 +366,163 @@ Notes:
356
  - Session TTL for GC: `SESSIONS_TTL_SECONDS` (default: 600)
357
  - See implementation in [Python.class \_SQLiteStore](main.py:481) and integration in [Python.function chat_completions](main.py:591).
358
  - Redis is not implemented yet; the design isolates persistence so a Redis-backed store can be added as a drop-in.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Transformers Inference Server (Qwen3‑VL)"
3
+ emoji: 🐍
4
+ colorFrom: purple
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 3000
8
+ pinned: false
9
+ ---
10
+
11
  # Python FastAPI Inference Server (OpenAI-Compatible) for Qwen3-VL-2B-Thinking
12
 
13
  This repository has been migrated from a Node.js/llama.cpp stack to a Python/Transformers stack to fully support multimodal inference (text, images, videos) with the Hugging Face Qwen3 models.
 
366
  - Session TTL for GC: `SESSIONS_TTL_SECONDS` (default: 600)
367
  - See implementation in [Python.class \_SQLiteStore](main.py:481) and integration in [Python.function chat_completions](main.py:591).
368
  - Redis is not implemented yet; the design isolates persistence so a Redis-backed store can be added as a drop-in.
369
+
370
+ ## Deploy on Render
371
+
372
+ Render has two easy options. Since our image already bakes the model, the fastest path is to deploy the public Docker image (CPU). Render currently doesn’t provide NVIDIA/AMD GPUs for standard Web Services, so use the CPU image.
373
+
374
+ Option A — Deploy public Docker image (recommended)
375
+ 1) In Render Dashboard: New → Web Service
376
+ 2) Environment: Docker → Public Docker image
377
+ 3) Image
378
+ - ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
379
+ 4) Instance and region
380
+ - Region: closest to your users
381
+ - Instance type: pick a plan with at least 16 GB RAM (more if you see OOM)
382
+ 5) Port/health
383
+ - Render auto-injects PORT; the server binds to it via [Python.os.getenv()](main.py:71)
384
+ - Health Check Path: /health (served by [Python.function health](main.py:871))
385
+ 6) Start command
386
+ - Leave blank; the image uses CMD ["python","main.py"] as defined in [Dockerfile](Dockerfile:54). The app entry is [Python.main()](main.py:1).
387
+ 7) Environment variables
388
+ - EAGER_LOAD_MODEL=1
389
+ - MAX_TOKENS=4096
390
+ - HF_TOKEN=your_hf_token_here (only if the model is gated)
391
+ - Optional persistence:
392
+ - PERSIST_SESSIONS=1
393
+ - SESSIONS_DB_PATH=/data/sessions.db (requires a disk)
394
+ 8) Persistent Disk (optional)
395
+ - Add a Disk (e.g., 1–5 GB) and mount it at /data if you enable SQLite persistence
396
+ 9) Create Web Service and wait for it to start
397
+ 10) Verify
398
+ - curl https://YOUR-SERVICE.onrender.com/health
399
+ - OpenAPI YAML: https://YOUR-SERVICE.onrender.com/openapi.yaml (served by [Python.function openapi_yaml](main.py:863))
400
+ - Chat endpoint: POST https://YOUR-SERVICE.onrender.com/v1/chat/completions (implemented in [Python.function chat_completions](main.py:891))
401
+
402
+ Option B — Build directly from this GitHub repo (Dockerfile)
403
+ 1) In Render Dashboard: New → Web Service → Build from a Git repo (connect this repo)
404
+ 2) Render will detect the Dockerfile automatically (no Build Command needed)
405
+ 3) Advanced → Docker Build Args
406
+ - BACKEND=cpu (ensures CPU-only torch wheel)
407
+ 4) Health and env vars
408
+ - Health Check Path: /health
409
+ - Set EAGER_LOAD_MODEL, MAX_TOKENS, HF_TOKEN as needed (same as Option A)
410
+ 5) (Optional) Add a Disk and mount at /data, then set SESSIONS_DB_PATH=/data/sessions.db if you want resumable SSE across restarts
411
+ 6) Deploy (first build can take a while due to the multi-GB model layer)
412
+
413
+ Notes and limits on Render
414
+ - GPU acceleration (NVIDIA/AMD) isn’t available for standard Web Services on Render; use the CPU image.
415
+ - The image already contains the Qwen3-VL model under /app/hf-cache, so there’s no model download at runtime.
416
+ - SSE is supported; streaming is produced by [Python.function chat_completions](main.py:891). Keep the connection open to avoid idle timeouts.
417
+ - If you enable SQLite persistence, remember to attach a Disk; otherwise, the DB is ephemeral.
418
+
419
+ Example render.yaml (optional IaC)
420
+ If you prefer infrastructure-as-code, you can use a render.yaml like:
421
+
422
+ services:
423
+ - type: web
424
+ name: qwen-vl-cpu
425
+ env: docker
426
+ image:
427
+ url: ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
428
+ plan: standard
429
+ region: oregon
430
+ healthCheckPath: /health
431
+ autoDeploy: true
432
+ envVars:
433
+ - key: EAGER_LOAD_MODEL
434
+ value: "1"
435
+ - key: MAX_TOKENS
436
+ value: "4096"
437
+ # - key: HF_TOKEN
438
+ # sync: false # set in dashboard or use Render secrets
439
+ # - key: PERSIST_SESSIONS
440
+ # value: "1"
441
+ # - key: SESSIONS_DB_PATH
442
+ # value: "/data/sessions.db"
443
+ disks:
444
+ # Uncomment if using persistence
445
+ # - name: data
446
+ # mountPath: /data
447
+ # sizeGB: 5
448
+
449
+ After deploy:
450
+ - Health: GET /health
451
+ - OpenAPI: GET /openapi.yaml
452
+ - Inference:
453
+ curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions \
454
+ -H "Content-Type: application/json" \
455
+ -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":128}"
456
+
457
+ ## Deploy on Hugging Face Spaces
458
+
459
+ Recommended: Docker Space (works with our FastAPI app and preserves multimodal behavior). You can run CPU or GPU hardware. To persist the HF cache across restarts, enable Persistent Storage and point HF cache to /data.
460
+
461
+ A) Create the Space (Docker)
462
+ 1) Install CLI and login:
463
+ pip install -U "huggingface_hub[cli]"
464
+ huggingface-cli login
465
+
466
+ 2) Create a Docker Space (public or private):
467
+ huggingface-cli repo create my-qwen3-vl-server --type space --sdk docker
468
+
469
+ 3) Add the Space as a remote and push this repo:
470
+ git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server
471
+ git push hf main
472
+
473
+ This pushes Dockerfile, main.py, requirements.txt. The Space will auto-build your container.
474
+
475
+ B) Configure Space settings
476
+ - Hardware:
477
+ - CPU: works out-of-the-box (fast to build, slower inference).
478
+ - GPU: choose a GPU tier (e.g., T4/A10G/L4) for faster inference.
479
+
480
+ - Persistent Storage (recommended):
481
+ - Enable Persistent storage (e.g., 10–30 GB).
482
+ - This lets you cache models and sessions across restarts.
483
+
484
+ - Variables and Secrets:
485
+ - Variables:
486
+ - EAGER_LOAD_MODEL=1
487
+ - MAX_TOKENS=4096
488
+ - HF_HOME=/data/hf-cache
489
+ - TRANSFORMERS_CACHE=/data/hf-cache
490
+ - Secrets:
491
+ - HF_TOKEN=your_hf_token_if_model_is_gated
492
+
493
+ C) CPU vs GPU on Spaces
494
+ - CPU: No change needed. Our Dockerfile defaults to CPU PyTorch and bakes the model during build. It will run on CPU Spaces.
495
+ - GPU: Edit the Space’s Dockerfile to switch the backend before the next build:
496
+ - In the file editor of the Space UI, change:
497
+ ARG BACKEND=cpu
498
+ to:
499
+ ARG BACKEND=nvidia
500
+ - Save/commit; the Space rebuilds with a CUDA-enabled torch. Choose a GPU hardware tier in the Space settings. Note: Building the GPU image pulls CUDA torch wheels and increases build time.
501
+ - AMD ROCm is not available on Spaces; use NVIDIA GPUs on Spaces.
502
+
503
+ D) Speed up cold starts and caching
504
+ - With Persistent Storage enabled and HF_HOME/TRANSFORMERS_CACHE pointed to /data/hf-cache, the model cache persists across restarts (subsequent spins are much faster).
505
+ - Keep the Space “Always on” if available on your plan to avoid cold starts.
506
+
507
+ E) Space endpoints
508
+ - Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
509
+ - Health: GET /health (implemented by [Python.function health](main.py:871))
510
+ - OpenAPI YAML: GET /openapi.yaml (implemented by [Python.openapi_yaml](main.py:863))
511
+ - Chat Completions: POST /v1/chat/completions (non-stream + SSE) [Python.function chat_completions](main.py:891)
512
+ - Cancel: POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:1091)
513
+
514
+ F) Quick test after the Space is “Running”
515
+ - Health:
516
+ curl -s https://YOUR-SPACE-Subdomain.hf.space/health
517
+ - Non-stream:
518
+ curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions \
519
+ -H "Content-Type: application/json" \
520
+ -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello from HF Spaces!\"}],\"max_tokens\":128}"
521
+ - Streaming:
522
+ curl -N -H "Content-Type: application/json" \
523
+ -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" \
524
+ https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions
525
+
526
+ Notes
527
+ - The Space build step can appear “idle” after “Model downloaded.” while Docker commits a multi‑GB layer; this is expected.
528
+ - If you hit OOM, increase the Space hardware memory or switch to a GPU tier. Reduce MAX_VIDEO_FRAMES and MAX_TOKENS if needed.