Spaces:

KillerKing93
/

Transformers-InferenceServer-OpenAPI

Running

App Files Files Community

KillerKing93 commited on 13 days ago

Commit

d5b5b09

verified ·

1 Parent(s): e73999d

Sync from GitHub 929e477

Browse files

Files changed (2) hide show

README.md +19 -8
main.py +2 -1

README.md CHANGED Viewed

@@ -74,6 +74,12 @@ Health check:
 curl http://localhost:3000/health
 ```
 Notes:
 - These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
 - Host requirements:
@@ -144,6 +150,10 @@ Cancel session API (custom extension)
 Endpoints (OpenAI-compatible)
 - Health
   GET /health
   Example:
@@ -161,13 +171,13 @@ Endpoints (OpenAI-compatible)
   Example (Windows CMD):
   curl -X POST http://localhost:3000/v1/chat/completions ^
   -H "Content-Type: application/json" ^
-  -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":128}"
   Example (PowerShell):
   $body = @{
   model = "qwen-local"
   messages = @(@{ role = "user"; content = "Hello Qwen3!" })
-  max_tokens = 128
   } | ConvertTo-Json -Depth 5
   curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
@@ -452,7 +462,7 @@ After deploy:
 - Inference:
   curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions \
     -H "Content-Type: application/json" \
-    -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":128}"
 ## Deploy on Hugging Face Spaces
@@ -506,10 +516,11 @@ D) Speed up cold starts and caching
 E) Space endpoints
 - Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
-- Health: GET /health (implemented by [Python.function health](main.py:871))
-- OpenAPI YAML: GET /openapi.yaml (implemented by [Python.openapi_yaml](main.py:863))
-- Chat Completions: POST /v1/chat/completions (non-stream + SSE) [Python.function chat_completions](main.py:891)
-- Cancel: POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:1091)
 F) Quick test after the Space is “Running”
 - Health:
@@ -517,7 +528,7 @@ F) Quick test after the Space is “Running”
 - Non-stream:
   curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions \
     -H "Content-Type: application/json" \
-    -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello from HF Spaces!\"}],\"max_tokens\":128}"
 - Streaming:
   curl -N -H "Content-Type: application/json" \
     -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" \

 curl http://localhost:3000/health
 ```
+Swagger UI:
+http://localhost:3000/docs
+OpenAPI (YAML):
+http://localhost:3000/openapi.yaml
 Notes:
 - These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
 - Host requirements:
 Endpoints (OpenAI-compatible)
+- Swagger UI
+  GET /docs
+- OpenAPI (YAML)
+  GET /openapi.yaml
 - Health
   GET /health
   Example:
   Example (Windows CMD):
   curl -X POST http://localhost:3000/v1/chat/completions ^
   -H "Content-Type: application/json" ^
+  -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":4096}"
   Example (PowerShell):
   $body = @{
   model = "qwen-local"
   messages = @(@{ role = "user"; content = "Hello Qwen3!" })
+  max_tokens = 4096
   } | ConvertTo-Json -Depth 5
   curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
 - Inference:
   curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions \
     -H "Content-Type: application/json" \
+    -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":4096}"
 ## Deploy on Hugging Face Spaces
 E) Space endpoints
 - Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
+- Swagger UI: GET /docs (interactive API with examples)
+- Health: GET /health (implemented by [Python.function health](main.py:951))
+- OpenAPI YAML: GET /openapi.yaml (implemented by [Python.openapi_yaml](main.py:943))
+- Chat Completions: POST /v1/chat/completions (non-stream + SSE) [Python.function chat_completions](main.py:971)
+- Cancel: POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:1191)
 F) Quick test after the Space is “Running”
 - Health:
 - Non-stream:
   curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions \
     -H "Content-Type: application/json" \
+    -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello from HF Spaces!\"}],\"max_tokens\":4096}"
 - Streaming:
   curl -N -H "Content-Type: application/json" \
     -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" \

main.py CHANGED Viewed

@@ -71,7 +71,8 @@ from huggingface_hub import snapshot_download, list_repo_files, hf_hub_download,
 PORT = int(os.getenv("PORT", "3000"))
 DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
 HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
-DEFAULT_MAX_TOKENS = int(os.getenv("MAX_TOKENS", "256"))
 DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
 MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
 DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")

 PORT = int(os.getenv("PORT", "3000"))
 DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
 HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
+# Default max tokens: honor env, fallback to 4096 as previously discussed
+DEFAULT_MAX_TOKENS = int(os.getenv("MAX_TOKENS", "4096"))
 DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
 MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
 DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")