Sync from GitHub 929e477
Browse files
README.md
CHANGED
|
@@ -74,6 +74,12 @@ Health check:
|
|
| 74 |
curl http://localhost:3000/health
|
| 75 |
```
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
Notes:
|
| 78 |
- These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
|
| 79 |
- Host requirements:
|
|
@@ -144,6 +150,10 @@ Cancel session API (custom extension)
|
|
| 144 |
|
| 145 |
Endpoints (OpenAI-compatible)
|
| 146 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
- Health
|
| 148 |
GET /health
|
| 149 |
Example:
|
|
@@ -161,13 +171,13 @@ Endpoints (OpenAI-compatible)
|
|
| 161 |
Example (Windows CMD):
|
| 162 |
curl -X POST http://localhost:3000/v1/chat/completions ^
|
| 163 |
-H "Content-Type: application/json" ^
|
| 164 |
-
-d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":
|
| 165 |
|
| 166 |
Example (PowerShell):
|
| 167 |
$body = @{
|
| 168 |
model = "qwen-local"
|
| 169 |
messages = @(@{ role = "user"; content = "Hello Qwen3!" })
|
| 170 |
-
max_tokens =
|
| 171 |
} | ConvertTo-Json -Depth 5
|
| 172 |
curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
|
| 173 |
|
|
@@ -452,7 +462,7 @@ After deploy:
|
|
| 452 |
- Inference:
|
| 453 |
curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions \
|
| 454 |
-H "Content-Type: application/json" \
|
| 455 |
-
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":
|
| 456 |
|
| 457 |
## Deploy on Hugging Face Spaces
|
| 458 |
|
|
@@ -506,10 +516,11 @@ D) Speed up cold starts and caching
|
|
| 506 |
|
| 507 |
E) Space endpoints
|
| 508 |
- Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
|
| 509 |
-
-
|
| 510 |
-
-
|
| 511 |
-
-
|
| 512 |
-
-
|
|
|
|
| 513 |
|
| 514 |
F) Quick test after the Space is “Running”
|
| 515 |
- Health:
|
|
@@ -517,7 +528,7 @@ F) Quick test after the Space is “Running”
|
|
| 517 |
- Non-stream:
|
| 518 |
curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions \
|
| 519 |
-H "Content-Type: application/json" \
|
| 520 |
-
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello from HF Spaces!\"}],\"max_tokens\":
|
| 521 |
- Streaming:
|
| 522 |
curl -N -H "Content-Type: application/json" \
|
| 523 |
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" \
|
|
|
|
| 74 |
curl http://localhost:3000/health
|
| 75 |
```
|
| 76 |
|
| 77 |
+
Swagger UI:
|
| 78 |
+
http://localhost:3000/docs
|
| 79 |
+
|
| 80 |
+
OpenAPI (YAML):
|
| 81 |
+
http://localhost:3000/openapi.yaml
|
| 82 |
+
|
| 83 |
Notes:
|
| 84 |
- These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
|
| 85 |
- Host requirements:
|
|
|
|
| 150 |
|
| 151 |
Endpoints (OpenAI-compatible)
|
| 152 |
|
| 153 |
+
- Swagger UI
|
| 154 |
+
GET /docs
|
| 155 |
+
- OpenAPI (YAML)
|
| 156 |
+
GET /openapi.yaml
|
| 157 |
- Health
|
| 158 |
GET /health
|
| 159 |
Example:
|
|
|
|
| 171 |
Example (Windows CMD):
|
| 172 |
curl -X POST http://localhost:3000/v1/chat/completions ^
|
| 173 |
-H "Content-Type: application/json" ^
|
| 174 |
+
-d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":4096}"
|
| 175 |
|
| 176 |
Example (PowerShell):
|
| 177 |
$body = @{
|
| 178 |
model = "qwen-local"
|
| 179 |
messages = @(@{ role = "user"; content = "Hello Qwen3!" })
|
| 180 |
+
max_tokens = 4096
|
| 181 |
} | ConvertTo-Json -Depth 5
|
| 182 |
curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
|
| 183 |
|
|
|
|
| 462 |
- Inference:
|
| 463 |
curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions \
|
| 464 |
-H "Content-Type: application/json" \
|
| 465 |
+
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":4096}"
|
| 466 |
|
| 467 |
## Deploy on Hugging Face Spaces
|
| 468 |
|
|
|
|
| 516 |
|
| 517 |
E) Space endpoints
|
| 518 |
- Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
|
| 519 |
+
- Swagger UI: GET /docs (interactive API with examples)
|
| 520 |
+
- Health: GET /health (implemented by [Python.function health](main.py:951))
|
| 521 |
+
- OpenAPI YAML: GET /openapi.yaml (implemented by [Python.openapi_yaml](main.py:943))
|
| 522 |
+
- Chat Completions: POST /v1/chat/completions (non-stream + SSE) [Python.function chat_completions](main.py:971)
|
| 523 |
+
- Cancel: POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:1191)
|
| 524 |
|
| 525 |
F) Quick test after the Space is “Running”
|
| 526 |
- Health:
|
|
|
|
| 528 |
- Non-stream:
|
| 529 |
curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions \
|
| 530 |
-H "Content-Type: application/json" \
|
| 531 |
+
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello from HF Spaces!\"}],\"max_tokens\":4096}"
|
| 532 |
- Streaming:
|
| 533 |
curl -N -H "Content-Type: application/json" \
|
| 534 |
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" \
|
main.py
CHANGED
|
@@ -71,7 +71,8 @@ from huggingface_hub import snapshot_download, list_repo_files, hf_hub_download,
|
|
| 71 |
PORT = int(os.getenv("PORT", "3000"))
|
| 72 |
DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
|
| 73 |
HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
|
| 74 |
-
|
|
|
|
| 75 |
DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
|
| 76 |
MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
|
| 77 |
DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")
|
|
|
|
| 71 |
PORT = int(os.getenv("PORT", "3000"))
|
| 72 |
DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
|
| 73 |
HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
|
| 74 |
+
# Default max tokens: honor env, fallback to 4096 as previously discussed
|
| 75 |
+
DEFAULT_MAX_TOKENS = int(os.getenv("MAX_TOKENS", "4096"))
|
| 76 |
DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
|
| 77 |
MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
|
| 78 |
DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")
|