KillerKing93 commited on
Commit
d5b5b09
·
verified ·
1 Parent(s): e73999d

Sync from GitHub 929e477

Browse files
Files changed (2) hide show
  1. README.md +19 -8
  2. main.py +2 -1
README.md CHANGED
@@ -74,6 +74,12 @@ Health check:
74
  curl http://localhost:3000/health
75
  ```
76
 
 
 
 
 
 
 
77
  Notes:
78
  - These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
79
  - Host requirements:
@@ -144,6 +150,10 @@ Cancel session API (custom extension)
144
 
145
  Endpoints (OpenAI-compatible)
146
 
 
 
 
 
147
  - Health
148
  GET /health
149
  Example:
@@ -161,13 +171,13 @@ Endpoints (OpenAI-compatible)
161
  Example (Windows CMD):
162
  curl -X POST http://localhost:3000/v1/chat/completions ^
163
  -H "Content-Type: application/json" ^
164
- -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":128}"
165
 
166
  Example (PowerShell):
167
  $body = @{
168
  model = "qwen-local"
169
  messages = @(@{ role = "user"; content = "Hello Qwen3!" })
170
- max_tokens = 128
171
  } | ConvertTo-Json -Depth 5
172
  curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
173
 
@@ -452,7 +462,7 @@ After deploy:
452
  - Inference:
453
  curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions \
454
  -H "Content-Type: application/json" \
455
- -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":128}"
456
 
457
  ## Deploy on Hugging Face Spaces
458
 
@@ -506,10 +516,11 @@ D) Speed up cold starts and caching
506
 
507
  E) Space endpoints
508
  - Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
509
- - Health: GET /health (implemented by [Python.function health](main.py:871))
510
- - OpenAPI YAML: GET /openapi.yaml (implemented by [Python.openapi_yaml](main.py:863))
511
- - Chat Completions: POST /v1/chat/completions (non-stream + SSE) [Python.function chat_completions](main.py:891)
512
- - Cancel: POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:1091)
 
513
 
514
  F) Quick test after the Space is “Running”
515
  - Health:
@@ -517,7 +528,7 @@ F) Quick test after the Space is “Running”
517
  - Non-stream:
518
  curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions \
519
  -H "Content-Type: application/json" \
520
- -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello from HF Spaces!\"}],\"max_tokens\":128}"
521
  - Streaming:
522
  curl -N -H "Content-Type: application/json" \
523
  -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" \
 
74
  curl http://localhost:3000/health
75
  ```
76
 
77
+ Swagger UI:
78
+ http://localhost:3000/docs
79
+
80
+ OpenAPI (YAML):
81
+ http://localhost:3000/openapi.yaml
82
+
83
  Notes:
84
  - These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
85
  - Host requirements:
 
150
 
151
  Endpoints (OpenAI-compatible)
152
 
153
+ - Swagger UI
154
+ GET /docs
155
+ - OpenAPI (YAML)
156
+ GET /openapi.yaml
157
  - Health
158
  GET /health
159
  Example:
 
171
  Example (Windows CMD):
172
  curl -X POST http://localhost:3000/v1/chat/completions ^
173
  -H "Content-Type: application/json" ^
174
+ -d "{\"model\":\"qwen-local\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly\"}],\"max_tokens\":4096}"
175
 
176
  Example (PowerShell):
177
  $body = @{
178
  model = "qwen-local"
179
  messages = @(@{ role = "user"; content = "Hello Qwen3!" })
180
+ max_tokens = 4096
181
  } | ConvertTo-Json -Depth 5
182
  curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
183
 
 
462
  - Inference:
463
  curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions \
464
  -H "Content-Type: application/json" \
465
+ -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":4096}"
466
 
467
  ## Deploy on Hugging Face Spaces
468
 
 
516
 
517
  E) Space endpoints
518
  - Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
519
+ - Swagger UI: GET /docs (interactive API with examples)
520
+ - Health: GET /health (implemented by [Python.function health](main.py:951))
521
+ - OpenAPI YAML: GET /openapi.yaml (implemented by [Python.openapi_yaml](main.py:943))
522
+ - Chat Completions: POST /v1/chat/completions (non-stream + SSE) [Python.function chat_completions](main.py:971)
523
+ - Cancel: POST /v1/cancel/{session_id} [Python.function cancel_session](main.py:1191)
524
 
525
  F) Quick test after the Space is “Running”
526
  - Health:
 
528
  - Non-stream:
529
  curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions \
530
  -H "Content-Type: application/json" \
531
+ -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello from HF Spaces!\"}],\"max_tokens\":4096}"
532
  - Streaming:
533
  curl -N -H "Content-Type: application/json" \
534
  -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Think step by step: 17*23?\"}],\"stream\":true}" \
main.py CHANGED
@@ -71,7 +71,8 @@ from huggingface_hub import snapshot_download, list_repo_files, hf_hub_download,
71
  PORT = int(os.getenv("PORT", "3000"))
72
  DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
73
  HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
74
- DEFAULT_MAX_TOKENS = int(os.getenv("MAX_TOKENS", "256"))
 
75
  DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
76
  MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
77
  DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")
 
71
  PORT = int(os.getenv("PORT", "3000"))
72
  DEFAULT_MODEL_ID = os.getenv("MODEL_REPO_ID", "Qwen/Qwen3-VL-2B-Thinking")
73
  HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
74
+ # Default max tokens: honor env, fallback to 4096 as previously discussed
75
+ DEFAULT_MAX_TOKENS = int(os.getenv("MAX_TOKENS", "4096"))
76
  DEFAULT_TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
77
  MAX_VIDEO_FRAMES = int(os.getenv("MAX_VIDEO_FRAMES", "16"))
78
  DEVICE_MAP = os.getenv("DEVICE_MAP", "auto")