vLLM support
Hey
Are embeddings & rerankers compatible with vLLM?
You can use them with sglang and infinity
You can use them with sglang and infinity
Unfortunately, it fails to load in SGLang:
docker run --gpus all \
--restart always \
--name qwemb-server \
--shm-size 16g \
-p 30000:30000 \
-v hf_cache:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-0.6B --host 0.0.0.0 --port 30000 --is-embedding
...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Parameter embed_tokens.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.input_layernorm.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.mlp.down_proj.weight not found in params_dict
[2025-06-06 06:16:33] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2297, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 277, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 231, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 271, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in load_model
self.model = get_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
return loader.load_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 381, in load_model
self.load_weights_and_postprocess(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 389, in load_weights_and_postprocess
model.load_weights(weights)
File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3.py", line 344, in load_weights
param = params_dict[name]
KeyError: 'layers.0.mlp.gate_up_proj.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Received sigquit from a child process. It usually means the child failed.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/mp-v5vgg6aq'
https://huggingface.co/woodx/Qwen3-Embedding-0.6B-SGLang
try this one! you need to transfer the origin model parameter name with new prefix "model."
vllm supports this model on the first day (Supporting it with vLLM doesn't even require changing any code. LOL).
PTAL
- https://github.com/vllm-project/vllm/pull/19260
- https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/discussions/2#68425b20bff553c9ed67d4da
After merging
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/discussions/2
embeddings-benchmark/mteb#2769 (comment)
Qwen3-Embedding can already output results close to SentenceTransformers.
Please close this discussion (ID: 1), which occurred on the first day of the Qwen/Qwen3-Embedding-0.6B launch and has already become very very very outdated.