vLLM support

#1
by ccdv - opened

Hey
Are embeddings & rerankers compatible with vLLM?

You can use them with sglang and infinity

You can use them with sglang and infinity

Unfortunately, it fails to load in SGLang:

docker run --gpus all \
    --restart always \
    --name qwemb-server \
    --shm-size 16g \
    -p 30000:30000 \
    -v hf_cache:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-0.6B --host 0.0.0.0 --port 30000 --is-embedding
...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Parameter embed_tokens.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.input_layernorm.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.mlp.down_proj.weight not found in params_dict
[2025-06-06 06:16:33] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2297, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 277, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 231, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 271, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 381, in load_model
    self.load_weights_and_postprocess(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 389, in load_weights_and_postprocess
    model.load_weights(weights)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3.py", line 344, in load_weights
    param = params_dict[name]
KeyError: 'layers.0.mlp.gate_up_proj.weight'

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

[2025-06-06 06:16:33] Received sigquit from a child process. It usually means the child failed.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/mp-v5vgg6aq'

https://huggingface.co/woodx/Qwen3-Embedding-0.6B-SGLang
try this one! you need to transfer the origin model parameter name with new prefix "model."

vllm supports this model on the first day (Supporting it with vLLM doesn't even require changing any code. LOL).

PTAL

After merging
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/discussions/2
embeddings-benchmark/mteb#2769 (comment)
Qwen3-Embedding can already output results close to SentenceTransformers.


Please close this discussion (ID: 1), which occurred on the first day of the Qwen/Qwen3-Embedding-0.6B launch and has already become very very very outdated.

Sign up or log in to comment