vLLM support

by ccdv - opened Jun 5

Discussion

ccdv

Jun 5

Hey
Are embeddings & rerankers compatible with vLLM?

michaelfeil

Jun 5

You can use them with sglang and infinity

WaveCut

Jun 6

You can use them with sglang and infinity

Unfortunately, it fails to load in SGLang:

docker run --gpus all \
    --restart always \
    --name qwemb-server \
    --shm-size 16g \
    -p 30000:30000 \
    -v hf_cache:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-0.6B --host 0.0.0.0 --port 30000 --is-embedding

...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Parameter embed_tokens.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.input_layernorm.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.mlp.down_proj.weight not found in params_dict
[2025-06-06 06:16:33] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2297, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 277, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 231, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 271, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 381, in load_model
    self.load_weights_and_postprocess(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 389, in load_weights_and_postprocess
    model.load_weights(weights)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3.py", line 344, in load_weights
    param = params_dict[name]
KeyError: 'layers.0.mlp.gate_up_proj.weight'

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

[2025-06-06 06:16:33] Received sigquit from a child process. It usually means the child failed.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/mp-v5vgg6aq'

woodx

Jun 9

https://huggingface.co/woodx/Qwen3-Embedding-0.6B-SGLang
try this one! you need to transfer the origin model parameter name with new prefix "model."

noooop9527

20 days ago

•

edited 20 days ago

vllm supports this model on the first day (Supporting it with vLLM doesn't even require changing any code. LOL).

PTAL

After merging
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/discussions/2
embeddings-benchmark/mteb#2769 (comment)
Qwen3-Embedding can already output results close to SentenceTransformers.

Please close this discussion (ID: 1), which occurred on the first day of the Qwen/Qwen3-Embedding-0.6B launch and has already become very very very outdated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment