How to do serving on vllm for the fine-tuned checkpoint which does not have lora adapters separated ?
The vllm does not seem to load the lora weights since now its merged with model
I meet the same issue. This is how I load:
finetuned = "checkpoints/phi4_audio_reasoning"
llm = LLM(
    model=finetuned, 
    trust_remote_code=True, 
    gpu_memory_utilization=0.9,
    limit_mm_per_prompt={"audio": 5},
    max_model_len = 2048,
    seed = 42,
)
Using transformers works perfectly:
llm = AutoModelForCausalLM.from_pretrained(
            finetuned,
            device_map="auto",
            trust_remote_code=True,
            _attn_implementation='sdpa',
            torch_dtype = torch.bfloat16
        )
Any solutions?
I just figured it out how to correctly load it with vllm. The roundabout way is to only save the speech lora, and load the base model requesting the lora module.
# instead of doing this:
model.set_lora_adapter('speech')
model.save_pretrained(lora_dir) # will save the entire model
# we do
model.load_adapter(model_name_or_path, adapter_name="speech", adapter_kwargs={"subfolder": 'speech-lora'})
model.set_adapter("speech")
model.save_pretrained(lora_dir) # only save lora module
processor.save_pretrained(lora_dir) # could be yield some bugs
After this, I still meet some configuration bugs. So I download the speech lora to my local dir, and copy the finetuned weights to this folder:
from huggingface_hub import snapshot_download
model_path = snapshot_download("microsoft/Phi-4-multimodal-instruct")
speech_lora_path = os.path.join(model_path, "speech-lora")
cp /checkpoints/speech_lora/adapter_model.safetensors speech_lora_path 
Now I can make a lora request in vllm:
model = 'microsoft/Phi-4-multimodal-instruct'
llm = LLM(
    model=model, 
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    limit_mm_per_prompt={"audio": 5},
    max_model_len = 2048,
    seed = 42,
    enable_prefix_caching = True,
    max_loras = 1,
    enable_lora = True,
    max_lora_rank = 320,
)
lora_requests =  LoRARequest("speech_lora", 1, "/checkpoints/speech_lora")
predictions = llm.generate(inputs, 
                            sampling_params=sampling_params, 
                            use_tqdm=False,
                            lora_request=[lora_requests] * len(inputs)) 
Thanks @binhquoc
I just figured it out how to correctly load it with vllm. The roundabout way is to only save the speech lora, and load the base model requesting the lora module.
# instead of doing this: model.set_lora_adapter('speech') model.save_pretrained(lora_dir) # will save the entire model # we do model.load_adapter(model_name_or_path, adapter_name="speech", adapter_kwargs={"subfolder": 'speech-lora'}) model.set_adapter("speech") model.save_pretrained(lora_dir) # only save lora module processor.save_pretrained(lora_dir) # could be yield some bugsAfter this, I still meet some configuration bugs. So I download the speech lora to my local dir, and copy the finetuned weights to this folder:
from huggingface_hub import snapshot_download model_path = snapshot_download("microsoft/Phi-4-multimodal-instruct") speech_lora_path = os.path.join(model_path, "speech-lora") cp /checkpoints/speech_lora/adapter_model.safetensors speech_lora_pathNow I can make a lora request in vllm:
model = 'microsoft/Phi-4-multimodal-instruct' llm = LLM( model=model, trust_remote_code=True, gpu_memory_utilization=0.9, limit_mm_per_prompt={"audio": 5}, max_model_len = 2048, seed = 42, enable_prefix_caching = True, max_loras = 1, enable_lora = True, max_lora_rank = 320, ) lora_requests = LoRARequest("speech_lora", 1, "/checkpoints/speech_lora") predictions = llm.generate(inputs, sampling_params=sampling_params, use_tqdm=False, lora_request=[lora_requests] * len(inputs))
Thanks 
@binhquoc
	, So you mean we should do something like:
model = AutoModelForCausalLM.from_pretrained(
            finetuned_model_path, #model with merged weights
            device_map="auto",
            trust_remote_code=True,
            _attn_implementation='sdpa',
            torch_dtype = torch.bfloat16
        )
then do this to separate Lora weights:
model.load_adapter(finetuned_model_path, adapter_name="speech", adapter_kwargs={"subfolder": 'speech-lora'})
model.set_adapter("speech")
model.save_pretrained(lora_dir) # only save lora module
processor.save_pretrained(lora_dir) # could be yield some bugs
The do vLLM call ?
