How to do serving on vllm for the fine-tuned checkpoint which does not have lora adapters separated ?

#83

by mayankiitkgp10 - opened Aug 28

Discussion

mayankiitkgp10

Aug 28

The vllm does not seem to load the lora weights since now its merged with model

binhquoc

Aug 30

I meet the same issue. This is how I load:

finetuned = "checkpoints/phi4_audio_reasoning"

llm = LLM(
    model=finetuned, 
    trust_remote_code=True, 
    gpu_memory_utilization=0.9,
    limit_mm_per_prompt={"audio": 5},
    max_model_len = 2048,
    seed = 42,
)

Using transformers works perfectly:

llm = AutoModelForCausalLM.from_pretrained(
            finetuned,
            device_map="auto",
            trust_remote_code=True,
            _attn_implementation='sdpa',
            torch_dtype = torch.bfloat16
        )

Any solutions?

binhquoc

Sep 2

I just figured it out how to correctly load it with vllm. The roundabout way is to only save the speech lora, and load the base model requesting the lora module.

# instead of doing this:
model.set_lora_adapter('speech')
model.save_pretrained(lora_dir) # will save the entire model

# we do
model.load_adapter(model_name_or_path, adapter_name="speech", adapter_kwargs={"subfolder": 'speech-lora'})
model.set_adapter("speech")
model.save_pretrained(lora_dir) # only save lora module
processor.save_pretrained(lora_dir) # could be yield some bugs

After this, I still meet some configuration bugs. So I download the speech lora to my local dir, and copy the finetuned weights to this folder:

from huggingface_hub import snapshot_download
model_path = snapshot_download("microsoft/Phi-4-multimodal-instruct")
speech_lora_path = os.path.join(model_path, "speech-lora")

cp /checkpoints/speech_lora/adapter_model.safetensors speech_lora_path

Now I can make a lora request in vllm:

model = 'microsoft/Phi-4-multimodal-instruct'
llm = LLM(
    model=model, 
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    limit_mm_per_prompt={"audio": 5},
    max_model_len = 2048,
    seed = 42,
    enable_prefix_caching = True,
    max_loras = 1,
    enable_lora = True,
    max_lora_rank = 320,
)
lora_requests =  LoRARequest("speech_lora", 1, "/checkpoints/speech_lora")

predictions = llm.generate(inputs, 
                            sampling_params=sampling_params, 
                            use_tqdm=False,
                            lora_request=[lora_requests] * len(inputs))

mayankiitkgp10

Sep 2

•

edited Sep 2

Thanks @binhquoc

I just figured it out how to correctly load it with vllm. The roundabout way is to only save the speech lora, and load the base model requesting the lora module.

# instead of doing this:
model.set_lora_adapter('speech')
model.save_pretrained(lora_dir) # will save the entire model

# we do
model.load_adapter(model_name_or_path, adapter_name="speech", adapter_kwargs={"subfolder": 'speech-lora'})
model.set_adapter("speech")
model.save_pretrained(lora_dir) # only save lora module
processor.save_pretrained(lora_dir) # could be yield some bugs

After this, I still meet some configuration bugs. So I download the speech lora to my local dir, and copy the finetuned weights to this folder:

from huggingface_hub import snapshot_download
model_path = snapshot_download("microsoft/Phi-4-multimodal-instruct")
speech_lora_path = os.path.join(model_path, "speech-lora")

cp /checkpoints/speech_lora/adapter_model.safetensors speech_lora_path

Now I can make a lora request in vllm:

model = 'microsoft/Phi-4-multimodal-instruct'
llm = LLM(
    model=model, 
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    limit_mm_per_prompt={"audio": 5},
    max_model_len = 2048,
    seed = 42,
    enable_prefix_caching = True,
    max_loras = 1,
    enable_lora = True,
    max_lora_rank = 320,
)
lora_requests =  LoRARequest("speech_lora", 1, "/checkpoints/speech_lora")

predictions = llm.generate(inputs, 
                            sampling_params=sampling_params, 
                            use_tqdm=False,
                            lora_request=[lora_requests] * len(inputs))

Thanks @binhquoc , So you mean we should do something like:
model = AutoModelForCausalLM.from_pretrained(
finetuned_model_path, #model with merged weights
device_map="auto",
trust_remote_code=True,
_attn_implementation='sdpa',
torch_dtype = torch.bfloat16
)

then do this to separate Lora weights:
model.load_adapter(finetuned_model_path, adapter_name="speech", adapter_kwargs={"subfolder": 'speech-lora'})
model.set_adapter("speech")
model.save_pretrained(lora_dir) # only save lora module
processor.save_pretrained(lora_dir) # could be yield some bugs
The do vLLM call ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment