microsoft
/

Phi-4-multimodal-instruct

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -42,13 +42,13 @@ Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
 - Vision: English
 - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
-🏡 [Phi-4-multimodal Portal]() <br>
-📰 [Phi-4-multimodal Microsoft Blog]() <br>
-📖 [Phi-4-multimodal Technical Report]() <br>
-👩‍🍳 [Phi-4-multimodal Cookbook]() <br>
 🖥️ [Try It](https://aka.ms/try-phi4mm) <br>
-**Phi-4**: [[multimodal-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) | [onnx]()]; [[mini-instruct]]();
 ## Intended Uses
@@ -218,10 +218,14 @@ torch==2.6.0
 transformers==4.48.2
 accelerate==1.3.0
 soundfile==0.13.1
-pillow==10.3.0
 ```
-Phi-4-multimodal-instruct is also available in [Azure AI Studio]()
 ### Tokenizer
@@ -324,7 +328,7 @@ If it is a square image, the resolution would be around (8*448 by 8*448). For mu
 ### Loading the model locally
-After obtaining the Phi-4-Mini-MM-Instruct model checkpoints, users can use this sample code for inference.
 ```python
 import requests
@@ -334,6 +338,8 @@ import io
 from PIL import Image
 import soundfile as sf
 from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
 # Define model path
 model_path = "microsoft/Phi-4-multimodal-instruct"
@@ -380,44 +386,27 @@ print(f'>>> Response\n{response}')
 # Part 2: Audio Processing
 print("\n--- AUDIO PROCESSING ---")
-audio_url = "https://voiceage.com/wbsamples/in_mono/Trailer.wav"
 speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
 prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
 print(f'>>> Prompt\n{prompt}')
-# Download audio file
-audio_response = requests.get(audio_url)
-if audio_response.status_code == 200:
-    # First save audio to a temporary file
-    temp_audio_path = "temp_audio.wav"
-    with open(temp_audio_path, "wb") as f:
-        f.write(audio_response.content)
-    # Read using soundfile
-    audio, samplerate = sf.read(temp_audio_path)
-    # Process with the model
-    inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
-    generate_ids = model.generate(
-        **inputs,
-        max_new_tokens=1000,
-        generation_config=generation_config,
-    )
-    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
-    response = processor.batch_decode(
-        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
-    )[0]
-    print(f'>>> Response\n{response}')
-    # Clean up
-    try:
-        os.remove(temp_audio_path)
-        print(f"Temporary file {temp_audio_path} removed successfully")
-    except Exception as e:
-        print(f"Error removing temporary file: {e}")
-else:
-    print(f"Failed to download audio file: {audio_response.status_code}")
 ```
 ## Responsible AI Considerations

 - Vision: English
 - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
+🏡 [Phi-4-multimodal Portal](https://aka.ms/phi-4-multimodal/azure) <br>
+📰 [Phi-4-multimodal Microsoft Blog](https://aka.ms/phi4techblog-feb2025) <br>
+📖 [Phi-4-multimodal Technical Report](https://aka.ms/phi-4-multimodal/techreport) <br>
+👩‍🍳 [Phi-4-multimodal Cookbook](https://github.com/microsoft/PhiCookBook) <br>
 🖥️ [Try It](https://aka.ms/try-phi4mm) <br>
+**Phi-4**: [[multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | [onnx](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)]; [[mini-instruct]](https://huggingface.co/microsoft/Phi-4-mini-instruct);
 ## Intended Uses
 transformers==4.48.2
 accelerate==1.3.0
 soundfile==0.13.1
+pillow==11.1.0
+scipy==1.15.2
+torchvision==0.21.0
+backoff==2.2.1
+peft==0.13.2
 ```
+Phi-4-multimodal-instruct is also available in [Azure AI Studio](https://aka.ms/phi-4-multimodal/azure)
 ### Tokenizer
 ### Loading the model locally
+After obtaining the Phi-4-multimodal-instruct model checkpoints, users can use this sample code for inference.
 ```python
 import requests
 from PIL import Image
 import soundfile as sf
 from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
+from urllib.request import urlopen
 # Define model path
 model_path = "microsoft/Phi-4-multimodal-instruct"
 # Part 2: Audio Processing
 print("\n--- AUDIO PROCESSING ---")
+audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
 speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
 prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
 print(f'>>> Prompt\n{prompt}')
+# Downlowd and open audio file
+audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))
+# Process with the model
+inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
+generate_ids = model.generate(
+    **inputs,
+    max_new_tokens=1000,
+    generation_config=generation_config,
+)
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
+response = processor.batch_decode(
+    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(f'>>> Response\n{response}')
 ```
 ## Responsible AI Considerations

phi_4_mm.tech_report.02252025.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a5469d9123cbee2b41729db3217cacfeaa96eaf543868caa2eeec7cf2d24547d
+size 5295165