metadata
			library_name: transformers
datasets:
  - Bingsu/zeroth-korean
  - google/fleurs
language:
  - ko
metrics:
  - cer
  - wer
  - bleu
base_model:
  - microsoft/Phi-4-multimodal-instruct
model-index:
  - name: Phi-4-multimodal-instruct-ko-asr
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          type: Bingsu/zeroth_korean
          name: zeroth-korean-test
        metrics:
          - type: bleu
            name: zeroth-test-BLEU
            value: 94.837
          - type: cer
            name: zeroth-test-CER
            value: 1.316
          - type: wer
            name: zeroth-test-WER
            value: 2.951
      - task:
          type: automatic-speech-recognition
        dataset:
          type: google/flerus
          name: flerus-ko-test
        metrics:
          - type: bleu
            name: fleurs-test-BLEU
            value: 67.659
          - type: cer
            name: fleurs-test-CER
            value: 7.951
          - type: wer
            name: fleurs-test-WER
            value: 18.313
pipeline_tag: automatic-speech-recognition
This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Bingsu/zeroth-korean, google/flerus in 5 epochs.
This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.
After that, we continue training with CoVoST2 Dataset / CoVoST2-Ko for AST.
AST Finetuned model is Here : Phi-4-multimodal-instruct-ko-speech
Evaluation
Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
 - AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).
 
Script is retrieved from here.
Compared to Phi-4-mm-inst-zeroth-kor and Phi-4-multimodal-finetune-ko-speech, ASR is significantly improved.
| Model | zeroth-CER | zeroth-WER | fleurs-ko_en-BLEU | fleurs-ko_en-cot-BLEU | fleurs-en_ko-BLEU | fleurs-en_ko-cot-BLEU | 
|---|---|---|---|---|---|---|
| original | 198.32 | - | 5.63 | 2.42 | 6.86 | 4.17 | 
| daekeun-ml/Phi-4-multimodal-finetune-ko-speech | 1.61 | 3.54 | 7.67 | 8.38 | 12.31 | 9.69 | 
| seastar105/Phi-4-mm-inst-zeroth-kor | 7.02 | - | 7.07 | 9.19 | 13.08 | 9.35 | 
| ASR finetune(this model) | 1.31 | 2.95 | 7.46 | 6.24 | 12.15 | 8.91 | 
| + 1 epoch finetune with Covost-Ko | 3.88 | - | 8.07 | 10.09 | 18.82 | 15.41 | 
| AST finetuned model | 1.77 | 2.99 | 8.01 | 9.09 | 17.09 | 11.82 |