Parakeet TDT 0.6B V3 - FP16 ONNX

FP16 (half-precision) quantized version of the Parakeet TDT 0.6B V3 ONNX model.

Overview

This repository contains FP16-quantized ONNX models for NVIDIA's Parakeet TDT (Token-and-Duration Transducer) 0.6B V3, a multilingual automatic speech recognition (ASR) model.

Key Benefits:

  • 50% smaller size: 1.25GB total vs 2.4GB original
  • Faster inference: FP16 operations accelerated on modern GPUs
  • Same accuracy: Minimal quality loss from quantization
  • Drop-in replacement: Compatible with onnx-asr library via quantization='fp16' parameter

Model Files

File Size Description
encoder-model.fp16.onnx 1.2GB FP16 encoder model
decoder_joint-model.fp16.onnx 35MB FP16 decoder model

Note: You'll also need the supporting files from the original repository:

  • config.json - Model configuration
  • vocab.txt - Vocabulary file
  • nemo128.onnx - Tokenizer model

Installation

pip install onnx-asr

Usage

Basic Usage

import onnx_asr

# Load FP16 model
model = onnx_asr.load_model(
    'nemo-parakeet-tdt-0.6b-v3',
    './models/parakeet',  # Directory containing both FP32 and FP16 files
    quantization='fp16',  # Use FP16 quantized models
    cpu_preprocessing=False
)

# Recognize speech from audio file
text = model.recognize('audio.wav')
print(text)

With NumPy Arrays

import numpy as np

# Load audio as numpy array (16kHz, mono, float32)
audio = np.random.randn(16000).astype(np.float32)

# Recognize
text = model.recognize(audio)

GPU Acceleration

FP16 models work best with GPU acceleration:

model = onnx_asr.load_model(
    'nemo-parakeet-tdt-0.6b-v3',
    './models/parakeet',
    quantization='fp16',
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider'],  # GPU first
    cpu_preprocessing=False
)

How It Was Created

This FP16 model was created using a two-step process:

Step 1: FP32 β†’ FP16 Conversion

from onnxconverter_common import float16
import onnx

model = onnx.load('encoder-model.onnx')
model_fp16 = float16.convert_float_to_float16(
    model,
    keep_io_types=True,           # Keep inputs/outputs as FP32
    disable_shape_infer=True      # Preserve external data
)
onnx.save(model_fp16, 'encoder-model.fp16.onnx')

Step 2: Fix Cast Operations

The initial conversion leaves some Cast operations targeting FP32, causing type mismatches. A post-processing script fixes these by converting internal Cast(to=FLOAT) operations to Cast(to=FLOAT16) while preserving output casts for compatibility.

See the conversion scripts:

Supported Languages

Supports 25 languages (same as original model):

  • English, Spanish, French, German, Italian, Portuguese
  • Russian, Polish, Ukrainian, Czech, Slovak
  • Chinese (Mandarin), Japanese, Korean
  • Arabic, Hebrew, Turkish
  • Dutch, Swedish, Danish, Norwegian, Finnish
  • And more...

License

This model is licensed under CC-BY-4.0 (Creative Commons Attribution 4.0), same as the original Parakeet model.

See huggingface repo for details.

Citation

If you use this model, please cite both the original Parakeet model and the ONNX conversion.

Credits

Related Links

Support

For issues or questions:

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for grikdotnet/parakeet-tdt-0.6b-fp16

Quantized
(6)
this model