Parakeet TDT 0.6B V3 - FP16 ONNX
FP16 (half-precision) quantized version of the Parakeet TDT 0.6B V3 ONNX model.
Overview
This repository contains FP16-quantized ONNX models for NVIDIA's Parakeet TDT (Token-and-Duration Transducer) 0.6B V3, a multilingual automatic speech recognition (ASR) model.
Key Benefits:
- 50% smaller size: 1.25GB total vs 2.4GB original
- Faster inference: FP16 operations accelerated on modern GPUs
- Same accuracy: Minimal quality loss from quantization
- Drop-in replacement: Compatible with
onnx-asrlibrary viaquantization='fp16'parameter
Model Files
| File | Size | Description |
|---|---|---|
encoder-model.fp16.onnx |
1.2GB | FP16 encoder model |
decoder_joint-model.fp16.onnx |
35MB | FP16 decoder model |
Note: You'll also need the supporting files from the original repository:
config.json- Model configurationvocab.txt- Vocabulary filenemo128.onnx- Tokenizer model
Installation
pip install onnx-asr
Usage
Basic Usage
import onnx_asr
# Load FP16 model
model = onnx_asr.load_model(
'nemo-parakeet-tdt-0.6b-v3',
'./models/parakeet', # Directory containing both FP32 and FP16 files
quantization='fp16', # Use FP16 quantized models
cpu_preprocessing=False
)
# Recognize speech from audio file
text = model.recognize('audio.wav')
print(text)
With NumPy Arrays
import numpy as np
# Load audio as numpy array (16kHz, mono, float32)
audio = np.random.randn(16000).astype(np.float32)
# Recognize
text = model.recognize(audio)
GPU Acceleration
FP16 models work best with GPU acceleration:
model = onnx_asr.load_model(
'nemo-parakeet-tdt-0.6b-v3',
'./models/parakeet',
quantization='fp16',
providers=['CUDAExecutionProvider', 'CPUExecutionProvider'], # GPU first
cpu_preprocessing=False
)
How It Was Created
This FP16 model was created using a two-step process:
Step 1: FP32 β FP16 Conversion
from onnxconverter_common import float16
import onnx
model = onnx.load('encoder-model.onnx')
model_fp16 = float16.convert_float_to_float16(
model,
keep_io_types=True, # Keep inputs/outputs as FP32
disable_shape_infer=True # Preserve external data
)
onnx.save(model_fp16, 'encoder-model.fp16.onnx')
Step 2: Fix Cast Operations
The initial conversion leaves some Cast operations targeting FP32, causing type mismatches. A post-processing script fixes these by converting internal Cast(to=FLOAT) operations to Cast(to=FLOAT16) while preserving output casts for compatibility.
See the conversion scripts:
Supported Languages
Supports 25 languages (same as original model):
- English, Spanish, French, German, Italian, Portuguese
- Russian, Polish, Ukrainian, Czech, Slovak
- Chinese (Mandarin), Japanese, Korean
- Arabic, Hebrew, Turkish
- Dutch, Swedish, Danish, Norwegian, Finnish
- And more...
License
This model is licensed under CC-BY-4.0 (Creative Commons Attribution 4.0), same as the original Parakeet model.
See huggingface repo for details.
Citation
If you use this model, please cite both the original Parakeet model and the ONNX conversion.
Credits
- Original Model: NVIDIA Parakeet TDT 0.6B V3
- ONNX Conversion: Igor Stupakov
- FP16 Quantization: this repository
Related Links
Support
For issues or questions:
- Original model questions: See nvidia/parakeet-tdt-0.6b-v3
- onnx-asr library: See onnx-asr documentation
- Downloads last month
- 28
Model tree for grikdotnet/parakeet-tdt-0.6b-fp16
Base model
nvidia/parakeet-tdt-0.6b-v3