Orpheus 3B โ INT4 AWQ Quantized
Streaming optimized Orpheus 3B quant. Meant to run on an A100 with TensorRT
Model Overview
This is a streaming optimized INT4 AWQ quantized version of canopylabs/orpheus-3b-0.1-ft, meant to run with TensorRT-LLM.
Key Features:
- Optimized for Production: Built for high-throughput, low-latency TTS serving
- Faster Inference: Up to ~3x faster than FP16 with minimal perceived quality loss
- Memory Efficient: โ4x smaller weights vs. FP16 (INT4)
- Ready for Streaming: Designed for real-time streaming TTS backends
- Calibrated: Calibrated for 48 tokens of input and up to 1024 tokens of output (roughly 12 seconds worth of audio per text chunk)
Technical Specifications
| Specification | Details |
|---|---|
| Source Model | canopylabs/orpheus-3b-0.1-ft |
| Quantization Method | AWQ (Activation-aware Weight Quantization) |
| Precision | INT4 weights, INT8 KV cache |
| AWQ Group/Block Size | 128 |
| TensorRT-LLM Version | 1.0.0 |
| Generated | 2025-10-09 |
| Pipeline | TensorRT-LLM AWQ Quantization |
Artifact Layout
trt-llm/
checkpoints/ # Quantized TRT-LLM checkpoints (portable)
*.safetensors
config.json
engines/sm80_trt-llm-1.0.0_cuda12.4/ # Built TensorRT-LLM engines (hardware-specific)
rank*.engine
build_metadata.json
build_command.sh
Quick Start
Download Artifacts (Python)
from huggingface_hub import snapshot_download
# Download quantized checkpoints (portable)
ckpt_path = snapshot_download(
repo_id="yapwithai/orpheus-3b-trt-int4-awq",
allow_patterns=["trt-llm/checkpoints/**"],
)
# Or download TensorRT-LLM engines for a specific build label
eng_path = snapshot_download(
repo_id="yapwithai/orpheus-3b-trt-int4-awq",
allow_patterns=["trt-llm/engines/sm80_trt-llm-1.0.0_cuda12.4/**"],
)
print("checkpoints:", ckpt_path)
print("engines:", eng_path)
Run a Streaming TTS Server (TensorRT-LLM)
# Point your server to the downloaded engines
export TRTLLM_ENGINE_DIR=/path/to/trt-llm/engines/sm80_trt-llm-1.0.0_cuda12.4
# Start your TTS server (example: FastAPI + WebSocket)
python -m server.server
Quantization Details
- Method: Activation-aware weight quantization (AWQ)
- Calibration size: 256
- AWQ block/group size: 128
- DType for build: float16
Configuration Summary
{
"quantization": {
"weights_precision": "int4_awq",
"kv_cache_dtype": "int8",
"awq_block_size": 128,
"calib_size": 256
},
"build": {
"dtype": "float16",
"max_input_len": 48,
"max_output_len": 1024,
"max_batch_size": 16,
"engine_label": "sm80_trt-llm-1.0.0_cuda12.4",
"tensorrt_llm_version": "1.0.0"
},
"environment": {
"sm_arch": "sm80",
"gpu_name": "NVIDIA A100 80GB PCIe",
"cuda_toolkit": "12.4",
"nvidia_driver": "550.127.05"
}
}
Use Cases
- Realtime Voice: assistants, product demos, interactive agents
- High-throughput Serving: batch TTS pipelines, APIs
- Edge & Cost-sensitive: limited VRAM environments
Advanced Configuration (Build-time)
- Max input length: tune
--max_input_len - Max output length: tune
--max_seq_len - Batch size: tune
--max_batch_size - Plugins:
--gpt_attention_plugin,--context_fmha,--paged_kv_cache
Requirements & Compatibility
System Requirements
- GPU: NVIDIA, Compute Capability โฅ 8.0 (A100/RTX 40/H100 class recommended)
- VRAM: โฅ 1.6 GB for INT4 engines (per GPU)
- CUDA: 12.x recommended
- Python: 3.10+
Framework Compatibility
- TensorRT-LLM (engines), version
1.0.0 - TRT-LLM Checkpoints are portable across systems; engines are not
Installation
pip install huggingface_hub
# Install TensorRT-LLM per NVIDIA docs
# https://nvidia.github.io/TensorRT-LLM/
Troubleshooting
Engine not portable
Engines are specific to GPU SM and TRT/CUDA versions. Rebuild on the target system or download a matching `engines/sm80_trt-llm-1.0.0_cuda12.4` variant if provided.OOM or Slow Loading
Reduce `max_seq_len`, lower `max_batch_size`, and ensure `gpu_memory_utilization` on your server is tuned to your GPU.Additional Resources
- TensorRT-LLM Docs: https://nvidia.github.io/TensorRT-LLM/
- Activation-aware Weight Quantization (AWQ): https://github.com/mit-han-lab/llm-awq
License
This quantized model inherits the license from the original model: Apache 2.0
Model tree for yapwithai/orpheus-3b-trt-int4-awq
Base model
meta-llama/Llama-3.2-3B-Instruct
Finetuned
canopylabs/orpheus-3b-0.1-pretrained
Finetuned
canopylabs/orpheus-3b-0.1-ft