Orpheus 3B โ€” INT4 AWQ Quantized

Streaming optimized Orpheus 3B quant. Meant to run on an A100 with TensorRT


AWQ INT4 TensorRT-LLM License

Model Overview

This is a streaming optimized INT4 AWQ quantized version of canopylabs/orpheus-3b-0.1-ft, meant to run with TensorRT-LLM.

Key Features:

  • Optimized for Production: Built for high-throughput, low-latency TTS serving
  • Faster Inference: Up to ~3x faster than FP16 with minimal perceived quality loss
  • Memory Efficient: โ‰ˆ4x smaller weights vs. FP16 (INT4)
  • Ready for Streaming: Designed for real-time streaming TTS backends
  • Calibrated: Calibrated for 48 tokens of input and up to 1024 tokens of output (roughly 12 seconds worth of audio per text chunk)

Technical Specifications

Specification Details
Source Model canopylabs/orpheus-3b-0.1-ft
Quantization Method AWQ (Activation-aware Weight Quantization)
Precision INT4 weights, INT8 KV cache
AWQ Group/Block Size 128
TensorRT-LLM Version 1.0.0
Generated 2025-10-09
Pipeline TensorRT-LLM AWQ Quantization

Artifact Layout

trt-llm/
  checkpoints/            # Quantized TRT-LLM checkpoints (portable)
    *.safetensors
    config.json
  engines/sm80_trt-llm-1.0.0_cuda12.4/     # Built TensorRT-LLM engines (hardware-specific)
    rank*.engine
    build_metadata.json
    build_command.sh

Quick Start

Download Artifacts (Python)

from huggingface_hub import snapshot_download

# Download quantized checkpoints (portable)
ckpt_path = snapshot_download(
    repo_id="yapwithai/orpheus-3b-trt-int4-awq",
    allow_patterns=["trt-llm/checkpoints/**"],
)

# Or download TensorRT-LLM engines for a specific build label
eng_path = snapshot_download(
    repo_id="yapwithai/orpheus-3b-trt-int4-awq",
    allow_patterns=["trt-llm/engines/sm80_trt-llm-1.0.0_cuda12.4/**"],
)
print("checkpoints:", ckpt_path)
print("engines:", eng_path)

Run a Streaming TTS Server (TensorRT-LLM)

# Point your server to the downloaded engines
export TRTLLM_ENGINE_DIR=/path/to/trt-llm/engines/sm80_trt-llm-1.0.0_cuda12.4

# Start your TTS server (example: FastAPI + WebSocket)
python -m server.server

Quantization Details

  • Method: Activation-aware weight quantization (AWQ)
  • Calibration size: 256
  • AWQ block/group size: 128
  • DType for build: float16

Configuration Summary

{
  "quantization": {
    "weights_precision": "int4_awq",
    "kv_cache_dtype": "int8",
    "awq_block_size": 128,
    "calib_size": 256
  },
  "build": {
    "dtype": "float16",
    "max_input_len": 48,
    "max_output_len": 1024,
    "max_batch_size": 16,
    "engine_label": "sm80_trt-llm-1.0.0_cuda12.4",
    "tensorrt_llm_version": "1.0.0"
  },
  "environment": {
    "sm_arch": "sm80",
    "gpu_name": "NVIDIA A100 80GB PCIe",
    "cuda_toolkit": "12.4",
    "nvidia_driver": "550.127.05"
  }
}

Use Cases

  • Realtime Voice: assistants, product demos, interactive agents
  • High-throughput Serving: batch TTS pipelines, APIs
  • Edge & Cost-sensitive: limited VRAM environments

Advanced Configuration (Build-time)

  • Max input length: tune --max_input_len
  • Max output length: tune --max_seq_len
  • Batch size: tune --max_batch_size
  • Plugins: --gpt_attention_plugin, --context_fmha, --paged_kv_cache

Requirements & Compatibility

System Requirements

  • GPU: NVIDIA, Compute Capability โ‰ฅ 8.0 (A100/RTX 40/H100 class recommended)
  • VRAM: โ‰ฅ 1.6 GB for INT4 engines (per GPU)
  • CUDA: 12.x recommended
  • Python: 3.10+

Framework Compatibility

  • TensorRT-LLM (engines), version 1.0.0
  • TRT-LLM Checkpoints are portable across systems; engines are not

Installation

pip install huggingface_hub
# Install TensorRT-LLM per NVIDIA docs
# https://nvidia.github.io/TensorRT-LLM/

Troubleshooting

Engine not portable Engines are specific to GPU SM and TRT/CUDA versions. Rebuild on the target system or download a matching `engines/sm80_trt-llm-1.0.0_cuda12.4` variant if provided.
OOM or Slow Loading Reduce `max_seq_len`, lower `max_batch_size`, and ensure `gpu_memory_utilization` on your server is tuned to your GPU.

Additional Resources


License

This quantized model inherits the license from the original model: Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yapwithai/orpheus-3b-trt-int4-awq