Vogent-Turn-80M / README.md
jagath's picture
Update README.md
a38c3fe verified
metadata
library_name: transformers
tags:
  - turn-detection
  - voice-ai
  - multimodal
  - conversational-ai
  - speech
extra_gated_prompt: >-
  If you are using this model as part of a horizontal voice agent platform, you
  agree *not* to set Vogent-Turn-80M as a default option, and to require users
  to select an option labeled 'Vogent Turn Detector' if they would like to use
  the model.
extra_gated_fields:
  Company: text
  Email: text

Vogent-Turn-80M

State-of-the-art multimodal turn detection model for voice AI systems, achieving 94.1% accuracy by combining acoustic and linguistic signals for real-time conversational applications.

Technical Report

HF Space

Inference Code

Model Details

Model Description

Vogent-Turn-80M is a multimodal turn detection model that addresses the critical challenge of determining when a speaker has finished their turn in a conversation. Unlike traditional approaches that rely solely on audio or text, Vogent-Turn-80M processes both acoustic features (via Whisper encoder) and semantic context to make accurate predictions in real-time (~7ms on T4 GPU).

  • Developed by: Vogent AI
  • Model type: Multimodal Turn Detection (Binary Classification)
  • Language(s) (NLP): English
  • License: Vogent-Turn-80M is licensed under a modified Apache-2.0 license; horizontal voice agent platforms may not select Vogent-Turn-80M as the default turn-detection model, and any end-users who which to use the model must be required to select 'Vogent Turn Detector.' Otherwise, standard Apache 2.0 provisions apply.
  • Finetuned from model: SmolLM2-135M (reduced to 80M parameters by using only first 12 layers)

Model Sources

Uses

Vogent-Turn-80M is designed for real-time turn detection in voice assistant applications, determining when a user has finished speaking to enable natural conversational flow without premature interruptions or awkward delays.

Bias, Risks, and Limitations

Technical Limitations:

  • English-only support; turn-taking conventions vary across languages and cultures
  • CPU inference may be too slow for some real-time applications

How to Get Started with the Model

For complete installation and usage instructions, visit: https://github.com/vogent/vogent-turn

Quick Install

# Clone the repository
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn

# Install in development mode
pip install -e .

Basic Usage

from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request

# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)

# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")

# Run turn detection with conversational context
result = detector.predict(
    audio,
    prev_line="What is your phone number",
    curr_line="My number is 804",
    sample_rate=sr,
    return_probs=True,
)

print(f"Turn complete: {result['is_endpoint']}")
print(f"Done speaking probability: {result['prob_endpoint']:.1%}")

Available Interfaces

  • Python Library: Direct integration with TurnDetector class
  • CLI Tool: vogent-turn-predict speech.wav --prev "What is your phone number" --curr "My number is 804"

See the GitHub repository for detailed documentation, performance benchmarks, and advanced usage.

Training Details

Training Data

The model was trained on a diverse dataset combining human-collected and synthetic conversational data:

Training Procedure

Preprocessing

  • Audio: Last 8 seconds extracted via Whisper-Tiny encoder β†’ ~400 audio tokens
  • Text: Full conversational context including assistant and user utterances
  • Labels: Binary classification (turn complete/incomplete)
  • Multimodal fusion: Audio embeddings projected into LLM's input space and concatenated with text

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Base model initialization: SmolLM2-135M (first 12 layers)
  • Architecture modifications: Reduced from 135M to ~80M parameters through layer ablation

Speeds, Sizes, Times

  • Model size: ~80M parameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal test set covering diverse conversational scenarios and edge cases where audio-only or text-only approaches fail.

  • Accuracy: 94.1%
  • AUPRC: 0.975

Technical Specifications

Model Architecture and Objective

Architecture:

  • Audio Encoder: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
  • Text Model: SmolLM-135M (12 layers, ~80M parameters)
  • Multimodal Fusion: Audio embeddings projected into LLM's input space
  • Classifier: Binary classification head (turn complete/incomplete)

Processing Flow:

  1. Audio (16kHz PCM) β†’ Whisper Encoder β†’ Audio Embeddings (~400 tokens)
  2. Text Context β†’ SmolLM Tokenizer β†’ Text Embeddings
  3. Concatenate embeddings β†’ SmolLM Transformer β†’ Last token hidden state
  4. Linear Classifier β†’ Softmax β†’ [P(continue), P(endpoint)]

Compute Infrastructure

Hardware

Optimization Features:

  • torch.compile with max-autotune mode
  • Dynamic tensor shapes without recompilation
  • Pre-warmed bucket sizes (64, 128, 256, 512, 1024)

Software

  • Framework: PyTorch with torch.compile
  • Audio processing: Whisper encoder (up to 8 seconds)

Citation

BibTeX:

@misc{voturn2025,
  title={Vogent-Turn-80M: State-of-the-Art Turn Detection for Voice Agents},
  author={Varadarajan, Vignesh and Vytheeswaran, Jagath},
  year={2025},
  publisher={Vogent AI},
  howpublished={\url{https://huggingface.co/vogent/Vogent-Turn-80M}},
  note={Blog: \url{https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents}}
}

More Information

Vogent-Turn-80M is part of Vogent's comprehensive voice AI platform.

Resources:

Upcoming releases:

  • Int8 quantized model for faster CPU deployment
  • Multilingual versions
  • Domain-specific adaptations

Model Card Authors

Vogent AI Team

Model Card Contact