library_name: transformers
tags:
- turn-detection
- voice-ai
- multimodal
- conversational-ai
- speech
extra_gated_prompt: >-
If you are using this model as part of a horizontal voice agent platform, you
agree *not* to set Vogent-Turn-80M as a default option, and to require users
to select an option labeled 'Vogent Turn Detector' if they would like to use
the model.
extra_gated_fields:
Company: text
Email: text
Vogent-Turn-80M
State-of-the-art multimodal turn detection model for voice AI systems, achieving 94.1% accuracy by combining acoustic and linguistic signals for real-time conversational applications.
Model Details
Model Description
Vogent-Turn-80M is a multimodal turn detection model that addresses the critical challenge of determining when a speaker has finished their turn in a conversation. Unlike traditional approaches that rely solely on audio or text, Vogent-Turn-80M processes both acoustic features (via Whisper encoder) and semantic context to make accurate predictions in real-time (~7ms on T4 GPU).
- Developed by: Vogent AI
- Model type: Multimodal Turn Detection (Binary Classification)
- Language(s) (NLP): English
- License: Vogent-Turn-80M is licensed under a modified Apache-2.0 license; horizontal voice agent platforms may not select Vogent-Turn-80M as the default turn-detection model, and any end-users who which to use the model must be required to select 'Vogent Turn Detector.' Otherwise, standard Apache 2.0 provisions apply.
- Finetuned from model: SmolLM2-135M (reduced to 80M parameters by using only first 12 layers)
Model Sources
- GitHub Repository: https://github.com/vogent/vogent-turn
- Blog post: https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents
Uses
Vogent-Turn-80M is designed for real-time turn detection in voice assistant applications, determining when a user has finished speaking to enable natural conversational flow without premature interruptions or awkward delays.
Bias, Risks, and Limitations
Technical Limitations:
- English-only support; turn-taking conventions vary across languages and cultures
- CPU inference may be too slow for some real-time applications
How to Get Started with the Model
For complete installation and usage instructions, visit: https://github.com/vogent/vogent-turn
Quick Install
# Clone the repository
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn
# Install in development mode
pip install -e .
Basic Usage
from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request
# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)
# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")
# Run turn detection with conversational context
result = detector.predict(
audio,
prev_line="What is your phone number",
curr_line="My number is 804",
sample_rate=sr,
return_probs=True,
)
print(f"Turn complete: {result['is_endpoint']}")
print(f"Done speaking probability: {result['prob_endpoint']:.1%}")
Available Interfaces
- Python Library: Direct integration with
TurnDetectorclass - CLI Tool:
vogent-turn-predict speech.wav --prev "What is your phone number" --curr "My number is 804"
See the GitHub repository for detailed documentation, performance benchmarks, and advanced usage.
Training Details
Training Data
The model was trained on a diverse dataset combining human-collected and synthetic conversational data:
Training Procedure
Preprocessing
- Audio: Last 8 seconds extracted via Whisper-Tiny encoder β ~400 audio tokens
- Text: Full conversational context including assistant and user utterances
- Labels: Binary classification (turn complete/incomplete)
- Multimodal fusion: Audio embeddings projected into LLM's input space and concatenated with text
Training Hyperparameters
- Training regime: fp16 mixed precision
- Base model initialization: SmolLM2-135M (first 12 layers)
- Architecture modifications: Reduced from 135M to ~80M parameters through layer ablation
Speeds, Sizes, Times
- Model size: ~80M parameters
Evaluation
Testing Data, Factors & Metrics
Testing Data
Internal test set covering diverse conversational scenarios and edge cases where audio-only or text-only approaches fail.
- Accuracy: 94.1%
- AUPRC: 0.975
Technical Specifications
Model Architecture and Objective
Architecture:
- Audio Encoder: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
- Text Model: SmolLM-135M (12 layers, ~80M parameters)
- Multimodal Fusion: Audio embeddings projected into LLM's input space
- Classifier: Binary classification head (turn complete/incomplete)
Processing Flow:
- Audio (16kHz PCM) β Whisper Encoder β Audio Embeddings (~400 tokens)
- Text Context β SmolLM Tokenizer β Text Embeddings
- Concatenate embeddings β SmolLM Transformer β Last token hidden state
- Linear Classifier β Softmax β [P(continue), P(endpoint)]
Compute Infrastructure
Hardware
Optimization Features:
- torch.compile with max-autotune mode
- Dynamic tensor shapes without recompilation
- Pre-warmed bucket sizes (64, 128, 256, 512, 1024)
Software
- Framework: PyTorch with torch.compile
- Audio processing: Whisper encoder (up to 8 seconds)
Citation
BibTeX:
@misc{voturn2025,
title={Vogent-Turn-80M: State-of-the-Art Turn Detection for Voice Agents},
author={Varadarajan, Vignesh and Vytheeswaran, Jagath},
year={2025},
publisher={Vogent AI},
howpublished={\url{https://huggingface.co/vogent/Vogent-Turn-80M}},
note={Blog: \url{https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents}}
}
More Information
Vogent-Turn-80M is part of Vogent's comprehensive voice AI platform.
Resources:
- Full documentation and code: https://github.com/vogent/vogent-turn
- Platform access: https://vogent.ai
- Enterprise solutions: Contact j@vogent.ai
Upcoming releases:
- Int8 quantized model for faster CPU deployment
- Multilingual versions
- Domain-specific adaptations
Model Card Authors
Vogent AI Team
Model Card Contact
- GitHub Repository: https://github.com/vogent/vogent-turn
- GitHub Issues: https://github.com/vogent/vogent-turn/issues
- Website: https://vogent.ai