Vogent-Turn-80M / README.md

jagath

Update README.md

a38c3fe verified 7 days ago

preview code

raw

history blame contribute delete

6.92 kB

metadata

library_name: transformers
tags:
  - turn-detection
  - voice-ai
  - multimodal
  - conversational-ai
  - speech
extra_gated_prompt: >-
  If you are using this model as part of a horizontal voice agent platform, you
  agree *not* to set Vogent-Turn-80M as a default option, and to require users
  to select an option labeled 'Vogent Turn Detector' if they would like to use
  the model.
extra_gated_fields:
  Company: text
  Email: text

Vogent-Turn-80M

State-of-the-art multimodal turn detection model for voice AI systems, achieving 94.1% accuracy by combining acoustic and linguistic signals for real-time conversational applications.

Technical Report

HF Space

Inference Code

Model Details

Model Description

Vogent-Turn-80M is a multimodal turn detection model that addresses the critical challenge of determining when a speaker has finished their turn in a conversation. Unlike traditional approaches that rely solely on audio or text, Vogent-Turn-80M processes both acoustic features (via Whisper encoder) and semantic context to make accurate predictions in real-time (~7ms on T4 GPU).

Developed by: Vogent AI
Model type: Multimodal Turn Detection (Binary Classification)
Language(s) (NLP): English
License: Vogent-Turn-80M is licensed under a modified Apache-2.0 license; horizontal voice agent platforms may not select Vogent-Turn-80M as the default turn-detection model, and any end-users who which to use the model must be required to select 'Vogent Turn Detector.' Otherwise, standard Apache 2.0 provisions apply.
Finetuned from model: SmolLM2-135M (reduced to 80M parameters by using only first 12 layers)

Model Sources

GitHub Repository: https://github.com/vogent/vogent-turn
Blog post: https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents

Uses

Vogent-Turn-80M is designed for real-time turn detection in voice assistant applications, determining when a user has finished speaking to enable natural conversational flow without premature interruptions or awkward delays.

Bias, Risks, and Limitations

Technical Limitations:

English-only support; turn-taking conventions vary across languages and cultures
CPU inference may be too slow for some real-time applications

How to Get Started with the Model

For complete installation and usage instructions, visit: https://github.com/vogent/vogent-turn

Quick Install

# Clone the repository
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn

# Install in development mode
pip install -e .

Basic Usage

from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request

# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)

# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")

# Run turn detection with conversational context
result = detector.predict(
    audio,
    prev_line="What is your phone number",
    curr_line="My number is 804",
    sample_rate=sr,
    return_probs=True,
)

print(f"Turn complete: {result['is_endpoint']}")
print(f"Done speaking probability: {result['prob_endpoint']:.1%}")

Available Interfaces

Python Library: Direct integration with TurnDetector class
CLI Tool: vogent-turn-predict speech.wav --prev "What is your phone number" --curr "My number is 804"

See the GitHub repository for detailed documentation, performance benchmarks, and advanced usage.

Training Details

Training Data

The model was trained on a diverse dataset combining human-collected and synthetic conversational data:

Training Procedure

Preprocessing

Audio: Last 8 seconds extracted via Whisper-Tiny encoder → ~400 audio tokens
Text: Full conversational context including assistant and user utterances
Labels: Binary classification (turn complete/incomplete)
Multimodal fusion: Audio embeddings projected into LLM's input space and concatenated with text

Training Hyperparameters

Training regime: fp16 mixed precision
Base model initialization: SmolLM2-135M (first 12 layers)
Architecture modifications: Reduced from 135M to ~80M parameters through layer ablation

Speeds, Sizes, Times

Model size: ~80M parameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal test set covering diverse conversational scenarios and edge cases where audio-only or text-only approaches fail.

Accuracy: 94.1%
AUPRC: 0.975

Technical Specifications

Model Architecture and Objective

Architecture:

Audio Encoder: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
Text Model: SmolLM-135M (12 layers, ~80M parameters)
Multimodal Fusion: Audio embeddings projected into LLM's input space
Classifier: Binary classification head (turn complete/incomplete)

Processing Flow:

Audio (16kHz PCM) → Whisper Encoder → Audio Embeddings (~400 tokens)
Text Context → SmolLM Tokenizer → Text Embeddings
Concatenate embeddings → SmolLM Transformer → Last token hidden state
Linear Classifier → Softmax → [P(continue), P(endpoint)]

Compute Infrastructure

Hardware

Optimization Features:

torch.compile with max-autotune mode
Dynamic tensor shapes without recompilation
Pre-warmed bucket sizes (64, 128, 256, 512, 1024)

Software

Framework: PyTorch with torch.compile
Audio processing: Whisper encoder (up to 8 seconds)

Citation

BibTeX:

@misc{voturn2025,
  title={Vogent-Turn-80M: State-of-the-Art Turn Detection for Voice Agents},
  author={Varadarajan, Vignesh and Vytheeswaran, Jagath},
  year={2025},
  publisher={Vogent AI},
  howpublished={\url{https://huggingface.co/vogent/Vogent-Turn-80M}},
  note={Blog: \url{https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents}}
}

More Information

Vogent-Turn-80M is part of Vogent's comprehensive voice AI platform.

Resources:

Full documentation and code: https://github.com/vogent/vogent-turn
Platform access: https://vogent.ai
Enterprise solutions: Contact j@vogent.ai

Upcoming releases:

Int8 quantized model for faster CPU deployment
Multilingual versions
Domain-specific adaptations

Model Card Authors

Vogent AI Team

Model Card Contact

GitHub Repository: https://github.com/vogent/vogent-turn
GitHub Issues: https://github.com/vogent/vogent-turn/issues
Website: https://vogent.ai