File size: 4,039 Bytes
5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 8c9c1fc 5858b83 d911155 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# TinyWave Expressive Speech 2B
**TinyWave Expressive Speech 2B** is a compact speech-to-speech language model distilled from the 7B SPIRIT-LM-Expressive teacher. It is trained to generate rich, expressive spoken language—capturing prosody, emotion, and speaker variation—purely from speech inputs.
Using a HuBERT-based discrete tokenizer augmented with **pitch and style tokens**, this model offers **high-fidelity expressive generation** with just 2B parameters, making it ideal for **low-latency deployment** in storytelling, assistive speech technologies, and interactive voice systems.
> 📖 For details, see the [TinyWave paper (arXiv:2506.23670)](https://arxiv.org/abs/2506.23670) and [project website](https://mohammadmahdinoori.github.io/tinywave-landing/).
---
## 🔧 Usage
This model requires **SPIRIT-LM's expressive speech tokenizer** for both encoding and decoding HuBERT-based audio tokens.
### 1. Clone SPIRIT-LM and Install Dependencies
```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````
---
### 2. Load Tokenizer
```python
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()
```
---
### 3. Inference Code (Speech-to-Speech)
```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch
# Load model and tokenizer
MODEL_PATH = "tinywave/speech-expressive-2b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)
# Load expressive speech tokenizer
speech_tokenizer = spiritlm_expressive()
def get_inference(audio_path):
audio, _ = torchaudio.load(audio_path)
input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
tokens = speech_tokenizer.encode_string(input_values)
input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
return tokenizer.decode(output[0])
```
---
### 4. Decode to WAV
```python
import numpy as np
from scipy.io.wavfile import write
def save_array_to_wav_int16(audio_array: np.ndarray, sampling_rate=16000, filename="output.wav"):
scaled = np.int16(audio_array / np.max(np.abs(audio_array)) * 32767)
write(filename, sampling_rate, scaled)
decoded_audio = speech_tokenizer.decode(generated_output.replace(" ", "").replace("<s>", "").replace("</s>", ""), speaker_id=2)
save_array_to_wav_int16(decoded_audio, filename="generated.wav")
```
---
## 🗣️ Inference Examples
### 🎧 Expressive Speech Continuation
Input: Spoken sentence (`.wav`)
Output: Expressive continuation in the same tone, pitch, and speaking style.
---
## 🧠 Model Details
| Feature | Description |
| ------------------- | ----------------------------------------------- |
| Architecture | 2B parameter distilled transformer |
| Tokenizer | SPIRIT-LM Expressive (HuBERT + prosody) |
| Input Type | Discrete HuBERT tokens only (speech-only) |
| Output Type | Discrete audio tokens (speech continuation) |
| Teacher Model | SPIRIT-LM-Expressive 7B |
| Tasks | Expressive speech continuation |
| Distillation Method | Layer-aligned: hidden states, attention, logits |
---
## 📎 Citation
```bibtex
@article{nouriborji2025tinywave,
title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
journal={arXiv preprint arXiv:2506.23670},
year={2025}
}
```
---
## 📂 Resources
* 🔗 [Project Page](https://mohammadmahdinoori.github.io/tinywave-landing/)
* 💬 [Demo Samples](https://mohammadmahdinoori.github.io/tinywave-landing/#samples)
* 🧠 [Training & Codebase](https://github.com/mohammadmahdinoori/TinyWave) |