File size: 4,039 Bytes

5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
 
 
 
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
 
 
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
 
 
8c9c1fc
5858b83
 
 
 
8c9c1fc
5858b83
 
8c9c1fc
5858b83
 
 
 
 
 
 
 
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
 
8c9c1fc
5858b83
 
 
8c9c1fc
5858b83
 
 
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
 
 
 
 
 
 
 
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
 
 
 
 
 
 
8c9c1fc
5858b83
8c9c1fc
5858b83
8c9c1fc
5858b83
 
d911155

# TinyWave Expressive Speech 2B

**TinyWave Expressive Speech 2B** is a compact speech-to-speech language model distilled from the 7B SPIRIT-LM-Expressive teacher. It is trained to generate rich, expressive spoken language—capturing prosody, emotion, and speaker variation—purely from speech inputs.

Using a HuBERT-based discrete tokenizer augmented with **pitch and style tokens**, this model offers **high-fidelity expressive generation** with just 2B parameters, making it ideal for **low-latency deployment** in storytelling, assistive speech technologies, and interactive voice systems.

> 📖 For details, see the [TinyWave paper (arXiv:2506.23670)](https://arxiv.org/abs/2506.23670) and [project website](https://mohammadmahdinoori.github.io/tinywave-landing/).

---

## 🔧 Usage

This model requires **SPIRIT-LM's expressive speech tokenizer** for both encoding and decoding HuBERT-based audio tokens.

### 1. Clone SPIRIT-LM and Install Dependencies

```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````

---

### 2. Load Tokenizer

```python
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()
```

---

### 3. Inference Code (Speech-to-Speech)

```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch

# Load model and tokenizer
MODEL_PATH = "tinywave/speech-expressive-2b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)

# Load expressive speech tokenizer
speech_tokenizer = spiritlm_expressive()

def get_inference(audio_path):
    audio, _ = torchaudio.load(audio_path)
    input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
    tokens = speech_tokenizer.encode_string(input_values)
    input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
    return tokenizer.decode(output[0])
```

---

### 4. Decode to WAV

```python
import numpy as np
from scipy.io.wavfile import write

def save_array_to_wav_int16(audio_array: np.ndarray, sampling_rate=16000, filename="output.wav"):
    scaled = np.int16(audio_array / np.max(np.abs(audio_array)) * 32767)
    write(filename, sampling_rate, scaled)

decoded_audio = speech_tokenizer.decode(generated_output.replace(" ", "").replace("<s>", "").replace("</s>", ""), speaker_id=2)
save_array_to_wav_int16(decoded_audio, filename="generated.wav")
```

---

## 🗣️ Inference Examples

### 🎧 Expressive Speech Continuation

Input: Spoken sentence (`.wav`)
Output: Expressive continuation in the same tone, pitch, and speaking style.

---

## 🧠 Model Details

| Feature             | Description                                     |
| ------------------- | ----------------------------------------------- |
| Architecture        | 2B parameter distilled transformer              |
| Tokenizer           | SPIRIT-LM Expressive (HuBERT + prosody)         |
| Input Type          | Discrete HuBERT tokens only (speech-only)       |
| Output Type         | Discrete audio tokens (speech continuation)     |
| Teacher Model       | SPIRIT-LM-Expressive 7B                         |
| Tasks               | Expressive speech continuation                  |
| Distillation Method | Layer-aligned: hidden states, attention, logits |

---

## 📎 Citation

```bibtex
@article{nouriborji2025tinywave,
  title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
  author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
  journal={arXiv preprint arXiv:2506.23670},
  year={2025}
}
```

---

## 📂 Resources

* 🔗 [Project Page](https://mohammadmahdinoori.github.io/tinywave-landing/)
* 💬 [Demo Samples](https://mohammadmahdinoori.github.io/tinywave-landing/#samples)
* 🧠 [Training & Codebase](https://github.com/mohammadmahdinoori/TinyWave)