slprl
/

PAST

Model card Files Files and versions

xet

Community

ortal1602 commited on Jul 6

Commit

341ef1d

verified ·

1 Parent(s): 8e7e856

Update README.md

Browse files

Files changed (1) hide show

README.md +67 -60

README.md CHANGED Viewed

@@ -2,39 +2,68 @@
 **Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi
 **Affiliation:** The Hebrew University of Jerusalem
-📄 [Paper PDF](https://huggingface.co/path/to/pdf) | 🌐 [Project Page](https://pastpaper2025.github.io/past) | 📦 [Model Repo](https://huggingface.co/username/past-model)
-🧠 **Abstract:** See below
-📸 **Figure:** See below
-📊 Sample results and evaluation: See tables below
 ---
-## 🧭 Quick Start
-### 📥 Clone and Set Up
 ```bash
-git clone https://github.com/yourname/past.git
-cd past
 conda create -n past_env python=3.10 -y
 conda activate past_env
 pip install -r requirements.txt
 ```
-### 🚀 Load the Model
 ```python
 from past.models.past_model import PastModel
 import torch
 device = "cuda" if torch.cuda.is_available() else "cpu"
-model = PastModel.from_pretrained("path/to/checkpoint.th", device=device)
-print("Sample rate:", model.sample_rate)
-```
-### 🔊 Run on Audio
-```python
 import torchaudio
 def read_one_wav(path, target_sr):
@@ -52,32 +81,10 @@ with torch.no_grad():
     reconstructed = model.decode(codes, scale)
 ```
-### 🎧 Listen and Evaluate
-```python
-from IPython.display import Audio, display
-display(Audio(wav.cpu().numpy().squeeze(), rate=model.sample_rate))
-display(Audio(reconstructed.cpu().numpy().squeeze(), rate=model.sample_rate))
-# Evaluate
-from audiocraft.losses.sisnr import SISNR
-from pypesq import pesq
-sisnr_val = SISNR(sample_rate=model.sample_rate)(reconstructed, wav)
-pesq_val = pesq(wav.squeeze().cpu().numpy(), reconstructed.squeeze().cpu().numpy(), model.sample_rate)
-print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
-```
----
-## 📌 What You Can Do
-- 🎙️ **Tokenize** audio into discrete phonetic-acoustic tokens
-- 🔁 **Reconstruct** audio from tokens (no vocoder needed)
-- 🧠 **Use tokens** in speech language modeling tasks
-- 📊 **Evaluate** token quality (PESQ, SI-SNR, ABX, PNMI)
-- 🛰️ Use the **streamable variant** for real-time applications
 ---
@@ -85,26 +92,34 @@ print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
 ### 🧠 Phonetic Information
-| Tokenizer        | PNMI ↑ | ABX↓ (W/A) | WER ↓ |
-|------------------|--------|------------|--------|
-| Deep HuBERT 500  | 0.67   | 3.91 / 4.73| 11.3 / 24.7 |
-| **PAST**         | **0.75** | **2.82 / 3.54** | 15.7 / 36.8 |
-| PAST Streamable  | 0.74   | 3.05 / 3.89| **14.3 / 32.3** |
 ### 🔊 Reconstruction Quality
-| Tokenizer        | SI-SNR ↑ | ViSQOL ↑ | PESQ ↑ |
-|------------------|----------|-----------|--------|
-| EnCodec          | **7.49** | 4.48      | 3.88   |
-| PAST             | 4.84     | 4.40      | 3.55   |
-| PAST Streamable  | 3.90     | 4.37      | 3.40   |
 ### 📖 Speech Language Modeling (sWUGGY)
-| Tokenizer        | Inter ↑ | OOV ↑ |
-|------------------|---------|--------|
-| PAST             | **71.8** | **57.5** |
-| PAST Streamable  | 70.2    | 56.3  |
 ---
@@ -121,11 +136,3 @@ print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
 }
 ```
----
-## 🖼️ Abstract and Figure
-> **Abstract:**
-We present **PAST**, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. [...] Results demonstrate that PAST surpasses existing tokenizers across phonetic representation, speech reconstruction, and language modeling. We also introduce a **streamable variant** for real-time use.
-![Figure 1: PAST pipeline](path/to/figure.png)

 **Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi
 **Affiliation:** The Hebrew University of Jerusalem
+📄 [Paper PDF](https://arxiv.org/abs/2505.14470) | 🌐 [Project Page](https://pages.cs.huji.ac.il/adiyoss-lab/PAST/) | 💻 [Code](https://github.com/slp-rl/PAST)
+![Schematic of the PAST pipeline. The auxiliary heads use the output of the first vector quantization module as input.](PAST_figure.png)
+🧠 **Abstract:**
+We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation.
+## Samples
+Audio samples are available on our [project demo page](https://pages.cs.huji.ac.il/adiyoss-lab/PAST/).
+## Model List
+| Model | Variant | Description |
+|:------|:--------|:------------|
+| `PAST` | Full | PAST model trained on LibriSpeech + TIMIT |
+| `PAST_streamable` | Streamable | Causal variant with 20ms look-ahead |
 ---
+## Usage
+### 📥 Pre-requisites
+Install
+```bash
+conda create -n past_env python=3.10 -y
+conda activate past_env
+pip install git+https://github.com/slp-rl/PAST.git
+```
+Clone
 ```bash
+git clone https://github.com/slp-rl/PAST.git
 conda create -n past_env python=3.10 -y
 conda activate past_env
 pip install -r requirements.txt
 ```
+### 🚀 Inference
 ```python
+# ---------------
+# load PAST model
+# ---------------
 from past.models.past_model import PastModel
 import torch
 device = "cuda" if torch.cuda.is_available() else "cpu"
+model = PastModel.from_pretrained("PAST.th", device=device)  # one of ['PAST', 'PAST_streamable']
+# ----------------------------------------------------------------------
+# Run on audio: PAST expects a batched input format [Batch, Channels, T]
+# ----------------------------------------------------------------------
 import torchaudio
 def read_one_wav(path, target_sr):
     reconstructed = model.decode(codes, scale)
 ```
+### Evaluation
+See [Eval README](https://github.com/slp-rl/PAST/eval_readme.md)
 ---
 ### 🧠 Phonetic Information
+| **Tokenizer**          | **PNMI ↑** | **ABX ↓ Within** | **ABX ↓ Across** | **WER ↓ Clean** | **WER ↓ Other** |
+|------------------------|------------|------------------|------------------|------------------|------------------|
+| D. HuBERT 500          | 0.67       | 3.91             | 4.73             | 11.3             | 24.7             |
+| SpeechTokenizer        | 0.72       | 3.43             | 4.50             | 18.5             | 41.3             |
+| X-Codec                | 0.40       | 9.42             | 12.6             | 17.1             | 37.1             |
+| **PAST**               | **0.75**   | **2.82**         | **3.54**         | 15.7             | 36.8             |
+| **PAST - Streamable**  | 0.74       | 3.05             | 3.89             | **14.3**         |  **32.3**        |
 ### 🔊 Reconstruction Quality
+| **Tokenizer**           | **SISNR ↑** | **VISQOL ↑** | **PESQ ↑** |
+|-------------------------|-------------|--------------|------------|
+| EnCodec                 | 7.49        | 4.48         | 3.88       |
+| SpeechTokenizer         | 0.44        | 4.38         | 3.15       |
+| X-Codec                 | -7.12       | **4.46**     | 3.33       |
+| **PAST**                | **4.84**    | 4.40         | **3.55**   |
+| **PAST - Streamable**   | 3.90        | 4.37         | 3.40       |
 ### 📖 Speech Language Modeling (sWUGGY)
+| **Tokenizer**           | **sWUGGY ↑ Inter** | **sWUGGY ↑ OOV** |
+|-------------------------|--------------------|------------------|
+| EnCodec                 | 56.3               | 53.7             |
+| D. HuBERT 500           | 67.9               | 55.4             |
+| SpeechTokenizer         | 63.7               | 55.6             |
+| X-Codec                 | 55.1               | 52.9             |
+| **PAST**                | **71.8**           | **57.5**         |
+| **PAST - Streamable**   | 70.2               | 56.3             |
 ---
 }
 ```