ortal1602 commited on
Commit
341ef1d
Β·
verified Β·
1 Parent(s): 8e7e856

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -60
README.md CHANGED
@@ -2,39 +2,68 @@
2
 
3
  **Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi
4
  **Affiliation:** The Hebrew University of Jerusalem
5
- πŸ“„ [Paper PDF](https://huggingface.co/path/to/pdf) | 🌐 [Project Page](https://pastpaper2025.github.io/past) | πŸ“¦ [Model Repo](https://huggingface.co/username/past-model)
6
- 🧠 **Abstract:** See below
7
- πŸ“Έ **Figure:** See below
8
- πŸ“Š Sample results and evaluation: See tables below
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ---
11
 
12
- ## 🧭 Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- ### πŸ“₯ Clone and Set Up
15
 
16
  ```bash
17
- git clone https://github.com/yourname/past.git
18
- cd past
19
  conda create -n past_env python=3.10 -y
20
  conda activate past_env
21
  pip install -r requirements.txt
22
  ```
23
 
24
- ### πŸš€ Load the Model
25
 
26
  ```python
 
 
 
 
27
  from past.models.past_model import PastModel
28
  import torch
29
 
30
  device = "cuda" if torch.cuda.is_available() else "cpu"
31
- model = PastModel.from_pretrained("path/to/checkpoint.th", device=device)
32
- print("Sample rate:", model.sample_rate)
33
- ```
34
 
35
- ### πŸ”Š Run on Audio
36
 
37
- ```python
 
 
38
  import torchaudio
39
 
40
  def read_one_wav(path, target_sr):
@@ -52,32 +81,10 @@ with torch.no_grad():
52
  reconstructed = model.decode(codes, scale)
53
  ```
54
 
55
- ### 🎧 Listen and Evaluate
56
-
57
- ```python
58
- from IPython.display import Audio, display
59
- display(Audio(wav.cpu().numpy().squeeze(), rate=model.sample_rate))
60
- display(Audio(reconstructed.cpu().numpy().squeeze(), rate=model.sample_rate))
61
-
62
- # Evaluate
63
- from audiocraft.losses.sisnr import SISNR
64
- from pypesq import pesq
65
-
66
- sisnr_val = SISNR(sample_rate=model.sample_rate)(reconstructed, wav)
67
- pesq_val = pesq(wav.squeeze().cpu().numpy(), reconstructed.squeeze().cpu().numpy(), model.sample_rate)
68
-
69
- print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
70
- ```
71
-
72
- ---
73
 
74
- ## πŸ“Œ What You Can Do
75
 
76
- - πŸŽ™οΈ **Tokenize** audio into discrete phonetic-acoustic tokens
77
- - πŸ” **Reconstruct** audio from tokens (no vocoder needed)
78
- - 🧠 **Use tokens** in speech language modeling tasks
79
- - πŸ“Š **Evaluate** token quality (PESQ, SI-SNR, ABX, PNMI)
80
- - πŸ›°οΈ Use the **streamable variant** for real-time applications
81
 
82
  ---
83
 
@@ -85,26 +92,34 @@ print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
85
 
86
  ### 🧠 Phonetic Information
87
 
88
- | Tokenizer | PNMI ↑ | ABX↓ (W/A) | WER ↓ |
89
- |------------------|--------|------------|--------|
90
- | Deep HuBERT 500 | 0.67 | 3.91 / 4.73| 11.3 / 24.7 |
91
- | **PAST** | **0.75** | **2.82 / 3.54** | 15.7 / 36.8 |
92
- | PAST Streamable | 0.74 | 3.05 / 3.89| **14.3 / 32.3** |
 
 
93
 
94
  ### πŸ”Š Reconstruction Quality
95
 
96
- | Tokenizer | SI-SNR ↑ | ViSQOL ↑ | PESQ ↑ |
97
- |------------------|----------|-----------|--------|
98
- | EnCodec | **7.49** | 4.48 | 3.88 |
99
- | PAST | 4.84 | 4.40 | 3.55 |
100
- | PAST Streamable | 3.90 | 4.37 | 3.40 |
 
 
101
 
102
  ### πŸ“– Speech Language Modeling (sWUGGY)
103
 
104
- | Tokenizer | Inter ↑ | OOV ↑ |
105
- |------------------|---------|--------|
106
- | PAST | **71.8** | **57.5** |
107
- | PAST Streamable | 70.2 | 56.3 |
 
 
 
 
108
 
109
  ---
110
 
@@ -121,11 +136,3 @@ print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
121
  }
122
  ```
123
 
124
- ---
125
-
126
- ## πŸ–ΌοΈ Abstract and Figure
127
-
128
- > **Abstract:**
129
- We present **PAST**, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. [...] Results demonstrate that PAST surpasses existing tokenizers across phonetic representation, speech reconstruction, and language modeling. We also introduce a **streamable variant** for real-time use.
130
-
131
- ![Figure 1: PAST pipeline](path/to/figure.png)
 
2
 
3
  **Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi
4
  **Affiliation:** The Hebrew University of Jerusalem
5
+
6
+ πŸ“„ [Paper PDF](https://arxiv.org/abs/2505.14470) | 🌐 [Project Page](https://pages.cs.huji.ac.il/adiyoss-lab/PAST/) | πŸ’» [Code](https://github.com/slp-rl/PAST)
7
+
8
+ ![Schematic of the PAST pipeline. The auxiliary heads use the output of the first vector quantization module as input.](PAST_figure.png)
9
+
10
+
11
+ 🧠 **Abstract:**
12
+
13
+ We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation.
14
+
15
+
16
+ ## Samples
17
+ Audio samples are available on our [project demo page](https://pages.cs.huji.ac.il/adiyoss-lab/PAST/).
18
+
19
+ ## Model List
20
+ | Model | Variant | Description |
21
+ |:------|:--------|:------------|
22
+ | `PAST` | Full | PAST model trained on LibriSpeech + TIMIT |
23
+ | `PAST_streamable` | Streamable | Causal variant with 20ms look-ahead |
24
 
25
  ---
26
 
27
+ ## Usage
28
+
29
+ ### πŸ“₯ Pre-requisites
30
+
31
+ Install
32
+
33
+ ```bash
34
+ conda create -n past_env python=3.10 -y
35
+ conda activate past_env
36
+ pip install git+https://github.com/slp-rl/PAST.git
37
+
38
+ ```
39
+
40
 
41
+ Clone
42
 
43
  ```bash
44
+ git clone https://github.com/slp-rl/PAST.git
 
45
  conda create -n past_env python=3.10 -y
46
  conda activate past_env
47
  pip install -r requirements.txt
48
  ```
49
 
50
+ ### πŸš€ Inference
51
 
52
  ```python
53
+ # ---------------
54
+ # load PAST model
55
+ # ---------------
56
+
57
  from past.models.past_model import PastModel
58
  import torch
59
 
60
  device = "cuda" if torch.cuda.is_available() else "cpu"
61
+ model = PastModel.from_pretrained("PAST.th", device=device) # one of ['PAST', 'PAST_streamable']
 
 
62
 
 
63
 
64
+ # ----------------------------------------------------------------------
65
+ # Run on audio: PAST expects a batched input format [Batch, Channels, T]
66
+ # ----------------------------------------------------------------------
67
  import torchaudio
68
 
69
  def read_one_wav(path, target_sr):
 
81
  reconstructed = model.decode(codes, scale)
82
  ```
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
+ ### Evaluation
86
 
87
+ See [Eval README](https://github.com/slp-rl/PAST/eval_readme.md)
 
 
 
 
88
 
89
  ---
90
 
 
92
 
93
  ### 🧠 Phonetic Information
94
 
95
+ | **Tokenizer** | **PNMI ↑** | **ABX ↓ Within** | **ABX ↓ Across** | **WER ↓ Clean** | **WER ↓ Other** |
96
+ |------------------------|------------|------------------|------------------|------------------|------------------|
97
+ | D. HuBERT 500 | 0.67 | 3.91 | 4.73 | 11.3 | 24.7 |
98
+ | SpeechTokenizer | 0.72 | 3.43 | 4.50 | 18.5 | 41.3 |
99
+ | X-Codec | 0.40 | 9.42 | 12.6 | 17.1 | 37.1 |
100
+ | **PAST** | **0.75** | **2.82** | **3.54** | 15.7 | 36.8 |
101
+ | **PAST - Streamable** | 0.74 | 3.05 | 3.89 | **14.3** | **32.3** |
102
 
103
  ### πŸ”Š Reconstruction Quality
104
 
105
+ | **Tokenizer** | **SISNR ↑** | **VISQOL ↑** | **PESQ ↑** |
106
+ |-------------------------|-------------|--------------|------------|
107
+ | EnCodec | 7.49 | 4.48 | 3.88 |
108
+ | SpeechTokenizer | 0.44 | 4.38 | 3.15 |
109
+ | X-Codec | -7.12 | **4.46** | 3.33 |
110
+ | **PAST** | **4.84** | 4.40 | **3.55** |
111
+ | **PAST - Streamable** | 3.90 | 4.37 | 3.40 |
112
 
113
  ### πŸ“– Speech Language Modeling (sWUGGY)
114
 
115
+ | **Tokenizer** | **sWUGGY ↑ Inter** | **sWUGGY ↑ OOV** |
116
+ |-------------------------|--------------------|------------------|
117
+ | EnCodec | 56.3 | 53.7 |
118
+ | D. HuBERT 500 | 67.9 | 55.4 |
119
+ | SpeechTokenizer | 63.7 | 55.6 |
120
+ | X-Codec | 55.1 | 52.9 |
121
+ | **PAST** | **71.8** | **57.5** |
122
+ | **PAST - Streamable** | 70.2 | 56.3 |
123
 
124
  ---
125
 
 
136
  }
137
  ```
138