Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -2,39 +2,68 @@ | |
| 2 |  | 
| 3 | 
             
            **Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi  
         | 
| 4 | 
             
            **Affiliation:** The Hebrew University of Jerusalem  
         | 
| 5 | 
            -
             | 
| 6 | 
            -
             | 
| 7 | 
            -
             | 
| 8 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 9 |  | 
| 10 | 
             
            ---
         | 
| 11 |  | 
| 12 | 
            -
            ##  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 13 |  | 
| 14 | 
            -
             | 
| 15 |  | 
| 16 | 
             
            ```bash
         | 
| 17 | 
            -
            git clone https://github.com/ | 
| 18 | 
            -
            cd past
         | 
| 19 | 
             
            conda create -n past_env python=3.10 -y
         | 
| 20 | 
             
            conda activate past_env
         | 
| 21 | 
             
            pip install -r requirements.txt
         | 
| 22 | 
             
            ```
         | 
| 23 |  | 
| 24 | 
            -
            ### π  | 
| 25 |  | 
| 26 | 
             
            ```python
         | 
|  | |
|  | |
|  | |
|  | |
| 27 | 
             
            from past.models.past_model import PastModel
         | 
| 28 | 
             
            import torch
         | 
| 29 |  | 
| 30 | 
             
            device = "cuda" if torch.cuda.is_available() else "cpu"
         | 
| 31 | 
            -
            model = PastModel.from_pretrained(" | 
| 32 | 
            -
            print("Sample rate:", model.sample_rate)
         | 
| 33 | 
            -
            ```
         | 
| 34 |  | 
| 35 | 
            -
            ### π Run on Audio
         | 
| 36 |  | 
| 37 | 
            -
             | 
|  | |
|  | |
| 38 | 
             
            import torchaudio
         | 
| 39 |  | 
| 40 | 
             
            def read_one_wav(path, target_sr):
         | 
| @@ -52,32 +81,10 @@ with torch.no_grad(): | |
| 52 | 
             
                reconstructed = model.decode(codes, scale)
         | 
| 53 | 
             
            ```
         | 
| 54 |  | 
| 55 | 
            -
            ### π§ Listen and Evaluate
         | 
| 56 | 
            -
             | 
| 57 | 
            -
            ```python
         | 
| 58 | 
            -
            from IPython.display import Audio, display
         | 
| 59 | 
            -
            display(Audio(wav.cpu().numpy().squeeze(), rate=model.sample_rate))
         | 
| 60 | 
            -
            display(Audio(reconstructed.cpu().numpy().squeeze(), rate=model.sample_rate))
         | 
| 61 | 
            -
             | 
| 62 | 
            -
            # Evaluate
         | 
| 63 | 
            -
            from audiocraft.losses.sisnr import SISNR
         | 
| 64 | 
            -
            from pypesq import pesq
         | 
| 65 | 
            -
             | 
| 66 | 
            -
            sisnr_val = SISNR(sample_rate=model.sample_rate)(reconstructed, wav)
         | 
| 67 | 
            -
            pesq_val = pesq(wav.squeeze().cpu().numpy(), reconstructed.squeeze().cpu().numpy(), model.sample_rate)
         | 
| 68 | 
            -
             | 
| 69 | 
            -
            print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
         | 
| 70 | 
            -
            ```
         | 
| 71 | 
            -
             | 
| 72 | 
            -
            ---
         | 
| 73 |  | 
| 74 | 
            -
             | 
| 75 |  | 
| 76 | 
            -
             | 
| 77 | 
            -
            - π **Reconstruct** audio from tokens (no vocoder needed)
         | 
| 78 | 
            -
            - π§  **Use tokens** in speech language modeling tasks
         | 
| 79 | 
            -
            - π **Evaluate** token quality (PESQ, SI-SNR, ABX, PNMI)
         | 
| 80 | 
            -
            - π°οΈ Use the **streamable variant** for real-time applications
         | 
| 81 |  | 
| 82 | 
             
            ---
         | 
| 83 |  | 
| @@ -85,26 +92,34 @@ print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}") | |
| 85 |  | 
| 86 | 
             
            ### π§  Phonetic Information
         | 
| 87 |  | 
| 88 | 
            -
            | Tokenizer | 
| 89 | 
            -
             | 
| 90 | 
            -
            |  | 
| 91 | 
            -
            |  | 
| 92 | 
            -
            |  | 
|  | |
|  | |
| 93 |  | 
| 94 | 
             
            ### π Reconstruction Quality
         | 
| 95 |  | 
| 96 | 
            -
            | Tokenizer | 
| 97 | 
            -
             | 
| 98 | 
            -
            | EnCodec | 
| 99 | 
            -
            |  | 
| 100 | 
            -
            |  | 
|  | |
|  | |
| 101 |  | 
| 102 | 
             
            ### π Speech Language Modeling (sWUGGY)
         | 
| 103 |  | 
| 104 | 
            -
            | Tokenizer | 
| 105 | 
            -
             | 
| 106 | 
            -
            |  | 
| 107 | 
            -
            |  | 
|  | |
|  | |
|  | |
|  | |
| 108 |  | 
| 109 | 
             
            ---
         | 
| 110 |  | 
| @@ -121,11 +136,3 @@ print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}") | |
| 121 | 
             
            }
         | 
| 122 | 
             
            ```
         | 
| 123 |  | 
| 124 | 
            -
            ---
         | 
| 125 | 
            -
             | 
| 126 | 
            -
            ## πΌοΈ Abstract and Figure
         | 
| 127 | 
            -
             | 
| 128 | 
            -
            > **Abstract:**  
         | 
| 129 | 
            -
            We present **PAST**, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. [...] Results demonstrate that PAST surpasses existing tokenizers across phonetic representation, speech reconstruction, and language modeling. We also introduce a **streamable variant** for real-time use.
         | 
| 130 | 
            -
             | 
| 131 | 
            -
            
         | 
|  | |
| 2 |  | 
| 3 | 
             
            **Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi  
         | 
| 4 | 
             
            **Affiliation:** The Hebrew University of Jerusalem  
         | 
| 5 | 
            +
             | 
| 6 | 
            +
            π [Paper PDF](https://arxiv.org/abs/2505.14470) | π [Project Page](https://pages.cs.huji.ac.il/adiyoss-lab/PAST/) | π» [Code](https://github.com/slp-rl/PAST)
         | 
| 7 | 
            +
             | 
| 8 | 
            +
            
         | 
| 9 | 
            +
             | 
| 10 | 
            +
             | 
| 11 | 
            +
            π§  **Abstract:**
         | 
| 12 | 
            +
             | 
| 13 | 
            +
            We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation. 
         | 
| 14 | 
            +
             | 
| 15 | 
            +
             | 
| 16 | 
            +
            ## Samples
         | 
| 17 | 
            +
            Audio samples are available on our [project demo page](https://pages.cs.huji.ac.il/adiyoss-lab/PAST/).
         | 
| 18 | 
            +
             | 
| 19 | 
            +
            ## Model List
         | 
| 20 | 
            +
            | Model | Variant | Description |
         | 
| 21 | 
            +
            |:------|:--------|:------------|
         | 
| 22 | 
            +
            | `PAST` | Full | PAST model trained on LibriSpeech + TIMIT |
         | 
| 23 | 
            +
            | `PAST_streamable` | Streamable | Causal variant with 20ms look-ahead |
         | 
| 24 |  | 
| 25 | 
             
            ---
         | 
| 26 |  | 
| 27 | 
            +
            ## Usage
         | 
| 28 | 
            +
             | 
| 29 | 
            +
            ### π₯ Pre-requisites
         | 
| 30 | 
            +
             | 
| 31 | 
            +
            Install
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            ```bash
         | 
| 34 | 
            +
            conda create -n past_env python=3.10 -y
         | 
| 35 | 
            +
            conda activate past_env
         | 
| 36 | 
            +
            pip install git+https://github.com/slp-rl/PAST.git
         | 
| 37 | 
            +
             | 
| 38 | 
            +
            ```
         | 
| 39 | 
            +
             | 
| 40 |  | 
| 41 | 
            +
            Clone
         | 
| 42 |  | 
| 43 | 
             
            ```bash
         | 
| 44 | 
            +
            git clone https://github.com/slp-rl/PAST.git
         | 
|  | |
| 45 | 
             
            conda create -n past_env python=3.10 -y
         | 
| 46 | 
             
            conda activate past_env
         | 
| 47 | 
             
            pip install -r requirements.txt
         | 
| 48 | 
             
            ```
         | 
| 49 |  | 
| 50 | 
            +
            ### π Inference
         | 
| 51 |  | 
| 52 | 
             
            ```python
         | 
| 53 | 
            +
            # ---------------
         | 
| 54 | 
            +
            # load PAST model
         | 
| 55 | 
            +
            # ---------------
         | 
| 56 | 
            +
             | 
| 57 | 
             
            from past.models.past_model import PastModel
         | 
| 58 | 
             
            import torch
         | 
| 59 |  | 
| 60 | 
             
            device = "cuda" if torch.cuda.is_available() else "cpu"
         | 
| 61 | 
            +
            model = PastModel.from_pretrained("PAST.th", device=device)  # one of ['PAST', 'PAST_streamable']
         | 
|  | |
|  | |
| 62 |  | 
|  | |
| 63 |  | 
| 64 | 
            +
            # ----------------------------------------------------------------------
         | 
| 65 | 
            +
            # Run on audio: PAST expects a batched input format [Batch, Channels, T]
         | 
| 66 | 
            +
            # ----------------------------------------------------------------------
         | 
| 67 | 
             
            import torchaudio
         | 
| 68 |  | 
| 69 | 
             
            def read_one_wav(path, target_sr):
         | 
|  | |
| 81 | 
             
                reconstructed = model.decode(codes, scale)
         | 
| 82 | 
             
            ```
         | 
| 83 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 84 |  | 
| 85 | 
            +
            ### Evaluation
         | 
| 86 |  | 
| 87 | 
            +
            See [Eval README](https://github.com/slp-rl/PAST/eval_readme.md)
         | 
|  | |
|  | |
|  | |
|  | |
| 88 |  | 
| 89 | 
             
            ---
         | 
| 90 |  | 
|  | |
| 92 |  | 
| 93 | 
             
            ### π§  Phonetic Information
         | 
| 94 |  | 
| 95 | 
            +
            | **Tokenizer**          | **PNMI β** | **ABX β Within** | **ABX β Across** | **WER β Clean** | **WER β Other** |
         | 
| 96 | 
            +
            |------------------------|------------|------------------|------------------|------------------|------------------|
         | 
| 97 | 
            +
            | D. HuBERT 500          | 0.67       | 3.91             | 4.73             | 11.3             | 24.7             |
         | 
| 98 | 
            +
            | SpeechTokenizer        | 0.72       | 3.43             | 4.50             | 18.5             | 41.3             |
         | 
| 99 | 
            +
            | X-Codec                | 0.40       | 9.42             | 12.6             | 17.1             | 37.1             |
         | 
| 100 | 
            +
            | **PAST**               | **0.75**   | **2.82**         | **3.54**         | 15.7             | 36.8             |
         | 
| 101 | 
            +
            | **PAST - Streamable**  | 0.74       | 3.05             | 3.89             | **14.3**         |  **32.3**        |
         | 
| 102 |  | 
| 103 | 
             
            ### π Reconstruction Quality
         | 
| 104 |  | 
| 105 | 
            +
            | **Tokenizer**           | **SISNR β** | **VISQOL β** | **PESQ β** |
         | 
| 106 | 
            +
            |-------------------------|-------------|--------------|------------|
         | 
| 107 | 
            +
            | EnCodec                 | 7.49        | 4.48         | 3.88       |
         | 
| 108 | 
            +
            | SpeechTokenizer         | 0.44        | 4.38         | 3.15       |
         | 
| 109 | 
            +
            | X-Codec                 | -7.12       | **4.46**     | 3.33       |
         | 
| 110 | 
            +
            | **PAST**                | **4.84**    | 4.40         | **3.55**   |
         | 
| 111 | 
            +
            | **PAST - Streamable**   | 3.90        | 4.37         | 3.40       |
         | 
| 112 |  | 
| 113 | 
             
            ### π Speech Language Modeling (sWUGGY)
         | 
| 114 |  | 
| 115 | 
            +
            | **Tokenizer**           | **sWUGGY β Inter** | **sWUGGY β OOV** |
         | 
| 116 | 
            +
            |-------------------------|--------------------|------------------|
         | 
| 117 | 
            +
            | EnCodec                 | 56.3               | 53.7             |
         | 
| 118 | 
            +
            | D. HuBERT 500           | 67.9               | 55.4             |
         | 
| 119 | 
            +
            | SpeechTokenizer         | 63.7               | 55.6             |
         | 
| 120 | 
            +
            | X-Codec                 | 55.1               | 52.9             |
         | 
| 121 | 
            +
            | **PAST**                | **71.8**           | **57.5**         |
         | 
| 122 | 
            +
            | **PAST - Streamable**   | 70.2               | 56.3             |
         | 
| 123 |  | 
| 124 | 
             
            ---
         | 
| 125 |  | 
|  | |
| 136 | 
             
            }
         | 
| 137 | 
             
            ```
         | 
| 138 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
