vampnet-conditional-music-generation

Running on Zero

App Files Files Community

lllindsey0615 commited on Aug 18

Commit

a189727

1 Parent(s): 78b8054

initial commit

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
DEFAULT_HF_MODEL_REPO +1 -0
DEFAULT_MODEL +1 -0
LICENSE +21 -0
README.md +243 -7
TODOS +1 -0
app.py +760 -0
assets/.DS_Store +0 -0
conf/c2f.yml +14 -0
conf/generated/cat/c2f.yml +15 -0
conf/generated/cat/coarse.yml +8 -0
conf/generated/cat/interface.yml +6 -0
conf/generated/cat10/c2f.yml +15 -0
conf/generated/cat10/coarse.yml +8 -0
conf/generated/cat10/interface.yml +6 -0
conf/generated/ivo/c2f.yml +15 -0
conf/generated/ivo/coarse.yml +8 -0
conf/generated/ivo/interface.yml +6 -0
conf/generated/lazaro-ros-sep/c2f.yml +15 -0
conf/generated/lazaro-ros-sep/coarse.yml +8 -0
conf/generated/lazaro-ros-sep/interface.yml +6 -0
conf/generated/lazaro-ros/c2f.yml +15 -0
conf/generated/lazaro-ros/coarse.yml +8 -0
conf/generated/lazaro-ros/interface.yml +6 -0
conf/generated/le-poisson-steve/c2f.yml +15 -0
conf/generated/le-poisson-steve/coarse.yml +8 -0
conf/generated/le-poisson-steve/interface.yml +6 -0
conf/generated/march-31/c2f.yml +15 -0
conf/generated/march-31/coarse.yml +8 -0
conf/generated/march-31/interface.yml +6 -0
conf/generated/sax-new/c2f.yml +15 -0
conf/generated/sax-new/coarse.yml +8 -0
conf/generated/sax-new/interface.yml +6 -0
conf/generated/saxophone/c2f.yml +15 -0
conf/generated/saxophone/coarse.yml +8 -0
conf/generated/saxophone/interface.yml +6 -0
conf/interface.yml +10 -0
conf/lora/lora-s2s.yml +27 -0
conf/lora/lora.yml +22 -0
conf/salad_bowl.yml +0 -0
conf/vampnet.yml +49 -0
hello.py +48 -0
requirements.txt +11 -0
scratch/convert_to_wav.sh +1 -0
scratch/rms_mask.txt +14 -0
scratch/separate_folder.sh +1 -0
scripts/exp/eval.py +110 -0
scripts/exp/experiment.py +254 -0
scripts/exp/export.py +75 -0
scripts/exp/fine_tune.py +87 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

DEFAULT_HF_MODEL_REPO ADDED Viewed

	@@ -0,0 +1 @@


1	+ hugggof/vampnet

DEFAULT_MODEL ADDED Viewed

	@@ -0,0 +1 @@


1	+ default

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Hugo Flores García and Prem Seetharaman
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,13 +1,249 @@
 ---
-title: Vampnet Music HARP V3
-emoji: 🐢
-colorFrom: red
-colorTo: yellow
 sdk: gradio
-sdk_version: 5.42.0
 app_file: app.py
 pinned: false
-short_description: Wrapped VampNet model for HARP3
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: salad bowl (vampnet)
+emoji: 🥗
+colorFrom: yellow
+colorTo: green
 sdk: gradio
+sdk_version: 5.23.2
+python_version: 3.11
 app_file: app.py
 pinned: false
+license: cc-by-nc-4.0
 ---
+# VampNet
+# Table of contents
+- [setting up](#setting-up)
+- [programmatic usage](#programmatic-usage)
+- [launching the web app](#launching-the-web-app)
+- [training / fine-tuning](#training--fine-tuning)
+  - [training a model](#training-a-model)
+  - [debugging training](#debugging-training)
+  - [fine-tuning](#fine-tuning)
+- [exporting your model](#exporting-your-model)
+- [unloop](#unloop)
+- [token telephone](#token-telephone)
+- [a note on argbind](#a-note-on-argbind)
+- [take a look at the pretrained models](#take-a-look-at-the-pretrained-models)
+- [licensing for pretrained models](#licensing-for-pretrained-models)
+## setting up
+python 3.9-3.11 works well. (for example, using conda)
+```bash
+conda create -n vampnet python=3.9
+conda activate vampnet
+```
+install VampNet
+```bash
+git clone https://github.com/hugofloresgarcia/vampnet.git
+pip install -e ./vampnet
+```
+## programmatic usage
+quick start!
+```python
+import random
+import vampnet
+import audiotools as at
+# load the default vampnet model
+interface = vampnet.interface.Interface.default()
+# list available finetuned models
+finetuned_model_choices = interface.available_models()
+print(f"available finetuned models: {finetuned_model_choices}")
+# pick a random finetuned model
+model_choice = random.choice(finetuned_model_choices)
+print(f"choosing model: {model_choice}")
+# load a finetuned model
+interface.load_finetuned(model_choice)
+# load an example audio file
+signal = at.AudioSignal("assets/example.wav")
+# get the tokens for the audio
+codes = interface.encode(signal)
+# build a mask for the audio
+mask = interface.build_mask(
+    codes, signal,
+    periodic_prompt=7,
+    upper_codebook_mask=3,
+)
+# generate the output tokens
+output_tokens = interface.vamp(
+    codes, mask, return_mask=False,
+    temperature=1.0,
+    typical_filtering=True,
+)
+# convert them to a signal
+output_signal = interface.decode(output_tokens)
+# save the output signal
+output_signal.write("scratch/output.wav")
+```
+# Launching the Web app
+You can launch a gradio UI to play with vampnet.
+```bash
+python app.py
+```
+# Training / Fine-tuning
+## Training a model
+To train a model, run the following script:
+```bash
+python scripts/exp/train.py --args.load conf/vampnet.yml --save_path /path/to/checkpoints
+```
+for multi-gpu training, use torchrun:
+```bash
+torchrun --nproc_per_node gpu scripts/exp/train.py --args.load conf/vampnet.yml --save_path path/to/ckpt
+```
+You can edit `conf/vampnet.yml` to change the dataset paths or any training hyperparameters.
+For coarse2fine models, you can use `conf/c2f.yml` as a starting configuration.
+See `python scripts/exp/train.py -h` for a list of options.
+## Debugging training
+To debug training, it's easier to debug with 1 gpu and 0 workers
+```bash
+CUDA_VISIBLE_DEVICES=0 python -m pdb scripts/exp/train.py --args.load conf/vampnet.yml --save_path /path/to/checkpoints --num_workers 0
+```
+# Fine-tuning
+To fine-tune a model, use the script in `scripts/exp/fine_tune.py`
+for an audio folder
+```bash
+python scripts/exp/fine_tune.py /path/to/audio/folder <fine_tune_name>
+```
+for multiple files
+```bash
+python scripts/exp/fine_tune.py "/path/to/audio1.mp3 /path/to/audio2/ /path/to/audio3.wav" <fine_tune_name>
+```
+This creates configuration files for a fine tuning train job. The save_paths will be set to `runs/<fine_tune_name>/coarse` and `runs/<fine_tune_name>/c2f`.
+launch the coarse job:
+```bash
+python scripts/exp/train.py --args.load conf/generated/<fine_tune_name>/coarse.yml
+```
+this will save the coarse model to `runs/<fine_tune_name>/coarse/ckpt/best/`.
+launch the c2f job:
+```bash
+python  scripts/exp/train.py --args.load conf/generated/<fine_tune_name>/c2f.yml
+```
+# Resuming a Training/Finetuning Job from checkpoint.
+To resume from checkpoint, use the `--resume` flag and the `--save_path` to point to the checkpoint you want to resume from.
+```bash
+python scripts/exp/train.py --args.load conf/generated/steve/coarse.yml --save_path runs/steve/coarse --resume
+```
+# Exporting your model
+Once your model has been fine-tuned, you can export it to a HuggingFace model.
+In order to use your model in `app.py`, you will need to export it to HuggingFace.
+**NOTE**: In order to export, you will need a [huggingface account](https://huggingface.co/).
+Now, log in to huggingface using the command line:
+```bash
+huggingface-cli login
+```
+replace the contents of the file named `./DEFAULT_HF_MODEL_REPO` with your `<HUGGINGFACE_USERNAME>/vampnet`. A model repo will be automatically created for you with `export.py`. The default is `hugggof/vampnet`.
+for example, if my username is `hugggof`, I would run the following command:`
+```bash
+echo 'hugggof/vampnet' > ./DEFAULT_HF_MODEL_REPO
+```
+Now, run the following command to export your model (replace `<your_finetuned_model_name>` with the name of your model):
+```bash
+python scripts/exp/export.py --name <your_finetuned_model_name> --model latest
+```
+Once that's done, your model should appear on the list of available models in the gradio interface.
+Simply run `python app.py` and select your model from the dropdown list.
+# Unloop
+Make sure you have Max installed on your laptop!
+**NOTE**: To run unloop (with a GPU-powered server), you will need to install the vampnet repo in both your local machine and your GPU server.
+## start a vampnet gradio server
+First, **on your GPU server**, run the gradio server:
+```bash
+python app.py --args.load conf/interface.yml --Interface.device cuda
+```
+This will run a vampnet gradio API on your GPU server. Copy the address. It will be something like `https://127.0.0.1:7860/`.
+**IMPORTANT** Make sure that this gradio port (by default `7860`) is forwarded to your local machine, where you have Max installed.
+## start the unloop gradio client
+Now, **on your local machine**, run the unloop gradio client.
+```
+cd unloop
+pip install -r requirements.txt
+python client.py --vampnet_url https://127.0.0.1:7860/ # replace with your gradio server address
+```
+This will start a gradio client that connects to the gradio server running on your GPU server.
+## start the unloop Max patch
+Now, open the unloop Max patch. It's located at `unloop/max/unloop.maxpat`.
+In the tape controls, check the heartbeat (`<3`) to make sure the connection to the local gradio client is working.
+have fun!
+# Token Telephone
+Instructions forthcoming, but the sauce is in `token_telephone/tt.py`
+## A note on argbind
+This repository relies on [argbind](https://github.com/pseeth/argbind) to manage CLIs and config files.
+Config files are stored in the `conf/` folder.
+### Take a look at the pretrained models
+All the pretrained models (trained by hugo) are stored here: https://huggingface.co/hugggof/vampnet
+### Licensing for Pretrained Models:
+The weights for the models are licensed [`CC BY-NC-SA 4.0`](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.ml). Likewise, any VampNet models fine-tuned on the pretrained models are also licensed [`CC BY-NC-SA 4.0`](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.ml).
+Download the pretrained models from [this link](https://zenodo.org/record/8136629). Then, extract the models to the `models/` folder.

TODOS ADDED Viewed

	@@ -0,0 +1 @@


1	+ [ ] add sketch2sound finetuning

app.py ADDED Viewed

	@@ -0,0 +1,760 @@

+import spaces
+from pathlib import Path
+import yaml
+import time
+import uuid
+import numpy as np
+import audiotools as at
+import argbind
+import shutil
+import torch
+from datetime import datetime
+from pyharp.core import build_endpoint, ModelCard
+from pyharp.labels import OutputLabel, LabelList
+from pyharp.media.audio import save_audio
+import gradio as gr
+from vampnet.interface import Interface, signal_concat
+from vampnet import mask as pmask
+device="cpu"
+print(f"using device {device}\n"*10)
+interface = Interface.default()
+init_model_choice = open("DEFAULT_MODEL").read().strip()
+# load the init model
+interface.load_finetuned(init_model_choice)
+def to_output(sig):
+    return sig.sample_rate, sig.cpu().detach().numpy()[0][0]
+MAX_DURATION_S = 10
+def load_audio(file):
+    print(file)
+    if isinstance(file, str):
+        filepath = file
+    elif isinstance(file, tuple):
+        # not a file
+        sr, samples = file
+        samples = samples / np.iinfo(samples.dtype).max
+        return sr, samples
+    else:
+        filepath = file.name
+    sig = at.AudioSignal.salient_excerpt(
+        filepath, duration=MAX_DURATION_S
+    )
+    sig = at.AudioSignal(filepath)
+    return to_output(sig)
+def load_example_audio():
+    return load_audio("./assets/example.wav")
+from torch_pitch_shift import pitch_shift, get_fast_shifts
+def shift_pitch(signal, interval: int):
+    signal.samples = pitch_shift(
+        signal.samples,
+        shift=interval,
+        sample_rate=signal.sample_rate
+    )
+    return signal
+def onsets(sig: at.AudioSignal, hop_length: int):
+    assert sig.batch_size == 1, "batch size must be 1"
+    assert sig.num_channels == 1, "mono signals only"
+    import librosa
+    onset_frame_idxs = librosa.onset.onset_detect(
+        y=sig.samples[0][0].detach().cpu().numpy(), sr=sig.sample_rate,
+        hop_length=hop_length,
+        backtrack=True,
+    )
+    return onset_frame_idxs
+@spaces.GPU
+def new_vampnet_mask(self,
+    codes,
+    onset_idxs,
+    width: int = 5,
+    periodic_prompt=2,
+    upper_codebook_mask=1,
+    drop_amt: float = 0.1
+):
+    from vampnet.newmask import mask_and, mask_or, onset_mask, periodic_mask, drop_ones, codebook_mask
+    mask =  mask_and(
+        periodic_mask(codes, periodic_prompt, 1, random_roll=False),
+        mask_or( # this re-masks the onsets, according to a periodic schedule
+            onset_mask(onset_idxs, codes, width=width),
+            periodic_mask(codes, periodic_prompt, 1, random_roll=False),
+        )
+    ).int()
+    # make sure the onset idxs themselves are unmasked
+    # mask = 1 - mask
+    mask[:, :, onset_idxs] = 0
+    mask = mask.cpu() # debug
+    mask = 1-drop_ones(1-mask, drop_amt)
+    mask = codebook_mask(mask, upper_codebook_mask)
+    # save mask as txt (ints)
+    np.savetxt("scratch/rms_mask.txt", mask[0].cpu().numpy(), fmt='%d')
+    mask = mask.to(self.device)
+    return mask[:, :, :]
+@spaces.GPU
+def mask_preview(periodic_p, n_mask_codebooks, onset_mask_width, dropout):
+    # make a mask preview
+    codes = torch.zeros((1, 14, 80)).to(device)
+    mask = interface.build_mask(
+        codes,
+        periodic_prompt=periodic_p,
+        # onset_mask_width=onset_mask_width,
+        _dropout=dropout,
+        upper_codebook_mask=n_mask_codebooks,
+    )
+    # mask = mask.cpu().numpy()
+    import matplotlib.pyplot as plt
+    plt.clf()
+    interface.visualize_codes(mask)
+    plt.title("mask preview")
+    plt.savefig("scratch/mask-prev.png")
+    return "scratch/mask-prev.png"
+@spaces.GPU
+def _vamp_internal(
+        seed, input_audio, model_choice,
+        pitch_shift_amt, periodic_p,
+        n_mask_codebooks, onset_mask_width,
+        dropout, sampletemp, typical_filtering,
+        typical_mass, typical_min_tokens, top_p,
+        sample_cutoff, stretch_factor, sampling_steps, beat_mask_ms, num_feedback_steps, api=False, harp=False
+    ):
+    if torch.cuda.is_available():
+        device = "cuda"
+    elif torch.backends.mps.is_available():
+        device = "mps"
+    else:
+        device = "cpu"
+    print("args!")
+    print(f"seed: {seed}")
+    print(f"input_audio: {input_audio}")
+    print(f"model_choice: {model_choice}")
+    print(f"pitch_shift_amt: {pitch_shift_amt}")
+    print(f"periodic_p: {periodic_p}")
+    print(f"n_mask_codebooks: {n_mask_codebooks}")
+    print(f"onset_mask_width: {onset_mask_width}")
+    print(f"dropout: {dropout}")
+    print(f"sampletemp: {sampletemp}")
+    print(f"typical_filtering: {typical_filtering}")
+    print(f"typical_mass: {typical_mass}")
+    print(f"typical_min_tokens: {typical_min_tokens}")
+    print(f"top_p: {top_p}")
+    print(f"sample_cutoff: {sample_cutoff}")
+    print(f"stretch_factor: {stretch_factor}")
+    print(f"sampling_steps: {sampling_steps}")
+    print(f"api: {api}")
+    print(f"beat_mask_ms: {beat_mask_ms}")
+    print(f"using device {interface.device}")
+    print(f"num feedback steps: {num_feedback_steps}")
+    t0 = time.time()
+    interface.to(device)
+    print(f"using device {interface.device}")
+    _seed = seed if seed > 0 else None
+    if _seed is None:
+        _seed = int(torch.randint(0, 2**32, (1,)).item())
+    at.util.seed(_seed)
+    if input_audio is None:
+        raise gr.Error("no input audio received!")
+    sr, input_audio = input_audio
+    input_audio = input_audio / np.iinfo(input_audio.dtype).max
+    sig = at.AudioSignal(input_audio, sr).to_mono()
+    loudness = sig.loudness()
+    sig = interface._preprocess(sig)
+    # reload the model if necessary
+    interface.load_finetuned(model_choice)
+    if pitch_shift_amt != 0:
+        sig = shift_pitch(sig, pitch_shift_amt)
+    codes = interface.encode(sig)
+    # mask = new_vampnet_mask(
+    #     interface,
+    #     codes,
+    #     onset_idxs=onsets(sig, hop_length=interface.codec.hop_length),
+    #     width=onset_mask_width,
+    #     periodic_prompt=periodic_p,
+    #     upper_codebook_mask=n_mask_codebooks,
+    #     drop_amt=dropout
+    # ).long()
+    mask = interface.build_mask(
+        codes,
+        sig=sig,
+        periodic_prompt=periodic_p,
+        onset_mask_width=onset_mask_width,
+        _dropout=dropout,
+        upper_codebook_mask=n_mask_codebooks,
+    )
+    if beat_mask_ms > 0:
+        # bm = pmask.mask_or(
+        #     pmask.periodic_mask(
+        #         codes, periodic_p, random_roll=False
+        #     ),
+        # )
+        mask = pmask.mask_and(
+            mask, interface.make_beat_mask(
+                sig, after_beat_s=beat_mask_ms/1000.,
+            )
+        )
+        mask = pmask.codebook_mask(mask, n_mask_codebooks)
+    np.savetxt("scratch/rms_mask.txt", mask[0].cpu().numpy(), fmt='%d')
+    interface.set_chunk_size(10.0)
+    # lord help me
+    if top_p is not None:
+        if top_p > 0:
+            pass
+        else:
+            top_p = None
+    codes, mask_z = interface.vamp(
+        codes, mask,
+        batch_size=2,
+        feedback_steps=num_feedback_steps,
+        _sampling_steps=sampling_steps,
+        time_stretch_factor=stretch_factor,
+        return_mask=True,
+        temperature=sampletemp,
+        typical_filtering=typical_filtering,
+        typical_mass=typical_mass,
+        typical_min_tokens=typical_min_tokens,
+        top_p=top_p,
+        seed=_seed,
+        sample_cutoff=sample_cutoff,
+    )
+    print(f"vamp took {time.time() - t0} seconds")
+    sig = interface.decode(codes)
+    sig = sig.normalize(loudness)
+    import matplotlib.pyplot as plt
+    plt.clf()
+    # plt.imshow(mask_z[0].cpu().numpy(), aspect='auto
+    interface.visualize_codes(mask)
+    plt.title("actual mask")
+    plt.savefig("scratch/mask.png")
+    plt.clf()
+    if harp:
+        return sig
+    if not api:
+        return to_output(sig[0]), to_output(sig[1]), "scratch/mask.png"
+    else:
+        return to_output(sig[0]), to_output(sig[1])
+@spaces.GPU
+def vamp(input_audio,
+        sampletemp,
+        top_p,
+        periodic_p,
+        dropout,
+        stretch_factor,
+        onset_mask_width,
+        typical_filtering,
+        typical_mass,
+        typical_min_tokens,
+        seed,
+        model_choice,
+        n_mask_codebooks,
+        pitch_shift_amt,
+        sample_cutoff,
+        sampling_steps,
+        beat_mask_ms,
+        num_feedback_steps):
+    return _vamp_internal(
+        seed=seed,
+        input_audio=input_audio,
+        model_choice=model_choice,
+        pitch_shift_amt=pitch_shift_amt,
+        periodic_p=periodic_p,
+        n_mask_codebooks=n_mask_codebooks,
+        onset_mask_width=onset_mask_width,
+        dropout=dropout,
+        sampletemp=sampletemp,
+        typical_filtering=typical_filtering,
+        typical_mass=typical_mass,
+        typical_min_tokens=typical_min_tokens,
+        top_p=top_p,
+        sample_cutoff=sample_cutoff,
+        stretch_factor=stretch_factor,
+        sampling_steps=sampling_steps,
+        beat_mask_ms=beat_mask_ms,
+        num_feedback_steps=num_feedback_steps,
+        api=False,
+    )
+@spaces.GPU
+def api_vamp(input_audio,
+                sampletemp, top_p,
+                periodic_p,
+                dropout,
+                stretch_factor,
+                onset_mask_width,
+                typical_filtering,
+                typical_mass,
+                typical_min_tokens,
+                seed,
+                model_choice,
+                n_mask_codebooks,
+                pitch_shift_amt,
+                sample_cutoff,
+                sampling_steps,
+                beat_mask_ms, num_feedback_steps):
+    return _vamp_internal(
+        seed=seed,
+        input_audio=input_audio,
+        model_choice=model_choice,
+        pitch_shift_amt=pitch_shift_amt,
+        periodic_p=periodic_p,
+        n_mask_codebooks=n_mask_codebooks,
+        onset_mask_width=onset_mask_width,
+        dropout=dropout,
+        sampletemp=sampletemp,
+        typical_filtering=typical_filtering,
+        typical_mass=typical_mass,
+        typical_min_tokens=typical_min_tokens,
+        top_p=top_p,
+        sample_cutoff=sample_cutoff,
+        stretch_factor=stretch_factor,
+        sampling_steps=sampling_steps,
+        beat_mask_ms=beat_mask_ms,
+        num_feedback_steps=num_feedback_steps,
+        api=True,
+    )
+@spaces.GPU
+def harp_vamp(input_audio, sampletemp, periodic_p, dropout, n_mask_codebooks, model_choice, stretch_factor):
+    sig = at.AudioSignal(input_audio).to_mono()
+    input_audio = sig.cpu().detach().numpy()[0][0]
+    input_audio = input_audio * np.iinfo(np.int16).max
+    input_audio = input_audio.astype(np.int16)
+    input_audio = input_audio.reshape(1, -1)
+    input_audio = (sig.sample_rate, input_audio)
+    sig =  _vamp_internal(
+        seed=0,
+        input_audio=input_audio,
+        model_choice=model_choice,
+        pitch_shift_amt=0,
+        periodic_p=int(periodic_p),
+        n_mask_codebooks=int(n_mask_codebooks),
+        onset_mask_width=0,
+        dropout=dropout,
+        sampletemp=sampletemp,
+        typical_filtering=False,
+        typical_mass=0.15,
+        typical_min_tokens=1,
+        top_p=None,
+        sample_cutoff=1.0,
+        stretch_factor=stretch_factor,
+        sampling_steps=36,
+        beat_mask_ms=int(0),
+        num_feedback_steps=1,
+        api=False,
+        harp=True,
+    )
+    ll = LabelList()
+    ll.append(OutputLabel(label='short label', t=0.0, description='longer description'))
+    return save_audio(sig.detach().cpu()), ll
+with gr.Blocks() as demo:
+    with gr.Row():
+        with gr.Column():
+            manual_audio_upload = gr.File(
+                label=f"upload some audio (will be randomly trimmed to max of 100s)",
+                file_types=["audio"]
+            )
+            load_example_audio_button = gr.Button("or load example audio")
+            input_audio = gr.Audio(
+                label="input audio",
+                interactive=False,
+                type="numpy",
+            )
+            # audio_mask = gr.Audio(
+            #     label="audio mask (listen to this to hear the mask hints)",
+            #     interactive=False,
+            #     type="numpy",
+            # )
+            # connect widgets
+            load_example_audio_button.click(
+                fn=load_example_audio,
+                inputs=[],
+                outputs=[ input_audio]
+            )
+            manual_audio_upload.change(
+                fn=load_audio,
+                inputs=[manual_audio_upload],
+                outputs=[ input_audio]
+            )
+        # mask settings
+        with gr.Column():
+            with gr.Accordion("manual controls", open=True):
+                periodic_p = gr.Slider(
+                    label="periodic prompt",
+                    minimum=0,
+                    maximum=13,
+                    step=1,
+                    value=7,
+                )
+                onset_mask_width = gr.Slider(
+                    label="onset mask width (multiplies with the periodic mask, 1 step ~= 10milliseconds) does not affect mask preview",
+                    minimum=0,
+                    maximum=100,
+                    step=1,
+                    value=0, visible=True
+                )
+                beat_mask_ms = gr.Slider(
+                    label="beat mask width (milliseconds) does not affect mask preview",
+                    minimum=1,
+                    maximum=200,
+                    step=1,
+                    value=0,
+                    visible=True
+                )
+                n_mask_codebooks = gr.Slider(
+                    label="compression prompt ",
+                    value=3,
+                    minimum=1,
+                    maximum=14,
+                    step=1,
+                )
+                dropout = gr.Slider(
+                    label="mask dropout",
+                    minimum=0.0,
+                    maximum=1.0,
+                    step=0.01,
+                    value=0.0
+                )
+                num_feedback_steps = gr.Slider(
+                    label="feedback steps (token telephone) -- turn it up for better timbre/rhythm transfer quality, but it's slower!",
+                    minimum=1,
+                    maximum=8,
+                    step=1,
+                    value=1
+                )
+                preset_dropdown = gr.Dropdown(
+                    label="preset",
+                    choices=["timbre transfer", "small variation", "small variation (follow beat)", "medium variation", "medium variation (follow beat)", "large variation", "large variation (follow beat)", "unconditional"],
+                    value="medium variation"
+                )
+                def change_preset(preset_dropdown):
+                    if preset_dropdown == "timbre transfer":
+                        periodic_p = 2
+                        n_mask_codebooks = 1
+                        onset_mask_width = 0
+                        dropout = 0.0
+                        beat_mask_ms = 0
+                    elif preset_dropdown == "small variation":
+                        periodic_p = 5
+                        n_mask_codebooks = 4
+                        onset_mask_width = 0
+                        dropout = 0.0
+                        beat_mask_ms = 0
+                    elif preset_dropdown == "small variation (follow beat)":
+                        periodic_p = 7
+                        n_mask_codebooks = 4
+                        onset_mask_width = 0
+                        dropout = 0.0
+                        beat_mask_ms = 50
+                    elif preset_dropdown == "medium variation":
+                        periodic_p = 7
+                        n_mask_codebooks = 4
+                        onset_mask_width = 0
+                        dropout = 0.0
+                        beat_mask_ms = 0
+                    elif preset_dropdown == "medium variation (follow beat)":
+                        periodic_p = 13
+                        n_mask_codebooks = 4
+                        onset_mask_width = 0
+                        dropout = 0.0
+                        beat_mask_ms = 50
+                    elif preset_dropdown == "large variation":
+                        periodic_p = 13
+                        n_mask_codebooks = 4
+                        onset_mask_width = 0
+                        dropout = 0.2
+                        beat_mask_ms = 0
+                    elif preset_dropdown == "large variation (follow beat)":
+                        periodic_p = 0
+                        n_mask_codebooks = 4
+                        onset_mask_width = 0
+                        dropout = 0.0
+                        beat_mask_ms=80
+                    elif preset_dropdown == "unconditional":
+                        periodic_p=0
+                        n_mask_codebooks=1
+                        onset_mask_width=0
+                        dropout=0.0
+                    return periodic_p, n_mask_codebooks, onset_mask_width, dropout, beat_mask_ms
+                preset_dropdown.change(
+                    fn=change_preset,
+                    inputs=[preset_dropdown],
+                    outputs=[periodic_p, n_mask_codebooks, onset_mask_width, dropout, beat_mask_ms]
+                )
+                # preset_dropdown.change(
+            maskimg = gr.Image(
+                label="mask image",
+                interactive=False,
+                type="filepath"
+            )
+            with gr.Accordion("extras ", open=False):
+                pitch_shift_amt = gr.Slider(
+                    label="pitch shift amount (semitones)",
+                    minimum=-12,
+                    maximum=12,
+                    step=1,
+                    value=0,
+                )
+                stretch_factor = gr.Slider(
+                    label="time stretch factor",
+                    minimum=0,
+                    maximum=8,
+                    step=1,
+                    value=1,
+                )
+            with gr.Accordion("sampling settings", open=False):
+                sampletemp = gr.Slider(
+                    label="sample temperature",
+                    minimum=0.1,
+                    maximum=10.0,
+                    value=1.0,
+                    step=0.001
+                )
+                top_p = gr.Slider(
+                    label="top p (0.0 = off)",
+                    minimum=0.0,
+                    maximum=1.0,
+                    value=0.0
+                )
+                typical_filtering = gr.Checkbox(
+                    label="typical filtering ",
+                    value=True
+                )
+                typical_mass = gr.Slider(
+                    label="typical mass (should probably stay between 0.1 and 0.5)",
+                    minimum=0.01,
+                    maximum=0.99,
+                    value=0.15
+                )
+                typical_min_tokens = gr.Slider(
+                    label="typical min tokens (should probably stay between 1 and 256)",
+                    minimum=1,
+                    maximum=256,
+                    step=1,
+                    value=64
+                )
+                sample_cutoff = gr.Slider(
+                    label="sample cutoff",
+                    minimum=0.0,
+                    maximum=0.9,
+                    value=1.0,
+                    step=0.01
+                )
+                sampling_steps = gr.Slider(
+                    label="sampling steps",
+                    minimum=1,
+                    maximum=128,
+                    step=1,
+                    value=36
+                )
+            seed = gr.Number(
+                label="seed (0 for random)",
+                value=0,
+                precision=0,
+            )
+        # mask settings
+        with gr.Column():
+            model_choice = gr.Dropdown(
+                label="model choice",
+                choices=list(interface.available_models()),
+                value=init_model_choice,
+                visible=True
+            )
+            vamp_button = gr.Button("generate (vamp)!!!")
+            audio_outs = []
+            use_as_input_btns = []
+            for i in range(2):
+                with gr.Column():
+                    audio_outs.append(gr.Audio(
+                        label=f"output audio {i+1}",
+                        interactive=False,
+                        type="numpy"
+                    ))
+                    use_as_input_btns.append(
+                        gr.Button(f"use as input (feedback)")
+                    )
+            thank_you = gr.Markdown("")
+            # download all the outputs
+            # download = gr.File(type="filepath", label="download outputs")
+    # mask preview change
+    for widget in (
+        periodic_p, n_mask_codebooks,
+        onset_mask_width, dropout
+    ):
+        widget.change(
+            fn=mask_preview,
+            inputs=[periodic_p, n_mask_codebooks,
+                    onset_mask_width, dropout],
+            outputs=[maskimg]
+        )
+    _inputs = [
+            input_audio,
+            sampletemp,
+            top_p,
+            periodic_p,
+            dropout,
+            stretch_factor,
+            onset_mask_width,
+            typical_filtering,
+            typical_mass,
+            typical_min_tokens,
+            seed,
+            model_choice,
+            n_mask_codebooks,
+            pitch_shift_amt,
+            sample_cutoff,
+            sampling_steps,
+            beat_mask_ms,
+            num_feedback_steps
+    ]
+    # connect widgets
+    vamp_button.click(
+        fn=vamp,
+        inputs=_inputs,
+        outputs=[audio_outs[0], audio_outs[1], maskimg],
+    )
+    api_vamp_button = gr.Button("api vamp", visible=True)
+    api_vamp_button.click(
+        fn=api_vamp,
+        inputs=[input_audio,
+                sampletemp, top_p,
+                periodic_p,
+                dropout,
+                stretch_factor,
+                onset_mask_width,
+                typical_filtering,
+                typical_mass,
+                typical_min_tokens,
+                seed,
+                model_choice,
+                n_mask_codebooks,
+                pitch_shift_amt,
+                sample_cutoff,
+                sampling_steps,
+                beat_mask_ms,
+                num_feedback_steps
+        ],
+        outputs=[audio_outs[0], audio_outs[1]],
+        api_name="vamp"
+    )
+     #NEW: HARP endpoint (new PyHARP API)
+    harp_model_card = ModelCard(
+        name="vampnet",
+        description="generating audio by filling in the blanks.",
+        author="hugo flores garcía et al. (descript/northwestern)",
+        tags=["sound", "generation"]
+    )
+    harp_input_components = [
+        gr.Audio(type="filepath", label="Input Audio").harp_required(True),
+        gr.Slider(label="Sample Temperature", minimum=0.1, maximum=10.0, value=1.0, step=0.001),
+        gr.Slider(label="Periodic Prompt", minimum=0, maximum=13, step=1, value=7),
+        gr.Slider(label="Mask Dropout", minimum=0.0, maximum=1.0, step=0.01, value=0.0),
+        gr.Slider(label="Compression Prompt", value=3, minimum=1, maximum=14, step=1),
+        gr.Dropdown(label="Model Choice", choices=list(interface.available_models()), value=init_model_choice),
+        gr.Slider(label="Time Stretch Factor", minimum=0, maximum=8, step=1, value=1),
+    ]
+    harp_output_components = [
+        gr.Audio(type="filepath", label="Generated Audio"),
+        gr.JSON(label="Generated Labels"),
+    ]
+    harp_app = build_endpoint(
+        model_card=harp_model_card,
+        input_components=harp_input_components,
+        output_components=harp_output_components,
+        process_fn=harp_vamp
+    )
+    with gr.Row():
+        gr.Markdown("### VST / HARP Plugin Controls")
+        for comp in harp_app.values():
+            comp.render()
+try:
+    demo.queue()
+    demo.launch(share=True)
+except KeyboardInterrupt:
+    shutil.rmtree("gradio-outputs", ignore_errors=True)
+    raise

assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

conf/c2f.yml ADDED Viewed

	@@ -0,0 +1,14 @@

+$include:
+  - conf/vampnet.yml
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.embedding_dim: 1280
+VampNet.n_layers: 16
+VampNet.n_heads: 20
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0

conf/generated/cat/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/cat/c2f
+train/AudioLoader.sources: &id001
+- scratch/cat-audio
+val/AudioLoader.sources: *id001

conf/generated/cat/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/cat/coarse
+train/AudioLoader.sources: &id001
+- scratch/cat-audio
+val/AudioLoader.sources: *id001

conf/generated/cat/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - scratch/cat-audio
+Interface.coarse2fine_ckpt: ./runs/cat/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/cat/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/cat10/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/cat10/c2f
+train/AudioLoader.sources: &id001
+- scratch/cat-audio-10s
+val/AudioLoader.sources: *id001

conf/generated/cat10/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/cat10/coarse
+train/AudioLoader.sources: &id001
+- scratch/cat-audio-10s
+val/AudioLoader.sources: *id001

conf/generated/cat10/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - scratch/cat-audio-10s
+Interface.coarse2fine_ckpt: ./runs/cat10/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/cat10/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/ivo/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/ivo/c2f
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/ivo/separated
+val/AudioLoader.sources: *id001

conf/generated/ivo/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/ivo/coarse
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/ivo/separated
+val/AudioLoader.sources: *id001

conf/generated/ivo/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - ./scratch/miguel/ivo/separated
+Interface.coarse2fine_ckpt: ./runs/ivo/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/ivo/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/lazaro-ros-sep/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/lazaro-ros-sep/c2f
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/lazaro-ros/separated
+val/AudioLoader.sources: *id001

conf/generated/lazaro-ros-sep/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/lazaro-ros-sep/coarse
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/lazaro-ros/separated
+val/AudioLoader.sources: *id001

conf/generated/lazaro-ros-sep/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - ./scratch/miguel/lazaro-ros/separated
+Interface.coarse2fine_ckpt: ./runs/lazaro-ros-sep/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/lazaro-ros-sep/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/lazaro-ros/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/lazaro-ros/c2f
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/lazaro-ros
+val/AudioLoader.sources: *id001

conf/generated/lazaro-ros/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/lazaro-ros/coarse
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/lazaro-ros
+val/AudioLoader.sources: *id001

conf/generated/lazaro-ros/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - ./scratch/miguel/lazaro-ros
+Interface.coarse2fine_ckpt: ./runs/lazaro-ros/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/lazaro-ros/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/le-poisson-steve/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/le-poisson-steve/c2f
+train/AudioLoader.sources: &id001
+- scratch/steve
+val/AudioLoader.sources: *id001

conf/generated/le-poisson-steve/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/le-poisson-steve/coarse
+train/AudioLoader.sources: &id001
+- scratch/steve
+val/AudioLoader.sources: *id001

conf/generated/le-poisson-steve/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - scratch/steve
+Interface.coarse2fine_ckpt: ./runs/le-poisson-steve/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/le-poisson-steve/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/march-31/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/march-31/c2f
+train/AudioLoader.sources: &id001
+- sound-journal-march-31
+val/AudioLoader.sources: *id001

conf/generated/march-31/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/march-31/coarse
+train/AudioLoader.sources: &id001
+- sound-journal-march-31
+val/AudioLoader.sources: *id001

conf/generated/march-31/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - sound-journal-march-31
+Interface.coarse2fine_ckpt: ./runs/march-31/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/march-31/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/sax-new/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/sax-new/c2f
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/saxophone-new/
+val/AudioLoader.sources: *id001

conf/generated/sax-new/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/sax-new/coarse
+train/AudioLoader.sources: &id001
+- ./scratch/miguel/saxophone-new/
+val/AudioLoader.sources: *id001

conf/generated/sax-new/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - ./scratch/miguel/saxophone-new/
+Interface.coarse2fine_ckpt: ./runs/sax-new/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/sax-new/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/generated/saxophone/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/c2f.pth
+save_path: ./runs/saxophone/c2f
+train/AudioLoader.sources: &id001
+- scratch/sounds
+val/AudioLoader.sources: *id001

conf/generated/saxophone/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/vampnet/coarse.pth
+save_path: ./runs/saxophone/coarse
+train/AudioLoader.sources: &id001
+- scratch/sounds
+val/AudioLoader.sources: *id001

conf/generated/saxophone/interface.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+AudioLoader.sources:
+- - scratch/sounds
+Interface.coarse2fine_ckpt: ./runs/saxophone/c2f/latest/vampnet/weights.pth
+Interface.coarse_ckpt: ./runs/saxophone/coarse/latest/vampnet/weights.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.wavebeat_ckpt: ./models/wavebeat.pth

conf/interface.yml ADDED Viewed

	@@ -0,0 +1,10 @@

+Interface.coarse_ckpt: ./models/vampnet/coarse.pth
+Interface.coarse2fine_ckpt: ./models/vampnet/c2f.pth
+Interface.codec_ckpt: ./models/vampnet/codec.pth
+Interface.coarse_chunk_size_s: 10
+Interface.coarse2fine_chunk_size_s: 3
+Interface.wavebeat_ckpt: ./models/wavebeat.pth
+# AudioLoader.sources:
+#   - /media/CHONK/null

conf/lora/lora-s2s.yml ADDED Viewed

	@@ -0,0 +1,27 @@

+$include:
+  - conf/vampnet.yml
+fine_tune: True
+train/AudioDataset.n_examples: 100000000
+val/AudioDataset.n_examples: 500
+NoamScheduler.warmup: 500
+batch_size: 7
+num_workers: 7
+save_iters: [2000, 4000, 10000,20000, 40000, 100000]
+sample_freq: 2000
+val_freq: 1000
+AdamW.lr: 0.0001
+# let's us organize sound classes into folders and choose from those sound classes uniformly
+AudioDataset.without_replacement: False
+num_iters: 500000
+# control signals to use as conditioning.
+Sketch2SoundController.ctrl_keys: ['rmsq16',]

conf/lora/lora.yml ADDED Viewed

	@@ -0,0 +1,22 @@

+$include:
+  - conf/vampnet.yml
+fine_tune: True
+train/AudioDataset.n_examples: 100000000
+val/AudioDataset.n_examples: 500
+NoamScheduler.warmup: 500
+batch_size: 7
+num_workers: 7
+save_iters: [2000, 4000, 10000, 20000, 40000, 100000]
+sample_freq: 2000
+val_freq: 1000
+AdamW.lr: 0.0001
+# let's us organize sound classes into folders and choose from those sound classes uniformly
+AudioDataset.without_replacement: False
+num_iters: 500000

conf/salad_bowl.yml ADDED Viewed

File without changes

conf/vampnet.yml ADDED Viewed

	@@ -0,0 +1,49 @@

+codec_ckpt: ./models/vampnet/codec.pth
+save_path: ckpt
+num_iters: 1000000000
+save_iters: [10000, 50000, 100000, 300000, 500000]
+val_idx: [0,1,2,3,4,5,6,7,8,9]
+sample_freq: 10000
+val_freq: 1000
+batch_size: 8
+num_workers: 10
+# Optimization
+amp: false
+CrossEntropyLoss.label_smoothing: 0.1
+AdamW.lr: 0.001
+NoamScheduler.factor: 2.0
+NoamScheduler.warmup: 10000
+VampNet.vocab_size: 1024
+VampNet.n_codebooks: 4
+VampNet.n_conditioning_codebooks: 0
+VampNet.r_cond_dim: 0
+VampNet.noise_mode: mask
+VampNet.embedding_dim: 1280
+VampNet.n_layers: 20
+VampNet.n_heads: 20
+VampNet.flash_attn: false
+VampNet.dropout: 0.1
+AudioLoader.relative_path: ""
+AudioDataset.loudness_cutoff: -30.0
+AudioDataset.without_replacement: true
+AudioLoader.shuffle: true
+AudioDataset.duration: 10.0
+train/AudioDataset.n_examples: 10000000
+train/AudioLoader.sources:
+  - /media/CHONK/hugo/spotdl/audio-train
+val/AudioDataset.n_examples: 2000
+val/AudioLoader.sources:
+  - /media/CHONK/hugo/spotdl/audio-val

hello.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import random
+import vampnet
+import audiotools as at
+# load the default vampnet model
+interface = vampnet.interface.Interface.default()
+# list available finetuned models
+finetuned_model_choices = interface.available_models()
+print(f"available finetuned models: {finetuned_model_choices}")
+# pick a random finetuned model
+model_choice = random.choice(finetuned_model_choices)
+print(f"choosing model: {model_choice}")
+# or pick a specific finetuned model
+print(f"actually, forcing model: default")
+model_choice = "default"
+# load a finetuned model
+interface.load_finetuned(model_choice)
+# load an example audio file
+signal = at.AudioSignal("assets/example.wav")
+# get the tokens for the audio
+codes = interface.encode(signal)
+# build a mask for the audio
+mask = interface.build_mask(
+    codes, signal,
+    periodic_prompt=13,
+    upper_codebook_mask=3,
+)
+# generate the output tokens
+output_tokens = interface.vamp(
+    codes, mask, return_mask=False,
+    temperature=1.0,
+    typical_filtering=False,
+    debug=True
+)
+# convert them to a signal
+output_signal = interface.decode(output_tokens)
+# save the output signal
+output_signal.write("scratch/output.wav")

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+torch
+argbind>=0.3.2
+numpy==1.23
+loralib
+wavebeat @ git+https://github.com/hugofloresgarcia/wavebeat
+lac @ git+https://github.com/hugofloresgarcia/lac.git
+descript-audiotools @ git+https://github.com/hugofloresgarcia/audiotools.git
+-e git+https://github.com/audacitorch/pyharp.git@develop#egg=pyharp
+torch_pitch_shift
+gradio
+pydantic==2.10.6

scratch/convert_to_wav.sh ADDED Viewed

	@@ -0,0 +1 @@


1	+ for f in *.mp3; do ffmpeg -i "$f" "${f%.mp3}.wav"; done

scratch/rms_mask.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1
+0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1
+0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

scratch/separate_folder.sh ADDED Viewed

	@@ -0,0 +1 @@


1	+ for f in *.mp3; do demucs "$f" --two-stems=vocals; done

scripts/exp/eval.py ADDED Viewed

	@@ -0,0 +1,110 @@

+from pathlib import Path
+import os
+from functools import partial
+from frechet_audio_distance import FrechetAudioDistance
+import pandas
+import argbind
+import torch
+from tqdm import tqdm
+import audiotools
+from audiotools import AudioSignal
+@argbind.bind(without_prefix=True)
+def eval(
+    exp_dir: str = None,
+    baseline_key: str = "baseline",
+    audio_ext: str = ".wav",
+):
+    assert exp_dir is not None
+    exp_dir = Path(exp_dir)
+    assert exp_dir.exists(), f"exp_dir {exp_dir} does not exist"
+    # set up our metrics
+    # sisdr_loss = audiotools.metrics.distance.SISDRLoss()
+    # stft_loss = audiotools.metrics.spectral.MultiScaleSTFTLoss()
+    mel_loss = audiotools.metrics.spectral.MelSpectrogramLoss()
+    frechet = FrechetAudioDistance(
+        use_pca=False,
+        use_activation=False,
+        verbose=True,
+        audio_load_worker=4,
+    )
+    frechet.model.to("cuda" if torch.cuda.is_available() else "cpu")
+    # figure out what conditions we have
+    conditions = [d.name for d in exp_dir.iterdir() if d.is_dir()]
+    assert baseline_key in conditions, f"baseline_key {baseline_key} not found in {exp_dir}"
+    conditions.remove(baseline_key)
+    print(f"Found {len(conditions)} conditions in {exp_dir}")
+    print(f"conditions: {conditions}")
+    baseline_dir = exp_dir / baseline_key
+    baseline_files = sorted(list(baseline_dir.glob(f"*{audio_ext}")), key=lambda x: int(x.stem))
+    metrics = []
+    for condition in tqdm(conditions):
+        cond_dir = exp_dir / condition
+        cond_files = sorted(list(cond_dir.glob(f"*{audio_ext}")), key=lambda x: int(x.stem))
+        print(f"computing fad for {baseline_dir} and {cond_dir}")
+        frechet_score = frechet.score(baseline_dir, cond_dir)
+        # make sure we have the same number of files
+        num_files = min(len(baseline_files), len(cond_files))
+        baseline_files = baseline_files[:num_files]
+        cond_files = cond_files[:num_files]
+        assert len(list(baseline_files)) == len(list(cond_files)), f"number of files in {baseline_dir} and {cond_dir} do not match. {len(list(baseline_files))} vs {len(list(cond_files))}"
+        def process(baseline_file, cond_file):
+            # make sure the files match (same name)
+            assert baseline_file.stem == cond_file.stem, f"baseline file {baseline_file} and cond file {cond_file} do not match"
+            # load the files
+            baseline_sig = AudioSignal(str(baseline_file))
+            cond_sig = AudioSignal(str(cond_file))
+            cond_sig.resample(baseline_sig.sample_rate)
+            cond_sig.truncate_samples(baseline_sig.length)
+            # if our condition is inpainting, we need to trim the conditioning off
+            if "inpaint" in condition:
+                ctx_amt = float(condition.split("_")[-1])
+                ctx_samples = int(ctx_amt * baseline_sig.sample_rate)
+                print(f"found inpainting condition. trimming off {ctx_samples} samples from {cond_file} and {baseline_file}")
+                cond_sig.trim(ctx_samples, ctx_samples)
+                baseline_sig.trim(ctx_samples, ctx_samples)
+            return {
+                # "sisdr": -sisdr_loss(baseline_sig, cond_sig).item(),
+                # "stft": stft_loss(baseline_sig, cond_sig).item(),
+                "mel": mel_loss(baseline_sig, cond_sig).item(),
+                "frechet": frechet_score,
+                # "visqol": vsq,
+                "condition": condition,
+                "file": baseline_file.stem,
+            }
+        print(f"processing {len(baseline_files)} files in {baseline_dir} and {cond_dir}")
+        metrics.extend(tqdm(map(process, baseline_files, cond_files), total=len(baseline_files)))
+    metric_keys = [k for k in metrics[0].keys() if k not in ("condition", "file")]
+    for mk in metric_keys:
+        stat = pandas.DataFrame(metrics)
+        stat = stat.groupby(['condition'])[mk].agg(['mean', 'count', 'std'])
+        stat.to_csv(exp_dir / f"stats-{mk}.csv")
+    df = pandas.DataFrame(metrics)
+    df.to_csv(exp_dir / "metrics-all.csv", index=False)
+if __name__ == "__main__":
+    args = argbind.parse_args()
+    with argbind.scope(args):
+        eval()

scripts/exp/experiment.py ADDED Viewed

	@@ -0,0 +1,254 @@

+from pathlib import Path
+import random
+from typing import List
+import tempfile
+import subprocess
+import argbind
+from tqdm import tqdm
+import torch
+from vampnet.interface import Interface
+from vampnet import mask as pmask
+import audiotools as at
+Interface: Interface = argbind.bind(Interface)
+def calculate_bitrate(
+        interface, num_codebooks,
+        downsample_factor
+    ):
+    bit_width = 10
+    sr = interface.codec.sample_rate
+    hop = interface.codec.hop_size
+    rate = (sr / hop) * ((bit_width * num_codebooks) / downsample_factor)
+    return rate
+def baseline(sig, interface):
+    return interface.preprocess(sig)
+def reconstructed(sig, interface):
+    return interface.decode(
+        interface.encode(sig)
+    )
+def coarse2fine(sig, interface):
+    z = interface.encode(sig)
+    z = z[:, :interface.c2f.n_conditioning_codebooks, :]
+    z = interface.coarse_to_fine(z)
+    return interface.decode(z)
+class CoarseCond:
+    def __init__(self, num_conditioning_codebooks, downsample_factor):
+        self.num_conditioning_codebooks = num_conditioning_codebooks
+        self.downsample_factor = downsample_factor
+    def __call__(self, sig, interface):
+        z = interface.encode(sig)
+        mask = pmask.full_mask(z)
+        mask = pmask.codebook_unmask(mask, self.num_conditioning_codebooks)
+        mask = pmask.periodic_mask(mask, self.downsample_factor)
+        zv = interface.coarse_vamp(z, mask)
+        zv = interface.coarse_to_fine(zv)
+        return interface.decode(zv)
+def opus(sig, interface, bitrate=128):
+    sig = interface.preprocess(sig)
+    with tempfile.NamedTemporaryFile(suffix=".wav") as f:
+        sig.write(f.name)
+        opus_name = Path(f.name).with_suffix(".opus")
+        # convert to opus
+        cmd = [
+            "ffmpeg", "-y", "-i", f.name,
+            "-c:a", "libopus",
+            "-b:a", f"{bitrate}",
+           opus_name
+        ]
+        subprocess.run(cmd, check=True)
+        # convert back to wav
+        output_name = Path(f"{f.name}-opus").with_suffix(".wav")
+        cmd = [
+            "ffmpeg", "-y", "-i", opus_name,
+            output_name
+        ]
+        subprocess.run(cmd, check=True)
+        sig = at.AudioSignal(
+            output_name,
+            sample_rate=sig.sample_rate
+        )
+    return sig
+def mask_ratio_1_step(ratio=1.0):
+    def wrapper(sig, interface):
+        z = interface.encode(sig)
+        mask = pmask.linear_random(z, ratio)
+        zv = interface.coarse_vamp(
+            z,
+            mask,
+            sampling_steps=1,
+        )
+        return interface.decode(zv)
+    return wrapper
+def num_sampling_steps(num_steps=1):
+    def wrapper(sig, interface: Interface):
+        z = interface.encode(sig)
+        mask = pmask.periodic_mask(z, 16)
+        zv = interface.coarse_vamp(
+            z,
+            mask,
+            sampling_steps=num_steps,
+        )
+        zv = interface.coarse_to_fine(zv)
+        return interface.decode(zv)
+    return wrapper
+def beat_mask(ctx_time):
+    def wrapper(sig, interface):
+        beat_mask = interface.make_beat_mask(
+            sig,
+            before_beat_s=ctx_time/2,
+            after_beat_s=ctx_time/2,
+            invert=True
+        )
+        z = interface.encode(sig)
+        zv = interface.coarse_vamp(
+            z, beat_mask
+        )
+        zv = interface.coarse_to_fine(zv)
+        return interface.decode(zv)
+    return wrapper
+def inpaint(ctx_time):
+    def wrapper(sig, interface: Interface):
+        z = interface.encode(sig)
+        mask = pmask.inpaint(z, interface.s2t(ctx_time), interface.s2t(ctx_time))
+        zv = interface.coarse_vamp(z, mask)
+        zv = interface.coarse_to_fine(zv)
+        return interface.decode(zv)
+    return wrapper
+def token_noise(noise_amt):
+    def wrapper(sig, interface: Interface):
+        z = interface.encode(sig)
+        mask = pmask.random(z, noise_amt)
+        z = torch.where(
+            mask,
+            torch.randint_like(z, 0, interface.coarse.vocab_size),
+            z
+        )
+        return interface.decode(z)
+    return wrapper
+EXP_REGISTRY = {}
+EXP_REGISTRY["gen-compression"] = {
+    "baseline": baseline,
+    "reconstructed": reconstructed,
+    "coarse2fine": coarse2fine,
+    **{
+        f"{n}_codebooks_downsampled_{x}x": CoarseCond(num_conditioning_codebooks=n, downsample_factor=x)
+            for (n, x) in (
+                (1, 1), # 1 codebook, no downsampling
+                (4, 4), # 4 codebooks, downsampled 4x
+                (4, 16), # 4 codebooks, downsampled 16x
+                (4, 32), # 4 codebooks, downsampled 16x
+            )
+    },
+    **{
+        f"token_noise_{x}": mask_ratio_1_step(ratio=x)
+            for x in [0.25, 0.5, 0.75]
+    },
+}
+EXP_REGISTRY["sampling-steps"] = {
+    # "codec": reconstructed,
+    **{f"steps_{n}": num_sampling_steps(n)  for n in [1, 4, 12, 36, 64, 72]},
+}
+EXP_REGISTRY["musical-sampling"] = {
+    **{f"beat_mask_{t}": beat_mask(t) for t in [0.075]},
+    **{f"inpaint_{t}": inpaint(t) for t in [0.5, 1.0,]}, # multiply these by 2 (they go left and right)
+}
+@argbind.bind(without_prefix=True)
+def main(
+        sources=[
+            "/media/CHONK/hugo/spotdl/val",
+        ],
+        output_dir: str = "./samples",
+        max_excerpts: int = 2000,
+        exp_type: str = "gen-compression",
+        seed: int = 0,
+        ext: str = [".mp3"],
+    ):
+    at.util.seed(seed)
+    interface = Interface()
+    output_dir = Path(output_dir)
+    output_dir.mkdir(exist_ok=True, parents=True)
+    from audiotools.data.datasets import AudioLoader, AudioDataset
+    loader = AudioLoader(sources=sources, shuffle_state=seed, ext=ext)
+    dataset = AudioDataset(loader,
+        sample_rate=interface.codec.sample_rate,
+        duration=interface.coarse.chunk_size_s,
+        n_examples=max_excerpts,
+        without_replacement=True,
+    )
+    if exp_type in EXP_REGISTRY:
+        SAMPLE_CONDS = EXP_REGISTRY[exp_type]
+    else:
+        raise ValueError(f"Unknown exp_type {exp_type}")
+    indices = list(range(max_excerpts))
+    random.shuffle(indices)
+    for i in tqdm(indices):
+        # if all our files are already there, skip
+        done = []
+        for name in SAMPLE_CONDS:
+            o_dir = Path(output_dir) / name
+            done.append((o_dir / f"{i}.wav").exists())
+        if all(done):
+            continue
+        sig = dataset[i]["signal"]
+        results = {
+            name: cond(sig, interface).cpu()
+            for name, cond in SAMPLE_CONDS.items()
+        }
+        for name, sig in results.items():
+            o_dir = Path(output_dir) / name
+            o_dir.mkdir(exist_ok=True, parents=True)
+            sig.write(o_dir / f"{i}.wav")
+if __name__ == "__main__":
+    args = argbind.parse_args()
+    with argbind.scope(args):
+        main()

scripts/exp/export.py ADDED Viewed

	@@ -0,0 +1,75 @@

+from pathlib import Path
+import shutil
+import argparse
+from vampnet import DEFAULT_HF_MODEL_REPO
+from huggingface_hub import create_repo, repo_exists, HfApi
+parser = argparse.ArgumentParser(description="Export the fine-tuned model to the repo")
+parser.add_argument(
+    "--name", type=str, default="lazaro-ros-sep",
+    help="name of the fine-tuned model to export"
+)
+parser.add_argument(
+    "--model", type=str, default="latest",
+    help="model version to export. check runs/<name> for available versions"
+)
+parser.add_argument(
+    "--repo", type=str, default=DEFAULT_HF_MODEL_REPO,
+    help="name of the repo to export to"
+)
+args = parser.parse_args()
+name = args.name
+version = args.model
+##
+print(f"~~~~~~~~~~~ vampnet export! ~~~~~~~~~~~~")
+print(f"exporting {name} version {version} to {args.repo}\n")
+run_dir = Path(f"runs/{name}")
+repo_dir = Path("models/vampnet")
+# create our repo
+new_repo = False
+if not repo_exists(args.repo):
+    print(f"repo {args.repo} does not exist, creating it")
+    print(f"creating a repo at {args.repo}")
+    create_repo(args.repo)
+    new_repo = True
+paths = []
+for part in ("coarse", "c2f"):
+    outdir = repo_dir / "loras" / name
+    outdir.mkdir(parents=True, exist_ok=True)
+    outpath = outdir / f"{part}.pth"
+    path = run_dir / part / version / "vampnet" / "weights.pth"
+    # path.rename(outpath)
+    shutil.copy(path, outpath)
+    paths.append(outpath)
+    print(f"copied {path} to {outpath}")
+print(f"uploading files to {args.repo}")
+# upload files to the repo
+# if it's a new repo, let's add the default models too
+if new_repo:
+    paths.extend([repo_dir / "c2f.pth", repo_dir / "coarse.pth", repo_dir / "codec.pth", repo_dir / "wavebeat.pth"])
+api = HfApi()
+for path in paths:
+    path_in_repo = str(path.relative_to(repo_dir))
+    print(f"uploading {path} to {args.repo}/{path_in_repo}")
+    api.upload_file(
+        path_or_fileobj=path,
+        path_in_repo=path_in_repo,
+        repo_id=args.repo,
+        token=True,
+        commit_message=f"uploading {path_in_repo}",
+    )
+print("done!!! >::0")

scripts/exp/fine_tune.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import argbind
+from pathlib import Path
+import yaml
+from typing import List
+"""example output: (yaml)
+"""
+@argbind.bind(without_prefix=True, positional=True)
+def fine_tune(audio_files_or_folders: List[str], name: str):
+    conf_dir = Path("conf")
+    assert conf_dir.exists(), "conf directory not found. are you in the vampnet directory?"
+    conf_dir = conf_dir / "generated"
+    conf_dir.mkdir(exist_ok=True)
+    finetune_dir = conf_dir / name
+    finetune_dir.mkdir(exist_ok=True)
+    finetune_c2f_conf = {
+        "$include": ["conf/lora/lora.yml"],
+        "fine_tune": True,
+        "train/AudioLoader.sources": audio_files_or_folders,
+        "val/AudioLoader.sources": audio_files_or_folders,
+        "VampNet.n_codebooks": 14,
+        "VampNet.n_conditioning_codebooks": 4,
+        "VampNet.embedding_dim": 1280,
+        "VampNet.n_layers": 16,
+        "VampNet.n_heads": 20,
+        "AudioDataset.duration": 3.0,
+        "AudioDataset.loudness_cutoff": -40.0,
+        "save_path": f"./runs/{name}/c2f",
+        "fine_tune_checkpoint": "./models/vampnet/c2f.pth"
+    }
+    finetune_coarse_conf = {
+        "$include": ["conf/lora/lora.yml"],
+        "fine_tune": True,
+        "train/AudioLoader.sources": audio_files_or_folders,
+        "val/AudioLoader.sources": audio_files_or_folders,
+        "save_path": f"./runs/{name}/coarse",
+        "fine_tune_checkpoint": "./models/vampnet/coarse.pth"
+    }
+    interface_conf = {
+        "Interface.coarse_ckpt": f"./runs/{name}/coarse/latest/vampnet/weights.pth",
+        "Interface.coarse2fine_ckpt": f"./runs/{name}/c2f/latest/vampnet/weights.pth",
+        "Interface.wavebeat_ckpt": "./models/wavebeat.pth",
+        "Interface.codec_ckpt": "./models/vampnet/codec.pth",
+        "AudioLoader.sources": [audio_files_or_folders],
+    }
+    # save the confs
+    with open(finetune_dir / "c2f.yml", "w") as f:
+        yaml.dump(finetune_c2f_conf, f)
+    with open(finetune_dir / "coarse.yml", "w") as f:
+        yaml.dump(finetune_coarse_conf, f)
+    with open(finetune_dir / "interface.yml", "w") as f:
+        yaml.dump(interface_conf, f)
+    # print(f"generated confs in {finetune_dir}.
+    # run training jobs with `python scripts/exp/train.py --args.load {finetune_dir}/<c2f/coarse>.yml` ")
+    print(f"generated confs in {finetune_dir}.")
+    print()
+    print(f"you'll need to run two training jobs, though they can run in parallel on separate GPUs.")
+    print(f"run the coarse job with \n\tpython scripts/exp/train.py --args.load {finetune_dir}/coarse.yml\n")
+    print(f"run the c2f job with \n\tpython scripts/exp/train.py --args.load {finetune_dir}/c2f.yml\n")
+if __name__ == "__main__":
+    args = argbind.parse_args()
+    with argbind.scope(args):
+        fine_tune()