Spaces:
Running
Running
| # Fine-tuning | |
| Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset. | |
| In current version, you only need to finetune the 'LLAMA' part. | |
| ## Fine-tuning LLAMA | |
| ### 1. Prepare the dataset | |
| ``` | |
| . | |
| βββ SPK1 | |
| β βββ 21.15-26.44.lab | |
| β βββ 21.15-26.44.mp3 | |
| β βββ 27.51-29.98.lab | |
| β βββ 27.51-29.98.mp3 | |
| β βββ 30.1-32.71.lab | |
| β βββ 30.1-32.71.mp3 | |
| βββ SPK2 | |
| βββ 38.79-40.85.lab | |
| βββ 38.79-40.85.mp3 | |
| ``` | |
| You need to convert your dataset into the above format and place it under `data`. The audio file can have the extensions `.mp3`, `.wav`, or `.flac`, and the annotation file should have the extensions `.lab`. | |
| !!! info "Dataset Format" | |
| The `.lab` annotation file only needs to contain the transcription of the audio, with no special formatting required. For example, if `hi.mp3` says "Hello, goodbye," then the `hi.lab` file would contain a single line of text: "Hello, goodbye." | |
| !!! warning | |
| It's recommended to apply loudness normalization to the dataset. You can use [fish-audio-preprocess](https://github.com/fishaudio/audio-preprocess) to do this. | |
| ```bash | |
| fap loudness-norm data-raw data --clean | |
| ``` | |
| ### 2. Batch extraction of semantic tokens | |
| Make sure you have downloaded the VQGAN weights. If not, run the following command: | |
| ```bash | |
| huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4 | |
| ``` | |
| You can then run the following command to extract semantic tokens: | |
| ```bash | |
| python tools/vqgan/extract_vq.py data \ | |
| --num-workers 1 --batch-size 16 \ | |
| --config-name "firefly_gan_vq" \ | |
| --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" | |
| ``` | |
| !!! note | |
| You can adjust `--num-workers` and `--batch-size` to increase extraction speed, but please make sure not to exceed your GPU memory limit. | |
| For the VITS format, you can specify a file list using `--filelist xxx.list`. | |
| This command will create `.npy` files in the `data` directory, as shown below: | |
| ``` | |
| . | |
| βββ SPK1 | |
| β βββ 21.15-26.44.lab | |
| β βββ 21.15-26.44.mp3 | |
| β βββ 21.15-26.44.npy | |
| β βββ 27.51-29.98.lab | |
| β βββ 27.51-29.98.mp3 | |
| β βββ 27.51-29.98.npy | |
| β βββ 30.1-32.71.lab | |
| β βββ 30.1-32.71.mp3 | |
| β βββ 30.1-32.71.npy | |
| βββ SPK2 | |
| βββ 38.79-40.85.lab | |
| βββ 38.79-40.85.mp3 | |
| βββ 38.79-40.85.npy | |
| ``` | |
| ### 3. Pack the dataset into protobuf | |
| ```bash | |
| python tools/llama/build_dataset.py \ | |
| --input "data" \ | |
| --output "data/protos" \ | |
| --text-extension .lab \ | |
| --num-workers 16 | |
| ``` | |
| After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory. | |
| ### 4. Finally, fine-tuning with LoRA | |
| Similarly, make sure you have downloaded the `LLAMA` weights. If not, run the following command: | |
| ```bash | |
| huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4 | |
| ``` | |
| Finally, you can start the fine-tuning by running the following command: | |
| ```bash | |
| python fish_speech/train.py --config-name text2semantic_finetune \ | |
| project=$project \ | |
| +lora@model.model.lora_config=r_8_alpha_16 | |
| ``` | |
| !!! note | |
| You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune.yaml`. | |
| !!! note | |
| For Windows users, you can use `trainer.strategy.process_group_backend=gloo` to avoid `nccl` issues. | |
| After training is complete, you can refer to the [inference](inference.md) section to generate speech. | |
| !!! info | |
| By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability. | |
| If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting. | |
| After training, you need to convert the LoRA weights to regular weights before performing inference. | |
| ```bash | |
| python tools/llama/merge_lora.py \ | |
| --lora-config r_8_alpha_16 \ | |
| --base-weight checkpoints/fish-speech-1.4 \ | |
| --lora-weight results/$project/checkpoints/step_000000010.ckpt \ | |
| --output checkpoints/fish-speech-1.4-yth-lora/ | |
| ``` | |
| !!! note | |
| You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data. | |