Spaces:
Build error
Build error
| # llama.cpp/example/tts | |
| This example demonstrates the Text To Speech feature. It uses a | |
| [model](https://www.outeai.com/blog/outetts-0.2-500m) from | |
| [outeai](https://www.outeai.com/). | |
| ## Quickstart | |
| If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the | |
| following command and the required models will be downloaded automatically: | |
| ```console | |
| $ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav | |
| ``` | |
| For details about the models and how to convert them to the required format | |
| see the following sections. | |
| ### Model conversion | |
| Checkout or download the model that contains the LLM model: | |
| ```console | |
| $ pushd models | |
| $ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M | |
| $ cd OuteTTS-0.2-500M && git lfs install && git lfs pull | |
| $ popd | |
| ``` | |
| Convert the model to .gguf format: | |
| ```console | |
| (venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \ | |
| --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16 | |
| ``` | |
| The generated model will be `models/outetts-0.2-0.5B-f16.gguf`. | |
| We can optionally quantize this to Q8_0 using the following command: | |
| ```console | |
| $ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \ | |
| models/outetts-0.2-0.5B-q8_0.gguf q8_0 | |
| ``` | |
| The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`. | |
| Next we do something simlar for the audio decoder. First download or checkout | |
| the model for the voice decoder: | |
| ```console | |
| $ pushd models | |
| $ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token | |
| $ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull | |
| $ popd | |
| ``` | |
| This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to | |
| huggingface format: | |
| ```console | |
| (venv) python examples/tts/convert_pt_to_hf.py \ | |
| models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt | |
| ... | |
| Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors | |
| Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json | |
| Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json | |
| ``` | |
| Then we can convert the huggingface format to gguf: | |
| ```console | |
| (venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \ | |
| --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16 | |
| ... | |
| INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf | |
| ``` | |
| ### Running the example | |
| With both of the models generated, the LLM model and the voice decoder model, | |
| we can run the example: | |
| ```console | |
| $ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \ | |
| -mv ./models/wavtokenizer-large-75-f16.gguf \ | |
| -p "Hello world" | |
| ... | |
| main: audio written to file 'output.wav' | |
| ``` | |
| The output.wav file will contain the audio of the prompt. This can be heard | |
| by playing the file with a media player. On Linux the following command will | |
| play the audio: | |
| ```console | |
| $ aplay output.wav | |
| ``` | |
| ### Running the example with llama-server | |
| Running this example with `llama-server` is also possible and requires two | |
| server instances to be started. One will serve the LLM model and the other | |
| will serve the voice decoder model. | |
| The LLM model server can be started with the following command: | |
| ```console | |
| $ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020 | |
| ``` | |
| And the voice decoder model server can be started using: | |
| ```console | |
| ./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none | |
| ``` | |
| Then we can run [tts-outetts.py](tts-outetts.py) to generate the audio. | |
| First create a virtual environment for python and install the required | |
| dependencies (this in only required to be done once): | |
| ```console | |
| $ python3 -m venv venv | |
| $ source venv/bin/activate | |
| (venv) pip install requests numpy | |
| ``` | |
| And then run the python script using: | |
| ```conole | |
| (venv) python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world" | |
| spectrogram generated: n_codes: 90, n_embd: 1282 | |
| converting to audio ... | |
| audio generated: 28800 samples | |
| audio written to file "output.wav" | |
| ``` | |
| And to play the audio we can again use aplay or any other media player: | |
| ```console | |
| $ aplay output.wav | |
| ``` | |