Spaces:
Running
Running
File size: 7,312 Bytes
53ea588 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
# Speech to Speech Example
In this example, we showcase how to build a simple speech-to-speech voice assistant pipeline using nvidia-pipecat along with pipecat-ai library and deploy for testing. This pipeline uses a Websocket based ACETransport, Riva ASR and TTS models and NVIDIA LLM Service. We recommend first following [the Pipecat documentation](https://docs.pipecat.ai/getting-started/core-concepts) or [the ACE Controller](https://docs.nvidia.com/ace/ace-controller-microservice/latest/user-guide.html#pipecat-overview) Pipecat overview section to understand core concepts.
## Prerequisites
1. Copy and configure the environment file:
```bash
cp env.example .env # and add your credentials
```
2. Ensure you have the required API keys:
- NVIDIA_API_KEY - Required for accessing NIM ASR, TTS and LLM models
- (Optional) ZEROSHOT_TTS_NVIDIA_API_KEY - Required for zero-shot TTS
## Option 1: Deploy Using Docker
#### Prerequisites
- You have access and are logged into NVIDIA NGC. For step-by-step instructions, refer to [the NGC Getting Started Guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#registering-activating-ngc-account).
- You have access to an NVIDIA Volta™, NVIDIA Turing™, or an NVIDIA Ampere architecture-based A100 GPU. For more information, refer to [the Support Matrix](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix.html#support-matrix).
- You have Docker installed with support for NVIDIA GPUs. For more information, refer to [the Support Matrix]((https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix.html#support-matrix)).
From the example/speech-to-speech directory, run below commands:
```bash
docker compose up -d
```
## Option 2: Deploy using Python environment
#### Prerequisites
From the examples/speech-to-speech directory, run the following commands to create a virtual environment and install the dependencies:
```bash
# Create and activate virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv sync
```
Make sure you've configured the `.env` file with your API keys before proceeding.
After making all required changes/customizations in bot.py, you can deploy the pipeline using below command:
```bash
python bot.py
```
## Start interacting with the application
This will host the static web client along with the ACE controller server, visit `http://WORKSTATION_IP:8100/static/index.html` in your browser to start a session.
Note: For mic access, you will need to update chrome://flags/ and add http://WORKSTATION_IP:8100 in Insecure origins treated as secure section.
If you want to update the port, make changes in the `uvicorn.run` command in [the bot.py](bot.py) and the `wsUrl` in [the static/index.html](../static/index.html).
## Bot customizations
### Enabling Speculative Speech Processing
Speculative speech processing reduces bot response latency by working directly on Riva ASR early interim user transcripts instead of waiting for final transcripts. This feature only works when using Riva ASR.
- Refer to the comments in [bot.py](bot.py) for guidance on enabling or disabling specific frame processors as needed.
- See the [ACE Controller Microservice documentation on Speculative Speech Processing](https://docs.nvidia.com/ace/ace-controller-microservice/1.0/user-guide.html#speculative-speech-processing) for more details.
### Switching ASR, LLM, and TTS Models
You can easily customize ASR (Automatic Speech Recognition), LLM (Large Language Model), and TTS (Text-to-Speech) services by configuring environment variables. This allows you to switch between NIM cloud-hosted models and locally deployed models.
The following environment variables control the endpoints and models:
- `RIVA_ASR_URL`: Address of the Riva ASR (speech-to-text) service (e.g., `localhost:50051` for local, "grpc.nvcf.nvidia.com:443" for cloud endpoint).
- `RIVA_TTS_URL`: Address of the Riva TTS (text-to-speech) service. (e.g., `localhost:50051` for local, "grpc.nvcf.nvidia.com:443" for cloud endpoint).
- `NVIDIA_LLM_URL`: URL for the NVIDIA LLM service. (e.g., `http://<machine-ip>:8000/v1` for local, "https://integrate.api.nvidia.com/v1" for cloud endpoint. )
You can set model, language, and voice using the `RIVA_ASR_MODEL`, `RIVA_TTS_MODEL`, `NVIDIA_LLM_MODEL`, `RIVA_ASR_LANGUAGE`, `RIVA_TTS_LANGUAGE`, and `RIVA_TTS_VOICE_ID` environment variables.
Update these variables in your Docker Compose configuration to match your deployment and desired models. For more details on available models and configuration options, refer to the [NIM NVIDIA Magpie](https://build.nvidia.com/nvidia/magpie-tts-multilingual), [NIM NVIDIA Parakeet](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr/api), and [NIM META Llama](https://build.nvidia.com/meta/llama-3_1-8b-instruct) documentation.
#### Example: Switching to the Llama 3.3-70B Model
To use larger LLMs like Llama 3.3-70B model in your deployment, you need to update both the Docker Compose configuration and the environment variables for your Python application. Follow these steps:
- In your `docker-compose.yml` file, find the `nvidia-llm` service section.
- Change the NIM image to 70B model: `nvcr.io/nim/meta/llama-3.3-70b-instruct:latest`
- Update the `device_ids` to allocate at least two GPUs (for example, `['2', '3']`).
- Update the environment variable under python-app service to `NVIDIA_LLM_MODEL=meta/llama-3.3-70b-instruct`
#### Setting up Zero-shot Magpie Latest Model
Follow these steps to configure and use the latest Zero-shot Magpie TTS model:
1. **Update Docker Compose Configuration**
Modify the `riva-tts-magpie` service in your docker-compose file with the following configuration:
```yaml
riva-tts-magpie:
image: <magpie-tts-zeroshot-image:version> # Replace this with the actual image tag
environment:
- NGC_API_KEY=${ZEROSHOT_TTS_NVIDIA_API_KEY}
- NIM_HTTP_API_PORT=9000
- NIM_GRPC_API_PORT=50051
ports:
- "49000:50051"
shm_size: 16GB
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
```
- Ensure your ZEROSHOT_TTS_NVIDIA_API_KEY key is properly set in your `.env` file:
```bash
ZEROSHOT_TTS_NVIDIA_API_KEY=
```
2. **Configure TTS Voice Settings**
Update the following environment variables under the `python-app` service:
```bash
RIVA_TTS_VOICE_ID=Magpie-ZeroShot.Female-1
RIVA_TTS_MODEL=magpie_tts_ensemble-Magpie-ZeroShot
```
3. **Zero-shot Audio Prompt Configuration**
To use a custom voice with zero-shot learning:
- Add your audio prompt file to the workspace
- Mount the audio file into your container by adding a volume in your `docker-compose.yml` under the `python-app` service:
```yaml
services:
python-app:
# ... existing code ...
volumes:
- ./audio_prompts:/app/audio_prompts
```
- Set the `ZERO_SHOT_AUDIO_PROMPT` environment variable to the path relative to your application root:
```yaml
environment:
- ZERO_SHOT_AUDIO_PROMPT=audio_prompts/voice_sample.wav # Path relative to app root
```
Note: The zero-shot audio prompt is only required when using the Magpie Zero-shot model. For standard Magpie multilingual models, this configuration should be omitted.
|