Spaces:

nvidia
/

voice-agent-examples

Running

App Files Files Community

fciannella commited on Sep 30

Commit

06523e9

1 Parent(s): e9446cb

Added the readme

Browse files

Files changed (2) hide show

README.md +85 -60
examples/voice_agent_webrtc_langgraph/README.md +82 -156

README.md CHANGED Viewed

@@ -1,92 +1,117 @@
----
-title: Ace Controller Pipeline
-emoji: 🐠
-colorFrom: indigo
-colorTo: gray
-sdk: docker
-pinned: false
-short_description: Voice Demos with Ace Controller
----
-# ACE Controller SDK
-The ACE Controller SDK allows you to build your own ACE Controller service to manage multimodal, real-time interactions with voice bots and avatars using NVIDIA ACE. With the SDK, you can create controllers that leverage the Python-based open-source [Pipecat framework](https://github.com/pipecat-ai/pipecat) for creating real-time, voice-enabled, and multimodal conversational AI agents. The SDK contains enhancements to the Pipecat framework, enabling developers to effortlessly customize, debug, and deploy complex pipelines while integrating robust NVIDIA Services into the Pipecat ecosystem.
-## Main Features
-- **Pipecat Extension:** A Pipecat extension to connect with ACE services and NVIDIA NIMs, facilitating the creation of human-avatar interactions. The NVIDIA Pipecat library augments [the Pipecat framework](https://github.com/pipecat-ai/pipecat) by adding additional frame processors and services, as well as new multimodal frames to enhance avatar interactions. This includes the integration of NVIDIA services and NIMs such as [NVIDIA Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html), [NVIDIA Audio2Face](https://build.nvidia.com/nvidia/audio2face-3d), and [NVIDIA Foundational RAG](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline).
-- **HTTP and WebSocket Server Implementation:** The SDK provides a FastAPI-based HTTP and WebSocket server implementation compatible with ACE. It includes functionality for stream and pipeline management by offering new Pipecat pipeline runners and transports. For ease of use and distribution, this functionality is currently included in the `nvidia-pipecat` Python library as well.
-## ACE Controller Microservice
-The ACE Controller SDK was used to build the [ACE Controller Microservice](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html).Check out the [ACE documentation](https://docs.nvidia.com/ace/tokkio/latest/customization/customization-options.html) for more details on how to configure the ACE Controller MS with your custom pipelines.
-## Getting Started
-The NVIDIA Pipecat package is released as a wheel on PyPI. Create a Python virtual environment and use the pip command to install the nvidia-pipecat package.
-```bash
-pip install nvidia-pipecat
-```
-You can start building pipecat pipelines utilizing services from the NVIDIA Pipecat package. For more details, follow [the ACE Controller](https://docs.nvidia.com/ace/ace-controller-microservice/latest/index.html) and [the Pipecat Framework](https://docs.pipecat.ai/getting-started/overview) documentation.
-## Hacking on the framework itself
-If you wish to work directly with the source code or modify services from the nvidia-pipecat package, you can utilize either the UV or Nix development setup as outlined below.
-### Using UV
-To get started, first install the [UV package manager](https://docs.astral.sh/uv/#highlights).
-Then, create a virtual environment with all the required dependencies by running the following commands:
 ```bash
-uv venv
-uv sync
-source .venv/bin/activate
 ```
-Once the environment is set up, you can begin building pipelines or modifying the services in the source code.
-If you wish to contribute your changes to the repository, please ensure you run the unit tests, linter, and formatting tool.
-To run unit tests, use:
-```
-uv run pytest
-```
-To format the code, use:
 ```bash
-ruff format
-```
-To run the linter, use:
-```
-ruff check
 ```
-### Using Nix
-To set up your development environment using [the Nix](https://nixos.org/download/#nix-install-linux), follow these steps:
-Initialize the development environment: Simply run the following command:
-```bash
-nix develop
 ```
-This setup provides you with a fully configured environment, allowing you to focus on development without worrying about dependency management.
-To ensure that all checks such as the formatting and linter for the repository are passing, use the following command:
-```bash
-nix flake check
-```
-## CONTRIBUTING
-We invite contributions! Open a GitHub issue or pull request! See contributing guildelines [here](./CONTRIBUTING.md).

+# Voice Agent WebRTC + LangGraph (Quick Start)
+This repository includes a complete voice agent stack:
+- LangGraph dev server for local agents
+- Pipecat-based speech pipeline (WebRTC, ASR, LangGraph LLM adapter, TTS)
+- Static UI you can open in a browser
+Primary example: `examples/voice_agent_webrtc_langgraph/`
+## 1) Mandatory environment variables
+Create `.env` in `examples/voice_agent_webrtc_langgraph/` (copy from `env.example`) and set at least:
+- `RIVA_API_KEY` or `NVIDIA_API_KEY`: required for NVIDIA NIM-hosted Riva ASR/TTS
+- `LANGGRAPH_BASE_URL` (default `http://127.0.0.1:2024`)
+- `LANGGRAPH_ASSISTANT` (default `ace-base-agent`)
+- `USER_EMAIL` (e.g. `test@example.com`)
+- `LANGGRAPH_STREAM_MODE` (default `values`)
+- `LANGGRAPH_DEBUG_STREAM` (default `true`)
+Optional but useful:
+- `RIVA_ASR_LANGUAGE` (default `en-US`)
+- `RIVA_TTS_LANGUAGE` (default `en-US`)
+- `RIVA_TTS_VOICE_ID` (e.g. `Magpie-ZeroShot.Female-1`)
+- `RIVA_TTS_MODEL` (e.g. `magpie_tts_ensemble-Magpie-ZeroShot`)
+- `ZERO_SHOT_AUDIO_PROMPT` if using Magpie Zero‑shot with a custom audio prompt
+- `ZERO_SHOT_AUDIO_PROMPT_URL` to auto-download prompt on startup
+- `ENABLE_SPECULATIVE_SPEECH` (default `true`)
+- `LANGGRAPH_AUTH_TOKEN` (or `AUTH0_ACCESS_TOKEN`/`AUTH_BEARER_TOKEN`) if your LangGraph server requires auth
+- TURN/Twilio for WebRTC if needed: `TWILIO_ACCOUNT_SID`, `TWILIO_AUTH_TOKEN`, or `TURN_SERVER_URL`, `TURN_USERNAME`, `TURN_PASSWORD`
+## 2) What it does
+- Starts LangGraph dev server serving agents from `examples/voice_agent_webrtc_langgraph/agents/`.
+- Starts the Pipecat pipeline (`pipeline.py`) exposing:
+  - HTTP: `http://<host>:7860` (health, RTC config)
+  - WebSocket: `ws://<host>:7860/ws` (audio + transcripts)
+- Serves the built UI at `http://<host>:9000/` (via Docker).
+Defaults:
+- ASR: NVIDIA Riva (NIM) via `RIVA_API_KEY` and built-in `NVIDIA_ASR_FUNCTION_ID`
+- LLM: LangGraph adapter, streaming from the selected assistant
+- TTS: NVIDIA Riva Magpie (NIM) via `RIVA_API_KEY` and built-in `NVIDIA_TTS_FUNCTION_ID`
+## 3) Run
+### Option A: Docker (recommended)
+From `examples/voice_agent_webrtc_langgraph/`:
 ```bash
+docker compose up --build -d
 ```
+Then open `http://<machine-ip>:9000/`.
+Chrome on http origins: enable “Insecure origins treated as secure” at `chrome://flags/` and add `http://<machine-ip>:9000`.
+### Option B: Python (local)
+Requires Python 3.12 and `uv`.
 ```bash
+cd examples/voice_agent_webrtc_langgraph
+uv run pipeline.py
 ```
+Then start the UI from `ui/` (see `examples/voice_agent_webrtc_langgraph/ui/README.md`).
+## 4) Swap TTS providers (Magpie ⇄ ElevenLabs)
+The default TTS in `examples/voice_agent_webrtc_langgraph/pipeline.py` is NVIDIA Riva Magpie via NIM:
+```python
+from nvidia_pipecat.services.riva_speech import RivaTTSService
+tts = RivaTTSService(
+    api_key=os.getenv("RIVA_API_KEY"),
+    function_id=os.getenv("NVIDIA_TTS_FUNCTION_ID", "4e813649-d5e4-4020-b2be-2b918396d19d"),
+    voice_id=os.getenv("RIVA_TTS_VOICE_ID", "Magpie-ZeroShot.Female-1"),
+    model=os.getenv("RIVA_TTS_MODEL", "magpie_tts_ensemble-Magpie-ZeroShot"),
+    language=os.getenv("RIVA_TTS_LANGUAGE", "en-US"),
+    zero_shot_audio_prompt_file=(
+        Path(os.getenv("ZERO_SHOT_AUDIO_PROMPT")) if os.getenv("ZERO_SHOT_AUDIO_PROMPT") else None
+    ),
+)
 ```
+To use ElevenLabs instead:
+1) Ensure ElevenLabs support is available (included via project deps).
+2) Set environment:
+   - `ELEVENLABS_API_KEY`
+   - Optionally `ELEVENLABS_VOICE_ID` and any model-specific settings
+3) Edit `examples/voice_agent_webrtc_langgraph/pipeline.py` to import and construct ElevenLabs TTS:
+```python
+from nvidia_pipecat.services.elevenlabs import ElevenLabsTTSServiceWithEndOfSpeech
+# Replace the RivaTTSService(...) block with:
+tts = ElevenLabsTTSServiceWithEndOfSpeech(
+    api_key=os.getenv("ELEVENLABS_API_KEY"),
+    voice_id=os.getenv("ELEVENLABS_VOICE_ID", "Rachel"),
+    sample_rate=16000,
+    channels=1,
+)
+```
+No other pipeline changes are required; transcript synchronization supports ElevenLabs end‑of‑speech events.
+Notes for Magpie Zero‑shot:
+- Set `RIVA_TTS_VOICE_ID` like `Magpie-ZeroShot.Female-1` and `RIVA_TTS_MODEL` like `magpie_tts_ensemble-Magpie-ZeroShot`.
+- If using a custom voice prompt, mount it via `docker-compose.yml` and set `ZERO_SHOT_AUDIO_PROMPT`, or set `ZERO_SHOT_AUDIO_PROMPT_URL` to auto-download on startup.
+## 5) Troubleshooting
+- Healthcheck: `curl -f http://localhost:7860/get_prompt`
+- If the UI can’t access the mic on http, use the Chrome flag above or host the UI via HTTPS.
+- For NAT/firewall issues, configure TURN or provide Twilio credentials.

examples/voice_agent_webrtc_langgraph/README.md CHANGED Viewed

@@ -1,186 +1,112 @@
-# Speech to Speech Demo
-In this example, we showcase how to build a speech-to-speech voice assistant pipeline using WebRTC with real-time transcripts. It uses Pipecat pipeline with FastAPI on the backend, and React on the frontend. This pipeline uses a WebRTC based SmallWebRTCTransport, Riva ASR and TTS models and NVIDIA LLM Service.
-## Prerequisites
-- You have access and are logged into NVIDIA NGC. For step-by-step instructions, refer to [the NGC Getting Started Guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#registering-activating-ngc-account).
-- You have Docker installed with support for NVIDIA GPUs. For more information, refer to [the Support Matrix](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix.html#support-matrix).
-## Setup API keys
-1. Copy and configure the environment file:
-   ```bash
-   cp env.example .env  # and add your credentials
-   ```
-2. Ensure you have the required API keys:
-   - NVIDIA_API_KEY - Required for accessing NIM ASR, TTS and LLM models
-   - (Optional) ZEROSHOT_TTS_NVIDIA_API_KEY - Required for zero-shot TTS
-## Option 1: Deploy Using Docker
-From the example/voice_agent_webrtc directory, run:
 ```bash
 docker compose up --build -d
 ```
-Then visit `http://<machine-ip>:9000/` in your browser to start interacting with the application.
-Note: To enable microphone access in Chrome, go to `chrome://flags/`, enable "Insecure origins treated as secure", add `http://<machine-ip>:9000` to the list, and restart Chrome.
-## Option 2: Deploy Using Python Environment
-### Requirements
-- Python (>=3.12)
-- [uv](https://github.com/astral-sh/uv)
-All Python dependencies are listed in `pyproject.toml` and can be installed with `uv`.
-### Run
 ```bash
 uv run pipeline.py
 ```
-Then run the ui from [`ui/README.md`](ui/README.md).
-## Using Coturn Server
-If you want to share widely or want to deploy on cloud platforms, you will need to setup coturn server. Follow instructions below for modifications required in example code for using coturn:
-### Deploy Coturn Server
-Update HOST_IP_EXTERNAL and run the below command:
-```bash
-docker run -d --network=host instrumentisto/coturn -n --verbose --log-file=stdout --external-ip=<HOST_IP_EXTERNAL>  --listening-ip=<HOST_IP_EXTERNAL>  --lt-cred-mech --fingerprint --user=admin:admin --no-multicast-peers --realm=tokkio.realm.org --min-port=51000 --max-port=52000
 ```
-#### Update pipeline.py
-Add the following configuration to your `pipeline.py` file to use the coturn server:
 ```python
-ice_servers = [
-    IceServer(
-        urls="<TURN_SERVER_URL>",
-        username="<TURN_USERNAME>",
-        credential="<TURN_PASSWORD>"
-    )
-]
-```
-#### Update ui/src/config.ts
-Add the following configuration to your `ui/src/config.ts` file to use the coturn server:
-```typescript
-export const RTC_CONFIG: ConstructorParameters<typeof RTCPeerConnection>[0] = {
-    iceServers: [
-      {
-        urls: "<turn_server_url>",
-        username: "<turn_server_username>",
-        credential: "<turn_server_credential>",
-      },
-    ],
-  };
-```
-## Bot customizations
-### Enabling Speculative Speech Processing
-Speculative speech processing reduces bot response latency by working directly on Riva ASR early interim user transcripts instead of waiting for final transcripts. This feature only works when using Riva ASR.
-- Update `ENABLE_SPECULATIVE_SPEECH` environment variable as `true`  in docker-compose.yml under `python-app` service
-- See the [ACE Controller Microservice documentation on Speculative Speech Processing](https://docs.nvidia.com/ace/ace-controller-microservice/1.0/user-guide.html#speculative-speech-processing) for more details.
-### Switching ASR, LLM, and TTS Models
-You can easily customize ASR (Automatic Speech Recognition), LLM (Large Language Model), and TTS (Text-to-Speech) services by configuring environment variables. This allows you to switch between NIM cloud-hosted models and locally deployed models.
-The following environment variables control the endpoints and models:
-- `RIVA_ASR_URL`: Address of the Riva ASR (speech-to-text) service (e.g., `localhost:50051` for local, "grpc.nvcf.nvidia.com:443" for cloud endpoint).
-- `RIVA_TTS_URL`: Address of the Riva TTS (text-to-speech) service. (e.g., `localhost:50051` for local, "grpc.nvcf.nvidia.com:443" for cloud endpoint).
-- `NVIDIA_LLM_URL`: URL for the NVIDIA LLM service. (e.g., `http://<machine-ip>:8000/v1` for local, "https://integrate.api.nvidia.com/v1" for cloud endpoint. )
-You can set model, language, and voice using the `RIVA_ASR_MODEL`, `RIVA_TTS_MODEL`, `NVIDIA_LLM_MODEL`, `RIVA_ASR_LANGUAGE`, `RIVA_TTS_LANGUAGE`, and `RIVA_TTS_VOICE_ID` environment variables.
-Update these variables in your Docker Compose configuration to match your deployment and desired models. For more details on available models and configuration options, refer to the [NIM NVIDIA Magpie](https://build.nvidia.com/nvidia/magpie-tts-multilingual), [NIM NVIDIA Parakeet](https://build.nvidia.com/nvidia/parakeet-ctc-1_1b-asr/api), and [NIM META Llama](https://build.nvidia.com/meta/llama-3_1-8b-instruct) documentation.
-#### Example: Switching to the Llama 3.3-70B Model
-To use larger LLMs like Llama 3.3-70B model in your deployment, you need to update both the Docker Compose configuration and the environment variables for your Python application. Follow these steps:
-- In your `docker-compose.yml` file, find the `nvidia-llm` service section.
-- Change the NIM image to 70B model: `nvcr.io/nim/meta/llama-3.3-70b-instruct:latest`
-- Update the `device_ids` to allocate at least two GPUs (for example, `['2', '3']`).
-- Update the environment variable under python-app service to `NVIDIA_LLM_MODEL=meta/llama-3.3-70b-instruct`
-#### Setting up Zero-shot Magpie Latest Model
-Follow these steps to configure and use the latest Zero-shot Magpie TTS model:
-1. **Update Docker Compose Configuration**
-Modify the `riva-tts-magpie` service in your docker-compose file with the following configuration:
-```yaml
- riva-tts-magpie:
-  image: <magpie-tts-zeroshot-image:version>  # Replace this with the actual image tag
-  environment:
-    - NGC_API_KEY=${ZEROSHOT_TTS_NVIDIA_API_KEY}
-    - NIM_HTTP_API_PORT=9000
-    - NIM_GRPC_API_PORT=50051
-  ports:
-    - "49000:50051"
-  shm_size: 16GB
-  deploy:
-    resources:
-      reservations:
-        devices:
-          - driver: nvidia
-            device_ids: ['0']
-            capabilities: [gpu]
 ```
-- Ensure your ZEROSHOT_TTS_NVIDIA_API_KEY key is properly set in your `.env` file:
-  ```bash
-  ZEROSHOT_TTS_NVIDIA_API_KEY=
-  ```
-2. **Configure TTS Voice Settings**
-Update the following environment variables under the `python-app` service:
-```bash
-RIVA_TTS_VOICE_ID=Magpie-ZeroShot.Female-1
-RIVA_TTS_MODEL=magpie_tts_ensemble-Magpie-ZeroShot
-```
-3. **Zero-shot Audio Prompt Configuration**
-To use a custom voice with zero-shot learning:
-- Add your audio prompt file to the workspace
-- Mount the audio file into your container by adding a volume in your `docker-compose.yml` under the `python-app` service:
-  ```yaml
-  services:
-    python-app:
-      # ... existing code ...
-      volumes:
-        - ./audio_prompts:/app/audio_prompts
-  ```
-- Set the `ZERO_SHOT_AUDIO_PROMPT` environment variable to the path relative to your application root:
-  ```yaml
-  environment:
-    - ZERO_SHOT_AUDIO_PROMPT=audio_prompts/voice_sample.wav  # Path relative to app root
-  ```
-Note: The zero-shot audio prompt is only required when using the Magpie Zero-shot model. For standard Magpie multilingual models, this configuration should be omitted.

+# Voice Agent WebRTC + LangGraph (Quick Start)
+This example launches a complete voice agent stack:
+- LangGraph dev server for local agents
+- Pipecat-based speech pipeline (WebRTC, ASR, LLM adapter, TTS)
+- Static UI you can open in a browser
+## 1) Mandatory environment variables
+Create `.env` next to this README (or copy from `env.example`) and set at least:
+- `NVIDIA_API_KEY` or `RIVA_API_KEY`: required for NVIDIA NIM-hosted Riva ASR/TTS
+- `USE_LANGGRAPH=true`: enable LangGraph-backed LLM
+- `LANGGRAPH_BASE_URL` (default `http://127.0.0.1:2024`)
+- `LANGGRAPH_ASSISTANT` (default `ace-base-agent`)
+- `USER_EMAIL` (any email for routing, e.g. `test@example.com`)
+- `LANGGRAPH_STREAM_MODE` (default `values`)
+- `LANGGRAPH_DEBUG_STREAM` (default `true`)
+Optional but commonly used:
+- `RIVA_ASR_LANGUAGE` (default `en-US`)
+- `RIVA_TTS_LANGUAGE` (default `en-US`)
+- `RIVA_TTS_VOICE_ID` (e.g. `Magpie-ZeroShot.Female-1`)
+- `RIVA_TTS_MODEL` (e.g. `magpie_tts_ensemble-Magpie-ZeroShot`)
+- `ZERO_SHOT_AUDIO_PROMPT` if using Magpie Zero‑shot and a custom voice prompt
+- `ZERO_SHOT_AUDIO_PROMPT_URL` to auto-download prompt on startup
+- `ENABLE_SPECULATIVE_SPEECH` (default `true`)
+- TURN/Twilio for WebRTC if needed: `TWILIO_ACCOUNT_SID`, `TWILIO_AUTH_TOKEN`, or `TURN_SERVER_URL`, `TURN_USERNAME`, `TURN_PASSWORD`
+## 2) What it does
+- Starts LangGraph dev server to serve local agents from `agents/`.
+- Starts the Pipecat pipeline (`pipeline.py`) exposing:
+  - HTTP: `http://<host>:7860` (health and RTC config)
+  - WebSocket: `ws://<host>:7860/ws` for audio and transcripts
+- Serves the built UI at `http://<host>:9000/` (via the container).
+By default it uses:
+- ASR: NVIDIA Riva (NIM) with `RIVA_API_KEY` and `NVIDIA_ASR_FUNCTION_ID`
+- LLM: LangGraph adapter streaming from the selected assistant
+- TTS: NVIDIA Riva Magpie (NIM) with `RIVA_API_KEY` and `NVIDIA_TTS_FUNCTION_ID`
+## 3) Run
+### Option A: Docker (recommended)
+From this directory:
 ```bash
 docker compose up --build -d
 ```
+Then open `http://<machine-ip>:9000/`.
+Chrome on http origins: enable “Insecure origins treated as secure” at `chrome://flags/` and add `http://<machine-ip>:9000`.
+### Option B: Python (local)
+Requires Python 3.12 and `uv`.
 ```bash
 uv run pipeline.py
 ```
+Then start the UI from `ui/` (see `ui/README.md`).
+## 4) Swap TTS providers (Magpie ⇄ ElevenLabs)
+The default TTS in `pipeline.py` is NVIDIA Riva Magpie via NIM:
+```startLine:endLine:examples/voice_agent_webrtc_langgraph/pipeline.py
+tts = RivaTTSService(
+    api_key=os.getenv("RIVA_API_KEY"),
+    function_id=os.getenv("NVIDIA_TTS_FUNCTION_ID", "4e813649-d5e4-4020-b2be-2b918396d19d"),
+    voice_id=os.getenv("RIVA_TTS_VOICE_ID", "Magpie-ZeroShot.Female-1"),
+    model=os.getenv("RIVA_TTS_MODEL", "magpie_tts_ensemble-Magpie-ZeroShot"),
+    language=os.getenv("RIVA_TTS_LANGUAGE", "en-US"),
+    zero_shot_audio_prompt_file=(
+        Path(os.getenv("ZERO_SHOT_AUDIO_PROMPT")) if os.getenv("ZERO_SHOT_AUDIO_PROMPT") else None
+    ),
+)
 ```
+To use ElevenLabs instead:
+1) Ensure `pipecat` ElevenLabs dependency is available (already included via project deps).
+2) Set environment:
+   - `ELEVENLABS_API_KEY`
+   - Optionally `ELEVENLABS_VOICE_ID` and model settings supported by ElevenLabs
+3) Change the TTS construction in `pipeline.py` to use `ElevenLabsTTSServiceWithEndOfSpeech`:
 ```python
+from nvidia_pipecat.services.elevenlabs import ElevenLabsTTSServiceWithEndOfSpeech
+# Replace RivaTTSService(...) with:
+tts = ElevenLabsTTSServiceWithEndOfSpeech(
+    api_key=os.getenv("ELEVENLABS_API_KEY"),
+    voice_id=os.getenv("ELEVENLABS_VOICE_ID", "Rachel"),
+    sample_rate=16000,
+    channels=1,
+)
 ```
+That’s it. No other pipeline changes are required. The transcript synchronization already supports ElevenLabs end‑of‑speech events.
+Notes for Magpie Zero‑shot:
+- Provide `RIVA_TTS_VOICE_ID` like `Magpie-ZeroShot.Female-1` and `RIVA_TTS_MODEL` like `magpie_tts_ensemble-Magpie-ZeroShot`.
+- If using a custom voice prompt, mount it via `docker-compose.yml` and set `ZERO_SHOT_AUDIO_PROMPT`. You can also set `ZERO_SHOT_AUDIO_PROMPT_URL` to auto-download at startup.
+## 5) Troubleshooting
+- Healthcheck: `curl -f http://localhost:7860/get_prompt`
+- If UI can’t access mic on http, use Chrome flag above or host UI via HTTPS.
+- For NAT/firewall issues, configure TURN or Twilio credentials.