Spaces:

victor
/

omni-docker

Sleeping

victor HF Staff commited on Sep 11, 2024

Commit

9616027

1 Parent(s): 040d99a

feat: Update Dockerfile and requirements.txt to resolve PyAudio build issues

The main changes are:

1. Added `build-essential` and `libasound2-dev` to the system dependencies in the Dockerfile to ensure the necessary build tools are available.
2. Removed PyAudio from the `requirements.txt` file to avoid the pip installation issues.
3. Added a separate `RUN pip install PyAudio==0.2.14` command in the Dockerfile to install PyAudio manually.

These changes should resolve the build issues with PyAudio on the CUDA server.

Revert "fix: Handle CUDA availability in OmniChatServer"

This reverts commit 28ed763269f75cea8298b3d64449fd7776d05f52.

docs: add PyAudio to dependencies

feat: Replace PyAudio with streamlit-webrtc for user recording

fix: Replace PyAudio with streamlit-webrtc for audio recording

feat: Serve HTML demo instead of Streamlit app

fix: Update API_URL and error handling in webui/omni_html_demo.html

fix: Replace audio playback with text-to-speech

feat: Implement audio processing and response generation

fix: Use a Docker data volume for caching

feat: Add Docker data volume and environment variables for caching

diff --git a/inference.py b/inference.py
index 4d4d4d1..d4d4d1a 100644
--- a/inference.py
+++ b/inference.py
@@ -1,6 +1,7 @@
def download_model(ckpt_dir):
repo_id = "gpt-omni/mini-omni"
- snapshot_download(repo_id, local_dir=ckpt_dir, revision="main")
+ cache_dir = os.environ.get('XDG_CACHE_HOME', '/tmp')
+ snapshot_download(repo_id, local_dir=ckpt_dir, revision="main", cache_dir=cache_dir)

rm

fix: Remove cache-related code and update Dockerfile

fix: Add Docker volume and set permissions for model download

fix: Set correct permissions for checkpoint directory

feat: Use DATA volume to store model checkpoint

fix: Set permissions and create necessary subdirectories in the DATA volume

fix: Implement error handling and CUDA Tensor Cores optimization in serve_html.py

fix: Improve error handling and logging in chat endpoint

Files changed (9) hide show

Dockerfile +23 -11
README.md +125 -124
inference.py +7 -5
requirements.txt +8 -2
serve_html.py +70 -0
server.py +4 -7
webui/index.html +0 -258
webui/omni_html_demo.html +13 -8
webui/omni_streamlit.py +134 -257

Dockerfile CHANGED Viewed

@@ -7,7 +7,6 @@ WORKDIR /app
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
     ffmpeg \
-    portaudio19-dev \
     && rm -rf /var/lib/apt/lists/*
 # Copy the current directory contents into the container at /app
@@ -16,20 +15,33 @@ COPY . /app
 # Install any needed packages specified in requirements.txt
 RUN pip install --no-cache-dir -r requirements.txt
-# Install PyAudio
-RUN pip install PyAudio==0.2.14
-# Make ports 7860 and 60808 available to the world outside this container
-EXPOSE 7860 60808
 # Set environment variable for API_URL
-ENV API_URL=http://0.0.0.0:60808/chat
 # Set PYTHONPATH
 ENV PYTHONPATH=./
-# Make start.sh executable
-RUN chmod +x start.sh
-# Run start.sh when the container launches
-CMD ["./start.sh"]

 # Install system dependencies
 RUN apt-get update && apt-get install -y \
     ffmpeg \
     && rm -rf /var/lib/apt/lists/*
 # Copy the current directory contents into the container at /app
 # Install any needed packages specified in requirements.txt
 RUN pip install --no-cache-dir -r requirements.txt
+# Make port 7860 available to the world outside this container
+EXPOSE 7860
 # Set environment variable for API_URL
+ENV API_URL=http://0.0.0.0:7860/chat
 # Set PYTHONPATH
 ENV PYTHONPATH=./
+# Set environment variables
+ENV MPLCONFIGDIR=/tmp/matplotlib
+ENV HF_HOME=/data/huggingface
+ENV XDG_CACHE_HOME=/data/cache
+# Create a volume for data
+VOLUME /data
+# Set permissions for the /data directory and create necessary subdirectories
+RUN mkdir -p /data/checkpoint /data/cache /data/huggingface && \
+    chown -R 1000:1000 /data && \
+    chmod -R 777 /data
+# Install Flask
+RUN pip install flask
+# Copy the HTML demo file
+COPY webui/omni_html_demo.html .
+# Run the Flask app to serve the HTML demo
+CMD ["python", "serve_html.py"]

README.md CHANGED Viewed

@@ -1,124 +1,125 @@
----
-title: Omni Docker
-emoji: 🦀
-colorFrom: green
-colorTo: red
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-# Mini-Omni
-<p align="center"><strong style="font-size: 18px;">
-Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
-</strong>
-</p>
-<p align="center">
-🤗 <a href="https://huggingface.co/gpt-omni/mini-omni">Hugging Face</a>   | 📖 <a href="https://github.com/gpt-omni/mini-omni">Github</a>
-|     📑 <a href="https://arxiv.org/abs/2408.16725">Technical report</a>
-</p>
-Mini-Omni is an open-source multimodal large language model that can **hear, talk while thinking**. Featuring real-time end-to-end speech input and **streaming audio output** conversational capabilities.
-<p align="center">
-    <img src="data/figures/frameworkv3.jpg" width="100%"/>
-</p>
-## Features
-✅ **Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required.
-✅ **Talking while thinking**, with the ability to generate text and audio at the same time.
-✅ **Streaming audio output** capabilities.
-✅ With "Audio-to-Text" and "Audio-to-Audio" **batch inference** to further boost the performance.
-## Demo
-NOTE: need to unmute first.
-https://github.com/user-attachments/assets/03bdde05-9514-4748-b527-003bea57f118
-## Install
-Create a new conda environment and install the required packages:
-```sh
-conda create -n omni python=3.10
-conda activate omni
-git clone https://github.com/gpt-omni/mini-omni.git
-cd mini-omni
-pip install -r requirements.txt
-```
-## Quick start
-**Interactive demo**
-- start server
-NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.
-```sh
-sudo apt-get install ffmpeg
-conda activate omni
-cd mini-omni
-python3 server.py --ip '0.0.0.0' --port 60808
-```
-- run streamlit demo
-NOTE: you need to run streamlit locally with PyAudio installed. For error: `ModuleNotFoundError: No module named 'utils.vad'`, please run `export PYTHONPATH=./` first.
-```sh
-pip install PyAudio==0.2.14
-API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
-```
-- run gradio demo
-```sh
-API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py
-```
-example:
-NOTE: need to unmute first. Gradio seems can not play audio stream instantly, so the latency feels a bit longer.
-https://github.com/user-attachments/assets/29187680-4c42-47ff-b352-f0ea333496d9
-**Local test**
-```sh
-conda activate omni
-cd mini-omni
-# test run the preset audio samples and questions
-python inference.py
-```
-## Common issues
-- Error: `ModuleNotFoundError: No module named 'utils.xxxx'`
-    Answer: run `export PYTHONPATH=./` first.
-## Acknowledgements
-- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
-- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
-- [whisper](https://github.com/openai/whisper/)  for audio encoding.
-- [snac](https://github.com/hubertsiuzdak/snac/)  for audio decoding.
-- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
-- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.
-## Star History
-[![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni&type=Date)](https://star-history.com/#gpt-omni/mini-omni&Date)

+---
+title: Omni Docker
+emoji: 🦀
+colorFrom: green
+colorTo: red
+sdk: docker
+pinned: false
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Mini-Omni
+<p align="center"><strong style="font-size: 18px;">
+Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
+</strong>
+</p>
+<p align="center">
+🤗 <a href="https://huggingface.co/gpt-omni/mini-omni">Hugging Face</a>   | 📖 <a href="https://github.com/gpt-omni/mini-omni">Github</a>
+|     📑 <a href="https://arxiv.org/abs/2408.16725">Technical report</a>
+</p>
+Mini-Omni is an open-source multimodal large language model that can **hear, talk while thinking**. Featuring real-time end-to-end speech input and **streaming audio output** conversational capabilities.
+<p align="center">
+    <img src="data/figures/frameworkv3.jpg" width="100%"/>
+</p>
+## Features
+✅ **Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required.
+✅ **Talking while thinking**, with the ability to generate text and audio at the same time.
+✅ **Streaming audio output** capabilities.
+✅ With "Audio-to-Text" and "Audio-to-Audio" **batch inference** to further boost the performance.
+## Demo
+NOTE: need to unmute first.
+https://github.com/user-attachments/assets/03bdde05-9514-4748-b527-003bea57f118
+## Install
+Create a new conda environment and install the required packages:
+```sh
+conda create -n omni python=3.10
+conda activate omni
+git clone https://github.com/gpt-omni/mini-omni.git
+cd mini-omni
+pip install -r requirements.txt
+pip install PyAudio==0.2.14
+```
+## Quick start
+**Interactive demo**
+- start server
+NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.
+```sh
+sudo apt-get install ffmpeg
+conda activate omni
+cd mini-omni
+python3 server.py --ip '0.0.0.0' --port 60808
+```
+- run streamlit demo
+NOTE: you need to run streamlit locally with PyAudio installed. For error: `ModuleNotFoundError: No module named 'utils.vad'`, please run `export PYTHONPATH=./` first.
+```sh
+pip install PyAudio==0.2.14
+API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
+```
+- run gradio demo
+```sh
+API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py
+```
+example:
+NOTE: need to unmute first. Gradio seems can not play audio stream instantly, so the latency feels a bit longer.
+https://github.com/user-attachments/assets/29187680-4c42-47ff-b352-f0ea333496d9
+**Local test**
+```sh
+conda activate omni
+cd mini-omni
+# test run the preset audio samples and questions
+python inference.py
+```
+## Common issues
+- Error: `ModuleNotFoundError: No module named 'utils.xxxx'`
+    Answer: run `export PYTHONPATH=./` first.
+## Acknowledgements
+- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
+- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
+- [whisper](https://github.com/openai/whisper/)  for audio encoding.
+- [snac](https://github.com/hubertsiuzdak/snac/)  for audio decoding.
+- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
+- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.
+## Star History
+[![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni&type=Date)](https://star-history.com/#gpt-omni/mini-omni&Date)

inference.py CHANGED Viewed

@@ -7,6 +7,8 @@ from litgpt import Tokenizer
 from litgpt.utils import (
     num_parameters,
 )
 from litgpt.generate.base import (
     generate_AA,
     generate_ASR,
@@ -347,8 +349,8 @@ def T1_T2(fabric, input_ids, model, text_tokenizer, step):
 def load_model(ckpt_dir, device):
-    snacmodel = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to(device)
-    whispermodel = whisper.load_model("small").to(device)
     text_tokenizer = Tokenizer(ckpt_dir)
     fabric = L.Fabric(devices=1, strategy="auto")
     config = Config.from_file(ckpt_dir + "/model_config.yaml")
@@ -367,12 +369,12 @@ def load_model(ckpt_dir, device):
 def download_model(ckpt_dir):
     repo_id = "gpt-omni/mini-omni"
-    snapshot_download(repo_id, local_dir=ckpt_dir, revision="main")
 class OmniInference:
-    def __init__(self, ckpt_dir='./checkpoint', device='cuda:0'):
         self.device = device
         if not os.path.exists(ckpt_dir):
             print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
@@ -508,7 +510,7 @@ class OmniInference:
 def test_infer():
     device = "cuda:0"
     out_dir = f"./output/{get_time_str()}"
-    ckpt_dir = f"./checkpoint"
     if not os.path.exists(ckpt_dir):
         print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
         download_model(ckpt_dir)

 from litgpt.utils import (
     num_parameters,
 )
+import matplotlib
+matplotlib.use('Agg')  # Use a non-GUI backend
 from litgpt.generate.base import (
     generate_AA,
     generate_ASR,
 def load_model(ckpt_dir, device):
+    snacmodel = SNAC.from_pretrained("hubertsiuzdak/snac_24khz", cache_dir="/data/cache/snac").eval().to(device)
+    whispermodel = whisper.load_model("small", download_root="/data/cache/whisper").to(device)
     text_tokenizer = Tokenizer(ckpt_dir)
     fabric = L.Fabric(devices=1, strategy="auto")
     config = Config.from_file(ckpt_dir + "/model_config.yaml")
 def download_model(ckpt_dir):
     repo_id = "gpt-omni/mini-omni"
+    snapshot_download(repo_id, local_dir=ckpt_dir, revision="main", cache_dir="/data/huggingface")
 class OmniInference:
+    def __init__(self, ckpt_dir='/data/checkpoint', device='cuda:0'):
         self.device = device
         if not os.path.exists(ckpt_dir):
             print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
 def test_infer():
     device = "cuda:0"
     out_dir = f"./output/{get_time_str()}"
+    ckpt_dir = f"/data/checkpoint"
     if not os.path.exists(ckpt_dir):
         print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
         download_model(ckpt_dir)

requirements.txt CHANGED Viewed

@@ -6,8 +6,14 @@ snac==1.2.0
 soundfile==0.12.1
 openai-whisper==20231117
 tokenizers==0.15.2
-streamlit==1.32.2
-PyAudio==0.2.14
 pydub==0.25.1
 onnxruntime==1.17.1
 numpy==1.26.4

 soundfile==0.12.1
 openai-whisper==20231117
 tokenizers==0.15.2
+torch==2.2.1
+torchvision==0.17.1
+torchaudio==2.2.1
+litgpt==0.4.3
+snac==1.2.0
+soundfile==0.12.1
+openai-whisper==20231117
+tokenizers==0.15.2
 pydub==0.25.1
 onnxruntime==1.17.1
 numpy==1.26.4

serve_html.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import torch
+torch.set_float32_matmul_precision('high')
+from flask import Flask, send_from_directory, request, Response
+import os
+import base64
+import numpy as np
+from inference import OmniInference
+import io
+app = Flask(__name__)
+# Initialize OmniInference
+try:
+    print("Initializing OmniInference...")
+    omni = OmniInference()
+    print("OmniInference initialized successfully.")
+except Exception as e:
+    print(f"Error initializing OmniInference: {str(e)}")
+    raise
+@app.route('/')
+def serve_html():
+    return send_from_directory('.', 'webui/omni_html_demo.html')
+@app.route('/chat', methods=['POST'])
+def chat():
+    try:
+        audio_data = request.json['audio']
+        if not audio_data:
+            return "No audio data received", 400
+        # Check if the audio_data contains the expected base64 prefix
+        if ',' in audio_data:
+            audio_bytes = base64.b64decode(audio_data.split(',')[1])
+        else:
+            audio_bytes = base64.b64decode(audio_data)
+        # Save audio to a temporary file
+        temp_audio_path = 'temp_audio.wav'
+        with open(temp_audio_path, 'wb') as f:
+            f.write(audio_bytes)
+        # Generate response using OmniInference
+        try:
+            response_generator = omni.run_AT_batch_stream(temp_audio_path)
+            # Concatenate all audio chunks
+            all_audio = b''
+            for audio_chunk in response_generator:
+                all_audio += audio_chunk
+            # Clean up temporary file
+            os.remove(temp_audio_path)
+            return Response(all_audio, mimetype='audio/wav')
+        except Exception as inner_e:
+            print(f"Error in OmniInference processing: {str(inner_e)}")
+            return f"An error occurred during audio processing: {str(inner_e)}", 500
+        finally:
+            # Ensure temporary file is removed even if an error occurs
+            if os.path.exists(temp_audio_path):
+                os.remove(temp_audio_path)
+    except Exception as e:
+        print(f"Error in chat endpoint: {str(e)}")
+        return f"An error occurred: {str(e)}", 500
+if __name__ == '__main__':
+    app.run(host='0.0.0.0', port=7860)

server.py CHANGED Viewed

@@ -2,21 +2,17 @@ import flask
 import base64
 import tempfile
 import traceback
-import torch
 from flask import Flask, Response, stream_with_context
 from inference import OmniInference
 class OmniChatServer(object):
     def __init__(self, ip='0.0.0.0', port=60808, run_app=True,
-                 ckpt_dir='./checkpoint', device=None) -> None:
         server = Flask(__name__)
         # CORS(server, resources=r"/*")
         # server.config["JSON_AS_ASCII"] = False
-        if device is None:
-            device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
         self.client = OmniInference(ckpt_dir, device)
         self.client.warm_up()
@@ -50,8 +46,9 @@ def create_app():
     return server.server
-def serve(ip='0.0.0.0', port=60808, device=None):
-    OmniChatServer(ip, port=port, run_app=True, device=device)
 if __name__ == "__main__":

 import base64
 import tempfile
 import traceback
 from flask import Flask, Response, stream_with_context
 from inference import OmniInference
 class OmniChatServer(object):
     def __init__(self, ip='0.0.0.0', port=60808, run_app=True,
+                 ckpt_dir='./checkpoint', device='cuda:0') -> None:
         server = Flask(__name__)
         # CORS(server, resources=r"/*")
         # server.config["JSON_AS_ASCII"] = False
         self.client = OmniInference(ckpt_dir, device)
         self.client.warm_up()
     return server.server
+def serve(ip='0.0.0.0', port=60808, device='cuda:0'):
+    OmniChatServer(ip, port=port,run_app=True, device=device)
 if __name__ == "__main__":

webui/index.html DELETED Viewed

@@ -1,258 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>Mini-Omni Chat Demo</title>
-    <style>
-        body {
-            background-color: black;
-            color: white;
-            font-family: Arial, sans-serif;
-        }
-        #chat-container {
-            height: 300px;
-            overflow-y: auto;
-            border: 1px solid #444;
-            padding: 10px;
-            margin-bottom: 10px;
-        }
-        #status-message {
-            margin-bottom: 10px;
-        }
-        button {
-            margin-right: 10px;
-        }
-    </style>
-</head>
-<body>
-    <div id="svg-container"></div>
-    <div id="chat-container"></div>
-    <div id="status-message">Current status: idle</div>
-    <button id="start-button">Start</button>
-    <button id="stop-button" disabled>Stop</button>
-    <main>
-        <p id="current-status">Current status: idle</p>
-    </main>
-</body>
-<script>
-  // Load the SVG
-const svgContainer = document.getElementById('svg-container');
-const svgContent = `
-<svg width="800" height="600" viewBox="0 0 800 600" xmlns="http://www.w3.org/2000/svg">
-  <ellipse id="left-eye" cx="340" cy="200" rx="20" ry="20" fill="white"/>
-  <circle id="left-pupil" cx="340" cy="200" r="8" fill="black"/>
-  <ellipse id="right-eye" cx="460" cy="200" rx="20" ry="20" fill="white"/>
-  <circle id="right-pupil" cx="460" cy="200" r="8" fill="black"/>
-  <path id="upper-lip" d="M 300 300 C 350 284, 450 284, 500 300" stroke="white" stroke-width="10" fill="none"/>
-  <path id="lower-lip" d="M 300 300 C 350 316, 450 316, 500 300" stroke="white" stroke-width="10" fill="none"/>
-</svg>`;
-svgContainer.innerHTML = svgContent;
-// Set up audio context
-const audioContext = new (window.AudioContext || window.webkitAudioContext)();
-const analyser = audioContext.createAnalyser();
-analyser.fftSize = 256;
-// Animation variables
-let isAudioPlaying = false;
-let lastBlinkTime = 0;
-let eyeMovementOffset = { x: 0, y: 0 };
-// Chat variables
-let mediaRecorder;
-let audioChunks = [];
-let isRecording = false;
-const API_URL = 'http://127.0.0.1:60808/chat';
-// Idle eye animation function
-function animateIdleEyes(timestamp) {
-  const leftEye = document.getElementById('left-eye');
-  const rightEye = document.getElementById('right-eye');
-  const leftPupil = document.getElementById('left-pupil');
-  const rightPupil = document.getElementById('right-pupil');
-  const baseEyeX = { left: 340, right: 460 };
-  const baseEyeY = 200;
-  // Blink effect
-  const blinkInterval = 4000 + Math.random() * 2000; // Random blink interval between 4-6 seconds
-  if (timestamp - lastBlinkTime > blinkInterval) {
-    leftEye.setAttribute('ry', '2');
-    rightEye.setAttribute('ry', '2');
-    leftPupil.setAttribute('ry', '0.8');
-    rightPupil.setAttribute('ry', '0.8');
-    setTimeout(() => {
-      leftEye.setAttribute('ry', '20');
-      rightEye.setAttribute('ry', '20');
-      leftPupil.setAttribute('ry', '8');
-      rightPupil.setAttribute('ry', '8');
-    }, 150);
-    lastBlinkTime = timestamp;
-  }
-  // Subtle eye movement
-  const movementSpeed = 0.001;
-  eyeMovementOffset.x = Math.sin(timestamp * movementSpeed) * 6;
-  eyeMovementOffset.y = Math.cos(timestamp * movementSpeed * 1.3) * 1; // Reduced vertical movement
-  leftEye.setAttribute('cx', baseEyeX.left + eyeMovementOffset.x);
-  leftEye.setAttribute('cy', baseEyeY + eyeMovementOffset.y);
-  rightEye.setAttribute('cx', baseEyeX.right + eyeMovementOffset.x);
-  rightEye.setAttribute('cy', baseEyeY + eyeMovementOffset.y);
-  leftPupil.setAttribute('cx', baseEyeX.left + eyeMovementOffset.x);
-  leftPupil.setAttribute('cy', baseEyeY + eyeMovementOffset.y);
-  rightPupil.setAttribute('cx', baseEyeX.right + eyeMovementOffset.x);
-  rightPupil.setAttribute('cy', baseEyeY + eyeMovementOffset.y);
-}
-// Main animation function
-function animate(timestamp) {
-  if (isAudioPlaying) {
-    const dataArray = new Uint8Array(analyser.frequencyBinCount);
-    analyser.getByteFrequencyData(dataArray);
-    // Calculate the average amplitude in the speech frequency range
-    const speechRange = dataArray.slice(5, 80); // Adjust based on your needs
-    const averageAmplitude = speechRange.reduce((a, b) => a + b) / speechRange.length;
-    // Normalize the amplitude (0-1 range)
-    const normalizedAmplitude = averageAmplitude / 255;
-    // Animate mouth
-    const upperLip = document.getElementById('upper-lip');
-    const lowerLip = document.getElementById('lower-lip');
-    const baseY = 300;
-    const maxMovement = 60;
-    const newUpperY = baseY - normalizedAmplitude * maxMovement;
-    const newLowerY = baseY + normalizedAmplitude * maxMovement;
-    // Adjust control points for more natural movement
-    const upperControlY1 = newUpperY - 8;
-    const upperControlY2 = newUpperY - 8;
-    const lowerControlY1 = newLowerY + 8;
-    const lowerControlY2 = newLowerY + 8;
-    upperLip.setAttribute('d', `M 300 ${baseY} C 350 ${upperControlY1}, 450 ${upperControlY2}, 500 ${baseY}`);
-    lowerLip.setAttribute('d', `M 300 ${baseY} C 350 ${lowerControlY1}, 450 ${lowerControlY2}, 500 ${baseY}`);
-    // Animate eyes
-    const leftEye = document.getElementById('left-eye');
-    const rightEye = document.getElementById('right-eye');
-    const leftPupil = document.getElementById('left-pupil');
-    const rightPupil = document.getElementById('right-pupil');
-    const baseEyeY = 200;
-    const maxEyeMovement = 10;
-    const newEyeY = baseEyeY - normalizedAmplitude * maxEyeMovement;
-    leftEye.setAttribute('cy', newEyeY);
-    rightEye.setAttribute('cy', newEyeY);
-    leftPupil.setAttribute('cy', newEyeY);
-    rightPupil.setAttribute('cy', newEyeY);
-  } else {
-    animateIdleEyes(timestamp);
-  }
-  requestAnimationFrame(animate);
-}
-// Start animation
-animate();
-// Chat functions
-function startRecording() {
-  navigator.mediaDevices.getUserMedia({ audio: true })
-    .then(stream => {
-      mediaRecorder = new MediaRecorder(stream);
-      mediaRecorder.ondataavailable = event => {
-        audioChunks.push(event.data);
-      };
-      mediaRecorder.onstop = sendAudioToServer;
-      mediaRecorder.start();
-      isRecording = true;
-      updateStatus('Recording...');
-      document.getElementById('start-button').disabled = true;
-      document.getElementById('stop-button').disabled = false;
-    })
-    .catch(error => {
-      console.error('Error accessing microphone:', error);
-      updateStatus('Error: ' + error.message);
-    });
-}
-function stopRecording() {
-  if (mediaRecorder && isRecording) {
-    mediaRecorder.stop();
-    isRecording = false;
-    updateStatus('Processing...');
-    document.getElementById('start-button').disabled = false;
-    document.getElementById('stop-button').disabled = true;
-  }
-}
-function sendAudioToServer() {
-  const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
-  const reader = new FileReader();
-  reader.readAsDataURL(audioBlob);
-  reader.onloadend = function() {
-    const base64Audio = reader.result.split(',')[1];
-    fetch(API_URL, {
-      method: 'POST',
-      headers: {
-        'Content-Type': 'application/json',
-      },
-      body: JSON.stringify({ audio: base64Audio }),
-    })
-    .then(response => response.blob())
-    .then(blob => {
-      const audioUrl = URL.createObjectURL(blob);
-      playResponseAudio(audioUrl);
-      updateChatHistory('User', 'Audio message sent');
-      updateChatHistory('Assistant', 'Audio response received');
-    })
-    .catch(error => {
-      console.error('Error:', error);
-      updateStatus('Error: ' + error.message);
-    });
-  };
-  audioChunks = [];
-}
-function playResponseAudio(audioUrl) {
-  const audio = new Audio(audioUrl);
-  audio.onloadedmetadata = () => {
-    const source = audioContext.createMediaElementSource(audio);
-    source.connect(analyser);
-    analyser.connect(audioContext.destination);
-  };
-  audio.onplay = () => {
-    isAudioPlaying = true;
-    updateStatus('Playing response...');
-  };
-  audio.onended = () => {
-    isAudioPlaying = false;
-    updateStatus('Idle');
-  };
-  audio.play();
-}
-function updateChatHistory(role, message) {
-  const chatContainer = document.getElementById('chat-container');
-  const messageElement = document.createElement('p');
-  messageElement.textContent = `${role}: ${message}`;
-  chatContainer.appendChild(messageElement);
-  chatContainer.scrollTop = chatContainer.scrollHeight;
-}
-function updateStatus(status) {
-  document.getElementById('status-message').textContent = status;
-  document.getElementById('current-status').textContent = 'Current status: ' + status;
-}
-// Event listeners
-document.getElementById('start-button').addEventListener('click', startRecording);
-document.getElementById('stop-button').addEventListener('click', stopRecording);
-// Initialize
-updateStatus('Idle');
-</script>
-</html>

webui/omni_html_demo.html CHANGED Viewed

@@ -21,7 +21,7 @@
     <audio id="audioPlayback" controls style="display:none;"></audio>
     <script>
-        const API_URL = 'http://127.0.0.1:60808/chat';
         const recordButton = document.getElementById('recordButton');
         const chatHistory = document.getElementById('chatHistory');
         const audioPlayback = document.getElementById('audioPlayback');
@@ -86,12 +86,13 @@
                             }
                         });
-                        const audioResponse = new Response(stream);
-                        const audioBlob = await audioResponse.blob();
-                        audioPlayback.src = URL.createObjectURL(audioBlob);
-                        audioPlayback.play();
-                        updateChatHistory('AI', URL.createObjectURL(audioBlob));
                     } else {
                         console.error('API response not ok:', response.status);
                         updateChatHistory('AI', 'Error in API response');
@@ -99,7 +100,11 @@
                 };
             } catch (error) {
                 console.error('Error sending audio to API:', error);
-                updateChatHistory('AI', 'Error communicating with the server');
             }
         }

     <audio id="audioPlayback" controls style="display:none;"></audio>
     <script>
+        const API_URL = '/chat';
         const recordButton = document.getElementById('recordButton');
         const chatHistory = document.getElementById('chatHistory');
         const audioPlayback = document.getElementById('audioPlayback');
                             }
                         });
+                        const responseBlob = await new Response(stream).blob();
+                        const audioUrl = URL.createObjectURL(responseBlob);
+                        updateChatHistory('AI', audioUrl);
+                        // Play the audio response
+                        const audio = new Audio(audioUrl);
+                        audio.play();
                     } else {
                         console.error('API response not ok:', response.status);
                         updateChatHistory('AI', 'Error in API response');
                 };
             } catch (error) {
                 console.error('Error sending audio to API:', error);
+                if (error.name === 'TypeError' && error.message === 'Failed to fetch') {
+                    updateChatHistory('AI', 'Error: Unable to connect to the server. Please ensure the server is running and accessible.');
+                } else {
+                    updateChatHistory('AI', 'Error communicating with the server: ' + error.message);
+                }
             }
         }

webui/omni_streamlit.py CHANGED Viewed

@@ -1,257 +1,134 @@
-import streamlit as st
-import wave
-# from ASR import recognize
-import requests
-import pyaudio
-import numpy as np
-import base64
-import io
-import os
-import time
-import tempfile
-import librosa
-import traceback
-from pydub import AudioSegment
-from utils.vad import get_speech_timestamps, collect_chunks, VadOptions
-API_URL = os.getenv("API_URL", "http://127.0.0.1:60808/chat")
-# recording parameters
-IN_FORMAT = pyaudio.paInt16
-IN_CHANNELS = 1
-IN_RATE = 24000
-IN_CHUNK = 1024
-IN_SAMPLE_WIDTH = 2
-VAD_STRIDE = 0.5
-# playing parameters
-OUT_FORMAT = pyaudio.paInt16
-OUT_CHANNELS = 1
-OUT_RATE = 24000
-OUT_SAMPLE_WIDTH = 2
-OUT_CHUNK = 5760
-# Initialize chat history
-if "messages" not in st.session_state:
-    st.session_state.messages = []
-def run_vad(ori_audio, sr):
-    _st = time.time()
-    try:
-        audio = np.frombuffer(ori_audio, dtype=np.int16)
-        audio = audio.astype(np.float32) / 32768.0
-        sampling_rate = 16000
-        if sr != sampling_rate:
-            audio = librosa.resample(audio, orig_sr=sr, target_sr=sampling_rate)
-        vad_parameters = {}
-        vad_parameters = VadOptions(**vad_parameters)
-        speech_chunks = get_speech_timestamps(audio, vad_parameters)
-        audio = collect_chunks(audio, speech_chunks)
-        duration_after_vad = audio.shape[0] / sampling_rate
-        if sr != sampling_rate:
-            # resample to original sampling rate
-            vad_audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=sr)
-        else:
-            vad_audio = audio
-        vad_audio = np.round(vad_audio * 32768.0).astype(np.int16)
-        vad_audio_bytes = vad_audio.tobytes()
-        return duration_after_vad, vad_audio_bytes, round(time.time() - _st, 4)
-    except Exception as e:
-        msg = f"[asr vad error] audio_len: {len(ori_audio)/(sr*2):.3f} s, trace: {traceback.format_exc()}"
-        print(msg)
-        return -1, ori_audio, round(time.time() - _st, 4)
-def warm_up():
-    frames = b"\x00\x00" * 1024 * 2  # 1024 frames of 2 bytes each
-    dur, frames, tcost = run_vad(frames, 16000)
-    print(f"warm up done, time_cost: {tcost:.3f} s")
-def save_tmp_audio(audio_bytes):
-    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmpfile:
-        file_name = tmpfile.name
-        audio = AudioSegment(
-            data=audio_bytes,
-            sample_width=OUT_SAMPLE_WIDTH,
-            frame_rate=OUT_RATE,
-            channels=OUT_CHANNELS,
-        )
-        audio.export(file_name, format="wav")
-        return file_name
-def speaking(status):
-    # Initialize PyAudio
-    p = pyaudio.PyAudio()
-    # Open PyAudio stream
-    stream = p.open(
-        format=OUT_FORMAT, channels=OUT_CHANNELS, rate=OUT_RATE, output=True
-    )
-    audio_buffer = io.BytesIO()
-    wf = wave.open(audio_buffer, "wb")
-    wf.setnchannels(IN_CHANNELS)
-    wf.setsampwidth(IN_SAMPLE_WIDTH)
-    wf.setframerate(IN_RATE)
-    total_frames = b"".join(st.session_state.frames)
-    dur = len(total_frames) / (IN_RATE * IN_CHANNELS * IN_SAMPLE_WIDTH)
-    status.warning(f"Speaking... recorded audio duration: {dur:.3f} s")
-    wf.writeframes(total_frames)
-    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmpfile:
-        with open(tmpfile.name, "wb") as f:
-            f.write(audio_buffer.getvalue())
-        file_name = tmpfile.name
-        with st.chat_message("user"):
-            st.audio(file_name, format="audio/wav", loop=False, autoplay=False)
-        st.session_state.messages.append(
-            {"role": "assistant", "content": file_name, "type": "audio"}
-        )
-    st.session_state.frames = []
-    audio_bytes = audio_buffer.getvalue()
-    base64_encoded = str(base64.b64encode(audio_bytes), encoding="utf-8")
-    files = {"audio": base64_encoded}
-    output_audio_bytes = b""
-    with requests.post(API_URL, json=files, stream=True) as response:
-        try:
-            for chunk in response.iter_content(chunk_size=OUT_CHUNK):
-                if chunk:
-                    # Convert chunk to numpy array
-                    output_audio_bytes += chunk
-                    audio_data = np.frombuffer(chunk, dtype=np.int8)
-                    # Play audio
-                    stream.write(audio_data)
-        except Exception as e:
-            st.error(f"Error during audio streaming: {e}")
-    out_file = save_tmp_audio(output_audio_bytes)
-    with st.chat_message("assistant"):
-        st.audio(out_file, format="audio/wav", loop=False, autoplay=False)
-    st.session_state.messages.append(
-        {"role": "assistant", "content": out_file, "type": "audio"}
-    )
-    wf.close()
-    # Close PyAudio stream and terminate PyAudio
-    stream.stop_stream()
-    stream.close()
-    p.terminate()
-    st.session_state.speaking = False
-    st.session_state.recording = True
-def recording(status):
-    audio = pyaudio.PyAudio()
-    stream = audio.open(
-        format=IN_FORMAT,
-        channels=IN_CHANNELS,
-        rate=IN_RATE,
-        input=True,
-        frames_per_buffer=IN_CHUNK,
-    )
-    temp_audio = b""
-    vad_audio = b""
-    start_talking = False
-    last_temp_audio = None
-    st.session_state.frames = []
-    while st.session_state.recording:
-        status.success("Listening...")
-        audio_bytes = stream.read(IN_CHUNK)
-        temp_audio += audio_bytes
-        if len(temp_audio) > IN_SAMPLE_WIDTH * IN_RATE * IN_CHANNELS * VAD_STRIDE:
-            dur_vad, vad_audio_bytes, time_vad = run_vad(temp_audio, IN_RATE)
-            print(f"duration_after_vad: {dur_vad:.3f} s, time_vad: {time_vad:.3f} s")
-            if dur_vad > 0.2 and not start_talking:
-                if last_temp_audio is not None:
-                    st.session_state.frames.append(last_temp_audio)
-                start_talking = True
-            if start_talking:
-                st.session_state.frames.append(temp_audio)
-            if dur_vad < 0.1 and start_talking:
-                st.session_state.recording = False
-                print(f"speech end detected. excit")
-            last_temp_audio = temp_audio
-            temp_audio = b""
-    stream.stop_stream()
-    stream.close()
-    audio.terminate()
-def main():
-    st.title("Chat Mini-Omni Demo")
-    status = st.empty()
-    if "warm_up" not in st.session_state:
-        warm_up()
-        st.session_state.warm_up = True
-    if "start" not in st.session_state:
-        st.session_state.start = False
-    if "recording" not in st.session_state:
-        st.session_state.recording = False
-    if "speaking" not in st.session_state:
-        st.session_state.speaking = False
-    if "frames" not in st.session_state:
-        st.session_state.frames = []
-    if not st.session_state.start:
-        status.warning("Click Start to chat")
-    start_col, stop_col, _ = st.columns([0.2, 0.2, 0.6])
-    start_button = start_col.button("Start", key="start_button")
-    # stop_button = stop_col.button("Stop", key="stop_button")
-    if start_button:
-        time.sleep(1)
-        st.session_state.recording = True
-        st.session_state.start = True
-    for message in st.session_state.messages:
-        with st.chat_message(message["role"]):
-            if message["type"] == "msg":
-                st.markdown(message["content"])
-            elif message["type"] == "img":
-                st.image(message["content"], width=300)
-            elif message["type"] == "audio":
-                st.audio(
-                    message["content"], format="audio/wav", loop=False, autoplay=False
-                )
-    while st.session_state.start:
-        if st.session_state.recording:
-            recording(status)
-        if not st.session_state.recording and st.session_state.start:
-            st.session_state.speaking = True
-            speaking(status)
-        # if stop_button:
-        #     status.warning("Stopped, click Start to chat")
-        #     st.session_state.start = False
-        #     st.session_state.recording = False
-        #     st.session_state.frames = []
-        #     break
-if __name__ == "__main__":
-    main()

+import streamlit as st
+import numpy as np
+import requests
+import base64
+import tempfile
+import os
+import time
+import traceback
+import librosa
+from pydub import AudioSegment
+from streamlit_webrtc import webrtc_streamer, WebRtcMode, RTCConfiguration
+import av
+from utils.vad import get_speech_timestamps, collect_chunks, VadOptions
+API_URL = os.getenv("API_URL", "http://127.0.0.1:60808/chat")
+# Initialize chat history
+if "messages" not in st.session_state:
+    st.session_state.messages = []
+def run_vad(audio, sr):
+    _st = time.time()
+    try:
+        audio = audio.astype(np.float32) / 32768.0
+        sampling_rate = 16000
+        if sr != sampling_rate:
+            audio = librosa.resample(audio, orig_sr=sr, target_sr=sampling_rate)
+        vad_parameters = {}
+        vad_parameters = VadOptions(**vad_parameters)
+        speech_chunks = get_speech_timestamps(audio, vad_parameters)
+        audio = collect_chunks(audio, speech_chunks)
+        duration_after_vad = audio.shape[0] / sampling_rate
+        if sr != sampling_rate:
+            # resample to original sampling rate
+            vad_audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=sr)
+        else:
+            vad_audio = audio
+        vad_audio = np.round(vad_audio * 32768.0).astype(np.int16)
+        vad_audio_bytes = vad_audio.tobytes()
+        return duration_after_vad, vad_audio_bytes, round(time.time() - _st, 4)
+    except Exception as e:
+        msg = f"[asr vad error] audio_len: {len(audio)/(sr):.3f} s, trace: {traceback.format_exc()}"
+        print(msg)
+        return -1, audio.tobytes(), round(time.time() - _st, 4)
+def save_tmp_audio(audio_bytes):
+    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmpfile:
+        file_name = tmpfile.name
+        audio = AudioSegment(
+            data=audio_bytes,
+            sample_width=2,
+            frame_rate=16000,
+            channels=1,
+        )
+        audio.export(file_name, format="wav")
+        return file_name
+def main():
+    st.title("Chat Mini-Omni Demo")
+    status = st.empty()
+    if "audio_buffer" not in st.session_state:
+        st.session_state.audio_buffer = []
+    webrtc_ctx = webrtc_streamer(
+        key="speech-to-text",
+        mode=WebRtcMode.SENDONLY,
+        audio_receiver_size=1024,
+        rtc_configuration=RTCConfiguration(
+            {"iceServers": [{"urls": ["stun:stun.l.google.com:19302"]}]}
+        ),
+        media_stream_constraints={"video": False, "audio": True},
+    )
+    if webrtc_ctx.audio_receiver:
+        while True:
+            try:
+                audio_frame = webrtc_ctx.audio_receiver.get_frame(timeout=1)
+                sound_chunk = np.frombuffer(audio_frame.to_ndarray(), dtype="int16")
+                st.session_state.audio_buffer.extend(sound_chunk)
+                if len(st.session_state.audio_buffer) >= 16000:
+                    duration_after_vad, vad_audio_bytes, vad_time = run_vad(
+                        np.array(st.session_state.audio_buffer), 16000
+                    )
+                    st.session_state.audio_buffer = []
+                    if duration_after_vad > 0:
+                        st.session_state.messages.append(
+                            {"role": "user", "content": "User audio"}
+                        )
+                        file_name = save_tmp_audio(vad_audio_bytes)
+                        st.audio(file_name, format="audio/wav")
+                        response = requests.post(API_URL, data=vad_audio_bytes)
+                        assistant_audio_bytes = response.content
+                        assistant_file_name = save_tmp_audio(assistant_audio_bytes)
+                        st.audio(assistant_file_name, format="audio/wav")
+                        st.session_state.messages.append(
+                            {"role": "assistant", "content": "Assistant response"}
+                        )
+            except Exception as e:
+                print(f"Error in audio processing: {e}")
+                break
+    if st.button("Process Audio"):
+        if st.session_state.audio_buffer:
+            duration_after_vad, vad_audio_bytes, vad_time = run_vad(
+                np.array(st.session_state.audio_buffer), 16000
+            )
+            st.session_state.messages.append({"role": "user", "content": "User audio"})
+            file_name = save_tmp_audio(vad_audio_bytes)
+            st.audio(file_name, format="audio/wav")
+            response = requests.post(API_URL, data=vad_audio_bytes)
+            assistant_audio_bytes = response.content
+            assistant_file_name = save_tmp_audio(assistant_audio_bytes)
+            st.audio(assistant_file_name, format="audio/wav")
+            st.session_state.messages.append(
+                {"role": "assistant", "content": "Assistant response"}
+            )
+            st.session_state.audio_buffer = []
+    if st.session_state.messages:
+        for message in st.session_state.messages:
+            if message["role"] == "user":
+                st.write(f"User: {message['content']}")
+            else:
+                st.write(f"Assistant: {message['content']}")
+if __name__ == "__main__":
+    main()