Akis Giannoukos commited on
Commit
ada1ece
Β·
1 Parent(s): 5731404

changed heading and removed Apply model and restart button

Browse files
Files changed (2) hide show
  1. README.md +15 -12
  2. app.py +3 -3
README.md CHANGED
@@ -32,7 +32,7 @@ Stop automatically when confidence in all PHQ-9 items is sufficiently high.
32
 
33
  Produce a final PHQ-9 severity report.
34
 
35
- The system will use MedGemma-4B-IT (instruction-tuned medical LLM) as the base model for both:
36
 
37
  -A Recording Agent (conversational component)
38
 
@@ -44,11 +44,11 @@ The system will use MedGemma-4B-IT (instruction-tuned medical LLM) as the base m
44
  Component Description
45
  -Frontend Client: Handles user interaction, voice input/output, and UI display.
46
  -Speech I/O Module: Converts speech to text (ASR) and text to speech (TTS).
47
- -Feature Extraction Module: Extracts acoustic and prosodic features via OpenSmile for emotional/speech analysis.
48
  -Recording Agent (Chatbot): Conducts clinician-like conversation with adaptive questioning.
49
  -Scoring Agent: Evaluates PHQ-9 symptom probabilities after each exchange and determines confidence in final diagnosis.
50
  Controller / Orchestrator: Manages communication between agents and triggers scoring cycles.
51
- Model Backend: Hosts MedGemma-4B-IT, fine-tuned or prompted for clinician reasoning.
52
 
53
  2.2 Architecture Diagram (Text Description)
54
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
@@ -68,8 +68,8 @@ Model Backend: Hosts MedGemma-4B-IT, fine-tuned or prompted for clinician reason
68
  β”‚
69
  β–Ό
70
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
71
- β”‚ Feature Extraction Module β”‚
72
- β”‚ - OpenSmile (prosody, pitch)β”‚
73
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
74
  β”‚
75
  β–Ό
@@ -106,7 +106,7 @@ Maintain conversational context.
106
 
107
  Adapt follow-up questions based on inferred patient state.
108
 
109
- Produce text responses using MedGemma-4B-IT with a clinician-style prompt template.
110
 
111
  After each user response, trigger the Scoring Agent to reassess.
112
 
@@ -143,7 +143,10 @@ Parse the full transcript and extract statements relevant to each PHQ-9 item.
143
 
144
  Combine textual cues + acoustic cues.
145
 
146
- Use MedGemma’s reasoning chain to map features to PHQ-9 scores.
 
 
 
147
 
148
  When confidence for all β‰₯ threshold Ο„ (e.g., 0.8), finalize results and signal termination.
149
 
@@ -153,7 +156,7 @@ User speaks β†’ Audio captured.
153
 
154
  ASR transcribes text.
155
 
156
- OpenSmile extracts voice features.
157
 
158
  Recording Agent uses transcript (and optionally summarized features) β†’ next conversational message.
159
 
@@ -167,13 +170,13 @@ TTS module vocalizes clinician output.
167
 
168
  5.1 Models and Libraries
169
  Function Tool / Library
170
- Base LLM MedGemma-4B-IT (from Hugging Face)
171
  Whisper
172
  gTTS (preferrably), TTS Coqui TTS, gTTS, or Bark
173
- Audio Features OpenSmile (IS09/ComParE configs)
174
- Backend Python / FastAPI server
175
  Frontend Gradio
176
- Communication WebSocket or REST APIs
177
 
178
  5.2 Confidence Computation
179
 
 
32
 
33
  Produce a final PHQ-9 severity report.
34
 
35
+ The system will use a configurable LLM (e.g., Gemma-2-2B-IT or MedGemma-4B-IT) as the base model for both:
36
 
37
  -A Recording Agent (conversational component)
38
 
 
44
  Component Description
45
  -Frontend Client: Handles user interaction, voice input/output, and UI display.
46
  -Speech I/O Module: Converts speech to text (ASR) and text to speech (TTS).
47
+ -Feature Extraction Module: Extracts acoustic and prosodic features via librosa (lightweight prosody proxies) for emotional/speech analysis.
48
  -Recording Agent (Chatbot): Conducts clinician-like conversation with adaptive questioning.
49
  -Scoring Agent: Evaluates PHQ-9 symptom probabilities after each exchange and determines confidence in final diagnosis.
50
  Controller / Orchestrator: Manages communication between agents and triggers scoring cycles.
51
+ Model Backend: Hosts a configurable LLM (e.g., Gemma-2-2B-IT, MedGemma-4B-IT), prompted for clinician reasoning.
52
 
53
  2.2 Architecture Diagram (Text Description)
54
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 
68
  β”‚
69
  β–Ό
70
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
71
+ β”‚ Feature Extraction Module β”‚
72
+ β”‚ - librosa (prosody pitch, energy/loudness, timing/phonation)β”‚
73
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
74
  β”‚
75
  β–Ό
 
106
 
107
  Adapt follow-up questions based on inferred patient state.
108
 
109
+ Produce text responses using a configurable LLM (e.g. Gemma-2-2B-IT, MedGemma-4B-IT) with a clinician-style prompt template.
110
 
111
  After each user response, trigger the Scoring Agent to reassess.
112
 
 
143
 
144
  Combine textual cues + acoustic cues.
145
 
146
+ Fusion mechanism: Acoustic features are summarized into a compact JSON and included in the scoring prompt alongside the transcript (early, prompt-level fusion).
147
+
148
+ Use the LLM’s reasoning chain to map features to PHQ-9 scores.
149
+
150
 
151
  When confidence for all β‰₯ threshold Ο„ (e.g., 0.8), finalize results and signal termination.
152
 
 
156
 
157
  ASR transcribes text.
158
 
159
+ librosa/OpenSmile extracts voice features (prosody proxies).
160
 
161
  Recording Agent uses transcript (and optionally summarized features) β†’ next conversational message.
162
 
 
170
 
171
  5.1 Models and Libraries
172
  Function Tool / Library
173
+ Base LLM Configurable (e.g. Gemma-2-2B-IT, MedGemma-4B-IT)
174
  Whisper
175
  gTTS (preferrably), TTS Coqui TTS, gTTS, or Bark
176
+ Audio Features librosa (RMS, ZCR, spectral centroid, f0, energy, duration)
177
+ Backend Python / Gradio (Spaces)
178
  Frontend Gradio
179
+ Communication Gradio UI
180
 
181
  5.2 Confidence Computation
182
 
app.py CHANGED
@@ -579,8 +579,8 @@ def create_demo():
579
  with gr.Blocks(theme=gr.themes.Soft()) as demo:
580
  gr.Markdown(
581
  """
582
- ### PHQ-9 Conversational Clinician Agent
583
- Engage in a brief, empathetic conversation. Your audio is transcribed, analyzed, and used to infer PHQ-9 scores.
584
  The system stops when confidence is high enough or any safety risk is detected. It does not provide therapy or emergency counseling.
585
  """
586
  )
@@ -596,7 +596,7 @@ def create_demo():
596
  model_id_tb = gr.Textbox(value=current_model_id, label="Chat Model ID", info="e.g., google/gemma-2-2b-it or google/medgemma-4b-it")
597
  with gr.Row():
598
  apply_model_btn = gr.Button("Apply model (no restart)")
599
- apply_model_restart_btn = gr.Button("Apply model and restart")
600
  model_status = gr.Markdown(value=f"Current model: `{current_model_id}`")
601
 
602
  with gr.Row():
 
579
  with gr.Blocks(theme=gr.themes.Soft()) as demo:
580
  gr.Markdown(
581
  """
582
+ ### Conversational Assessment for Responsive Engagement (CARE) Notes
583
+ Engage in a brief conversation. Your audio is transcribed, analyzed, and used to infer PHQ-9 scores.
584
  The system stops when confidence is high enough or any safety risk is detected. It does not provide therapy or emergency counseling.
585
  """
586
  )
 
596
  model_id_tb = gr.Textbox(value=current_model_id, label="Chat Model ID", info="e.g., google/gemma-2-2b-it or google/medgemma-4b-it")
597
  with gr.Row():
598
  apply_model_btn = gr.Button("Apply model (no restart)")
599
+ # apply_model_restart_btn = gr.Button("Apply model and restart")
600
  model_status = gr.Markdown(value=f"Current model: `{current_model_id}`")
601
 
602
  with gr.Row():