Spaces:
Running
on
Zero
title: Phq 9 Clinician Agent
emoji: π’
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: MedGemma clinician chatbot demo (research prototype)
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Technical Design Document: MedGemma-Based PHQ-9 Conversational Assessment Agent
- Overview
1.1 Project Goal
The goal of this project is to develop an AI-driven clinician simulation agent that conducts natural conversations with patients to assess depression severity based on the PHQ-9 (Patient Health Questionnaire-9) scale. Unlike simple questionnaire bots, this system aims to infer a patientβs score implicitly through conversation and speech cues, mirroring a clinicianβs behavior in real-world interviews.
1.2 Core Concept
The system will:
Engage the user in a realistic, adaptive dialogue (clinician-style questioning).
Continuously analyze textual and vocal features to estimate PHQ-9 category scores.
Stop automatically when confidence in all PHQ-9 items is sufficiently high.
Produce a final PHQ-9 severity report.
The system will use a configurable LLM (e.g., Gemma-2-2B-IT or MedGemma-4B-IT) as the base model for both:
-A Recording Agent (conversational component)
-A Scoring Agent (PHQ-9 inference component)
- System Architecture
2.1 High-Level Components Component Description -Frontend Client: Handles user interaction, voice input/output, and UI display. -Speech I/O Module: Converts speech to text (ASR) and text to speech (TTS). -Feature Extraction Module: Extracts acoustic and prosodic features via librosa (lightweight prosody proxies) for emotional/speech analysis. -Recording Agent (Chatbot): Conducts clinician-like conversation with adaptive questioning. -Scoring Agent: Evaluates PHQ-9 symptom probabilities after each exchange and determines confidence in final diagnosis. Controller / Orchestrator: Manages communication between agents and triggers scoring cycles. Model Backend: Hosts a configurable LLM (e.g., Gemma-2-2B-IT, MedGemma-4B-IT), prompted for clinician reasoning.
2.2 Architecture Diagram (Text Description) βββββββββββββββββββββββββ β Frontend Client β β (Web / Desktop App) β β - Voice Input/Output β β - Text Display β βββββββββββ¬ββββββββββββββ β (Audio stream) β βββββββββββββββββββββββββ β Speech I/O Module β β - ASR (Whisper) β β - TTS (e.g., Coqui) β βββββββββββ¬ββββββββββββββ β βΌ ββββββββββββββββββββββββββββββ β Feature Extraction Module β β - librosa (prosody pitch, energy/loudness, timing/phonation)β βββββββββββ¬βββββββββββββββββββ β βΌ βββββββββββββββββββββββββββββββββ β Recording Agent (MedGemma) β β - Generates next question β β - Conversational context β βββββββββββ¬ββββββββββββββββββββββ β βΌ βββββββββββββββββββββββββββββββββ β Scoring Agent (MedGemma) β β - Maps text+voice features β β β PHQ-9 dimension confidences β β - Determines if assessment doneβ βββββββββββ¬ββββββββββββββββββββββ β βΌ βββββββββββββββββββββββββββββββββ β Controller / Orchestrator β β - Loop until confidence β₯ Ο β β - Output PHQ-9 report β βββββββββββββββββββββββββββββββββ
- Agent Design
3.1 Recording Agent
Role: Simulates a clinician conducting an empathetic, open-ended dialogue to elicit responses relevant to the PHQ-9 categories (mood, sleep, appetite, concentration, energy, self-worth, psychomotor changes, suicidal ideation).
Key Responsibilities:
Maintain conversational context.
Adapt follow-up questions based on inferred patient state.
Produce text responses using a configurable LLM (e.g. Gemma-2-2B-IT, MedGemma-4B-IT) with a clinician-style prompt template.
After each user response, trigger the Scoring Agent to reassess.
Prompt Skeleton Example:
System: You are a clinician conducting a conversational assessment to infer PHQ-9 symptoms without listing questions. Keep tone empathetic, natural, and human. User: [transcribed patient input] Assistant: [clinician-style response / next question]
3.2 Scoring Agent
Role: Evaluates the ongoing conversation to infer a PHQ-9 score distribution and confidence values for each symptom.
Input:
Conversation transcript (all turns)
OpenSmile features (prosody, energy, speech rate)
Optional: timestamped emotional embeddings (via pretrained affect model)
Output:
Vector of 9 PHQ-9 scores (0β3)
Confidence scores per question
Overall depression severity classification (Minimal, Mild, Moderate, Moderately Severe, Severe)
Operation Flow:
Parse the full transcript and extract statements relevant to each PHQ-9 item.
Combine textual cues + acoustic cues.
Fusion mechanism: Acoustic features are summarized into a compact JSON and included in the scoring prompt alongside the transcript (early, prompt-level fusion).
Use the LLMβs reasoning chain to map features to PHQ-9 scores.
When confidence for all β₯ threshold Ο (e.g., 0.8), finalize results and signal termination.
- Data Flow
User speaks β Audio captured.
ASR transcribes text.
librosa/OpenSmile extracts voice features (prosody proxies).
Recording Agent uses transcript (and optionally summarized features) β next conversational message.
Scoring Agent evaluates cumulative context β PHQ-9 score vector + confidence.
If confidence < Ο β continue conversation; else β output final diagnosis.
TTS module vocalizes clinician output.
- Implementation Details
5.1 Models and Libraries Function Tool / Library Base LLM Configurable (e.g. Gemma-2-2B-IT, MedGemma-4B-IT) Whisper gTTS (preferrably), TTS Coqui TTS, gTTS, or Bark Audio Features librosa (RMS, ZCR, spectral centroid, f0, energy, duration) Backend Python / Gradio (Spaces) Frontend Gradio Communication Gradio UI
5.2 Confidence Computation
Each PHQ-9 item i has a confidence score ci β [0,1].
ci estimated via secondary LLM reasoning (e.g., βHow confident are you about this inference?β).
Global confidence C=minici. β Stop condition: Cβ₯Ο, e.g., 0.8.
5.3 Example API Workflow
POST /api/message { "audio": , "transcript": "...", "features": {...} } β { "agent_response": "...", "phq9_scores": [1, 0, 2, ...], "confidences": [0.9, 0.85, ...], "finished": false }
- Training and Fine-Tuning (Future work, will not be implemented now as we do not have the data at the moment.)
Supervised Fine-Tuning (SFT) using synthetic dialogues labeled with PHQ-9 scores.
Speech-text alignment: fuse OpenSmile embeddings with conversation text embeddings before feeding to scoring prompts.
Possible multi-modal fusion via:
Feature concatenation β token embedding
or cross-attention adapter (if fine-tuning allowed).
- Output Specification
Final Output:
{ "PHQ9_Scores": { "interest": 2, "mood": 3, "sleep": 2, "energy": 2, "appetite": 1, "self_worth": 2, "concentration": 1, "motor": 1, "suicidal_thoughts": 0 }, "Total_Score": 14, "Severity": "Moderate Depression", "Confidence": 0.86 }
Displayed alongside a clinician-style summary:
βBased on our discussion, your responses suggest moderate depressive symptoms, with difficulties in mood and sleep being most prominent.β
- Termination and Safety
The system will not offer therapy advice or emergency counseling.
If the patient mentions suicidal thoughts (item 9), the system:
Flags high risk,
Terminates the chat, and
Displays emergency contact information (e.g., βIf you are in danger or need immediate help, call 988 in the U.S.β).
- Future Extensions (Not implemented now)
Fine-tuned model jointly trained on PHQ-9 labeled conversations.
Multilingual support (via Whisper multilingual and TTS).
Confidence calibration using Bayesian reasoning or uncertainty quantification.
Integration with EHR systems for clinician verification.
- Summary
This project creates an intelligent, conversational PHQ-9 assessment agent that blends:
The MedGemma-4B-IT medical LLM,
Audio emotion analysis with OpenSmile,
A dual-agent architecture for conversation and scoring,
and multimodal reasoning to deliver clinician-like mental health assessments.
The modular design enables local deployment on GPU servers, privacy-preserving operation, and future research extensions into multimodal diagnostic reasoning.