Explainable-Vision-Language-Model

Running on Zero

khang119966 commited on Apr 13

Commit

3d50453

verified ·

1 Parent(s): 8bfbd60

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -446,6 +446,7 @@ model = AutoModel.from_pretrained(
     torch_dtype=torch.bfloat16,
     low_cpu_mem_usage=True,
     trust_remote_code=True,
 ).eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
@@ -559,18 +560,11 @@ with gr.Blocks() as demo:
     gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
 This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
 ### 📌 What it does:
-- Takes an input image and a text prompt.
-- Shows how the model’s attention shifts on the image for each generated token.
-- Helps explain the model’s behavior and decision-making.
-### 🖼️ Video layout (per frame):
-Each frame in the video includes:
-1. 🔥 **Heatmap over image**: Shows which area the model focuses on.
-2. 📝 **Generated text**: With old context, current token highlighted.
-3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
 ### 🎯 Use cases:
-- Research explainability of vision-language models.
-- Debugging or interpreting model outputs.
-- Creating educational visualizations.
 """)
     with gr.Row():

     torch_dtype=torch.bfloat16,
     low_cpu_mem_usage=True,
     trust_remote_code=True,
+    use_flash_attn=False,
 ).eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
     gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
 This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
 ### 📌 What it does:
+- Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
+### 🖼️ Video layout (per frame):
+Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
 ### 🎯 Use cases:
+- Research explainability of vision-language models. - Debugging or interpreting model outputs. - Creating educational visualizations.
 """)
     with gr.Row():