Spaces:
Running
on
Zero
Running
on
Zero
Update app.py
Browse files
app.py
CHANGED
|
@@ -446,6 +446,7 @@ model = AutoModel.from_pretrained(
|
|
| 446 |
torch_dtype=torch.bfloat16,
|
| 447 |
low_cpu_mem_usage=True,
|
| 448 |
trust_remote_code=True,
|
|
|
|
| 449 |
).eval().cuda()
|
| 450 |
tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
|
| 451 |
|
|
@@ -559,18 +560,11 @@ with gr.Blocks() as demo:
|
|
| 559 |
gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
|
| 560 |
This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
|
| 561 |
### 📌 What it does:
|
| 562 |
-
- Takes an input image and a text prompt.
|
| 563 |
-
|
| 564 |
-
|
| 565 |
-
### 🖼️ Video layout (per frame):
|
| 566 |
-
Each frame in the video includes:
|
| 567 |
-
1. 🔥 **Heatmap over image**: Shows which area the model focuses on.
|
| 568 |
-
2. 📝 **Generated text**: With old context, current token highlighted.
|
| 569 |
-
3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
|
| 570 |
### 🎯 Use cases:
|
| 571 |
-
- Research explainability of vision-language models.
|
| 572 |
-
- Debugging or interpreting model outputs.
|
| 573 |
-
- Creating educational visualizations.
|
| 574 |
""")
|
| 575 |
|
| 576 |
with gr.Row():
|
|
|
|
| 446 |
torch_dtype=torch.bfloat16,
|
| 447 |
low_cpu_mem_usage=True,
|
| 448 |
trust_remote_code=True,
|
| 449 |
+
use_flash_attn=False,
|
| 450 |
).eval().cuda()
|
| 451 |
tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
|
| 452 |
|
|
|
|
| 560 |
gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
|
| 561 |
This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
|
| 562 |
### 📌 What it does:
|
| 563 |
+
- Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
|
| 564 |
+
### 🖼️ Video layout (per frame):
|
| 565 |
+
Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 566 |
### 🎯 Use cases:
|
| 567 |
+
- Research explainability of vision-language models. - Debugging or interpreting model outputs. - Creating educational visualizations.
|
|
|
|
|
|
|
| 568 |
""")
|
| 569 |
|
| 570 |
with gr.Row():
|