File size: 11,715 Bytes
925cdd8 eb158bd 7bcaca6 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd e95e8b7 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd ba18816 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd 925cdd8 eb158bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 |
---
library_name: transformers
tags:
- multimodal
- multilingual
- vlm
- translation
language:
- en
- de
- nl
- es
- fr
- pt
- uk
- hi
- zh
- ru
- cs
- ko
- ja
- it
- pl
- ro
- nb
- nn
base_model:
- Unbabel/Tower-Plus-9B
pipeline_tag: image-text-to-text
license: cc-by-nc-sa-4.0
---
# Model Card for TowerVision
<p align="left">
<img src="Tower.png" alt="TowerVision Logo" width="200">
</p>
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning.
- **Model Family**: TowerVision (2B, 9B variants)
- **Context length**: 8192 tokens
- **Languages**: 20+ languages including European, Asian, and other language families
<span style="font-size: 1.2em;"><strong>🌟 Try TowerVision</strong></span>: [Project Page](https://guilhermeviveiros.github.io/TowerVision.io/) | [Code Repository](https://github.com/GuilhermeViveiros/LLaVA-NeXT)
## Available Models
<p align="left">
| Model | Parameters | HF Link |
|-------|------------|---------|
| TowerVision-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
| TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
## How to Use TowerVision
When using the model, make sure your prompt is formated correctly!
Also, we recommend using **bfloat16** rather than **fp32/16**
### Quick Start with Transformers
<details open>
<summary>Click to expand/collapse code</summary>
```python
from transformers import (
LlavaNextProcessor,
LlavaNextForConditionalGeneration
)
import requests
from PIL import Image
model_id = "utter-project/TowerVision-2B" # or any other variant
def prepare_prompt(query):
conversation = [
{
"role": "user",
"content": f"<image>\n{query}"
}
]
# Format message with the towervision chat template
prompt = processor.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True
)
return prompt
# we recommend using "bfloat16" as torch_dtype
kwargs = {
"torch_dtype": "bfloat16",
"device_map": "auto",
}
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)
# img url
img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
image = Image.open(requests.get(img_url, stream=True).raw)
# Multilingual prompts - TowerVision supports 20+ languages!
prompt = prepare_prompt("Is this person really big, or is this building just super small?")
# Prepare inputs
inputs = processor(
text=prompt, images=image, return_tensors="pt"
).to(model.device)
# Generate response ids
gen_tokens = model.generate(**inputs, max_new_tokens=512)
# Decode response
print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```
</details>
### Batch Inference with Transformers
For processing multiple images and prompts simultaneously:
<details>
<summary>Click to expand/collapse code</summary>
```python
def prepare_prompts(queries):
prompts = []
for query in queries:
conversation = [
{
"role": "user",
"content": f"<image>\n{query}"
}
]
# Format message with the towervision chat template
prompt = processor.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True
)
prompts.append(prompt)
return prompts
# we recommend using "bfloat16" as torch_dtype
kwargs = {
"torch_dtype": "bfloat16",
"device_map": "auto",
}
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)
# Sample images and queries for batch processing
img_urls = [
"https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
"https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
]
queries = [
"Is this person really big, or is this building just super small?",
"Where was this photo taken?"
]
# Load images
images = []
for url in img_urls[:batch_size]:
image = Image.open(requests.get(url, stream=True).raw)
images.append(image)
# Prepare prompts
prompts = prepare_prompts(queries[:batch_size])
# Prepare batch inputs
inputs = processor(
text=prompts,
images=images,
return_tensors="pt",
padding=True
).to(model.device)
# Generate response ids for batch
gen_tokens = model.generate(**inputs, max_new_tokens=512, do_sample=False)
# Decode responses
print(f"Batch processing {len(images)} images:")
print("-" * 50)
for i in range(len(images)):
input_length = inputs.input_ids[i].shape[0]
response = processor.tokenizer.decode(
gen_tokens[i][input_length:],
skip_special_tokens=True
)
print(f"Response: {response}")
print("-" * 50)
```
</details>
### Pipeline Usage
<summary>Click to expand/collapse code</summary>
<details>
```python
from transformers import pipeline
from PIL import Image
import requests
pipe = pipeline(
model="utter-project/TowerVision-9B",
task="image-text-to-text",
device_map="auto",
dtype="bfloat16"
)
def prepare_prompt(query):
conversation = [
{
"role": "user",
"content": f"<image>\n{query}"
}
]
# Format message with the towervision chat template
return pipe.processor.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True
)
img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
image = Image.open(requests.get(img_url, stream=True).raw)
text = prepare_prompt("Is this person really big, or is this building just super small?")
outputs = pipe(text=text, images=image, max_new_tokens=300, return_full_text=False)
print(outputs)
```
</details>
## Model Details
**Input**: Model accepts input text and images.
**Output**: Model generates text in multiple languages.
**Model Architecture**: TowerVision uses a multilingual language model based on [Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
**Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
**Languages Covered**: The model has been trained on **20 languages and dialects**:
- **European languages**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk)
- **Asian languages**: Chinese (Simplified & Traditional), Japanese, Korean, Hindi
- **Other languages**: Russian, Ukrainian
**Key Strengths**:
- **🏆 Exceptional performance on culturally-aware benchmarks** with deep understanding of cultural contexts and visual nuances
- **🌐 State-of-the-art results on multimodal multilingual translation benchmarks**, enabling seamless cross-lingual visual communication
- **📊 Strong cross-lingual transfer capabilities** across diverse vision-language tasks
## Training Data
TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
| Dataset | Samples | HF Link | |
|---------|---------|---------|-------|
| VisionBlocks | 6.31M | [🤗 utter-project/VisionBlocks](https://huggingface.co/datasets/utter-project/VisionBlocks) | Coming Soon |
### Dataset Statistics
- **Total samples**: 6.31M
- **Created by our team**: 1.21M samples (~19%)
- **Human-collected/external**: 5.10M samples (~81%)
### Dataset Composition Overview
**VisionBlocks** contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data:
- **Chart/Plot Reasoning**: DVQA, ChartQA, PlotQA, TabMWP (~405K samples)
- **General VQA**: VQAv2, RLAIF-4V (~488K samples)
- **Document VQA**: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples)
- **Reasoning/Knowledge**: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples)
- **Multilingual/Cultural**: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples)
- **Specialized VQA**: IconQA, InfographicVQA, Stratos (~34K samples)
- **Counting/Math**: TallyQA, PixMo-Count (~107K samples)
- **Vision/Text**: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples)
- **Video/Text**: LLaVA-Video collections (~1.4M samples)
**Collection Types**: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages.
## Evaluation
All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
### Multiple Purpose Multimodal Benchmarks
TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks:
<img src="mc-eval1.png" alt="Multiple Purpose Multimodal Benchmarks Results" width="600">
### Multimodal Multilingual Translation Tasks
TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities:
<img src="mc-eval2.png" alt="Multimodal Multilingual Translation Results" width="600">
### Supported Languages Performance
✅ **Fully Supported**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian
📊 **Benchmark Coverage**: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks.
## Citation
If you find TowerVision useful in your research, please consider citing the following paper:
```bibtex
@misc{viveiros2025towervisionunderstandingimprovingmultilinguality,
title={TowerVision: Understanding and Improving Multilinguality in Vision-Language Models},
author={André G. Viveiros and Patrick Fernandes and Saul Santos and Sonal Sannigrahi and Emmanouil Zaranis and Nuno M. Guerreiro and Amin Farajian and Pierre Colombo and Graham Neubig and André F. T. Martins},
year={2025},
eprint={2510.21849},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.21849},
}
```
## Model Card Contact
For errors or additional questions about details in this model card, contact the research team.
## Acknowledgments
TowerVision builds upon the excellent work of:
- **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
- **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B)** language models for multilingual capabilities
- **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
- The broader multilingual NLP and multimodal communities |