GuilhermeNunes commited on
Commit
d7d74b1
·
verified ·
1 Parent(s): c5a68a2

README updated

Browse files

![Tower](https://cdn-uploads.huggingface.co/production/uploads/6412130d6e51a8e21884a467/txPrPEb4VBujEL2x31CCj.png)

Files changed (1) hide show
  1. README.md +309 -156
README.md CHANGED
@@ -1,199 +1,352 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
 
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
 
 
92
 
93
- #### Training Hyperparameters
 
 
 
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
100
 
101
- [More Information Needed]
102
 
103
  ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
 
127
- ### Results
128
 
129
- [More Information Needed]
 
 
 
 
 
 
 
 
130
 
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - multimodal
5
+ - multilingual
6
+ - llm
7
+ - vision
8
+ - vlm
9
+ - translation
10
+ language:
11
+ - en
12
+ - de
13
+ - nl
14
+ - es
15
+ - fr
16
+ - pt
17
+ - uk
18
+ - hi
19
+ - zh
20
+ - ru
21
+ - cs
22
+ - ko
23
+ - ja
24
+ - it
25
+ - pl
26
+ - ro
27
+ - nb
28
+ - nn
29
+ base_model:
30
+ - Unbabel/Tower-Plus-2B
31
+ pipeline_tag: image-text-to-text
32
  ---
33
 
34
+ # Model Card for TowerVision
 
 
 
35
 
36
+ <p align="center">
37
+ <img src="Tower.png" alt="TowerVision Logo" width="200">
38
+ </p>
39
+
40
+ TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
41
+
42
+ This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning.
43
+
44
+ - **Point of Contact**: X (add some email here)
45
+ - **License**: Apache 2.0
46
+ - **Model Family**: TowerVision (2B, 9B variants)
47
+ - **Context length**: 8192 tokens
48
+ - **Languages**: 20+ languages including European, Asian, and other language families
49
+
50
+ <span style="font-size: 1.2em;"><strong>🌟 Try TowerVision</strong></span>: [Project Page](https://guilhermeviveiros.github.io/TowerVision.io/) | [Code Repository](https://github.com/GuilhermeViveiros/LLaVA-NeXT)
51
+
52
+ ## Available Models
53
+
54
+ <p align="left">
55
+
56
+ | Model | Parameters | HF Link |
57
+ |-------|------------|---------|
58
+ | TowerVision-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
59
+ | TowerVision-2B-pt | 2B | [🤗 utter-project/TowerVision-2B-pt](https://huggingface.co/utter-project/TowerVision-2B-pt)
60
+ | TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
61
+ | TowerVision-9B-pt | 9B | [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)
62
+
63
+ ## How to Use TowerVision
64
+
65
+ ### Quick Start with Transformers
66
+
67
+ <details open>
68
+ <summary>Click to expand/collapse code</summary>
69
+
70
+ ```python
71
+ from transformers import (
72
+ LlavaNextProcessor,
73
+ LlavaNextForConditionalGeneration
74
+ )
75
+ import requests
76
+ from PIL import Image
77
+
78
+ model_id = "utter-project/TowerVision-2B" # or any other variant
79
+
80
+ def prepare_prompt(query):
81
+ conversation = [
82
+ {
83
+ "role": "user",
84
+ "content": f"<image>\n{query}"
85
+ }
86
+ ]
87
+
88
+ # Format message with the towervision chat template
89
+ prompt = processor.apply_chat_template(
90
+ conversation,
91
+ tokenize=False,
92
+ add_generation_prompt=True
93
+ )
94
+
95
+ return prompt
96
+
97
+ # we recommend using "bfloat16" as torch_dtype
98
+ kwargs = {
99
+ "torch_dtype": "bfloat16",
100
+ "device_map": "auto",
101
+ }
102
+ processor = LlavaNextProcessor.from_pretrained(model_id)
103
+ model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)
104
+
105
+ # img url
106
+ img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
107
+ image = Image.open(requests.get(img_url, stream=True).raw)
108
+
109
+ # Multilingual prompts - TowerVision supports 20+ languages!
110
+ prompt = prepare_prompt("Is this person really big, or is this building just super small?")
111
+
112
+ # Prepare inputs
113
+ inputs = processor(
114
+ text=prompt, images=image, return_tensors="pt"
115
+ ).to(model.device)
116
+
117
+ # Generate response ids
118
+ gen_tokens = model.generate(**inputs, max_new_tokens=512)
119
+ # Decode response
120
+ print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
121
+ ```
122
+
123
+ </details>
124
+
125
+ ### Batch Inference with Transformers
126
+
127
+ For processing multiple images and prompts simultaneously:
128
+
129
+ <details>
130
+ <summary>Click to expand/collapse code</summary>
131
+
132
+ ```python
133
+ def prepare_prompts(queries):
134
+ prompts = []
135
+ for query in queries:
136
+ conversation = [
137
+ {
138
+ "role": "user",
139
+ "content": f"<image>\n{query}"
140
+ }
141
+ ]
142
+
143
+ # Format message with the towervision chat template
144
+ prompt = processor.apply_chat_template(
145
+ conversation,
146
+ tokenize=False,
147
+ add_generation_prompt=True
148
+ )
149
+ prompts.append(prompt)
150
+ return prompts
151
+
152
+ # we recommend using "bfloat16" as torch_dtype
153
+ kwargs = {
154
+ "torch_dtype": "bfloat16",
155
+ "device_map": "auto",
156
+ }
157
+ processor = LlavaNextProcessor.from_pretrained(model_id)
158
+ model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)
159
+
160
+ # Sample images and queries for batch processing
161
+ img_urls = [
162
+ "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
163
+ "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
164
+ ]
165
+
166
+ queries = [
167
+ "Is this person really big, or is this building just super small?",
168
+ "Where was this photo taken?"
169
+ ]
170
+
171
+ # Load images
172
+ images = []
173
+ for url in img_urls[:batch_size]:
174
+ image = Image.open(requests.get(url, stream=True).raw)
175
+ images.append(image)
176
+
177
+ # Prepare prompts
178
+ prompts = prepare_prompts(queries[:batch_size])
179
+
180
+ # Prepare batch inputs
181
+ inputs = processor(
182
+ text=prompts,
183
+ images=images,
184
+ return_tensors="pt",
185
+ padding=True
186
+ ).to(model.device)
187
+
188
+ # Generate response ids for batch
189
+ gen_tokens = model.generate(**inputs, max_new_tokens=512, do_sample=False)
190
+
191
+ # Decode responses
192
+ print(f"Batch processing {len(images)} images:")
193
+ print("-" * 50)
194
+
195
+ for i in range(len(images)):
196
+ input_length = inputs.input_ids[i].shape[0]
197
+ response = processor.tokenizer.decode(
198
+ gen_tokens[i][input_length:],
199
+ skip_special_tokens=True
200
+ )
201
+ print(f"Response: {response}")
202
+ print("-" * 50)
203
+ ```
204
+
205
+ </details>
206
+
207
+ ### Pipeline Usage
208
+
209
+ <summary>Click to expand/collapse code</summary>
210
+ <details>
211
+
212
+ ```python
213
+ from transformers import pipeline
214
+ from PIL import Image
215
+ import requests
216
+
217
+
218
+ pipe = pipeline(
219
+ model="utter-project/TowerVision-9B",
220
+ task="image-text-to-text",
221
+ device_map="auto",
222
+ dtype="bfloat16"
223
+ )
224
+
225
+ def prepare_prompt(query):
226
+ conversation = [
227
+ {
228
+ "role": "user",
229
+ "content": f"<image>\n{query}"
230
+ }
231
+ ]
232
+
233
+ # Format message with the towervision chat template
234
+ return pipe.processor.apply_chat_template(
235
+ conversation,
236
+ tokenize=False,
237
+ add_generation_prompt=True
238
+ )
239
+
240
+
241
+ img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
242
+ image = Image.open(requests.get(img_url, stream=True).raw)
243
+ text = prepare_prompt("Is this person really big, or is this building just super small?")
244
+
245
+ outputs = pipe(text=text, images=image, max_new_tokens=300, return_full_text=False)
246
+ print(outputs)
247
+ ```
248
+
249
+ </details>
250
 
251
  ## Model Details
252
 
253
+ **Input**: Model accepts input text and images.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
+ **Output**: Model generates text in multiple languages.
256
 
257
+ **Model Architecture**: TowerVision uses a multilingual language model based on [Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
258
 
259
+ **Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
260
 
261
+ **Languages Covered**: The model has been trained on **20 languages and dialects**:
262
+ - **European languages**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk)
263
+ - **Asian languages**: Chinese (Simplified & Traditional), Japanese, Korean, Hindi
264
+ - **Other languages**: Russian, Ukrainian
265
 
266
+ **Key Strengths**:
267
+ - **🏆 Exceptional performance on culturally-aware benchmarks** with deep understanding of cultural contexts and visual nuances
268
+ - **🌐 State-of-the-art results on multimodal multilingual translation benchmarks**, enabling seamless cross-lingual visual communication
269
+ - **📊 Strong cross-lingual transfer capabilities** across diverse vision-language tasks
270
 
271
+ ## Training Data
272
 
273
+ TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
274
 
275
+ | Dataset | Samples | HF Link | |
276
+ |---------|---------|---------|-------|
277
+ | VisionBlocks | 6.31M | [🤗 utter-project/VisionBlocks](https://huggingface.co/datasets/utter-project/VisionBlocks) | Coming Soon |
278
 
279
+ ### Dataset Statistics
280
+ - **Total samples**: 6.31M
281
+ - **Created by our team**: 1.21M samples (~19%)
282
+ - **Human-collected/external**: 5.10M samples (~81%)
283
 
284
+ ### Dataset Composition Overview
285
 
286
+ **VisionBlocks** contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data:
287
 
288
+ - **Chart/Plot Reasoning**: DVQA, ChartQA, PlotQA, TabMWP (~405K samples)
289
+ - **General VQA**: VQAv2, RLAIF-4V (~488K samples)
290
+ - **Document VQA**: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples)
291
+ - **Reasoning/Knowledge**: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples)
292
+ - **Multilingual/Cultural**: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples)
293
+ - **Specialized VQA**: IconQA, InfographicVQA, Stratos (~34K samples)
294
+ - **Counting/Math**: TallyQA, PixMo-Count (~107K samples)
295
+ - **Vision/Text**: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples)
296
+ - **Video/Text**: LLaVA-Video collections (~1.4M samples)
297
 
298
+ **Collection Types**: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages.
299
 
300
  ## Evaluation
301
 
302
+ All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
303
 
304
+ ### Multiple Purpose Multimodal Benchmarks
305
 
306
+ TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks:
307
 
308
+ <img src="mc-eval1.png" alt="Multiple Purpose Multimodal Benchmarks Results" width="600">
309
 
310
+ ### Multimodal Multilingual Translation Tasks
311
 
312
+ TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities:
313
 
314
+ <img src="mc-eval2.png" alt="Multimodal Multilingual Translation Results" width="600">
315
 
316
+ ### Supported Languages Performance
317
 
318
+ **Fully Supported**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian
319
 
320
+ 📊 **Benchmark Coverage**: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks.
321
 
322
+ ## Citation
323
 
324
+ If you find TowerVision useful in your research, please consider citing the following paper:
325
 
326
+ ```bibtex
327
+ @article{towervision2025,
328
+ title={Understanding and Improving Multilinguality in Vision-Language Models},
329
+ author={[Authors to be added]},
330
+ journal={[Journal to be added]},
331
+ year={2025},
332
+ note={Paper in preparation}
333
+ }
334
+ ```
335
 
336
+ ## Model Card Contact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
337
 
338
+ For errors or additional questions about details in this model card, contact the research team.
339
 
340
+ ## Terms of Use
341
 
342
+ We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of highly performant multilingual vision-language models to researchers all over the world.
343
 
344
+ This model is governed by the Apache 2.0 License.
345
 
346
+ ## Acknowledgments
347
 
348
+ TowerVision builds upon the excellent work of:
349
+ - **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
350
+ - **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B)** language models for multilingual capabilities
351
+ - **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
352
+ - The broader multilingual NLP and multimodal communities