Update README.md
Browse files
README.md
CHANGED
|
@@ -10,32 +10,32 @@ tags:
|
|
| 10 |
- colpali
|
| 11 |
- multimodal-embedding
|
| 12 |
---
|
| 13 |
-
|
| 14 |
|
| 15 |
**Ops-MM-embedding-v1-7B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.
|
| 16 |
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.
|
| 22 |
|
| 23 |
-
|
| 24 |
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).
|
| 25 |
|
| 26 |
-
|
| 27 |
- **Ops-MM-embedding-v1-7B** achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.
|
| 28 |
|
| 29 |
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
MMEB-train, CC-3M, colpali training set.
|
| 34 |
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
| Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
|
| 41 |
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
|
|
@@ -46,7 +46,7 @@ MMEB-train, CC-3M, colpali training set.
|
|
| 46 |
| gme-Qwen2-VL-2B-Instruct | 2.21 | 54.37 | 51.89 | 33.86 | 73.47 |
|
| 47 |
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
The table below compares performance on MMEB-Image benchmark among models of similar size.
|
| 52 |
|
|
@@ -60,7 +60,7 @@ The table below compares performance on MMEB-Image benchmark among models of sim
|
|
| 60 |
| UNITE-Instruct-7B | 8.29 | 70.3 | 68.3 | 65.1 | 71.6 | 84.8 |
|
| 61 |
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
| Model | Avg | ESG Restaurant Human | MIT Bio | Econ. Macro | ESG Restaurant Synth. | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
|
| 66 |
| ---------------------- | -------- | -------------------- | ------- | ----------- | --------------------- | -------------- | ----------------- | ---------------------------- |
|
|
|
|
| 10 |
- colpali
|
| 11 |
- multimodal-embedding
|
| 12 |
---
|
| 13 |
+
# Ops-MM-embedding-v1-7B
|
| 14 |
|
| 15 |
**Ops-MM-embedding-v1-7B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.
|
| 16 |
|
| 17 |
|
| 18 |
+
## **Key Features**
|
| 19 |
|
| 20 |
+
### Unified Multimodal Embeddings
|
| 21 |
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.
|
| 22 |
|
| 23 |
+
### High Performance on MMEB
|
| 24 |
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).
|
| 25 |
|
| 26 |
+
### Multilingual Capabilities
|
| 27 |
- **Ops-MM-embedding-v1-7B** achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.
|
| 28 |
|
| 29 |
|
| 30 |
|
| 31 |
+
## Training data
|
| 32 |
|
| 33 |
MMEB-train, CC-3M, colpali training set.
|
| 34 |
|
| 35 |
|
| 36 |
+
## Performance
|
| 37 |
|
| 38 |
+
### MMEB-V2
|
| 39 |
|
| 40 |
| Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
|
| 41 |
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
|
|
|
|
| 46 |
| gme-Qwen2-VL-2B-Instruct | 2.21 | 54.37 | 51.89 | 33.86 | 73.47 |
|
| 47 |
|
| 48 |
|
| 49 |
+
### MMEB-Image
|
| 50 |
|
| 51 |
The table below compares performance on MMEB-Image benchmark among models of similar size.
|
| 52 |
|
|
|
|
| 60 |
| UNITE-Instruct-7B | 8.29 | 70.3 | 68.3 | 65.1 | 71.6 | 84.8 |
|
| 61 |
|
| 62 |
|
| 63 |
+
### ViDoRe-v2
|
| 64 |
|
| 65 |
| Model | Avg | ESG Restaurant Human | MIT Bio | Econ. Macro | ESG Restaurant Synth. | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
|
| 66 |
| ---------------------- | -------- | -------------------- | ------- | ----------- | --------------------- | -------------- | ----------------- | ---------------------------- |
|