Update ProCLIP Model Card with Metadata and Comprehensive Content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,9 +1,77 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
- pipeline_tag: image-text-to-text
 
6
  tags:
7
  - Multi-modal
8
  - CLIP
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ license: mit
5
+ pipeline_tag: zero-shot-image-classification
6
  tags:
7
  - Multi-modal
8
  - CLIP
9
+ ---
10
+
11
+ <p align="center">
12
+ <h1 align="center"><img src="https://github.com/VisionXLab/ProCLIP/raw/main/assets/logo.png" alt="ProCLIP Logo" width="35" style="vertical-align: -25px; margin-right: 5px"/>ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder</h1>
13
+ </p>
14
+
15
+ Official PyTorch implementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder](https://huggingface.co/papers/2510.18795).
16
+
17
+ This model addresses limitations of the original CLIP text encoder by introducing a curriculum learning-based progressive vision-language alignment framework. It aims to effectively align the CLIP image encoder with an LLM-based embedder to enhance long-text, multilingual, and fine-grained understanding.
18
+
19
+ <div align="center">
20
+ <img src="https://github.com/VisionXLab/ProCLIP/raw/main/assets/overview.png" width="100%"/>
21
+ </div>
22
+
23
+ ## Abstract
24
+ The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning.
25
+
26
+ ## Paper and Code
27
+ - **Paper**: [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder](https://huggingface.co/papers/2510.18795)
28
+ - **GitHub Repository**: [VisionXLab/ProCLIP](https://github.com/VisionXLab/ProCLIP)
29
+
30
+ ## Methodology
31
+
32
+ The ProCLIP framework employs a two-stage curriculum learning approach for progressive vision-language alignment:
33
+
34
+ <div align="center">
35
+ <img src="https://github.com/VisionXLab/ProCLIP/raw/main/assets/method.png" width="100%"/>
36
+ </div>
37
+
38
+ - **Stage 1**: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
39
+ - **Stage 2**: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.
40
+
41
+ ## Results
42
+
43
+ ProCLIP demonstrates strong performance across various tasks, including:
44
+
45
+ ### Retrieval Results
46
+ ![Sample Result](https://github.com/VisionXLab/ProCLIP/raw/main/assets/retrieval.png)
47
+
48
+ ### Classification Results
49
+ ![Sample Result](https://github.com/VisionXLab/ProCLIP/raw/main/assets/classification.png)
50
+
51
+ ### Multilingual Retrieval Results
52
+ ![Sample Result](https://github.com/VisionXLab/ProCLIP/raw/main/assets/xm3600.png)
53
+
54
+ ### Comparison with other LLM embedders-based CLIP models
55
+ ![Sample Result](https://github.com/VisionXLab/ProCLIP/raw/main/assets/comparision.png)
56
+
57
+ More detailed results and comparisons can be found in the [paper](https://huggingface.co/papers/2510.18795).
58
+
59
+ ## Citation
60
+
61
+ If you find our work helpful, please cite our paper:
62
+
63
+ ```bibtex
64
+ @misc{ProCLIP,
65
+ title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder},
66
+ author={Xiaoxing Hu and Kaicheng Yang and Ziyang Gong and Qi Ming and Zonghao Guo and Xiang An and Ziyong Feng and Junchi Yan and Xue Yang},
67
+ year={2025},
68
+ eprint={2510.18795},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.CV},
71
+ url={https://arxiv.org/abs/2510.18795},
72
+ }
73
+ ```
74
+
75
+ ## License
76
+
77
+ This project is licensed under the MIT License. For more details, see the [LICENSE](https://github.com/VisionXLab/ProCLIP/blob/main/LICENSE) file in the GitHub repository.