Update ProCLIP Model Card with Metadata and Comprehensive Content
Browse filesThis PR significantly enhances the model card for ProCLIP, a model focused on progressive vision-language alignment.
Key updates include:
- **License Update**: Changed from `apache-2.0` to `mit`, aligning with the license specified in the official GitHub repository.
- **Pipeline Tag Refinement**: Updated from `image-text-to-text` to `zero-shot-image-classification` to more accurately reflect the model's capabilities, particularly its strong performance in classification tasks as indicated by the paper and GitHub results.
- **Comprehensive Content**: Populated the model card with a detailed overview, including the paper's abstract, methodology explanation (with an illustrative image), and demonstration of results (with images).
- **Links**: Added direct links to the Hugging Face paper page and the official GitHub repository for easy access to the source material and code.
These improvements aim to make the model more discoverable and understandable for users on the Hugging Face Hub.
|
@@ -1,9 +1,77 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
|
|
|
|
| 6 |
tags:
|
| 7 |
- Multi-modal
|
| 8 |
- CLIP
|
| 9 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
+
license: mit
|
| 5 |
+
pipeline_tag: zero-shot-image-classification
|
| 6 |
tags:
|
| 7 |
- Multi-modal
|
| 8 |
- CLIP
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
<p align="center">
|
| 12 |
+
<h1 align="center"><img src="https://github.com/VisionXLab/ProCLIP/raw/main/assets/logo.png" alt="ProCLIP Logo" width="35" style="vertical-align: -25px; margin-right: 5px"/>ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder</h1>
|
| 13 |
+
</p>
|
| 14 |
+
|
| 15 |
+
Official PyTorch implementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder](https://huggingface.co/papers/2510.18795).
|
| 16 |
+
|
| 17 |
+
This model addresses limitations of the original CLIP text encoder by introducing a curriculum learning-based progressive vision-language alignment framework. It aims to effectively align the CLIP image encoder with an LLM-based embedder to enhance long-text, multilingual, and fine-grained understanding.
|
| 18 |
+
|
| 19 |
+
<div align="center">
|
| 20 |
+
<img src="https://github.com/VisionXLab/ProCLIP/raw/main/assets/overview.png" width="100%"/>
|
| 21 |
+
</div>
|
| 22 |
+
|
| 23 |
+
## Abstract
|
| 24 |
+
The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning.
|
| 25 |
+
|
| 26 |
+
## Paper and Code
|
| 27 |
+
- **Paper**: [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder](https://huggingface.co/papers/2510.18795)
|
| 28 |
+
- **GitHub Repository**: [VisionXLab/ProCLIP](https://github.com/VisionXLab/ProCLIP)
|
| 29 |
+
|
| 30 |
+
## Methodology
|
| 31 |
+
|
| 32 |
+
The ProCLIP framework employs a two-stage curriculum learning approach for progressive vision-language alignment:
|
| 33 |
+
|
| 34 |
+
<div align="center">
|
| 35 |
+
<img src="https://github.com/VisionXLab/ProCLIP/raw/main/assets/method.png" width="100%"/>
|
| 36 |
+
</div>
|
| 37 |
+
|
| 38 |
+
- **Stage 1**: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
|
| 39 |
+
- **Stage 2**: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.
|
| 40 |
+
|
| 41 |
+
## Results
|
| 42 |
+
|
| 43 |
+
ProCLIP demonstrates strong performance across various tasks, including:
|
| 44 |
+
|
| 45 |
+
### Retrieval Results
|
| 46 |
+

|
| 47 |
+
|
| 48 |
+
### Classification Results
|
| 49 |
+

|
| 50 |
+
|
| 51 |
+
### Multilingual Retrieval Results
|
| 52 |
+

|
| 53 |
+
|
| 54 |
+
### Comparison with other LLM embedders-based CLIP models
|
| 55 |
+

|
| 56 |
+
|
| 57 |
+
More detailed results and comparisons can be found in the [paper](https://huggingface.co/papers/2510.18795).
|
| 58 |
+
|
| 59 |
+
## Citation
|
| 60 |
+
|
| 61 |
+
If you find our work helpful, please cite our paper:
|
| 62 |
+
|
| 63 |
+
```bibtex
|
| 64 |
+
@misc{ProCLIP,
|
| 65 |
+
title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder},
|
| 66 |
+
author={Xiaoxing Hu and Kaicheng Yang and Ziyang Gong and Qi Ming and Zonghao Guo and Xiang An and Ziyong Feng and Junchi Yan and Xue Yang},
|
| 67 |
+
year={2025},
|
| 68 |
+
eprint={2510.18795},
|
| 69 |
+
archivePrefix={arXiv},
|
| 70 |
+
primaryClass={cs.CV},
|
| 71 |
+
url={https://arxiv.org/abs/2510.18795},
|
| 72 |
+
}
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
## License
|
| 76 |
+
|
| 77 |
+
This project is licensed under the MIT License. For more details, see the [LICENSE](https://github.com/VisionXLab/ProCLIP/blob/main/LICENSE) file in the GitHub repository.
|