language:
- en
license: mit
pipeline_tag: zero-shot-image-classification
tags:
- Multi-modal
- CLIP
ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Official PyTorch implementation of ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.
This model addresses limitations of the original CLIP text encoder by introducing a curriculum learning-based progressive vision-language alignment framework. It aims to effectively align the CLIP image encoder with an LLM-based embedder to enhance long-text, multilingual, and fine-grained understanding.
Abstract
The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning.
Paper and Code
- Paper: ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
- GitHub Repository: VisionXLab/ProCLIP
Methodology
The ProCLIP framework employs a two-stage curriculum learning approach for progressive vision-language alignment:
- Stage 1: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
- Stage 2: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.
Results
ProCLIP demonstrates strong performance across various tasks, including:
Retrieval Results
Classification Results
Multilingual Retrieval Results
Comparison with other LLM embedders-based CLIP models
More detailed results and comparisons can be found in the paper.
Citation
If you find our work helpful, please cite our paper:
@misc{ProCLIP,
title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder},
author={Xiaoxing Hu and Kaicheng Yang and Ziyang Gong and Qi Ming and Zonghao Guo and Xiang An and Ziyong Feng and Junchi Yan and Xue Yang},
year={2025},
eprint={2510.18795},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.18795},
}
License
This project is licensed under the MIT License. For more details, see the LICENSE file in the GitHub repository.



