ProCLIP / README.md
nielsr's picture
nielsr HF Staff
Update ProCLIP Model Card with Metadata and Comprehensive Content
bdde3f8 verified
|
raw
history blame
4.76 kB
metadata
language:
  - en
license: mit
pipeline_tag: zero-shot-image-classification
tags:
  - Multi-modal
  - CLIP

ProCLIP LogoProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Official PyTorch implementation of ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.

This model addresses limitations of the original CLIP text encoder by introducing a curriculum learning-based progressive vision-language alignment framework. It aims to effectively align the CLIP image encoder with an LLM-based embedder to enhance long-text, multilingual, and fine-grained understanding.

Abstract

The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning.

Paper and Code

Methodology

The ProCLIP framework employs a two-stage curriculum learning approach for progressive vision-language alignment:

  • Stage 1: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
  • Stage 2: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.

Results

ProCLIP demonstrates strong performance across various tasks, including:

Retrieval Results

Sample Result

Classification Results

Sample Result

Multilingual Retrieval Results

Sample Result

Comparison with other LLM embedders-based CLIP models

Sample Result

More detailed results and comparisons can be found in the paper.

Citation

If you find our work helpful, please cite our paper:

@misc{ProCLIP,
      title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder}, 
      author={Xiaoxing Hu and Kaicheng Yang and Ziyang Gong and Qi Ming and Zonghao Guo and Xiang An and Ziyong Feng and Junchi Yan and Xue Yang},
      year={2025},
      eprint={2510.18795},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.18795}, 
}

License

This project is licensed under the MIT License. For more details, see the LICENSE file in the GitHub repository.