scvreadgo commited on
Commit
4e0e896
·
verified ·
1 Parent(s): e54c966

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -32,9 +32,9 @@ This repository provides the official models for the following paper:
32
 
33
  > **Abstract:** *Recent progress in text-to-image (T2I) diffusion models has greatly improved image quality and flexibility. However, a major challenge in personalized generation remains: preserving the subject’s identity (ID) while allowing diverse visual changes. We address this with a new framework for ID-preserving image generation. Instead of relying on adapter modules to inject identity features into pre-trained models, we propose a unified multi-modal encoding strategy that jointly captures identity and text information. Our method, called FaceCLIP, learns a shared embedding space for facial identity and textual semantics. Given a reference face image and a text prompt, FaceCLIP produces a joint representation that guides the generative model to synthesize images consistent with both the subject’s identity and the prompt. To train FaceCLIP, we introduce a multi-modal alignment loss that aligns features across face, text, and image domains. We then integrate FaceCLIP with existing UNet and Diffusion Transformer (DiT) architectures, forming a complete synthesis pipeline FaceCLIP-x. Compared to existing ID-preserving approaches, our method produces more photorealistic portraits with better identity retention and text alignment. Extensive experiments demonstrate that FaceCLIP-x outperforms prior methods in both qualitative and quantitative evaluations.*
34
 
35
- ![teaser](../asset/readme/teaser.jpeg)
36
- ![teaser](../asset/readme/demo.jpeg)
37
- ![teaser](../asset/readme/arch.jpeg)
38
 
39
  ## 🔧 Installation and Usage
40
 
 
32
 
33
  > **Abstract:** *Recent progress in text-to-image (T2I) diffusion models has greatly improved image quality and flexibility. However, a major challenge in personalized generation remains: preserving the subject’s identity (ID) while allowing diverse visual changes. We address this with a new framework for ID-preserving image generation. Instead of relying on adapter modules to inject identity features into pre-trained models, we propose a unified multi-modal encoding strategy that jointly captures identity and text information. Our method, called FaceCLIP, learns a shared embedding space for facial identity and textual semantics. Given a reference face image and a text prompt, FaceCLIP produces a joint representation that guides the generative model to synthesize images consistent with both the subject’s identity and the prompt. To train FaceCLIP, we introduce a multi-modal alignment loss that aligns features across face, text, and image domains. We then integrate FaceCLIP with existing UNet and Diffusion Transformer (DiT) architectures, forming a complete synthesis pipeline FaceCLIP-x. Compared to existing ID-preserving approaches, our method produces more photorealistic portraits with better identity retention and text alignment. Extensive experiments demonstrate that FaceCLIP-x outperforms prior methods in both qualitative and quantitative evaluations.*
34
 
35
+ ![teaser](asset/readme/teaser.jpeg)
36
+ ![teaser](asset/readme/demo.jpeg)
37
+ ![teaser](asset/readme/arch.jpeg)
38
 
39
  ## 🔧 Installation and Usage
40