ByteDance
/

FaceCLIP

subject-personalization

Model card Files Files and versions

scvreadgo commited on 18 days ago

Commit

4e0e896

·

verified ·

1 Parent(s): e54c966

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -32,9 +32,9 @@ This repository provides the official models for the following paper:
 > **Abstract:** *Recent progress in text-to-image (T2I) diffusion models has greatly improved image quality and flexibility. However, a major challenge in personalized generation remains: preserving the subject’s identity (ID) while allowing diverse visual changes. We address this with a new framework for ID-preserving image generation. Instead of relying on adapter modules to inject identity features into pre-trained models, we propose a unified multi-modal encoding strategy that jointly captures identity and text information. Our method, called FaceCLIP, learns a shared embedding space for facial identity and textual semantics. Given a reference face image and a text prompt, FaceCLIP produces a joint representation that guides the generative model to synthesize images consistent with both the subject’s identity and the prompt. To train FaceCLIP, we introduce a multi-modal alignment loss that aligns features across face, text, and image domains. We then integrate FaceCLIP with existing UNet and Diffusion Transformer (DiT) architectures, forming a complete synthesis pipeline FaceCLIP-x. Compared to existing ID-preserving approaches, our method produces more photorealistic portraits with better identity retention and text alignment. Extensive experiments demonstrate that FaceCLIP-x outperforms prior methods in both qualitative and quantitative evaluations.*
-![teaser](../asset/readme/teaser.jpeg)
-![teaser](../asset/readme/demo.jpeg)
-![teaser](../asset/readme/arch.jpeg)
 ## 🔧 Installation and Usage

 > **Abstract:** *Recent progress in text-to-image (T2I) diffusion models has greatly improved image quality and flexibility. However, a major challenge in personalized generation remains: preserving the subject’s identity (ID) while allowing diverse visual changes. We address this with a new framework for ID-preserving image generation. Instead of relying on adapter modules to inject identity features into pre-trained models, we propose a unified multi-modal encoding strategy that jointly captures identity and text information. Our method, called FaceCLIP, learns a shared embedding space for facial identity and textual semantics. Given a reference face image and a text prompt, FaceCLIP produces a joint representation that guides the generative model to synthesize images consistent with both the subject’s identity and the prompt. To train FaceCLIP, we introduce a multi-modal alignment loss that aligns features across face, text, and image domains. We then integrate FaceCLIP with existing UNet and Diffusion Transformer (DiT) architectures, forming a complete synthesis pipeline FaceCLIP-x. Compared to existing ID-preserving approaches, our method produces more photorealistic portraits with better identity retention and text alignment. Extensive experiments demonstrate that FaceCLIP-x outperforms prior methods in both qualitative and quantitative evaluations.*
+![teaser](asset/readme/teaser.jpeg)
+![teaser](asset/readme/demo.jpeg)
+![teaser](asset/readme/arch.jpeg)
 ## 🔧 Installation and Usage