File size: 5,088 Bytes
e0ace6c
 
 
 
 
 
 
d8ea9b1
e0ace6c
 
d8ea9b1
 
 
 
d2bb8f8
 
 
 
 
 
 
 
2ef0bd5
6249043
e0ace6c
 
 
 
 
 
 
 
 
 
6249043
e0ace6c
 
 
 
 
 
2ef0bd5
e0ace6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6e049c
 
 
 
 
 
 
 
75be094
 
a795de1
 
887658d
 
 
e0ace6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53d8637
e0ace6c
 
 
 
ddc42b9
e0ace6c
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: mit
pipeline_tag: image-to-image
library_name: diffusers
---

<h1 align="center">
  πŸš€ REPA-E <em>for</em> T2I
</h1>

<p align="center">
  <em>End-to-End Tuned VAEs for Supercharging Text-to-Image Diffusion Transformers</em>
</p>

<p align="center">
  <a href="https://End2End-Diffusion.github.io/repa-e-t2i">🌐 Project Page</a> &ensp;
  <a href="https://huggingface.co/REPA-E/models">πŸ€— Models</a> &ensp;
  <a href="https://arxiv.org/abs/2504.10483">πŸ“ƒ Paper</a> &ensp;
  <br><br>
  <!-- <a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a> -->
</p>

<!-- <p align="center">
  <a href="https://scholar.google.com.au/citations?user=GQzvqS4AAAAJ" target="_blank">Xingjian&nbsp;Leng</a><sup>1,2*</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://1jsingh.github.io/" target="_blank">Jaskirat&nbsp;Singh</a><sup>1</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://rynmurdock.github.io/" target="_blank">Ryan&nbsp;Murdock</a><sup>2</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://www.ethansmith2000.com/" target="_blank">Ethan&nbsp;Smith</a><sup>2</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://xiaoyang-rebecca.github.io/cv/" target="_blank">Rebecca&nbsp;Li</a><sup>2</sup> &ensp; <b>&middot;</b> &ensp;
  <a href="https://www.sainingxie.com/" target="_blank">Saining&nbsp;Xie</a><sup>3</sup>&ensp; <b>&middot;</b> &ensp;
  <a href="https://zheng-lab-anu.github.io/" target="_blank">Liang&nbsp;Zheng</a><sup>1</sup>&ensp;
</p>

<p align="center">
  <sup>1</sup> Australian National University &emsp; <sup>2</sup>Canva &emsp; <sup>3</sup>New York University &emsp; <br>
  <sub><sup>*</sup>Done during internship at Canva &emsp;</sub>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2504.10483" target="_blank">πŸ“„ REPA-E Paper</a> &ensp; | &ensp;
  <a href="https://end2end-diffusion.github.io/repa-e-t2i/" target="_blank">🌐 Blog Post</a> &ensp; | &ensp;
  <a href="https://huggingface.co/REPA-E" target="_blank">πŸ€— Models</a>
</p> -->

---

## πŸš€ Overall

<p>
We present REPA-E for T2I, a family of end-to-end tuned VAEs designed to supercharge text-to-image generation training. These models consistently outperform FLUX-VAE across all benchmarks (COCO-30K, DPG-Bench, GenAI-Bench, GenEval, and MJHQ-30K) without requiring any additional representation alignment losses.
</p>

<p>
  For training, we adopt the <a href="https://github.com/End2End-Diffusion/REPA-E" target="_blank"><strong>official REPA-E training code</strong></a> to optimize the
  <a href="https://huggingface.co/black-forest-labs/FLUX.1-dev" target="_blank">FLUX-VAE</a> for <strong>80 epochs</strong> with a batch size of <strong>256</strong> on the <strong>ImageNet-256</strong> dataset.
  The REPA-E training effectively refines the VAE’s latent-space structure and enables faster convergence in downstream text-to-image latent diffusion model training.
</p>

<p>
  This repository provides <code>diffusers</code>-compatible weights for the <strong>end-to-end trained FLUX-VAE</strong>. In addition, we release <strong>end-to-end trained variants</strong> of several other widely used VAEs to facilitate research and integration within text-to-image diffusion frameworks.
</p>

## ⚑️ Quickstart 
```python
from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to("cuda")
```
> Use `vae.encode(...)` / `vae.decode(...)` in your pipeline. (A full example is provided below.)

### 🧩 End-to-End Trained VAE Releases

| Model | Hugging Face Link |
|-------|-------------------|
| **E2E-FLUX-VAE** | πŸ€— [REPA-E/e2e-flux-vae](https://huggingface.co/REPA-E/e2e-flux-vae) |
| **E2E-SD-3.5-VAE** | πŸ€— [REPA-E/e2e-sd3.5-vae](https://huggingface.co/REPA-E/e2e-sd3.5-vae) |
| **E2E-Qwen-Image-VAE** | πŸ€— [REPA-E/e2e-qwenimage-vae](https://huggingface.co/REPA-E/e2e-qwenimage-vae) |

## πŸ“¦ Requirements
The following packages are required to load and run the REPA-E VAEs with the `diffusers` library:

```bash
pip install diffusers>=0.33.0
pip install torch>=2.3.1
```

## πŸš€ Example Usage
Below is a minimal example showing how to load and use the REPA-E end-to-end trained FLUX-VAE with `diffusers`:

```python
from io import BytesIO
import requests

from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image


response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

```