Update README.md
Browse files
README.md
CHANGED
|
@@ -7,7 +7,7 @@ license_link: LICENSE
|
|
| 7 |
<!-- ## **HunyuanVideo** -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/logo.png" height=100>
|
| 11 |
</p>
|
| 12 |
|
| 13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
|
@@ -71,7 +71,7 @@ using a large language model, and used as the condition. Gaussian noise and cond
|
|
| 71 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
| 72 |
the 3D VAE decoder.
|
| 73 |
<p align="center">
|
| 74 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/overall.png" height=300>
|
| 75 |
</p>
|
| 76 |
|
| 77 |
## 🎉 **HunyuanVideo Key Features**
|
|
@@ -83,7 +83,7 @@ tokens and feed them into subsequent Transformer blocks for effective multimodal
|
|
| 83 |
This design captures complex interactions between visual and semantic information, enhancing
|
| 84 |
overall model performance.
|
| 85 |
<p align="center">
|
| 86 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/backbone.png" height=350>
|
| 87 |
</p>
|
| 88 |
|
| 89 |
### **MLLM Text Encoder**
|
|
@@ -91,13 +91,13 @@ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as te
|
|
| 91 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
| 92 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
| 93 |
<p align="center">
|
| 94 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/text_encoder.png" height=275>
|
| 95 |
</p>
|
| 96 |
|
| 97 |
### **3D VAE**
|
| 98 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
| 99 |
<p align="center">
|
| 100 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/3dvae.png" height=150>
|
| 101 |
</p>
|
| 102 |
|
| 103 |
### **Prompt Rewrite**
|
|
|
|
| 7 |
<!-- ## **HunyuanVideo** -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/logo.png" height=100>
|
| 11 |
</p>
|
| 12 |
|
| 13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
|
|
|
| 71 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
| 72 |
the 3D VAE decoder.
|
| 73 |
<p align="center">
|
| 74 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/overall.png" height=300>
|
| 75 |
</p>
|
| 76 |
|
| 77 |
## 🎉 **HunyuanVideo Key Features**
|
|
|
|
| 83 |
This design captures complex interactions between visual and semantic information, enhancing
|
| 84 |
overall model performance.
|
| 85 |
<p align="center">
|
| 86 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/backbone.png" height=350>
|
| 87 |
</p>
|
| 88 |
|
| 89 |
### **MLLM Text Encoder**
|
|
|
|
| 91 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
| 92 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
| 93 |
<p align="center">
|
| 94 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/text_encoder.png" height=275>
|
| 95 |
</p>
|
| 96 |
|
| 97 |
### **3D VAE**
|
| 98 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
| 99 |
<p align="center">
|
| 100 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/3dvae.png" height=150>
|
| 101 |
</p>
|
| 102 |
|
| 103 |
### **Prompt Rewrite**
|