Update README.md
Browse files
README.md
CHANGED
|
@@ -5,11 +5,12 @@ license: other
|
|
| 5 |
license_link: LICENSE
|
| 6 |
---
|
| 7 |
|
|
|
|
| 8 |
<div align="center">
|
| 9 |
|
| 10 |
<img src="./assets/logo.png" alt="HunyuanImage-3.0 Logo" width="400">
|
| 11 |
|
| 12 |
-
# π¨ HunyuanImage-3.0: A Powerful Native Multimodal Model for
|
| 13 |
|
| 14 |
</div>
|
| 15 |
|
|
@@ -22,10 +23,10 @@ license_link: LICENSE
|
|
| 22 |
|
| 23 |
|
| 24 |
<div align="center">
|
| 25 |
-
<a href=
|
| 26 |
<a href=https://huggingface.co/tencent/HunyuanImage-3.0 target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
|
| 27 |
<a href=https://github.com/Tencent-Hunyuan/HunyuanImage-3.0 target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
|
| 28 |
-
<a href=
|
| 29 |
<a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
|
| 30 |
</div>
|
| 31 |
|
|
@@ -40,11 +41,12 @@ If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
|
|
| 40 |
|
| 41 |
## π Open-source Plan
|
| 42 |
|
| 43 |
-
- HunyuanImage-3.0 (
|
| 44 |
- [x] Inference
|
| 45 |
-
- [x]
|
| 46 |
-
- [
|
| 47 |
- [ ] VLLM Support
|
|
|
|
| 48 |
- [ ] Image-to-Image Generation
|
| 49 |
- [ ] Multi-turn Interaction
|
| 50 |
|
|
@@ -66,6 +68,8 @@ If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
|
|
| 66 |
- [π¨ Interactive Gradio Demo](#-interactive-gradio-demo)
|
| 67 |
- [π§± Models Cards](#-models-cards)
|
| 68 |
- [π Prompt Guide](#-prompt-guide)
|
|
|
|
|
|
|
| 69 |
- [Advanced Tips](#advanced-tips)
|
| 70 |
- [More Cases](#more-cases)
|
| 71 |
- [π Evaluation](#-evaluation)
|
|
@@ -86,9 +90,9 @@ If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
|
|
| 86 |
|
| 87 |
## β¨ Key Features
|
| 88 |
|
| 89 |
-
* π§ **Unified Multimodal Architecture:** Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich
|
| 90 |
|
| 91 |
-
* π **The Largest
|
| 92 |
|
| 93 |
* π¨ **Superior Image Generation Performance:**Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.
|
| 94 |
|
|
@@ -131,12 +135,14 @@ pip install flash-attn==2.8.3 --no-build-isolation
|
|
| 131 |
# FlashInfer for optimized moe inference. v0.3.1 is tested.
|
| 132 |
pip install flashinfer-python
|
| 133 |
```
|
| 134 |
-
>
|
| 135 |
> FlashInfer relies on this compatibility when compiling kernels at runtime. Pytorch 2.7.1+cu128 is tested.
|
| 136 |
> GCC version >=9 is recommended for compiling FlashAttention and FlashInfer.
|
| 137 |
|
| 138 |
> β‘ **Performance Tips:** These optimizations can significantly speed up your inference!
|
| 139 |
|
|
|
|
|
|
|
| 140 |
## π Usage
|
| 141 |
|
| 142 |
### π₯ Quick Start with Transformers
|
|
@@ -144,26 +150,26 @@ pip install flashinfer-python
|
|
| 144 |
The easiest way to get started with HunyuanImage-3.0:
|
| 145 |
|
| 146 |
```python
|
| 147 |
-
from
|
| 148 |
|
| 149 |
# Load the model
|
| 150 |
model_id = "tencent/HunyuanImage-3.0"
|
| 151 |
|
| 152 |
kwargs = dict(
|
| 153 |
-
attn_implementation="sdpa",
|
| 154 |
trust_remote_code=True,
|
| 155 |
torch_dtype="auto",
|
| 156 |
device_map="auto",
|
| 157 |
-
moe_impl="eager",
|
| 158 |
)
|
| 159 |
|
| 160 |
-
model =
|
| 161 |
model.load_tokenizer(model_id)
|
| 162 |
|
| 163 |
# generate the image
|
| 164 |
-
prompt = "A brown and white dog is running on the grass
|
| 165 |
-
image = model.generate_image(prompt=prompt,
|
| 166 |
-
image.save("
|
| 167 |
```
|
| 168 |
|
| 169 |
### π Local Installation & Usage
|
|
@@ -179,13 +185,13 @@ cd HunyuanImage-3.0/
|
|
| 179 |
|
| 180 |
```bash
|
| 181 |
# Download from HuggingFace
|
| 182 |
-
|
| 183 |
```
|
| 184 |
|
| 185 |
#### 3οΈβ£ Run the Demo
|
| 186 |
|
| 187 |
```bash
|
| 188 |
-
python3
|
| 189 |
```
|
| 190 |
|
| 191 |
### π¨ Interactive Gradio Demo
|
|
@@ -247,15 +253,22 @@ Notes:
|
|
| 247 |
## π Prompt Guide
|
| 248 |
|
| 249 |
### Manually Writing Prompts.
|
| 250 |
-
The
|
| 251 |
|
| 252 |
Reference: [HunyuanImage 3.0 Prompt Handbook](
|
| 253 |
https://docs.qq.com/doc/DUVVadmhCdG9qRXBU)
|
| 254 |
|
| 255 |
|
| 256 |
-
### System Prompt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 257 |
|
| 258 |
-
|
|
|
|
|
|
|
| 259 |
|
| 260 |
### Advanced Tips
|
| 261 |
- **Content Priority**: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: **Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters**. Keywords can be added both before and after this structure.
|
|
@@ -329,8 +342,6 @@ We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the
|
|
| 329 |
</p>
|
| 330 |
|
| 331 |
|
| 332 |
-
* π **Open Leaderboard** - Coming Soon
|
| 333 |
-
|
| 334 |
## π Citation
|
| 335 |
|
| 336 |
If you find HunyuanImage-3.0 useful in your research, please cite our work:
|
|
@@ -352,4 +363,5 @@ We extend our heartfelt gratitude to the following open-source projects and comm
|
|
| 352 |
* π¨ [Diffusers](https://github.com/huggingface/diffusers) - Diffusion models library
|
| 353 |
* π [HuggingFace](https://huggingface.co/) - AI model hub and community
|
| 354 |
* β‘ [FlashAttention](https://github.com/Dao-AILab/flash-attention) - Memory-efficient attention
|
| 355 |
-
* π [FlashInfer](https://github.com/flashinfer-ai/flashinfer) - Optimized inference engine
|
|
|
|
|
|
| 5 |
license_link: LICENSE
|
| 6 |
---
|
| 7 |
|
| 8 |
+
|
| 9 |
<div align="center">
|
| 10 |
|
| 11 |
<img src="./assets/logo.png" alt="HunyuanImage-3.0 Logo" width="400">
|
| 12 |
|
| 13 |
+
# π¨ HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation
|
| 14 |
|
| 15 |
</div>
|
| 16 |
|
|
|
|
| 23 |
|
| 24 |
|
| 25 |
<div align="center">
|
| 26 |
+
<a href=https://hunyuan.tencent.com/image target="_blank"><img src=https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage height=22px></a>
|
| 27 |
<a href=https://huggingface.co/tencent/HunyuanImage-3.0 target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
|
| 28 |
<a href=https://github.com/Tencent-Hunyuan/HunyuanImage-3.0 target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
|
| 29 |
+
<a href="come soon" target="_blank"><img src=https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv height=22px></a>
|
| 30 |
<a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
|
| 31 |
</div>
|
| 32 |
|
|
|
|
| 41 |
|
| 42 |
## π Open-source Plan
|
| 43 |
|
| 44 |
+
- HunyuanImage-3.0 (Image Generation Model)
|
| 45 |
- [x] Inference
|
| 46 |
+
- [x] Pretrain Checkpoints
|
| 47 |
+
- [ ] Instruct Checkpoints
|
| 48 |
- [ ] VLLM Support
|
| 49 |
+
- [ ] Distilled Checkpoints
|
| 50 |
- [ ] Image-to-Image Generation
|
| 51 |
- [ ] Multi-turn Interaction
|
| 52 |
|
|
|
|
| 68 |
- [π¨ Interactive Gradio Demo](#-interactive-gradio-demo)
|
| 69 |
- [π§± Models Cards](#-models-cards)
|
| 70 |
- [π Prompt Guide](#-prompt-guide)
|
| 71 |
+
- [Manually Writing Prompts](#manually-writing-prompts)
|
| 72 |
+
- [System Prompt For Automatic Rewriting the Prompt](#system-prompt-for-automatic-rewriting-the-prompt)
|
| 73 |
- [Advanced Tips](#advanced-tips)
|
| 74 |
- [More Cases](#more-cases)
|
| 75 |
- [π Evaluation](#-evaluation)
|
|
|
|
| 90 |
|
| 91 |
## β¨ Key Features
|
| 92 |
|
| 93 |
+
* π§ **Unified Multimodal Architecture:** Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation.
|
| 94 |
|
| 95 |
+
* π **The Largest Image Generation MoE Model:** This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance.
|
| 96 |
|
| 97 |
* π¨ **Superior Image Generation Performance:**Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.
|
| 98 |
|
|
|
|
| 135 |
# FlashInfer for optimized moe inference. v0.3.1 is tested.
|
| 136 |
pip install flashinfer-python
|
| 137 |
```
|
| 138 |
+
> π‘**Installation Tips:** It is critical that the CUDA version used by PyTorch matches the system's CUDA version.
|
| 139 |
> FlashInfer relies on this compatibility when compiling kernels at runtime. Pytorch 2.7.1+cu128 is tested.
|
| 140 |
> GCC version >=9 is recommended for compiling FlashAttention and FlashInfer.
|
| 141 |
|
| 142 |
> β‘ **Performance Tips:** These optimizations can significantly speed up your inference!
|
| 143 |
|
| 144 |
+
> π‘**Notation:** When FlashInfer is enabled, the first inference may be slower (about 10 minutes) due to kernel compilation. Subsequent inferences on the same machine will be much faster.
|
| 145 |
+
|
| 146 |
## π Usage
|
| 147 |
|
| 148 |
### π₯ Quick Start with Transformers
|
|
|
|
| 150 |
The easiest way to get started with HunyuanImage-3.0:
|
| 151 |
|
| 152 |
```python
|
| 153 |
+
from transformers import AutoModelForCausalLM
|
| 154 |
|
| 155 |
# Load the model
|
| 156 |
model_id = "tencent/HunyuanImage-3.0"
|
| 157 |
|
| 158 |
kwargs = dict(
|
| 159 |
+
attn_implementation="sdpa", # Use "flash_attention_2" if FlashAttention is installed
|
| 160 |
trust_remote_code=True,
|
| 161 |
torch_dtype="auto",
|
| 162 |
device_map="auto",
|
| 163 |
+
moe_impl="eager", # Use "flashinfer" if FlashInfer is installed
|
| 164 |
)
|
| 165 |
|
| 166 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
|
| 167 |
model.load_tokenizer(model_id)
|
| 168 |
|
| 169 |
# generate the image
|
| 170 |
+
prompt = "A brown and white dog is running on the grass"
|
| 171 |
+
image = model.generate_image(prompt=prompt, stream=True)
|
| 172 |
+
image.save("image.png")
|
| 173 |
```
|
| 174 |
|
| 175 |
### π Local Installation & Usage
|
|
|
|
| 185 |
|
| 186 |
```bash
|
| 187 |
# Download from HuggingFace
|
| 188 |
+
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
|
| 189 |
```
|
| 190 |
|
| 191 |
#### 3οΈβ£ Run the Demo
|
| 192 |
|
| 193 |
```bash
|
| 194 |
+
python3 run_image_gen.py --model-id ./HunyuanImage-3 --verbose 1 --prompt "A brown and white dog is running on the grass"
|
| 195 |
```
|
| 196 |
|
| 197 |
### π¨ Interactive Gradio Demo
|
|
|
|
| 253 |
## π Prompt Guide
|
| 254 |
|
| 255 |
### Manually Writing Prompts.
|
| 256 |
+
The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, Instruct Checkpoint can rewrite or enhance input prompts with thinking . For optimal results currently, we recommend community partners consulting our official guide on how to write effective prompts.
|
| 257 |
|
| 258 |
Reference: [HunyuanImage 3.0 Prompt Handbook](
|
| 259 |
https://docs.qq.com/doc/DUVVadmhCdG9qRXBU)
|
| 260 |
|
| 261 |
|
| 262 |
+
### System Prompt For Automatic Rewriting the Prompt.
|
| 263 |
+
|
| 264 |
+
We've included two system prompts in the PE folder of this repository that leverage DeepSeek to automatically enhance user inputs:
|
| 265 |
+
|
| 266 |
+
* **system_prompt_universal**: This system prompt converts photographic style, artistic prompts into a detailed one.
|
| 267 |
+
* **system_prompt_text_rendering**: This system prompt converts UI/Poster/Text Rending prompts to a deailed on that suits the model.
|
| 268 |
|
| 269 |
+
Note that these system prompts are in Chinese because Deepseek works better with Chinese system prompts. If you want to use it for English oriented model, you may translate it into English or refer to the comments in the PE file as a guide.
|
| 270 |
+
|
| 271 |
+
We also create a [Yuanqi workflow](https://yuanqi.tencent.com/agent/H69VgtJdj3Dz) to implent the universal one, you can directly try it.
|
| 272 |
|
| 273 |
### Advanced Tips
|
| 274 |
- **Content Priority**: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: **Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters**. Keywords can be added both before and after this structure.
|
|
|
|
| 342 |
</p>
|
| 343 |
|
| 344 |
|
|
|
|
|
|
|
| 345 |
## π Citation
|
| 346 |
|
| 347 |
If you find HunyuanImage-3.0 useful in your research, please cite our work:
|
|
|
|
| 363 |
* π¨ [Diffusers](https://github.com/huggingface/diffusers) - Diffusion models library
|
| 364 |
* π [HuggingFace](https://huggingface.co/) - AI model hub and community
|
| 365 |
* β‘ [FlashAttention](https://github.com/Dao-AILab/flash-attention) - Memory-efficient attention
|
| 366 |
+
* π [FlashInfer](https://github.com/flashinfer-ai/flashinfer) - Optimized inference engine
|
| 367 |
+
|