somos99 commited on
Commit
7b49680
Β·
verified Β·
1 Parent(s): b80ce60

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -24
README.md CHANGED
@@ -5,11 +5,12 @@ license: other
5
  license_link: LICENSE
6
  ---
7
 
 
8
  <div align="center">
9
 
10
  <img src="./assets/logo.png" alt="HunyuanImage-3.0 Logo" width="400">
11
 
12
- # 🎨 HunyuanImage-3.0: A Powerful Native Multimodal Model for Text-to-Image Generation
13
 
14
  </div>
15
 
@@ -22,10 +23,10 @@ license_link: LICENSE
22
 
23
 
24
  <div align="center">
25
- <a href=xxxx target="_blank"><img src=https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage height=22px></a>
26
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0 target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
27
  <a href=https://github.com/Tencent-Hunyuan/HunyuanImage-3.0 target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
28
- <a href=xxxx target="_blank"><img src=https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv height=22px></a>
29
  <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
30
  </div>
31
 
@@ -40,11 +41,12 @@ If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
40
 
41
  ## πŸ“‘ Open-source Plan
42
 
43
- - HunyuanImage-3.0 (Text-to-Image Model)
44
  - [x] Inference
45
- - [x] Pretain Checkpoints
46
- - [x] Instruct Checkpoints
47
  - [ ] VLLM Support
 
48
  - [ ] Image-to-Image Generation
49
  - [ ] Multi-turn Interaction
50
 
@@ -66,6 +68,8 @@ If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
66
  - [🎨 Interactive Gradio Demo](#-interactive-gradio-demo)
67
  - [🧱 Models Cards](#-models-cards)
68
  - [πŸ“ Prompt Guide](#-prompt-guide)
 
 
69
  - [Advanced Tips](#advanced-tips)
70
  - [More Cases](#more-cases)
71
  - [πŸ“Š Evaluation](#-evaluation)
@@ -86,9 +90,9 @@ If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
86
 
87
  ## ✨ Key Features
88
 
89
- * 🧠 **Unified Multimodal Architecture:** Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich text-to-image generation.
90
 
91
- * πŸ† **The Largest Text-to-Image MoE Model:** This is the largest text-to-image Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance.
92
 
93
  * 🎨 **Superior Image Generation Performance:**Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.
94
 
@@ -131,12 +135,14 @@ pip install flash-attn==2.8.3 --no-build-isolation
131
  # FlashInfer for optimized moe inference. v0.3.1 is tested.
132
  pip install flashinfer-python
133
  ```
134
- > πŸ’‘ **Installation Tips:** It is critical that the CUDA version used by PyTorch matches the system's CUDA version.
135
  > FlashInfer relies on this compatibility when compiling kernels at runtime. Pytorch 2.7.1+cu128 is tested.
136
  > GCC version >=9 is recommended for compiling FlashAttention and FlashInfer.
137
 
138
  > ⚑ **Performance Tips:** These optimizations can significantly speed up your inference!
139
 
 
 
140
  ## πŸš€ Usage
141
 
142
  ### πŸ”₯ Quick Start with Transformers
@@ -144,26 +150,26 @@ pip install flashinfer-python
144
  The easiest way to get started with HunyuanImage-3.0:
145
 
146
  ```python
147
- from hunyuan_image_3.hunyuan import HunyuanImage3ForCausalMM
148
 
149
  # Load the model
150
  model_id = "tencent/HunyuanImage-3.0"
151
 
152
  kwargs = dict(
153
- attn_implementation="sdpa",
154
  trust_remote_code=True,
155
  torch_dtype="auto",
156
  device_map="auto",
157
- moe_impl="eager",
158
  )
159
 
160
- model = HunyuanImage3ForCausalMM.from_pretrained(model_id, **kwargs)
161
  model.load_tokenizer(model_id)
162
 
163
  # generate the image
164
- prompt = "A brown and white dog is running on the grass, 9:16"
165
- image = model.generate_image(prompt=prompt, cot="think", stream=True)
166
- image.save("test.png")
167
  ```
168
 
169
  ### 🏠 Local Installation & Usage
@@ -179,13 +185,13 @@ cd HunyuanImage-3.0/
179
 
180
  ```bash
181
  # Download from HuggingFace
182
- huggingface-cli download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
183
  ```
184
 
185
  #### 3️⃣ Run the Demo
186
 
187
  ```bash
188
- python3 run_local.py --model-id ./HunyuanImage-3 --prompt "A brown and white dog is running on the grass, 9:16"
189
  ```
190
 
191
  ### 🎨 Interactive Gradio Demo
@@ -247,15 +253,22 @@ Notes:
247
  ## πŸ“ Prompt Guide
248
 
249
  ### Manually Writing Prompts.
250
- The base model does not automatically rewrite or enhance input prompts. For optimal results, we recommend consulting our official guide on how to write effective prompts.
251
 
252
  Reference: [HunyuanImage 3.0 Prompt Handbook](
253
  https://docs.qq.com/doc/DUVVadmhCdG9qRXBU)
254
 
255
 
256
- ### System Prompt to automatically rewrite the input.
 
 
 
 
 
257
 
258
- We've included two system prompts in the PE folder of this repository that leverage DeepSeek and Gemini 2.5 to automatically enhance user inputs. These prompts intelligently expand and refine simple descriptions into rich, detailed versions optimized for our model, resulting in significantly improved image generation quality.
 
 
259
 
260
  ### Advanced Tips
261
  - **Content Priority**: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: **Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters**. Keywords can be added both before and after this structure.
@@ -329,8 +342,6 @@ We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the
329
  </p>
330
 
331
 
332
- * πŸ† **Open Leaderboard** - Coming Soon
333
-
334
  ## πŸ“š Citation
335
 
336
  If you find HunyuanImage-3.0 useful in your research, please cite our work:
@@ -352,4 +363,5 @@ We extend our heartfelt gratitude to the following open-source projects and comm
352
  * 🎨 [Diffusers](https://github.com/huggingface/diffusers) - Diffusion models library
353
  * 🌐 [HuggingFace](https://huggingface.co/) - AI model hub and community
354
  * ⚑ [FlashAttention](https://github.com/Dao-AILab/flash-attention) - Memory-efficient attention
355
- * πŸš€ [FlashInfer](https://github.com/flashinfer-ai/flashinfer) - Optimized inference engine
 
 
5
  license_link: LICENSE
6
  ---
7
 
8
+
9
  <div align="center">
10
 
11
  <img src="./assets/logo.png" alt="HunyuanImage-3.0 Logo" width="400">
12
 
13
+ # 🎨 HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation
14
 
15
  </div>
16
 
 
23
 
24
 
25
  <div align="center">
26
+ <a href=https://hunyuan.tencent.com/image target="_blank"><img src=https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage height=22px></a>
27
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0 target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
28
  <a href=https://github.com/Tencent-Hunyuan/HunyuanImage-3.0 target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
29
+ <a href="come soon" target="_blank"><img src=https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv height=22px></a>
30
  <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
31
  </div>
32
 
 
41
 
42
  ## πŸ“‘ Open-source Plan
43
 
44
+ - HunyuanImage-3.0 (Image Generation Model)
45
  - [x] Inference
46
+ - [x] Pretrain Checkpoints
47
+ - [ ] Instruct Checkpoints
48
  - [ ] VLLM Support
49
+ - [ ] Distilled Checkpoints
50
  - [ ] Image-to-Image Generation
51
  - [ ] Multi-turn Interaction
52
 
 
68
  - [🎨 Interactive Gradio Demo](#-interactive-gradio-demo)
69
  - [🧱 Models Cards](#-models-cards)
70
  - [πŸ“ Prompt Guide](#-prompt-guide)
71
+ - [Manually Writing Prompts](#manually-writing-prompts)
72
+ - [System Prompt For Automatic Rewriting the Prompt](#system-prompt-for-automatic-rewriting-the-prompt)
73
  - [Advanced Tips](#advanced-tips)
74
  - [More Cases](#more-cases)
75
  - [πŸ“Š Evaluation](#-evaluation)
 
90
 
91
  ## ✨ Key Features
92
 
93
+ * 🧠 **Unified Multimodal Architecture:** Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation.
94
 
95
+ * πŸ† **The Largest Image Generation MoE Model:** This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance.
96
 
97
  * 🎨 **Superior Image Generation Performance:**Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.
98
 
 
135
  # FlashInfer for optimized moe inference. v0.3.1 is tested.
136
  pip install flashinfer-python
137
  ```
138
+ > πŸ’‘**Installation Tips:** It is critical that the CUDA version used by PyTorch matches the system's CUDA version.
139
  > FlashInfer relies on this compatibility when compiling kernels at runtime. Pytorch 2.7.1+cu128 is tested.
140
  > GCC version >=9 is recommended for compiling FlashAttention and FlashInfer.
141
 
142
  > ⚑ **Performance Tips:** These optimizations can significantly speed up your inference!
143
 
144
+ > πŸ’‘**Notation:** When FlashInfer is enabled, the first inference may be slower (about 10 minutes) due to kernel compilation. Subsequent inferences on the same machine will be much faster.
145
+
146
  ## πŸš€ Usage
147
 
148
  ### πŸ”₯ Quick Start with Transformers
 
150
  The easiest way to get started with HunyuanImage-3.0:
151
 
152
  ```python
153
+ from transformers import AutoModelForCausalLM
154
 
155
  # Load the model
156
  model_id = "tencent/HunyuanImage-3.0"
157
 
158
  kwargs = dict(
159
+ attn_implementation="sdpa", # Use "flash_attention_2" if FlashAttention is installed
160
  trust_remote_code=True,
161
  torch_dtype="auto",
162
  device_map="auto",
163
+ moe_impl="eager", # Use "flashinfer" if FlashInfer is installed
164
  )
165
 
166
+ model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
167
  model.load_tokenizer(model_id)
168
 
169
  # generate the image
170
+ prompt = "A brown and white dog is running on the grass"
171
+ image = model.generate_image(prompt=prompt, stream=True)
172
+ image.save("image.png")
173
  ```
174
 
175
  ### 🏠 Local Installation & Usage
 
185
 
186
  ```bash
187
  # Download from HuggingFace
188
+ hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
189
  ```
190
 
191
  #### 3️⃣ Run the Demo
192
 
193
  ```bash
194
+ python3 run_image_gen.py --model-id ./HunyuanImage-3 --verbose 1 --prompt "A brown and white dog is running on the grass"
195
  ```
196
 
197
  ### 🎨 Interactive Gradio Demo
 
253
  ## πŸ“ Prompt Guide
254
 
255
  ### Manually Writing Prompts.
256
+ The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, Instruct Checkpoint can rewrite or enhance input prompts with thinking . For optimal results currently, we recommend community partners consulting our official guide on how to write effective prompts.
257
 
258
  Reference: [HunyuanImage 3.0 Prompt Handbook](
259
  https://docs.qq.com/doc/DUVVadmhCdG9qRXBU)
260
 
261
 
262
+ ### System Prompt For Automatic Rewriting the Prompt.
263
+
264
+ We've included two system prompts in the PE folder of this repository that leverage DeepSeek to automatically enhance user inputs:
265
+
266
+ * **system_prompt_universal**: This system prompt converts photographic style, artistic prompts into a detailed one.
267
+ * **system_prompt_text_rendering**: This system prompt converts UI/Poster/Text Rending prompts to a deailed on that suits the model.
268
 
269
+ Note that these system prompts are in Chinese because Deepseek works better with Chinese system prompts. If you want to use it for English oriented model, you may translate it into English or refer to the comments in the PE file as a guide.
270
+
271
+ We also create a [Yuanqi workflow](https://yuanqi.tencent.com/agent/H69VgtJdj3Dz) to implent the universal one, you can directly try it.
272
 
273
  ### Advanced Tips
274
  - **Content Priority**: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: **Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters**. Keywords can be added both before and after this structure.
 
342
  </p>
343
 
344
 
 
 
345
  ## πŸ“š Citation
346
 
347
  If you find HunyuanImage-3.0 useful in your research, please cite our work:
 
363
  * 🎨 [Diffusers](https://github.com/huggingface/diffusers) - Diffusion models library
364
  * 🌐 [HuggingFace](https://huggingface.co/) - AI model hub and community
365
  * ⚑ [FlashAttention](https://github.com/Dao-AILab/flash-attention) - Memory-efficient attention
366
+ * πŸš€ [FlashInfer](https://github.com/flashinfer-ai/flashinfer) - Optimized inference engine
367
+