Qwen-Image-EliGen / README.md

Upload folder using huggingface_hub

65e0c87 verified about 1 month ago

3.96 kB

	---
	license: apache-2.0
	---
	# Qwen-Image Precise Region Control Model

	![](./assets/title.png)

	## Model Introduction

	This model is a precise region control model trained based on [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image), with a LoRA architecture. It enables control over the position and shape of each entity by taking as input both textual descriptions and regional conditions (mask maps) for each entity. The training framework is built on [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio), using the dataset [DiffSynth-Studio/EliGenTrainSet](https://www.modelscope.cn/datasets/DiffSynth-Studio/EliGenTrainSet).

	## Results Demonstration

	\|Entity Control Condition\|Generated Image\|
	\|-\|-\|
	\|![eligen_example_1_0](./assets/samples/poster_region.png)\|![eligen_example_1_mask_0](./assets/samples/poster.png)\|
	\|![eligen_example_1_0](./assets/samples/eligen_example_1_mask.png)\|![eligen_example_1_mask_0](./assets/samples/eligen_example_1.png)\|
	\|![eligen_example_1_0](./assets/samples/eligen_example_2_mask.png)\|![eligen_example_1_mask_0](./assets/samples/eligen_example_2.png)\|
	\|![eligen_example_1_0](./assets/samples/eligen_example_3_mask.png)\|![eligen_example_1_mask_0](./assets/samples/eligen_example_3.png)\|
	\|![eligen_example_1_0](./assets/samples/eligen_example_4_mask.png)\|![eligen_example_1_mask_0](./assets/samples/eligen_example_4.png)\|
	\|![eligen_example_1_0](./assets/samples/eligen_example_5_mask.png)\|![eligen_example_1_mask_0](./assets/samples/eligen_example_5.png)\|

	## Inference Code
	```
	git clone https://github.com/modelscope/DiffSynth-Studio.git
	cd DiffSynth-Studio
	pip install -e .
	```

	```python
	from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
	from modelscope import dataset_snapshot_download, snapshot_download
	import torch
	from PIL import Image
	```

	```python
	pipe = QwenImagePipeline.from_pretrained(
	torch_dtype=torch.bfloat16,
	device="cuda",
	model_configs=[
	ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
	ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
	ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
	],
	tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
	)
	snapshot_download("DiffSynth-Studio/Qwen-Image-EliGen", local_dir="models/DiffSynth-Studio/Qwen-Image-EliGen", allow_file_pattern="model.safetensors")
	pipe.load_lora(pipe.dit, "models/DiffSynth-Studio/Qwen-Image-EliGen/model.safetensors")

	global_prompt = "Poster for the Qwen-Image-EliGen Magic Café, featuring two magical coffees—one emitting flames and the other emitting ice spikes—against a light blue misty background, with text reading 'Qwen-Image-EliGen Magic Café' and 'New Arrival'"
	entity_prompts = ["A red magic coffee with flames rising from the cup",
	"A red magic coffee surrounded by ice spikes",
	"Text: 'New Arrival'",
	"Text: 'Qwen-Image-EliGen Magic Café'"]

	dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern=f"data/examples/eligen/qwen-image/example_6/*.png")
	masks = [Image.open(f"./data/examples/eligen/qwen-image/example_6/{i}.png").convert('RGB').resize((1328, 1328)) for i in range(len(entity_prompts))]

	image = pipe(
	prompt=global_prompt,
	seed=0,
	eligen_entity_prompts=entity_prompts,
	eligen_entity_masks=masks,
	)
	image.save("image.jpg")
	```

	## Citation
	If you find our work helpful, please consider citing our research:
	```
	@article{zhang2025eligen,
	title={Eligen: Entity-level Controlled Image Generation with Regional Attention},
	author={Zhang, Hong and Duan, Zhongjie and Wang, Xingjun and Chen, Yingda and Zhang, Yu},
	journal={arXiv preprint arXiv:2501.01097},
	year={2025}
	}
	```