AMD Nitro-E

Introduction

Nitro-E is a family of text-to-image diffusion models focused on highly efficient training. With just 304M parameters, Nitro-E is designed to be resource-friendly for both training and inference. For training, it only takes 1.5 days on a single node with 8 AMD Instinct™ MI300X GPUs. On the inference side, Nitro-E delivers a throughput of 18.8 samples per second (batch size 32, 512px images) a single AMD Instinct MI300X GPU. The distilled version can further increase the throughput to 39.3 samples per second. The release consists of:

Nitro-E-512px: a EMMDiT-based 20-steps model train from scratch.
Nitro-E-512px-dist: a EMMDiT-based model distilled from Nitro-E-512px.
Nitro-E-512px-GRPO: a post-training model fine-tuned from Nitro-E-512px using Group Relative Policy Optimization (GRPO) strategy.

⚡️ Open-source code! ⚡️ technical blog!

Details

Model architecture: We propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. See our technical blog post for more details.
Dataset: Our models were trained on a dataset of ~25M images consisting of both real and synthetic data sources that are openly available on the internet. We make use of the following datasets for training: Segment-Anything-1B, JourneyDB, DiffusionDB and DataComp as prompt of the generated data.
Training cost: The Nitro-E-512px model requires only 1.5 days of training from scratch on a single node with 8 AMD Instinct™ MI300X GPUs.

Quickstart

Image generation with 20 steps:

import torch
from core.tools.inference_pipe import init_pipe

device = torch.device('cuda:0')
dtype = torch.bfloat16
repo_name = "amd/Nitro-E"

resolution = 512
ckpt_name = 'Nitro-E-512px.safetensors'

# for 1024px model
# resolution = 1024
# ckpt_name = 'Nitro-E-1024px.safetensors'

use_grpo = True

if use_grpo: 
    pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name, ckpt_path_grpo='ckpt_grpo_512px')
else:
    pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
prompt = 'A hot air balloon in the shape of a heart grand canyon'
images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=20, guidance_scale=4.5).images

Image generation with 4 steps:

import torch
from core.tools.inference_pipe import init_pipe

device = torch.device('cuda:0')
dtype = torch.bfloat16
resolution = 512
repo_name = "amd/Nitro-E"
ckpt_name = 'Nitro-E-512px-dist.safetensors'

pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
prompt = 'A hot air balloon in the shape of a heart grand canyon'

images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=4, guidance_scale=0).images

License

This project is licensed under the MIT License.

Downloads last month: 334

Collection including amd/Nitro-E

Nitro Diffusion 💥

Collection

Nitro Diffusion is a series of efficient text-to-image diffusion models built on AMD Instinct™ GPUs. • 5 items • Updated about 3 hours ago • 4