paris / README.md

nielsr HF Staff

Improve model card: Add pipeline tag, library name, and Hugging Face paper link

78cb265 verified 25 days ago

preview code

raw

history blame

7.12 kB

metadata

license: mit
pipeline_tag: text-to-image
library_name: diffusers
tags:
  - diffusion
  - multi-expert
  - dit
  - laion
  - distributed
  - decentralized
  - flow-matching

Paris: A Decentralized Trained Open-Weight Diffusion Model

The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. Read our technical report to learn more.

Key Characteristics

8 independently trained expert diffusion models (605M parameters each, 4.84B total)
No gradient synchronization, parameter sharing, or activation exchange among nodes during training
Lightweight transformer router (~129M parameters) for dynamic expert selection
11M LAION-Aesthetic images across 120 A40 GPU-days
14× less training data than prior decentralized baselines
16× less compute than prior decentralized baselines
Competitive generation quality (FID 12.45 on DiTExpert XL/2)
Open weights for research and commercial use under MIT license

Examples

Text-conditioned image generation samples using Paris across diverse prompts and visual styles

Architecture Details

Component	Specification
Model Scale	DiT-XL/2
Parameters per Expert	605M
Total Expert Parameters	4.84B (8 experts)
Router Parameters	~129M
Hidden Dimensions	1152
Transformer Layers	28
Attention Heads	16
Patch Size	2×2 (latent space)
Latent Resolution	32×32×4
Image Resolution	256×256
Text Conditioning	CLIP ViT-L/14
VAE	sd-vae-ft-mse (8× downsampling)

Training Approach

Paris implements fully decentralized training where:

Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
No gradient synchronization, parameter sharing, or activation exchange between experts during training
Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
Router trained post-hoc on full dataset for expert selection during inference
Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)

Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.

This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.

Comparison with Traditional Parallelization

Strategy	Synchronization	Straggler Impact	Topology Requirements
Data Parallel	Periodic all-reduce	Slowest worker blocks iteration	Latency-sensitive cluster
Model Parallel	Sequential layer transfers	Slowest layer blocks pipeline	Linear pipeline
Pipeline Parallel	Stage-to-stage per microbatch	Bubble overhead from slowest stage	Linear pipeline
Paris	No synchronization	No blocking	Arbitrary

Routing Strategies

top-1 (default): Single best expert per step. Fastest inference, competitive quality.
top-2: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
full-ensemble: All 8 experts weighted by router. Highest compute (8× cost).

Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).

Performance Metrics

Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)

Inference Strategy	FID-50K ↓
Monolithic (single model)	29.64
Paris Top-1	30.60
Paris Top-2	22.60
Paris Full Ensemble	47.89

Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.

Training Details

Hyperparameters (DiT-XL/2)

Parameter	Value
Dataset	LAION-Aesthetic (11M images)
Clustering	DINOv2 semantic features
Batch Size	16 per expert (effective 32 with 2-step accumulation)
Learning Rate	2e-5 (AdamW, no scheduling)
Training Steps	~120k total across experts (asynchronous)
EMA Decay	0.9999
Mixed Precision	FP16 with automatic loss scaling
Conditioning	AdaLN-Single (23% parameter reduction)

Router Training

Parameter	Value
Architecture	DiT-B (smaller than experts)
Batch Size	64 with 4-step accumulation (effective 256)
Learning Rate	5e-5 with cosine annealing (25 epochs)
Loss	Cross-entropy on cluster assignments
Training	Post-hoc on full dataset

Citation

@misc{jiang2025paris,
  title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
  author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
  year={2025},
  eprint={2510.03434},
  archivePrefix={arXiv},
  primaryClass={cs.GR},
  url={https://arxiv.org/abs/2510.03434}
}

License

MIT License – Open for research and commercial use.

Made with ❤️ by