license: mit
pipeline_tag: text-to-image
library_name: diffusers
tags:
- diffusion
- multi-expert
- dit
- laion
- distributed
- decentralized
- flow-matching
Paris: A Decentralized Trained Open-Weight Diffusion Model
The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. Read our technical report to learn more.
Key Characteristics
- 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
- No gradient synchronization, parameter sharing, or activation exchange among nodes during training
- Lightweight transformer router (~129M parameters) for dynamic expert selection
- 11M LAION-Aesthetic images across 120 A40 GPU-days
- 14× less training data than prior decentralized baselines
- 16× less compute than prior decentralized baselines
- Competitive generation quality (FID 12.45 on DiTExpert XL/2)
- Open weights for research and commercial use under MIT license
Examples
Text-conditioned image generation samples using Paris across diverse prompts and visual styles
Architecture Details
| Component | Specification |
|---|---|
| Model Scale | DiT-XL/2 |
| Parameters per Expert | 605M |
| Total Expert Parameters | 4.84B (8 experts) |
| Router Parameters | ~129M |
| Hidden Dimensions | 1152 |
| Transformer Layers | 28 |
| Attention Heads | 16 |
| Patch Size | 2×2 (latent space) |
| Latent Resolution | 32×32×4 |
| Image Resolution | 256×256 |
| Text Conditioning | CLIP ViT-L/14 |
| VAE | sd-vae-ft-mse (8× downsampling) |
Training Approach
Paris implements fully decentralized training where:
- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
- No gradient synchronization, parameter sharing, or activation exchange between experts during training
- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
- Router trained post-hoc on full dataset for expert selection during inference
- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)
Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.
This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.
Comparison with Traditional Parallelization
| Strategy | Synchronization | Straggler Impact | Topology Requirements |
|---|---|---|---|
| Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster |
| Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline |
| Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline |
| Paris | No synchronization | No blocking | Arbitrary |
Routing Strategies
top-1(default): Single best expert per step. Fastest inference, competitive quality.top-2: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.full-ensemble: All 8 experts weighted by router. Highest compute (8× cost).
Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).
Performance Metrics
Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)
| Inference Strategy | FID-50K ↓ |
|---|---|
| Monolithic (single model) | 29.64 |
| Paris Top-1 | 30.60 |
| Paris Top-2 | 22.60 |
| Paris Full Ensemble | 47.89 |
Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.
Training Details
Hyperparameters (DiT-XL/2)
| Parameter | Value |
|---|---|
| Dataset | LAION-Aesthetic (11M images) |
| Clustering | DINOv2 semantic features |
| Batch Size | 16 per expert (effective 32 with 2-step accumulation) |
| Learning Rate | 2e-5 (AdamW, no scheduling) |
| Training Steps | ~120k total across experts (asynchronous) |
| EMA Decay | 0.9999 |
| Mixed Precision | FP16 with automatic loss scaling |
| Conditioning | AdaLN-Single (23% parameter reduction) |
Router Training
| Parameter | Value |
|---|---|
| Architecture | DiT-B (smaller than experts) |
| Batch Size | 64 with 4-step accumulation (effective 256) |
| Learning Rate | 5e-5 with cosine annealing (25 epochs) |
| Loss | Cross-entropy on cluster assignments |
| Training | Post-hoc on full dataset |
Citation
@misc{jiang2025paris,
title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
year={2025},
eprint={2510.03434},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2510.03434}
}
License
MIT License – Open for research and commercial use.


