Bidhan Roy commited on
Commit
0b94e29
·
1 Parent(s): 73872ef

Add README updates and images with Git LFS

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -6,108 +6,198 @@ tags:
6
  - multi-expert
7
  - dit
8
  - laion
 
 
 
9
  ---
10
 
11
- # Paris
12
 
13
- A multi-expert diffusion model trained with dynamic expert routing on LAION-Aesthetic.
14
 
15
- ## Model Description
16
 
17
- This model uses **8 specialized DiT experts** with a learned router that dynamically selects the best expert for each generation based on the noisy latent and timestep.
 
 
 
 
 
 
 
 
18
 
19
- - **Architecture**: dit-XL/2 with 8 experts
20
- - **Router**: dit-based routing network
21
- - **Hidden Size**: 1152
22
- - **Layers**: 28
23
- - **Attention Heads**: 16
24
- - **Parameters per Expert**: ~0M
25
- - **Total Parameters**: ~3M
26
- - **Text Conditioning**: ✓ (CLIP ViT-L/14)
27
- - **Training Dataset**: LAION-Aesthetic
28
 
29
- ## Model Structure
30
 
31
- ```
32
- Paris/
33
- ├── config.json # High-level model configuration
34
- ├── model_index.json # Pipeline component index
35
- ├── expert_0/ # Specialized expert models
36
- │ ├── config.json
37
- │ └── diffusion_pytorch_model.safetensors
38
- ├── expert_1/ ... expert_7/
39
- ├── router/ # Dynamic routing network
40
- │ ├── config.json
41
- │ └── pytorch_model.safetensors
42
- ├── vae/ # VAE (sd-vae-ft-mse)
43
- ├── text_encoder/ # CLIP text encoder
44
- ├── tokenizer/ # CLIP tokenizer
45
- └── inference_pipeline.py # Custom inference code
46
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ```python
51
- from inference_pipeline import DDMPipeline
 
52
 
53
  # Load the pipeline
54
- pipeline = DDMPipeline.from_pretrained("paris")
 
 
 
 
 
55
 
56
  # Generate images
57
  images = pipeline(
58
  prompt="A beautiful sunset over Paris, oil painting style",
59
  num_inference_steps=50,
60
  guidance_scale=7.5,
61
- num_images=4
62
- )
 
63
 
64
- # Save images
65
- for i, img in enumerate(images):
66
- img.save(f"output_{i}.png")
67
  ```
68
 
69
- ## Training Details
 
 
 
 
70
 
71
- - **Base Model**: DiT-XL/2 pretrained on ImageNet
72
- - **Batch Size**: 16 per expert
73
- - **Learning Rate**: 2e-05
74
- - **Image Size**: 256x256 (32x32 latent space)
75
- - **VAE**: SD VAE (8x downsampling)
76
- - **Text Encoder**: CLIP ViT-L/14
77
- - **EMA**: True
78
- - **Mixed Precision**: True
79
 
80
- ### Multi-Expert Architecture
81
 
82
- Each expert specializes in different visual styles/content through dynamic routing:
83
- - The router network analyzes the noisy latent and timestep
84
- - Selects the most appropriate expert for denoising
85
- - Enables better quality and diversity compared to single models
86
 
 
 
 
 
 
 
87
 
 
 
 
88
 
89
- ## Examples
90
 
91
- Coming soon! Check back for generated examples.
92
 
93
- ## Limitations
 
 
 
 
 
 
 
 
 
 
94
 
95
- - Trained on LAION-Aesthetic which may contain biases
96
- - Best results at 256x256 resolution
97
- - Requires GPU for inference (8GB+ VRAM recommended)
98
 
99
- ## Citation
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  ```bibtex
102
- @misc{paris,
103
- author = {Your Name},
104
- title = {Paris: Multi-Expert Diffusion Model},
105
- year = {2024},
106
- publisher = {HuggingFace},
107
- url = {https://huggingface.co/paris}
108
  }
109
  ```
110
 
111
- ## License
 
 
 
 
 
 
 
 
112
 
113
- MIT License
 
6
  - multi-expert
7
  - dit
8
  - laion
9
+ - distributed
10
+ - decentralized
11
+ - flow-matching
12
  ---
13
 
14
+ <div align="center">
15
 
16
+ <img src="bagel_labs_logo.png" alt="Bagel Labs" width="120"/>
17
 
18
+ # Paris: A Decentralized Trained Open-Weight Diffusion Model
19
 
20
+ <a href="https://huggingface.co/bageldotcom/paris">
21
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Like%20this-model-yellow?style=for-the-badge" alt="Like on Hugging Face">
22
+ </a>
23
+ <a href="https://github.com/bageldotcom/paris">
24
+ <img src="https://img.shields.io/github/stars/bageldotcom/paris?style=for-the-badge&logo=github&label=Star%20on%20GitHub" alt="Star on GitHub">
25
+ </a>
26
+ <a href="https://github.com/bageldotcom/Paris/blob/main/paper.pdf">
27
+ <img src="https://img.shields.io/badge/📄%20Read-Technical%20Report-red?style=for-the-badge" alt="Read Technical Report">
28
+ </a>
29
 
30
+ </div>
 
 
 
 
 
 
 
 
31
 
32
+ <br>
33
 
34
+ The world's first diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/Paris/blob/main/paper.pdf) to learn more.
35
+
36
+ # Key Characteristics
37
+
38
+ - 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
39
+ - No gradient synchronization, parameter sharing, or activation exchange among nodes during training
40
+ - Lightweight transformer router (~158M parameters) for dynamic expert selection
41
+ - 11M LAION-Aesthetic images across 120 A40 GPU-days
42
+ - 14× less training data than prior decentralized baselines
43
+ - 16× less compute than prior decentralized baselines
44
+ - Competitive generation quality (FID 12.45)
45
+ - Open weights for research and commercial use under MIT license
46
+
47
+ ---
48
+
49
+ # Examples
50
+
51
+ ![Paris Generation Examples](generated_images.png)
52
+
53
+ *Text-conditioned image generation samples using Paris across diverse prompts and visual styles*
54
+
55
+ ---
56
+
57
+ # Architecture Details
58
+
59
+ | Component | Specification |
60
+ |-----------|--------------|
61
+ | **Model Scale** | DiT-XL/2 |
62
+ | **Parameters per Expert** | 605M |
63
+ | **Total Expert Parameters** | 4.84B (8 experts) |
64
+ | **Router Parameters** | ~158M |
65
+ | **Hidden Dimensions** | 1152 |
66
+ | **Transformer Layers** | 28 |
67
+ | **Attention Heads** | 16 |
68
+ | **Patch Size** | 2×2 (latent space) |
69
+ | **Latent Resolution** | 32×32×4 |
70
+ | **Image Resolution** | 256×256 |
71
+ | **Text Conditioning** | CLIP ViT-L/14 |
72
+ | **VAE** | sd-vae-ft-mse (8× downsampling) |
73
+
74
+ ---
75
+
76
+ # Training Approach
77
+
78
+ Paris implements fully decentralized training where:
79
+
80
+ - Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
81
+ - No gradient synchronization, parameter sharing, or activation exchange between experts during training
82
+ - Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
83
+ - Router trained post-hoc on full dataset for expert selection during inference
84
+ - Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)
85
+
86
+ ![Training Architecture](training_architecture.png)
87
+
88
+ *Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.*
89
+
90
+ This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.
91
 
92
+ **Comparison with Traditional Parallelization**
93
+
94
+ | **Strategy** | **Synchronization** | **Straggler Impact** | **Topology Requirements** |
95
+ |--------------|---------------------|---------------------|---------------------------|
96
+ | Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster |
97
+ | Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline |
98
+ | Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline |
99
+ | **Paris** | **No synchronization** | **No blocking** | **Arbitrary** |
100
+
101
+ ---
102
+
103
+ # Usage
104
 
105
  ```python
106
+ from diffusers import DiffusionPipeline
107
+ import torch
108
 
109
  # Load the pipeline
110
+ pipeline = DiffusionPipeline.from_pretrained(
111
+ "bageldotcom/paris",
112
+ torch_dtype=torch.float16,
113
+ use_safetensors=True
114
+ )
115
+ pipeline.to("cuda")
116
 
117
  # Generate images
118
  images = pipeline(
119
  prompt="A beautiful sunset over Paris, oil painting style",
120
  num_inference_steps=50,
121
  guidance_scale=7.5,
122
+ height=256,
123
+ width=256
124
+ ).images
125
 
126
+ images[0].save("output.png")
 
 
127
  ```
128
 
129
+ ### Routing Strategies
130
+
131
+ - **`top-1`** (default): Single best expert per step. Fastest inference, competitive quality.
132
+ - **`top-2`**: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
133
+ - **`full-ensemble`**: All 8 experts weighted by router. Highest compute (8× cost).
134
 
135
+ ---
 
 
 
 
 
 
 
136
 
137
+ # Performance Metrics
138
 
139
+ **Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)**
 
 
 
140
 
141
+ | **Inference Strategy** | **FID-50K ↓** |
142
+ |------------------------|---------------|
143
+ | Monolithic (single model) | 29.64 |
144
+ | Paris Top-1 | 30.60 |
145
+ | **Paris Top-2** | **22.60** |
146
+ | Paris Full Ensemble | 47.89 |
147
 
148
+ *Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.*
149
+
150
+ ---
151
 
152
+ # Training Details
153
 
154
+ **Hyperparameters (DiT-XL/2)**
155
 
156
+ | **Parameter** | **Value** |
157
+ |---------------|-----------|
158
+ | Dataset | LAION-Aesthetic (11M images) |
159
+ | Clustering | DINOv2 semantic features |
160
+ | Batch Size | 16 per expert (effective 32 with 2-step accumulation) |
161
+ | Learning Rate | 2e-5 (AdamW, no scheduling) |
162
+ | Training Steps | ~120k total across experts (asynchronous) |
163
+ | EMA Decay | 0.9999 |
164
+ | Mixed Precision | FP16 with automatic loss scaling |
165
+ | Initialization | ImageNet-pretrained DiT-XL/2 |
166
+ | Conditioning | AdaLN-Single (23% parameter reduction) |
167
 
168
+ **Router Training**
 
 
169
 
170
+ | **Parameter** | **Value** |
171
+ |---------------|-----------|
172
+ | Architecture | DiT-B (smaller than experts) |
173
+ | Batch Size | 64 with 4-step accumulation (effective 256) |
174
+ | Learning Rate | 5e-5 with cosine annealing (25 epochs) |
175
+ | Loss | Cross-entropy on cluster assignments |
176
+ | Training | Post-hoc on full dataset |
177
+
178
+
179
+ ---
180
+
181
+ # Citation
182
 
183
  ```bibtex
184
+ @misc{paris2025,
185
+ title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
186
+ author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
187
+ year={2025},
188
+ publisher={Bagel Labs},
189
+ url={https://huggingface.co/bageldotcom/paris}
190
  }
191
  ```
192
 
193
+ ---
194
+
195
+ # License
196
+
197
+ MIT License – Open for research and commercial use.
198
+
199
+ <div align="center">
200
+
201
+ Made with ❤️ by [Bagel Labs](https://bagel.com)
202
 
203
+ </div>
bagel_labs_logo.png ADDED

Git LFS Details

  • SHA256: 05c05aac89f9eab593cda699e6e35ccc6461ef9c7c58316ae36533f545309042
  • Pointer size: 130 Bytes
  • Size of remote file: 90.6 kB
generated_images.png ADDED

Git LFS Details

  • SHA256: d1ccbd5a554519c8c3e0356afc2b14637cd43aaee3d5b96877ce1b2fc8856c93
  • Pointer size: 132 Bytes
  • Size of remote file: 2.46 MB
training_architecture.png ADDED

Git LFS Details

  • SHA256: a8ff5f814d3ae1688ba31b2cee77135c0401d46574f480745d17856fb3da467f
  • Pointer size: 131 Bytes
  • Size of remote file: 462 kB