Spaces:
Running
Running
Update readme.
Browse files
README.md
CHANGED
|
@@ -6,58 +6,63 @@ This project provides tools for emulating and visualizing pipeline parallelism s
|
|
| 6 |
|
| 7 |
Pipeline parallelism is a technique used to train large models by partitioning the model across multiple devices and processing data in a pipelined fashion. This project allows you to:
|
| 8 |
|
| 9 |
-
- Simulate different pipeline parallelism strategies (1F1B, Interleaved)
|
| 10 |
- Visualize the execution schedule on multiple devices
|
| 11 |
- Compare different strategies for efficiency
|
| 12 |
|
| 13 |
## Features
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
-
|
| 18 |
-
|
| 19 |
-
-
|
| 20 |
-
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Installation
|
| 24 |
|
| 25 |
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
|
| 26 |
|
| 27 |
-
Setup `uv` if not installed
|
| 28 |
-
```
|
| 29 |
-
# On macOS and Linux
|
| 30 |
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 31 |
```
|
| 32 |
|
| 33 |
## Usage
|
| 34 |
|
| 35 |
-
Running for 1F1B strategy:
|
| 36 |
```bash
|
| 37 |
uv run python main.py strategy=1f1b num_devices=4 num_stages=4 num_batches=8
|
| 38 |
```
|
| 39 |

|
| 40 |
|
| 41 |
-
Running for
|
| 42 |
```bash
|
| 43 |
uv run python main.py strategy=interleave num_devices=4 num_stages=8 num_batches=8
|
| 44 |
```
|
| 45 |

|
| 46 |
|
| 47 |
-
Running for ZB-1P strategy:
|
| 48 |
```bash
|
| 49 |
uv run python main.py strategy=zb1p num_devices=4 num_stages=4 num_batches=8
|
| 50 |
```
|
| 51 |

|
| 52 |
|
| 53 |
-
|
| 54 |
-
Running for 1F1B-batch-overlap strategy:
|
| 55 |
```bash
|
| 56 |
uv run python main.py strategy=1f1b_overlap num_devices=4 num_stages=4 num_batches=8
|
| 57 |
```
|
| 58 |

|
| 59 |
|
| 60 |
-
Running for 1F1B-interleave-overlap strategy:
|
| 61 |
```bash
|
| 62 |
uv run python main.py strategy=1f1b_interleave_overlap num_devices=4 num_stages=8 num_batches=8
|
| 63 |
```
|
|
@@ -77,7 +82,7 @@ You can use different configuration files with Hydra in several ways:
|
|
| 77 |
```
|
| 78 |
conf/
|
| 79 |
โโโ config.yaml # Default configuration
|
| 80 |
-
โโโ model_A.yaml # Create your own config with stage-specific latency for performance projection
|
| 81 |
```
|
| 82 |
|
| 83 |
2. Run with your desired configuration using the `--config-name` flag:
|
|
@@ -108,11 +113,12 @@ PP-Emulation/
|
|
| 108 |
โโโ README.md # This file
|
| 109 |
```
|
| 110 |
|
| 111 |
-
##
|
|
|
|
| 112 |
1. _PipeDream: Fast and Efficient Pipeline Parallel DNN Training_. [arxiv](https://arxiv.org/abs/1806.03377)
|
| 113 |
2. _Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM_. [arxiv](https://arxiv.org/abs/2104.04473)
|
| 114 |
-
3. _Zero Bubble Pipeline Parallelism_ [arxiv](https://arxiv.org/abs/2401.10241)
|
| 115 |
-
4.
|
| 116 |
|
| 117 |
## License
|
| 118 |
|
|
|
|
| 6 |
|
| 7 |
Pipeline parallelism is a technique used to train large models by partitioning the model across multiple devices and processing data in a pipelined fashion. This project allows you to:
|
| 8 |
|
| 9 |
+
- Simulate different pipeline parallelism strategies (1F1B, Interleaved, Zero-Bubble, etc.)
|
| 10 |
- Visualize the execution schedule on multiple devices
|
| 11 |
- Compare different strategies for efficiency
|
| 12 |
|
| 13 |
## Features
|
| 14 |
+
|
| 15 |
+
- **Supported Pipeline Strategies**:
|
| 16 |
+
- 1F1B (One-Forward-One-Backward)
|
| 17 |
+
- Interleaved 1F1B
|
| 18 |
+
- Zero-Bubble 1F1B (ZB-1P)
|
| 19 |
+
- 1F1B with computation-communication overlap
|
| 20 |
+
- Interleaved 1F1B with computation-communication overlap
|
| 21 |
+
|
| 22 |
+
- **Visualization**:
|
| 23 |
+
- Interactive visualization dashboard using Plotly/Dash
|
| 24 |
+
|
| 25 |
+
- **Configuration**:
|
| 26 |
+
- Configurable simulation parameters through Hydra
|
| 27 |
+
- Customizable stage latency and communication costs
|
| 28 |
|
| 29 |
## Installation
|
| 30 |
|
| 31 |
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
|
| 32 |
|
| 33 |
+
Setup `uv` if not installed on your computer:
|
| 34 |
+
```bash
|
| 35 |
+
# On macOS and Linux
|
| 36 |
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 37 |
```
|
| 38 |
|
| 39 |
## Usage
|
| 40 |
|
| 41 |
+
### Running for 1F1B strategy:
|
| 42 |
```bash
|
| 43 |
uv run python main.py strategy=1f1b num_devices=4 num_stages=4 num_batches=8
|
| 44 |
```
|
| 45 |

|
| 46 |
|
| 47 |
+
### Running for interleaved strategy:
|
| 48 |
```bash
|
| 49 |
uv run python main.py strategy=interleave num_devices=4 num_stages=8 num_batches=8
|
| 50 |
```
|
| 51 |

|
| 52 |
|
| 53 |
+
### Running for ZB-1P strategy:
|
| 54 |
```bash
|
| 55 |
uv run python main.py strategy=zb1p num_devices=4 num_stages=4 num_batches=8
|
| 56 |
```
|
| 57 |

|
| 58 |
|
| 59 |
+
### Running for 1F1B-batch-overlap strategy:
|
|
|
|
| 60 |
```bash
|
| 61 |
uv run python main.py strategy=1f1b_overlap num_devices=4 num_stages=4 num_batches=8
|
| 62 |
```
|
| 63 |

|
| 64 |
|
| 65 |
+
### Running for 1F1B-interleave-overlap strategy:
|
| 66 |
```bash
|
| 67 |
uv run python main.py strategy=1f1b_interleave_overlap num_devices=4 num_stages=8 num_batches=8
|
| 68 |
```
|
|
|
|
| 82 |
```
|
| 83 |
conf/
|
| 84 |
โโโ config.yaml # Default configuration
|
| 85 |
+
โโโ model_A.yaml # Create your own config with stage-specific latency for performance projection
|
| 86 |
```
|
| 87 |
|
| 88 |
2. Run with your desired configuration using the `--config-name` flag:
|
|
|
|
| 113 |
โโโ README.md # This file
|
| 114 |
```
|
| 115 |
|
| 116 |
+
## References
|
| 117 |
+
|
| 118 |
1. _PipeDream: Fast and Efficient Pipeline Parallel DNN Training_. [arxiv](https://arxiv.org/abs/1806.03377)
|
| 119 |
2. _Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM_. [arxiv](https://arxiv.org/abs/2104.04473)
|
| 120 |
+
3. _Zero Bubble Pipeline Parallelism_. [arxiv](https://arxiv.org/abs/2401.10241)
|
| 121 |
+
4. _Communication-Computation Overlap in MoE Training with 1F1B Pipeline Parallelism_. [blog](https://zhuanlan.zhihu.com/p/28463368206)
|
| 122 |
|
| 123 |
## License
|
| 124 |
|