| # Pipeline of Pre-Training RDT | |
| Firstly, you need to install the prerequisites for RDT (see [README](../README.md#installation)). Then, you can install the prerequisites for TensorFlow Dataset (in another Conda environment). | |
| ## Installation for TensorFlow Dataset | |
| ```bash | |
| # Under the root directory of this repo | |
| conda create -n rdt-data python=3.10 | |
| conda activate rdt-data | |
| # Install all the prequisites | |
| pip install -r requirements_data.txt | |
| # Or you can manually install each package (please refer to requirements_data.txt for specific versions) | |
| pip install tfds-nightly gsutil tensorflow Pillow pyyaml opencv-python tensorflow-graphics imageio[ffmpeg] | |
| # If the speed is too slow, you can specify alternative sources (tfds-nightly is not available in Tsinghua mirror) | |
| pip install -i https://pypi.tuna.tsinghua.edu.cn/simple gsutil tensorflow Pillow pyyaml opencv-python tensorflow-graphics imageio[ffmpeg] | |
| ``` | |
| ## Download and Prepare Datasets | |
| We introduce how to download each of our pre-training datasets. If you plan to pre-train on a subset of them, just download the ones you need. You can also fine-tune RDT through this pipeline only if your target dataset is included below or in the Google Cloud Storage. | |
| | Dataset | Sample Percentage (%) | | |
| | ---- | ---- | | |
| | RT-1 Dataset | 9.00 | | |
| | TACO Dataset | 1.99 | | |
| | JACO Play Dataset | 1.10 | | |
| | Cable Routing Dataset | 0.27 | | |
| | NYU Door Opening | 0.33 | | |
| | Viola | 0.40 | | |
| | Berkeley UR5 | 1.06 | | |
| | TOTO | 1.06 | | |
| | Kuka | 1.66 | | |
| | Language Table | 3.32 | | |
| | Columbia Cairlab Pusht Real | 0.40 | | |
| | Stanford Kuka Multimodal Dataset | 1.83 | | |
| | Stanford Hydra Dataset | 0.80 | | |
| | Austin Buds Dataset | 0.23 | | |
| | Maniskill Dataset | 5.78 | | |
| | Furniture Bench Dataset | 2.36 | | |
| | UCSD Kitchen Dataset | 0.40 | | |
| | UCSD Pick And Place Dataset | 1.23 | | |
| | Austin Sailor Dataset | 0.50 | | |
| | Austin Sirius Dataset | 0.80 | | |
| | BC Z | 6.91 | | |
| | UTokyo PR2 Opening Fridge | 0.30 | | |
| | UTokyo PR2 Tabletop Manipulation | 0.50 | | |
| | UTokyo Xarm Pick And Place | 0.33 | | |
| | UTokyo Xarm Bimanual | 0.03 | | |
| | Berkeley MVP | 0.73 | | |
| | Berkeley RPT | 1.00 | | |
| | KAIST Nonprehensile | 0.46 | | |
| | Tokyo U LSMO | 0.23 | | |
| | DLR Sara Grid Clamp | 0.03 | | |
| | Robocook | 1.66 | | |
| | Imperialcollege Sawyer Wrist Cam | 0.43 | | |
| | Iamlab CMU Pickup Insert | 0.83 | | |
| | UTAustin Mutex | 1.29 | | |
| | Fanuc Manipulation | 0.66 | | |
| | Play Fusion | 0.80 | | |
| | Droid | 10.06 | | |
| | FMB| 1.39 | | |
| | Dobb·E | 1.20 | | |
| | QUT Dexterous Manipulation | 0.46 | | |
| | Aloha Dataset | 4.98 | | |
| | Mobile Aloha Dataset | 4.98 | | |
| | Roboset | 4.48 | | |
| | RH20T | 10.99 | | |
| | Calvin Dataset | 3.32 | | |
| | Bridgev2 | 7.44 | | |
| Before everything, let's link the dataset directory on your disk to a subfolder of this repo: | |
| ```bash | |
| ln -s /path/to/dataset /path/to/repo/RoboticsDiffusionTransformer/data/datasets | |
| ``` | |
| ### Open X-Embodiment | |
| Specify the correct path to the `gsutil` in your Conda in [this file](../data/openx_embod/download.sh#L72). | |
| Run the following commands to download our selected datasets for the Open X-Embodiment: | |
| ```bash | |
| # Under the root directory of this repo | |
| cd data/openx_embod | |
| # Download all datasets | |
| bash download_openx_embod.sh | |
| ``` | |
| Note: By modifying `download_openx_embod.sh`, you can download any dataset on the Google Cloud (as long as it can be downloaded with `gsutil` and is stored in `TFRecord` format), not just the ones we have listed. | |
| ### Mobile ALOHA Dataset | |
| Download the Mobile ALOHA Dataset from the [official website](https://mobile-aloha.github.io) to `data/datasets/aloha`, then run: | |
| ```bash | |
| cd data/aloha | |
| # Convert the dataset to TFRecord | |
| python hdf5totfrecords.py | |
| ``` | |
| ### Bridgev2 | |
| Run: | |
| ```bash | |
| cd data/bridgev2 | |
| # Download and preprocess the dataset | |
| sh download.sh | |
| ``` | |
| ### Calvin | |
| Run: | |
| ```bash | |
| cd data/calvin | |
| # Download and preprocess the dataset | |
| sh download.sh | |
| # Convert the dataset to TFRecord format | |
| python hdf5totfrecords.py | |
| ``` | |
| ### RH20T | |
| Download the RH20T Dataset from there [official website](https://rh20t.github.io/#download) to `data/datasets/rh20t`, then run | |
| ```bash | |
| cd data/rh20t | |
| # Convert the dataset to TFRecord | |
| python hdf5totfrecords.py | |
| ``` | |
| ### RoboSet | |
| Run: | |
| ```bash | |
| cd data/roboset | |
| # Download and preprocess the dataset | |
| sh download.sh | |
| ``` | |
| ## If Want to Train on a New Dataset | |
| If you want to train on a new dataset (e.g., `my_pretrain_dataset`) through this pre-training pipeline, you need to modify several files as follows: | |
| ##### 1. `configs/dataset_control_freq.json` | |
| Add the control frequency of your dataset. | |
| ##### 2. `data/preprocess_scripts/my_pretrain_dataset.py` | |
| If your dataset can be loaded by `tfds.builder_from_directory()`, then you only need to download it into the folder of Open X-Embodiment `data/datasets/openx_embod` and implement the function of `process_step()`. You may need to specify the tfds loading path in L78 (see [this file](../data/vla_dataset.py#L78)). We refer to `data/preprocess_scripts/droid.py` for an example. | |
| If not, you need to first convert it into TFRecords and then implement both `load_dataset()` and `process_step()`. We refer to `data/agilex/hdf5totfrecords.py` and `data/preprocess_scripts/agilex.py` for examples. | |
| Here some descriptions: | |
| ##### `load_dataset(seed: int)` | |
| - Returns a dataset that supports iterator and `repeat` method with a random seed. | |
| - Suggested implementation: Use `tf.data.Dataset.from_generator` and `tf.data.TFRecordDataset`. | |
| - The iterator should return a subdataset that supports iterator representing one episode with the following structure: | |
| - `step`: A dataset object that supports iterator containing multiple frames per episode. | |
| - `observation`: A dictionary containing your images. | |
| - `your_first_image_key`: Your observation RGB image keys. | |
| - ... | |
| - `other_attribute`: Any other relevant attributes. | |
| ##### `process_step(step: dict) -> dict` | |
| Processes a single frame and returns a dictionary with the following keys: | |
| - `observation`: | |
| - `your_first_view_image: tf.Tensor`: Your first view image. | |
| - `arm_concat: tf.Tensor`: Concatenation of physical states. | |
| - `format: tf.constant(string)`: Format of `arm_concat` (e.g., `arm_joint_pos_0,arm_joint_pos_1,arm_joint_pos_2`). | |
| - `action`: Frame action (leave empty if there's none). | |
| - `arm_concat`: Same as in `observation`. | |
| - `format`: Same as in `observation`. | |
| - `terminate: tf.Tensor`: Boolean Tensor indicates if the episode ends. | |
| **IMPORTANT**: You should only use TensorFlow functions for any branch or loop operations. For example, use `tf.cond` instead of `if`. | |
| ##### 3. `configs/dataset_img_keys.json` | |
| Add the image keys of your dataset. For example: | |
| ```json | |
| "my_pretrain_dataset": { | |
| "image_keys": [ | |
| "exterior-cam", | |
| "right-wrist-cam", | |
| "left-wrist-cam", | |
| "left-wrist-cam" | |
| ], | |
| "image_mask": [1, 1, 1, 0] | |
| } | |
| ``` | |
| - To make TensorFlow happy, you have to specify four images in this order: `exterior-cam, right-wrist-cam, left-wrist-cam, any-cam`. Each key should correspond to your `step` attribute key of observation images. | |
| - If you only have a single wrist, just make it a *right* wrist. | |
| - The `image_mask` indicates whether each image is valid (1) or not (0). | |
| - What if you don’t have four images? Simply repeat the images in the following positions and set their masks to 0 (invalid). | |
| - The key order is *strict*. If you don't have the exterior camera but have both wrists, leave the exterior position blank (or pad) and use the following: | |
| ```json | |
| "my_pretrain_dataset": { | |
| "image_keys": [ | |
| "right-wrist-cam", | |
| "right-wrist-cam", | |
| "left-wrist-cam", | |
| "left-wrist-cam" | |
| ], | |
| "image_mask": [0, 1, 1, 0] | |
| } | |
| ``` | |
| - During training, only the first *three* cameras will be used. | |
| ##### 4. `configs/dataset_stat.json` | |
| Compute the statistics (min, max, mean, and std) for your dataset: | |
| ```bash | |
| # Use -h to see the full usage | |
| python -m data.compute_dataset_stat --skip_exist | |
| ``` | |
| This will update the `dataset_stat.json` file with your dataset's statistics. | |
| ##### 5. `data/vla_dataset.py` | |
| - Add your dataset to `DATASET_NAMES_NOOPENX` if it cannot be loaded by `tfds.builder_from_directory()`. | |
| - If your dataset only contains action but no proprioception (i.e., robot state), add your dataset to `DATASET_NAMES_NO_STATE` in [this file](../data/preprocess.py). | |
| - Normally, we consider the future state as the action of current timestep. If you want to use different actions, you should implement more functions. We refer to `flatten_episode_agilex()` in [this file](../data/episode_transform.py) and `_generate_json_state_agilex()` in [this file](../data/preprocess.py) for examples. You may also refer to L318 in [this file](../data/preprocess.py) and L128 in [this file](../data/vla_dataset.py) for how to select your dataset and preprocess it differently. | |
| ## Start Pre-Training | |
| We employ a producer-consumer framework with TensorFlow Dataset for fast data loading. Since most of the datasets in the Open X-Embodiment are stored in the form of `TFRecord`, we convert all pre-training datasets into `TFRecord` for storage. In pre-training, we use the producer process to decompress the data from `TFRecord` and store it in a buffer on the hard disk. At the same time, we use the consumer process to read data from the buffer in a disorderly order and feed it to the model training. This not only decouples the `TensorFlow` and `PyTorch` environments but also alleviates the training performance loss caused by the small size of the shuffling buffer in the memory. | |
| [This file](../configs/base.yaml) includes configurations relevant to model architecture (including number of heads, hidden dimension, and so on) and data processing. You may need to modify `buf_path` (L22) to your real buffer path. This buffer is used as disk shuffling buffer for data loading. | |
| Configurations relevant to training are passed through *Command Line Arguments*. Use `python main.py -h ` to see the descriptions. We provide an example pre-training script in [this file](../pretrain.sh) (`pretrain.sh`). You may need to modify some of the parameters in this file, such as `CUTLASS_PATH` and `WANDB_PROJECT`. | |
| You may need to modify the list of pre-training datasets in [this file](../configs/pretrain_datasets.json) and their corresponding sampling weights in [this file](../configs/pretrain_sample_weights.json). If you want to fine-tune RDT through this pipeline, you may need to remove abundant datasets in the list. | |
| Before start pre-training, we first start the data producer process (if you use multiple nodes, you should run this command in each node): | |
| ```bash | |
| # Under the root directory of this repo | |
| conda activate rdt-data | |
| # Use -h to see the full usage | |
| python -m data.producer --fill_up | |
| # Please proceed to the next step AFTER finishing the filling up process | |
| ``` | |
| Then, we run the pre-training script: | |
| ```bash | |
| source pretrain.sh | |
| ``` | |
| Note: You can monitor the training process by observing `loss` (through a long window moving average), `overall_avg_sample_mse`, and the sampling MSE of each dataset in [Wandb](https://wandb.ai/site) or [TensorBoard](https://www.tensorflow.org/tensorboard). We empirically found that the lower the `overall_avg_sample_mse`, the better the model performs. | |