ACT: Condiment Handover (s0101_act_condiment_handover)

Action Chunking Transformer model fine-tuned on ASGARD condiment handover demonstrations

This is an ACT (Action Chunking Transformer) model trained on 40 episodes of condiment manipulation and handover demonstrations from the ASGARD so101_follower robot arm.

Model Details

Model Type

  • Architecture: ACT (Action Chunking Transformers)
  • Parameters: ~52M
  • Task: Condiment manipulation and handover
  • Checkpoint: Step 1860 (final)

Training Configuration

  • Dataset: asgard-robot/asgard_training_data_condiment
  • Episodes: 40 demonstrations
  • Total Frames: 31,522
  • Robot: ASGARD so101_follower (single-arm 6 DOF)
  • Training Steps: 1,860
  • Logged Epochs: ~120 (with MetricsTracker bug accounting)

Hardware

  • GPUs: 4x NVIDIA H100 PCIe (80GB VRAM each)
  • Total VRAM: ~320GB
  • Effective Batch Size: 512 (128 per GPU × 4 GPUs)
  • Training Time: ~27 minutes

Hyperparameters

  • Learning Rate: 1e-5
  • Learning Rate (Backbone): 1e-5
  • Weight Decay: 1e-4
  • Batch Size: 128 per GPU
  • Optimizer: AdamW (betas: 0.9, 0.999, eps: 1e-8)
  • Gradient Clipping: 10.0

ACT-Specific Parameters

  • Chunk Size: 100
  • Action Steps: 100
  • VAE Training: Yes
  • KL Weight: 10.0
  • Dropout: 0.1

Architecture Details

  • Vision Backbone: ResNet-18 (pretrained on ImageNet)
  • Hidden Dimension: 512
  • Feedforward Dimension: 3,200
  • Attention Heads: 8
  • Encoder Layers: 4
  • Decoder Layers: 1
  • VAE Encoder Layers: 4
  • Latent Dimension: 32

Training Results

  • Initial Loss: 12.852
  • Final Loss: 0.262 (at step 1860)
  • Loss Reduction: 98%
  • Training Speed: ~0.64 steps/second
  • Memory Usage: ~40-50GB per GPU

Model Files

  • model.safetensors: 198MB (model weights)
  • config.json: ACT configuration
  • train_config.json: Training configuration
  • policy_preprocessor.json: Input preprocessing
  • policy_postprocessor.json: Output postprocessing
  • Normalizer weights: 7.5KB each

Intended Use

This model is designed for:

  • Condiment manipulation: Picking up, moving, and handling condiment bottles
  • Handover operations: Coordinated handoff to humans
  • Fine-grained manipulation: Precise gripper control
  • Home assistant applications: Domestic robot tasks

Performance Characteristics

Training Metrics

  • Smooth loss convergence
  • No overfitting observed
  • Stable gradient magnitudes (~12-28)
  • Consistent learning throughout training
  • Faster convergence than potato task

Expected Performance

  • High success rate on similar condiment manipulation tasks
  • Smooth action generation
  • Proper force control during handover
  • Robust to slight variations in setup

Limitations

  • Limited demonstrations: Trained on 40 episodes only
  • Overfitting risk: May not generalize to drastically different scenarios
  • Camera dependency: Requires similar camera setup (wrist + external)
  • Robot-specific: Designed for ASGARD so101_follower robot
  • Task-specific: Optimized for condiment handover task

Usage Example

from lerobot.scripts.lerobot_train import load_policy
import torch

# Load the trained model
policy = load_policy(
    "asgard-robot/s0101-act-condiment-handover",
    device="cuda"
)

# Run inference
# observation should contain:
# - Images from wrist and external cameras
# - Current joint positions
action = policy(observation)

Dataset

  • Source: ASGARD teleoperation demonstrations
  • Format: LeRobot v3.0
  • Cameras: Dual RGB (wrist + external)
  • Control: 6 DOF joint positions
  • Frequency: 30 FPS

Training Environment

  • Framework: LeRobot (Hugging Face)
  • Branch: ASGARD teleop control
  • Python: 3.10
  • PyTorch: With CUDA support
  • Multi-GPU: Distributed training with Accelerate

Citation

If you use this model, please cite:

@misc{s0101_act_condiment_handover_2024,
  title={ACT: Condiment Handover for ASGARD Robot},
  author={ASGARD Team},
  year={2024},
  url={https://huggingface.co/asgard-robot/s0101-act-condiment-handover}
}

Model Card Author

ASGARD Robot Team

Acknowledgments

  • Base architecture: ACT (Zhao et al., 2023) - Learning Fine-Grained Bimanual Manipulation
  • Training framework: LeRobot by Hugging Face
  • Hardware: Shadeform H100 Multi-GPU Cluster
  • Dataset: Collected via ASGARD teleoperation system
Downloads last month
14
Video Preview
loading