ACT: Condiment Handover (s0101_act_condiment_handover)

Action Chunking Transformer model fine-tuned on ASGARD condiment handover demonstrations

This is an ACT (Action Chunking Transformer) model trained on 40 episodes of condiment manipulation and handover demonstrations from the ASGARD so101_follower robot arm.

Model Details

Model Type

Architecture: ACT (Action Chunking Transformers)
Parameters: ~52M
Task: Condiment manipulation and handover
Checkpoint: Step 1860 (final)

Training Configuration

Dataset: asgard-robot/asgard_training_data_condiment
Episodes: 40 demonstrations
Total Frames: 31,522
Robot: ASGARD so101_follower (single-arm 6 DOF)
Training Steps: 1,860
Logged Epochs: ~120 (with MetricsTracker bug accounting)

Hardware

GPUs: 4x NVIDIA H100 PCIe (80GB VRAM each)
Total VRAM: ~320GB
Effective Batch Size: 512 (128 per GPU × 4 GPUs)
Training Time: ~27 minutes

Hyperparameters

Learning Rate: 1e-5
Learning Rate (Backbone): 1e-5
Weight Decay: 1e-4
Batch Size: 128 per GPU
Optimizer: AdamW (betas: 0.9, 0.999, eps: 1e-8)
Gradient Clipping: 10.0

ACT-Specific Parameters

Chunk Size: 100
Action Steps: 100
VAE Training: Yes
KL Weight: 10.0
Dropout: 0.1

Architecture Details

Vision Backbone: ResNet-18 (pretrained on ImageNet)
Hidden Dimension: 512
Feedforward Dimension: 3,200
Attention Heads: 8
Encoder Layers: 4
Decoder Layers: 1
VAE Encoder Layers: 4
Latent Dimension: 32

Training Results

Initial Loss: 12.852
Final Loss: 0.262 (at step 1860)
Loss Reduction: 98%
Training Speed: ~0.64 steps/second
Memory Usage: ~40-50GB per GPU

Model Files

model.safetensors: 198MB (model weights)
config.json: ACT configuration
train_config.json: Training configuration
policy_preprocessor.json: Input preprocessing
policy_postprocessor.json: Output postprocessing
Normalizer weights: 7.5KB each

Intended Use

This model is designed for:

Condiment manipulation: Picking up, moving, and handling condiment bottles
Handover operations: Coordinated handoff to humans
Fine-grained manipulation: Precise gripper control
Home assistant applications: Domestic robot tasks

Performance Characteristics

Training Metrics

Smooth loss convergence
No overfitting observed
Stable gradient magnitudes (~12-28)
Consistent learning throughout training
Faster convergence than potato task

Expected Performance

High success rate on similar condiment manipulation tasks
Smooth action generation
Proper force control during handover
Robust to slight variations in setup

Limitations

Limited demonstrations: Trained on 40 episodes only
Overfitting risk: May not generalize to drastically different scenarios
Camera dependency: Requires similar camera setup (wrist + external)
Robot-specific: Designed for ASGARD so101_follower robot
Task-specific: Optimized for condiment handover task

Usage Example

from lerobot.scripts.lerobot_train import load_policy
import torch

# Load the trained model
policy = load_policy(
    "asgard-robot/s0101-act-condiment-handover",
    device="cuda"
)

# Run inference
# observation should contain:
# - Images from wrist and external cameras
# - Current joint positions
action = policy(observation)

Dataset

Source: ASGARD teleoperation demonstrations
Format: LeRobot v3.0
Cameras: Dual RGB (wrist + external)
Control: 6 DOF joint positions
Frequency: 30 FPS

Training Environment

Framework: LeRobot (Hugging Face)
Branch: ASGARD teleop control
Python: 3.10
PyTorch: With CUDA support
Multi-GPU: Distributed training with Accelerate

Citation

If you use this model, please cite:

@misc{s0101_act_condiment_handover_2024,
  title={ACT: Condiment Handover for ASGARD Robot},
  author={ASGARD Team},
  year={2024},
  url={https://huggingface.co/asgard-robot/s0101-act-condiment-handover}
}

Model Card Author

ASGARD Robot Team

Acknowledgments

Base architecture: ACT (Zhao et al., 2023) - Learning Fine-Grained Bimanual Manipulation
Training framework: LeRobot by Hugging Face
Hardware: Shadeform H100 Multi-GPU Cluster
Dataset: Collected via ASGARD teleoperation system

Downloads last month: 14

Video Preview

Robotics