μ0: A Scalable 3D Interaction-Trace World Model

*Equal contribution Equal advising
1University of Maryland, College Park 2Seoul National University

Teaser Video (with Audio Narration 🔊)

TL;DR: μ0 is a trace world model that predicts 3D interaction traces instead of pixels or low-level actions, enabling transfer across embodiments from video-only pretraining.

Why 3D Interaction Traces?

Pixel-space video models burn capacity on appearance, while action-labeled VLAs are bound to specific embodiments. μ0 occupies the middle ground: it predicts 3D traces of semantic interaction points — objects, tools, hands, and contact regions — that compactly describe what must move, regardless of the robot used.

Why 3D interaction traces

Overview

Results

1. Trace Prediction Quality

1.1 Baseline Comparison

Qualitative comparison of predicted traces from μ0 against trace-prediction baselines.

Input

Original input frame

Ground Truth

Ground truth trace

Prediction

Baseline prediction
1.2 Interactive 3D Visualization

Interactive 3D trace predictions from μ0. Drag to orbit, scroll to zoom. The input frame is shown in the top-left; predicted trajectories are colored by time (purple → red). Pick a sample below.

Input frame
Input

Time Step 32 / 32
Point size 0.007


2. Simulation (RoboCasa365)

μ0 + action expert reaches 30.25% average success across 8 RoboCasa365 tasks, outperforming π0 by 5.0 points and TraceGen + action expert by 7.25 points despite relying solely on video-only pretraining.

Simulation results in RoboCasa365

Simulation results in RoboCasa365. Success rates (%) on 8 representative RoboCasa365 tasks.



3. Real-World Evaluation

Real-world experimental setup

Real-world experimental setup and task visualizations. The setup includes a UR3 robot arm with a two-finger gripper and the three real-world manipulation tasks used for evaluation.

Sample Execution Videos by Task

Real-robot rollouts, speed up 5×.

VLM + action expert

π0

π0.5

TraceGen + action expert

Ours (μ0) + action expert

Real-world evaluation results

Real-world evaluation results. Bar charts show average success rates (%) for three in-distribution UR3 manipulation tasks. Pick & Place and Pour are averaged over multiple objects.