μ₀: A Scalable 3D Interaction-Trace World Model

Seungjae Lee¹^* Yoonkyo Jung¹^* Jusuk Lee² Jonghun Shin² Amir Hossein Shahidzadeh¹ Yao-Chih Lee¹

H. Jin Kim² Jia-Bin Huang¹^† Furong Huang¹^†

^*Equal contribution ^†Equal advising

¹University of Maryland, College Park ²Seoul National University

Teaser Video (with Audio Narration 🔊)

TL;DR: μ₀ is a trace world model that predicts 3D interaction traces instead of pixels or low-level actions, enabling transfer across embodiments from video-only pretraining.

Why 3D Interaction Traces?

Pixel-space video models burn capacity on appearance, while action-labeled VLAs are bound to specific embodiments. μ₀ occupies the middle ground: it predicts 3D traces of semantic interaction points — objects, tools, hands, and contact regions — that compactly describe what must move, regardless of the robot used.

Overview

Results

1. Trace Prediction Quality

1.1 Baseline Comparison

Qualitative comparison of predicted traces from μ₀ against trace-prediction baselines.

Task

Baseline

Input

Ground Truth

Prediction

1.2 Interactive 3D Visualization

Interactive 3D trace predictions from μ₀. Drag to orbit, scroll to zoom. The input frame is shown in the top-left; predicted trajectories are colored by time (purple → red). Pick a sample below.

Prediction Ground truth Keypoints Point cloud

Time Step 32 / 32

Point size 0.007

2. Simulation (RoboCasa365)

μ₀ + action expert reaches 30.25% average success across 8 RoboCasa365 tasks, outperforming π₀ by 5.0 points and TraceGen + action expert by 7.25 points despite relying solely on video-only pretraining.

**Simulation results in RoboCasa365.** Success rates (%) on 8 representative RoboCasa365 tasks.

3. Real-World Evaluation

**Real-world experimental setup and task visualizations.** The setup includes a UR3 robot arm with a two-finger gripper and the three real-world manipulation tasks used for evaluation.

Sample Execution Videos by Task

Real-robot rollouts, speed up 5×.

Task

VLM + action expert

π₀

π_0.5

TraceGen + action expert

Ours (μ₀) + action expert

**Real-world evaluation results.** Bar charts show average success rates (%) for three in-distribution UR3 manipulation tasks. Pick & Place and Pour are averaged over multiple objects.

Concurrent Work
(Additional works to be added)

arXiv Project Page

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction.
Predicts goal-conditioned 3D point trajectories and transfers to robot manipulation and video generation.

How it relates

Our work shares the view that 3D trajectories provide a useful intermediate representation, while focusing on trace-space world modeling for actionable robot control through an Action Expert.

arXiv Project Page

UMA: Unified Motion-Action Modeling for Heterogeneous Robot Learning.
Uses 3D object motion as a shared interface to jointly model visuomotor control and dynamics under a masked generative objective, learning from action-free video, real, and simulated robot data.

How it relates

Like ours, UMA treats embodiment-agnostic 3D object motion as the bridge between action-free video and robot control. Our work focuses on video-only pretraining of a trace-space world model, decoding actionable control through a dedicated Action Expert.

arXiv Project Page

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition.
Learns short-horizon manipulation intent from human video as object flow and palm-pose references, executed by embodiment-specific controllers trained in simulation.

How it relates

Like ours, LUCID bridges action-free human video and robot control through an embodiment-agnostic 3D motion interface. LUCID hand-designs an explicit flow-and-palm-pose interface between a separately trained intent predictor and a simulation-RL controller, whereas our work learns a unified trace-space world model from video-only pretraining and decodes control through an Action Expert.

BibTeX

@article{lee2026mu0,
  title={$\mu_0$: A Scalable 3D Interaction-Trace World Model},
  author={Lee, Seungjae and Jung, Yoonkyo and Lee, Jusuk and Shin, Jonghun and Shahidzadeh, Amir Hossein and Lee, Yao-Chih and Kim, H. Jin and Huang, Jia-Bin and Huang, Furong},
  journal={arXiv preprint arXiv:2606.13769},
  year={2026}
}

μ0: A Scalable 3D Interaction-Trace World Model

Teaser Video (with Audio Narration 🔊)

Why 3D Interaction Traces?

Overview

Results

1. Trace Prediction Quality

1.1 Baseline Comparison

1.2 Interactive 3D Visualization

2. Simulation (RoboCasa365)

3. Real-World Evaluation

Sample Execution Videos by Task

Concurrent Work(Additional works to be added)

BibTeX

μ₀: A Scalable 3D Interaction-Trace World Model

Concurrent Work
(Additional works to be added)