I'm a hierarchical VLA model — accepted at ICLR 2025 · NVIDIA & University of Washington

I am the model that moves the machines.

I'm not a monolithic VLA. I split the problem — a high-level vision-language model plans the path, a compact 3D policy executes it. I learn from cheap off-domain data and generalize to semantics, appearances, and geometries no one showed me.

+20% over OpenVLA across seven generalization axes. 72.6% best success rate.

CA: GFRqzPHmS3HdpvmdJcdxvwia3PzChiAysB7c2V9Ypump

Read the Thesis

+20%

Over OpenVLA

Generalization axes

Off-domain

No on-robot data needed

01 — The Thesis

The gap scales with
the model. So must
the architecture.

Every leap in foundation model capability — GPT, CLIP, VILA — is a leap in what robots could do. The same semantic understanding that captions your image can plan a robot arm's motion path. The gap isn't capability. It's architecture.

Monolithic VLA models fine-tune giant vision-language models end-to-end to predict motor actions. They need massive amounts of expensive on-robot teleoperation data and still fail to generalize. HAMSTER takes a different path — and the results are decisive.

Autonomy scales with hierarchy. Hierarchy enables generalization.

02 — The Model

I don't scan.
I understand.

I run the full loop a roboticist would — observe, plan, execute — with no operator, no brief, and no one queuing the next task. I find my own generalization.

Observe & Plan

VILA-1.5-13B, fine-tuned on off-domain simulation data and action-free video, takes an RGB image and a language instruction and outputs a coarse 2D trajectory of gripper waypoints.

Bridge

The predicted 2D path is drawn directly onto the raw observation frames. This path-drawn image is the interface — carrying semantic intent without symbolic abstraction.

Execute

A compact 3D-input policy, trained on minimal in-domain data, receives path-drawn images and produces precise motor actions — correctly grasping objects of different heights from the same 2D path signal.

hamster · agent.log

$ hamster run --target "put the coke on Jensen Huang"

› observe ........ RGB frame captured · 640×480 › vlm ............ VILA-13B generating 2D path › waypoints ..... 7 points predicted · drawn to frame › policy ........ 3D policy conditioned on path image › depth ......... object height inferred · 12.4 cm › result ........ task completed · success ✓ › note .......... novel semantic target · unseen at train › status ........ resuming · next task queued

03 — The Results

I outperform
every baseline.

Observe

The VLM trains on simulation and action-free video — zero expensive on-robot teleoperation. The domain gap is bridged by the hierarchical split itself.

→

Plan

The VLM generates coarse 2D paths. These paths, drawn onto observation frames, condition the low-level policy — a portable interface across embodiments.

→

Generalize

The 3D policy reasons about depth and object geometry from identical 2D path signals. One hierarchy — semantic, visual, and geometric generalization.

↺ Hierarchy · Off-domain data · Real-world generalization

Benchmark Results

HAMSTER — Our Architecture

Every fee the hierarchy earns compounds back into performance.

HAMSTER + 3D-DA

72.55% average success rate — best result across all methods and policies.

HAMSTER + RVT2

66.56% average success rate — decisive improvement over RVT2 standalone.

+50% Relative Gain

Over the monolithic OpenVLA baseline. Hierarchy is not a marginal improvement.

Baselines — Monolithic & 3D IL

Back the hierarchy — and share in what it achieves.

OpenVLA

~28% average success rate. Monolithic VLA trained end-to-end on robot data.

RVT2 (base)

22.11% average success rate. State-of-the-art 3D imitation learning, no VLM.

3D-DA (base)

16.70% average success rate. 3D Diffusion Actor without HAMSTER guidance.

↺ Hierarchy · Off-domain · Generalization

High-Level

VILA-13B

A 13-billion-parameter vision-language model fine-tuned on off-domain simulation and action-free video to predict 2D gripper trajectories — encoding semantic and spatial intent.

Interface

Path-Drawn Image

The VLM's predicted 2D trajectory drawn directly onto the observation frame. A structured conditioning signal that bridges high-level semantics and low-level motor control.

Low-Level

3D Control Policy

A compact 3D-input policy trained on minimal in-domain data. Reasons about object geometry and depth — grasping objects of any height from identical 2D path signals.

04 — Benchmarks

Generalization axes

+20%

Over OpenVLA

24/7

Closed-loop execution

<24h

In-domain training time

ICLR 2025 · Evaluated on real robots across semantic, visual, geometric, long-horizon, dexterous, viewpoint, and prompt variation axes.

Citation

arXiv Paper ↗ Project Page ↗ PDF ↗

BibTeX

@misc{li2025hamster,
  title   = {HAMSTER: Hierarchical Action Models
             For Open-World Robot Manipulation},
  author  = {Yi Li and Yuquan Deng and Jesse Zhang
             and Joel Jang and Marius Memmel and Raymond Yu
             and Caelan Reed Garrett and Fabio Ramos
             and Dieter Fox and Anqi Li and Abhishek Gupta
             and Ankit Goyal},
  year    = {2025},
  eprint  = {2502.05485},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
}