We present a vision-action policy that achieved first place in the 2025 BEHAVIOR Challenge, a large-scale benchmark of 50 long-horizon household tasks in photo-realistic simulation requiring bimanual manipulation, navigation, and object interaction. Building on the Pi0.5 architecture, we introduce several innovations, including correlated noise for flow matching to improve training efficiency and produce smoother action sequences, as well as learnable mixed layer attention and System 2 stage tracking for ambiguity resolution. Training uses multi-sample flow matching to reduce variance, while inference applies action compression and heuristics to overcome dataset limitations. Our method achieves a 26% q-score on both public and private leaderboards, and this report provides a detailed analysis of observed failure modes.

Challenge

Training Examples

Each task comes with 200 demonstrations recorded via teleoperation. Video: robot head camera 8× speed.

The Challenge includes 50 tasks set in multiple house-like environments. The main difficulties are:

Model

VLA Foundation

Our policy is based on the Pi0.5 vision-language-action (VLA) architecture, combining a SigLIP vision encoder with a Gemma LLM backbone.

Loading interactive model...

Results

On the held-out evaluation, our approach achieves q-score ~0.26 (including partial successes) with minimal public–private gap.

Rank Team Affiliation Task Success (private) Q-Score (private)
1 Robot Learning Collective (ours) Independent 0.124 0.260
2 Comet NVIDIA Research 0.114 0.251
3 SimpleAI Robot Beijing Simple AI Technology 0.108 0.159
4 The North Star Huawei CRI EAI Team 0.076 0.120
5 Embodied Intelligence Independent 0.052 0.095

Top 5 teams on the held-out test set (leaderboard)

Per individual task results

  • Average task success rate varies significantly. Some tasks are almost solved, except under particularly tricky initial conditions, while in others, the model was unsuccessful across all 10 trials.
  • For tasks with 0 success, we do not observe that they are generally impossible; instead, they usually contain one tricky step that involves very high-precision manipulation (with low success rate even for human teleoperators) or a carefully followed sequence that is slightly beyond the current model's limits.
  • Task duration does not appear to be a fundamental obstacle: longer tasks simply have many more steps, which makes full success harder, but partial success remains very achievable.
Per-task and per-episode scores sorted by task duration. Green = success; light green = partial success; red = failure. Click to enlarge.

Examples of 100% Successful Episodes

Select an episode to show 10X-speed clip

Failure Analysis

To analyze why the robot fails, we labeled failures on a subset of tasks (15/50). Select a reason to see the explanation and video examples:

Select a reason to see the explanation and video examples.

Examples of Failure Episodes

Select an episode to show 5X-speed clip

Please note that fail reason labeling is subjective, and is there to provide a big picture. Refer to all evaluation videos and scores here.

Recovery from cross-task learning

Training on all 50 tasks leads to emergent behaviors. Due to the data collection process, single-task models never recover from mistakes; the multi-task model trained sufficiently long learns to use skills from other tasks to recover (e.g., to pick up a dropped item).

Single task model: no attempt to recover
Multi task model: non-trivial recovery
Single task model: no attempt to recover
Multi task model: non-trivial recovery

Runtime heuristics

Unfortunately, cross task learning alone was insufficient to overcome the bias toward ideal demonstrations in the data. As a result, the robot often missed the grasp, closed the gripper, and failed to retry, since the training data lacked recovery demonstrations that include reopening the gripper. In practice, however, a fully closed gripper usually indicates that the robot has grasped nothing. By introducing a simple heuristic that automatically opens the gripper in such cases, we observed robust recovery behavior that was surprisingly similar to what emerges when recovery is properly represented during data collection.

Without heuristics: fails to recover
With heuristics: robust recovery

Summary


Acknowledgments

Nebius Logo

This work was made possible by the generous support of Nebius, who provided the high-performance cloud GPU compute resources required to train our models.

We would also like to thank the following people for their help and support: Vladimir Ershov, Justyna Ilczuk, Andrey Mulenkov.

Interested in Robot Learning? Join our Discord to discuss the Challenge results and collaboration.