Winning the BEHAVIOR-1K Challenge (NeurIPS 2025)

We present a vision-action policy that achieved first place in the 2025 BEHAVIOR Challenge, a large-scale benchmark of 50 long-horizon household tasks in photo-realistic simulation requiring bimanual manipulation, navigation, and object interaction. Building on the Pi0.5 architecture, we introduce several innovations, including correlated noise for flow matching to improve training efficiency and produce smoother action sequences, as well as learnable mixed layer attention and System 2 stage tracking for ambiguity resolution. Training uses multi-sample flow matching to reduce variance, while inference applies action compression and heuristics to overcome dataset limitations. Our method achieves a 26% q-score on both public and private leaderboards, and this report provides a detailed analysis of observed failure modes.

Challenge

Training Examples

Each task comes with 200 demonstrations recorded via teleoperation. Video: robot head camera 8× speed.

The Challenge includes 50 tasks set in multiple house-like environments. The main difficulties are:

Long-horizon execution: Tasks run for an average of 6.6 minutes.
Bimanual manipulation: Coordinated control of two 7-DOF arms with parallel jaw grippers.
Mobile navigation: Operation in realistic indoor and outdoor environments.
Non-Markovian states: Many states are visually ambiguous. Without memory of past actions or explicit stage tracking, the policy cannot distinguish between these states and may take incorrect actions.
No recovery demonstrations: The training set contains only successful demonstrations, spanning a very narrow manifold of possible trajectories.
Data: Each task provides 200 demonstrations. Public evaluation uses 10 additional instances; the private leaderboard evaluates performance on a separate held-out set of another 10 instances.
High compute requirements: Training on all 1200 hours of data takes weeks, and full evaluation across all tasks takes several days.

Model

VLA Foundation

Our policy is based on the Pi0.5 vision-language-action (VLA) architecture, combining a SigLIP vision encoder with a Gemma LLM backbone.

Loading interactive model...

Results

On the held-out evaluation, our approach achieves q-score ~0.26 (including partial successes) with minimal public–private gap.

Rank	Team	Affiliation	Task Success (private)	Q-Score (private)
1	Robot Learning Collective (ours)	Independent	0.124	0.260
2	Comet	NVIDIA Research	0.114	0.251
3	SimpleAI Robot	Beijing Simple AI Technology	0.108	0.159
4	The North Star	Huawei CRI EAI Team	0.076	0.120
5	Embodied Intelligence	Independent	0.052	0.095

Top 5 teams on the held-out test set (leaderboard)

Per individual task results

Average task success rate varies significantly. Some tasks are almost solved, except under particularly tricky initial conditions, while in others, the model was unsuccessful across all 10 trials.
For tasks with 0 success, we do not observe that they are generally impossible; instead, they usually contain one tricky step that involves very high-precision manipulation (with low success rate even for human teleoperators) or a carefully followed sequence that is slightly beyond the current model's limits.
Task duration does not appear to be a fundamental obstacle: longer tasks simply have many more steps, which makes full success harder, but partial success remains very achievable.

Per-task and per-episode scores sorted by task duration. Green = success; light green = partial success; red = failure. Click to enlarge.

Examples of 100% Successful Episodes

Select an episode to show 10X-speed clip

Failure Analysis

To analyze why the robot fails, we labeled failures on a subset of tasks (15/50). Select a reason to see the explanation and video examples:

Select a reason to see the explanation and video examples.

Examples of Failure Episodes

Select an episode to show 5X-speed clip

Please note that fail reason labeling is subjective, and is there to provide a big picture. Refer to all evaluation videos and scores here.

Recovery from cross-task learning

Training on all 50 tasks leads to emergent behaviors. Due to the data collection process, single-task models never recover from mistakes; the multi-task model trained sufficiently long learns to use skills from other tasks to recover (e.g., to pick up a dropped item).

Single task model: no attempt to recover

Multi task model: non-trivial recovery

Single task model: no attempt to recover

Multi task model: non-trivial recovery

Runtime heuristics

Unfortunately, cross task learning alone was insufficient to overcome the bias toward ideal demonstrations in the data. As a result, the robot often missed the grasp, closed the gripper, and failed to retry, since the training data lacked recovery demonstrations that include reopening the gripper. In practice, however, a fully closed gripper usually indicates that the robot has grasped nothing. By introducing a simple heuristic that automatically opens the gripper in such cases, we observed robust recovery behavior that was surprisingly similar to what emerges when recovery is properly represented during data collection.

Without heuristics: fails to recover

With heuristics: robust recovery

Summary

The dominant failure modes closely reflect real-world challenges faced by imitation-learning-based robotics models, supporting the role of the BEHAVIOR benchmark as a valuable test bed for evaluating new approaches.
Evidence from cross-task learning reinforces the view that imitation-learning-based models benefit significantly from training on diverse datasets.
Due to limited resources for thorough ablation studies, we cannot precisely identify which components were critical. Nonetheless, the combined approach proves robust and outperforms comparable Pi0.5-based methods despite having a smaller training budget.

Ilia Larchenko

Gleb Zarin

Akash Karnatak

Acknowledgments

This work was made possible by the generous support of Nebius, who provided the high-performance cloud GPU compute resources required to train our models.

We would also like to thank the following people for their help and support: Vladimir Ershov, Justyna Ilczuk, Andrey Mulenkov.

Interested in Robot Learning? Join our Discord to discuss the Challenge results and collaboration.