RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

  1. 3 DAYS AGO

    Ep#69: MolmoSpaces, an Open Ecosystem for Embodied AI

    Benchmarking, evaluating, and developing robotics code is difficult, and part of this is because no simulator really reflects the diversity and scale of real embodiments. Enter MolmoSpaces from AI2: a massive open ecosystem with a range of 230,000 handcrafted and procedurally-generated home environments, including 48,000 manipulable objects. Crucially, MolmoSpaces provides simulation environments which work for both navigation and manipulation. We talked to the team: Yejin Kim, Omar Rayyan, and Max Argus, to tell us more. Watch Episode 69 of RoboPapers, with Michael Cho and Jiafei Duan, now! Abstract: Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, ρ = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research. Learn more: Project page: https://allenai.org/blog/molmospaces Technical report: https://allenai.org/papers/molmospaces Code: https://github.com/allenai/molmospaces This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1hr 11min
  2. 20 MAR

    Ep#68: DreamZero: World Action Models are Zero-Shot Policies

    Achieving generalizable manipulation is the north star for robotics learning, and while we’ve in the past seen incredible results on specific tasks using fine-tuned VLAs, this north star has remained elusive. Perhaps what is needed is a different approach. DreamZero proposes World Action models (WAMs), which jointly model both action and video in order to achieve state-of-the-art performance on benchmarks like MolmoSpaces and RoboArena. Seonghyeon Ye of NVIDIA Robotics joins us to talk about building a 14B parameter autoregressive diffusion model which achieves state-of-the-art generalization on real world tasks and on the best available benchmarks. Watch episode #68 of RoboPapers, with Michael Cho and Chris Paxton, now! Abstract: State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization. Learn more: Project Page: https://dreamzero0.github.io/ ArXiV: https://arxiv.org/abs/2602.15922 Github: https://github.com/dreamzero0/dreamzero You can also read Chris Paxton’s previous post on DreamZero: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    43 min
  3. Ep#66: Ordered Action Tokenization

    11 MAR

    Ep#66: Ordered Action Tokenization

    How should we represent robot actions for autoregressive transformers? Most robot policies use diffusion or flow to generate continuous action sequences, but this isn’t how large language models work; they predict output tokens, which has many advantages. But coming up with a set of useful action tokens, so we can skip the slow and expensive diffusion steps, is difficult. Chaoqi Liu says action tokens need three qualities: reasonable compression, universal decodability, and a left-to-right causally ordered token space, and he proposes Ordered Action Tokenization as a solution to all three. Watch Episode 66 of RoboPapers now, with Michael Cho and Chris Paxton, to learn more! Abstract: Autoregressive policies offer a compelling foundation for scalable robot learning by enabling discrete abstraction, token-level reasoning, and flexible inference. However, applying autoregressive modeling to continuous robot actions requires an effective action tokenization scheme. Existing approaches either rely on analytical discretization methods that produce prohibitively long token sequences, or learned latent tokenizers that lack structure, limiting their compatibility with next-token prediction. In this work, we identify three desiderata for action tokenization — reasonable compression, universal decodability, and a left-to-right causally ordered token space — and introduce Ordered Action Tokenization (OAT), a learned action tokenizer that satisfies all three. OAT discretizes action chunks into an ordered sequence of tokens using transformer with register tokens, finite scalar quantization, and ordering-inducing training mechanisms. The resulting token space aligns naturally with autoregressive generation and enables prefix-based detokenization, yielding an anytime trade-off between inference cost and action fidelity. Across more than 20 tasks spanning four simulation benchmarks and real-world settings, autoregressive policies equipped with OAT consistently outperform prior tokenization schemes and diffusion-based baselines, while offering significantly greater flexibility at inference time. Project Site: https://ordered-action-tokenization.github.io/ ArXiV: https://arxiv.org/abs/2602.04215 Blog Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    52 min
  4. 5 MAR

    Ep#65: VLM4VLA: Revisiting Vision-Language Models in Vision-Language-Action Models

    Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and Yanjiang Guo join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan. Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning. Learn more: Project page: https://cladernyjorn.github.io/VLM4VLA.github.io/ ArXiV: https://arxiv.org/abs/2601.03309 Code: https://github.com/CladernyJorn/VLM4VLA This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1hr 4min
  5. 26 FEB

    Ep#64: Project Instinct

    Human motion is instinctual. We know how to interact with the world around us, almost without thinking about it at all. Ziwen and Shaoting joined us on RoboPapers to talk about their ambitious Project Instinct: which provides the tools, algorithms, and environments necessary to build humanoid whole-body control which can handle contact with the environment. Watch Episode #64 of RoboPapers with Michael Cho and Jiafei Duan now! Abstract: We present a unified framework from algorithm, environment, dataset curation, and deployment for Instinct-Level intelligence on humanoid robots. Project Site: https://project-instinct.github.io/ Github for InstinctLab: https://github.com/project-instinct/instinctlab Embrace Collisions Perform contact-rich humanoid robot tasks like getting up from the ground. Abstract: Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods. An unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real-time. The success of the zero-shot sim-to-real reinforcement learning method for humanoids heavily depends on the acceleration of GPU-based rigid-body physical simulator and simplification of the collision detection. Lacking extreme torso movement of the humanoid research makes all other components non-trivial to design, such as termination conditions, motion commands and reward designs. To address these potential challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot's motor action in real-time. Using a GPU-accelerated rigid-body simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real-time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion command. Project Site: https://project-instinct.github.io/embrace-collisions/ ArXiV: https://arxiv.org/abs/2502.01465 Deep Whole-Body Parkour Current approaches to humanoid control generally fall into two paradigms: perceptive locomotion, which handles terrain well but is limited to pedal gaits, and general motion tracking, which reproduces complex skills but ignores environmental capabilities. This work unites these paradigms to achieve perceptive general motion control. We present a framework where exteroceptive sensing is integrated into whole-body motion tracking, permitting a humanoid to perform highly dynamic, non-locomotion tasks on uneven terrain. By training a single policy to perform multiple distinct motions across varied terrestrial features, we demonstrate the non-trivial benefit of integrating perception into the control loop. Our results show that this framework enables robust, highly dynamic multi-contact motions, such as vaulting and dive-rolling, on unstructured terrain, significantly expanding the robot's traversability beyond simple walking or running. this https URL Project Site: https://project-instinct.github.io/deep-whole-body-parkour/ ArXiV: https://arxiv.org/abs/2601.07701 Hiking in the Wild Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift. For instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity. Specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present Hiking in the Wild, a scalable, end-to-end perceptive parkour framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable Terrain Edge Detection with Foot Volume Points to prevent catastrophic slippage on edges, and a Flat Patch Sampling strategy that eliminates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications. Project Site: https://project-instinct.github.io/hiking-in-the-wild/ ArXiV: https://arxiv.org/abs/2601.07718 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    56 min
  6. Ep#63: NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

    19 FEB

    Ep#63: NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

    The holy grail of robotics is to be able to perform previously-unseen, out-of-distribution manipulation tasks “zero shot” in a new environment. NovaFlow proposes an approach which (1) generates a video, (2) computes predicted flow — how points move through the scene — and (3) uses this flow as an objective to generate a motion. Using this procedure, NovaFlow generates motions in unseen scenes, for unseen tasks, and can transfer across embodiments. To learn more, we are joined by Hongyu Li and Jiahui Fu from RAI. Watch Episode #63 of RoboPapers now to learn more! Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Learn more: Project site: https://novaflow.lhy.xyz/ ArXiV: https://arxiv.org/abs/2510.08568 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1hr 9min
  7. Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

    11 FEB

    Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

    Evaluating robot policies is hard. Ideally, instead of testing every new policy on a real robot, you could test in simulation; but simulations rarely correlate well with real-world performance. In order to make good, useful simulations, you need to spend a great deal of time and effort. That’s where PolaRiS comes in: it’s a toolkit that lets you take a short video of a real scene and turn it into a high-fidelity simulation. It provides what you need to build a good evaluation environment, and it “ships” with off-the-shelf environments that already show strong sim-to-real correlation, meaning that they can be used to inform policy performance. Arhan Jain and Karl Pertsch join us to talk about what they have built, why, and how you can use it. Watch Episode #62 of RoboPapers, with Chris Paxton and Jiafei Duan, now! Abstract: A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models. Learn More: Project Page: https://polaris-evals.github.io/ ArXiV: https://arxiv.org/abs/2512.16881 This post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1 hr

About

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

You Might Also Like