RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

  1. 1 ದಿನದ ಹಿಂದೆ

    Ep#76: OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control

    We’ve seen lots of incredible videos of humanoid robots dancing, doing martial arts, running up walls — but these extreme behaviors are usually from individual, highly specialized policies. But now OmniXtreme shows us how to achieve incredible behaviors that push the limits of humanoid motion, by (1) training a flow-based motion generative model, and (2) doing residual RL post-training to handle complex real-world dynamics. Yunsheng Wang and Shaohang Zhu join us to talk about their work towards general-purpose high performance humanoid robot control. Watch Episode #76 of RoboPapers, with Michael Cho and Jiafei Duan, now! Abstract High-fidelity motion tracking serves as the ultimate litmus test for generalizable, human-level motor skills. However, current policies often hit a "generality barrier": as motion libraries scale in diversity, tracking fidelity inevitably collapses - especially for real-world deployment of high-dynamic motions. We identify this failure as the result of two compounding factors: the learning bottleneck in scaling multi-motion optimization and the physical executability constraints that arise in real-world actuation. To overcome these challenges, we introduce OmniXtreme, a scalable framework that decouples general motor skill learning from sim-to-real physical skill refinement. Our approach uses a flow-matching policy with high-capacity architectures to scale representation capacity without interference-intensive multi-motion RL optimization, followed by an actuation-aware refinement phase that ensures robust performance on physical hardware. Extensive experiments demonstrate that OmniXtreme maintains high-fidelity tracking across diverse, high-difficulty datasets. On real robots, the unified policy successfully executes multiple extreme motions, effectively breaking the long-standing fidelity-scalability trade-off in high-dynamic humanoid control. Learn More Project Page: https://extreme-humanoid.github.io/ Github: https://github.com/Perkins729/OmniXtreme ArXiV: https://arxiv.org/abs/2602.23843 Original thread on X: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    48 ನಿ.
  2. Ep#75: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

    5 ದಿನಗಳ ಹಿಂದೆ

    Ep#75: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

    Reinforcement on robots is highly limited by our ability to design good reward functions; this means that designing strong, generalizable reward functions is a key enabler to progress on real-world reinforcement learning. But we already have a very general class of models: VLMs. Wouldn’t it be great if you could just use a VLM to generate rewards, then? TOPReward directly generates rewards from the probability of the “True” token of a VLM question-answering response; this makes it easy to implement, incredibly general, and surprisingly powerful. We talked to Shirui Chen and Cole Harrison to learn more. Watch Episode#75 of RoboPapers now to learn more, with Chris Paxton and Jiafei Duan! Abstract While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning. Learn More Project Page: https://topreward.github.io/webpage/ ArXiV: https://arxiv.org/abs/2602.19313 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1 ಗಂ. 1 ನಿ.
  3. Ep#73: VideoManip: Dexterous Manipulation Policies from RGB Human Videos via 3D Hand-Object Trajectory Reconstruction

    18 ಏಪ್ರಿ

    Ep#73: VideoManip: Dexterous Manipulation Policies from RGB Human Videos via 3D Hand-Object Trajectory Reconstruction

    Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand. In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies. Watch episode #73 of RoboPapers, hosted by Michael Cho and Chris Paxton, now to learn more! Abstract Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 3D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at this http URL. Learn More Project page: https://videomanip.github.io/ ArXiV: https://arxiv.org/abs/2602.09013 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    44 ನಿ.
  4. Ep#72: SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

    15 ಏಪ್ರಿ

    Ep#72: SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

    How can we build a general-purpose “foundation model” for robot motion? Zhengyi Luo joitns us to talk about SONIC, which uses motion tracking as a foundational task for humanoid robot control, and scales humanoid control training to 9k GPU hours and 100 million frames worth of data. The result: a model with a generally-useful embedding space that can be controlled by a VLA, or from human video, to perform a wide variety of humanoid whole-body-control tasks, including with zero-shot transfer to previously unseen motions. Watch episode 72 of RoboPapers, with Michael Cho and Jiafei Duan, now! Abstract Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control. Learn More Project Page: https://nvlabs.github.io/GEAR-SONIC/ ArXiV: https://arxiv.org/abs/2511.07820 Paper PDF: https://nvlabs.github.io/GEAR-SONIC/static/pdf/sonic_paper.pdf This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1 ಗಂ.
  5. Ep#71: Build Your Own Robot

    8 ಏಪ್ರಿ

    Ep#71: Build Your Own Robot

    Robots, unfortunately, tend to be expensive. And finding a robot that’s both capable of performing a wide variety of mobile manipulation tasks, and is affordable and “hackable”, is extremely difficult. Many different problems need to be addressed, from arm control to navigation to integrating your data collection strategy into hardware design. This can make it difficult for all but the most well-funded teams to “scale” real-world robotics research. Fortunately, the team behind Build Your Own Robot has a solution. Manan Anjaria, Mahi Shafiullah, Jeff Cui, and Enes Erciyes joined us to talk about how they build a fully open-source mobile manipulator out of off-the-shelf parts, which has humanlike range of motion, and can perform a wide variety of tasks, all while being only roughly $10,000 to build. Watch Episode 71 of RoboPapers, with Michael Cho and Chris Paxton, today to learn more! Abstract Recent advances in robot learning have generated significant interest in capable platforms that may eventually approach human-level competence. This interest, combined with the commoditization of actuators, has propelled growth in low-cost robotic platforms. However, the optimal form factor for mobile manipulation, especially on a budget, remains an open question. We introduce YOR, an open-source, low-cost mobile manipulator that integrates an omnidirectional base, a telescopic vertical lift, and two arms with grippers to achieve whole-body mobility and manipulation. Our design emphasizes modularity, ease of assembly using off-the-shelf components, and affordability, with a bill-of-materials cost under 10,000 USD. We demonstrate YOR's capability by completing tasks that require coordinated whole-body control, bimanual manipulation, and autonomous navigation. Overall, YOR offers competitive functionality for mobile manipulation research at a fraction of the cost of existing platforms. Project website: this https URL Learn More Project Page: https://yourownrobot.ai/ ArXiV: https://arxiv.org/abs/2602.11150 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1 ಗಂ. 1 ನಿ.
  6. Ep#70: A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

    1 ಏಪ್ರಿ

    Ep#70: A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

    Co-training has become a key part of the recipe for training large robotics models; it means that you mix some proportion of real robot data with other data sources, like simulation or egocentric human video data. This is especially important because robotics data tends to lack diversity which can be somewhat compensated for by the inclusion of these other modalities. And yet there has not been a sizable study on what constitute good practices for cotraining until now! We talk to Fanqi Lin and Jose Barreiros about their new work, a massive study which evaluated 89 policies over thousands of rollouts to tell us which forms of co-training were most useful for robotics. Watch episode 70 of RoboPapers, with Michael Cho and Chris Paxton, now! Abstract Large behavior models have shown strong dexterous manipulation capabilities by extending imitation learning to large-scale training on multi-task robot data, yet their generalization remains limited by the insufficient robot data coverage. To expand this coverage without costly additional data collection, recent work relies on co-training: jointly learning from target robot data and heterogeneous data modalities. However, how different co-training data modalities and strategies affect policy performance remains poorly understood. We present a large-scale empirical study examining five co-training data modalities: standard vision-language data, dense language annotations for robot trajectories, cross-embodiment robot data, human videos, and discrete robot action tokens across single- and multi-phase training strategies. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples to train vision-language-action policies. We evaluate 89 policies over 58,000 simulation rollouts and 2,835 real-world rollouts. Our results show that co-training with forms of vision-language and cross-embodiment robot data substantially improves generalization to distribution shifts, unseen tasks, and language following, while discrete action token variants yield no significant benefits. Combining effective modalities produces cumulative gains and enables rapid adaptation to unseen long-horizon dexterous tasks via fine-tuning. Training exclusively on robot data degrades the visiolinguistic understanding of the vision-language model backbone, while co-training with effective modalities restores these capabilities. Explicitly conditioning action generation on chain-of-thought traces learned from co-training data does not improve performance in our simulation benchmark. Together, these results provide practical guidance for building scalable generalist robot policies. Learn More Project page: https://co-training-lbm.github.io ArXiV: https://arxiv.org/abs/2602.01067 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1 ಗಂ. 25 ನಿ.
  7. 25 ಮಾರ್ಚ್

    Ep#69: MolmoSpaces, an Open Ecosystem for Embodied AI

    Benchmarking, evaluating, and developing robotics code is difficult, and part of this is because no simulator really reflects the diversity and scale of real embodiments. Enter MolmoSpaces from AI2: a massive open ecosystem with a range of 230,000 handcrafted and procedurally-generated home environments, including 48,000 manipulable objects. Crucially, MolmoSpaces provides simulation environments which work for both navigation and manipulation. We talked to the team: Yejin Kim, Omar Rayyan, and Max Argus, to tell us more. Watch Episode 69 of RoboPapers, with Michael Cho and Jiafei Duan, now! Abstract: Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, ρ = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research. Learn more: Project page: https://allenai.org/blog/molmospaces Technical report: https://allenai.org/papers/molmospaces Code: https://github.com/allenai/molmospaces This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1 ಗಂ. 11 ನಿ.

ಬಗ್ಗೆ

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

ನೀವು ಇವುಗಳನ್ನೂ ಇಷ್ಟಪಡಬಹುದು