RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

  1. Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    قبل ٥ أيام

    Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about mimic-video and Mimic Robotics. Mimic-ivdeo is part of a new class of video-action models, capable of achieving complex, dexterous bimanual robotic manipulation with relatively little robot data. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho and Chris Paxton to learn more! Abstract Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures. Learn More Project page: https://mimic-video.github.io/ ArXiV: https://arxiv.org/abs/2512.15692 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    ١ س ٩ د
  2. Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

    ١٤ مايو

    Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

    Sports like tennis are great examples of the sort of dynamic whole-body interaction that’s possible with humanoid robots. But capturing examples of fast, dynamic interactions from humans is really difficult. Enter LATENT, which uses lower-quality human data plus reinforcement learning to teach a robot to play tennis, able to complete back-and-forth volleys at a human level. LATENT has three steps: (1) collecting imperfect human data like a backswing, (2) using these to learn a latent action space, and (3) they train a high-level policy in simulation which can compose these actions and execute tennis skills on a robot. Haofei Lu and Yunrui Lian join us to tell us about their method. Watch Episode #80 of RoboPapers, with Chris Paxton and Jiafei Duan, now to learn more! Abstract Human athletes demonstrate versatile and highly-dynamic tennis skills to successfully conduct competitive rallies with a high-speed tennis ball. However, reproducing such behaviors on humanoid robots is difficult, partially due to the lack of perfect humanoid action data or human kinematic motion data in tennis scenarios as reference. In this work, we propose LATENT, a system that Learns Athletic humanoid TEnnis skills from imperfect human motioN daTa. The imperfect human motion data consist only of motion fragments that capture the primitive skills used when playing tennis rather than precise and complete human-tennis motion sequences from real-world tennis matches, thereby significantly reducing the difficulty of data collection. Our key insight is that, despite being imperfect, such quasi-realistic data still provide priors about human primitive skills in tennis scenarios. With further correction and composition, we learn a humanoid policy that can consistently strike incoming balls under a wide range of conditions and return them to target locations, while preserving natural motion styles. We also propose a series of designs for robust sim-to-real transfer and deploy our policy on the Unitree G1 humanoid robot. Our method achieves surprising results in the real world and can stably sustain multi-shot rallies with human players. Learn More Project page; https://zzk273.github.io/LATENT/ ArXiV: https://arxiv.org/pdf/2603.12686 Code: https://github.com/GalaxyGeneralRobotics/LATENT This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    ٥٦ د
  3. Ep#78: Three Eras of Robot Learning

    ٥ مايو

    Ep#78: Three Eras of Robot Learning

    Robotics has changed dramatically over the last eight years. Ted has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: * The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL * The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) * The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer The only reason something succeeds is if everything goes right. Behavior cloning, for example, seemed stuck at 60-70% success rate on key tasks until his team rewrote their learning stack — at which point it hit 95-99%+ success rates. For most of those eight years, something was wrong. The stack wasn’t quite right, the learning algorithms were wrong, the data didn’t exist. Hardware and operations are not mature enough. But they kept working on these problems, over and over, until finally they have arrived at amazing breakthrough. Some key trends now: * Reasoning models for robotics * Long-horizon, precision-oriented tasks, like making coffee from Physical Intelligence or GPU assembly from Skild * Cross-embodiment transfer * Hardware and model co-design * Results are nice, but capabilities are even more — and academics are going to have trouble keeping up with compute and resources available to companies Watch Episode 78 of RoboPapers, with Michael Cho and Jiafei Duan, to learn more! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    ١ س ١١ د
  4. Ep#77: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    ٢٩ أبريل

    Ep#77: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    World models have many different uses, from evaluation to training data generation to robot planning. DreamDojo is a new foundation world model that allows for impressively general and long-horizon interaction, generating coherent videos for interaction sequences over a minute long. It works in a wide range of environments and even generalizes to previously-unseen environments. We talked to Shenyuan Gao and William Liang about how they built DreamDojo, and about what tricks were necessary to scale world model learning on data with sparse action labels, pretraining on 44,000 hours of human data and adapting to a wide variety of robots, environments, and skills. Watch Epsiode #77 of RoboPapers with Michael Cho and Chris Paxton now to learn more! Abstract Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models. Learn More Project Page: https://dreamdojo-world.github.io/ ArXiV: https://arxiv.org/abs/2602.06949 Github: https://github.com/NVIDIA/DreamDojo Original thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    ١ س ٣ د
  5. ٢٧ أبريل

    Ep#76: OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control

    We’ve seen lots of incredible videos of humanoid robots dancing, doing martial arts, running up walls — but these extreme behaviors are usually from individual, highly specialized policies. But now OmniXtreme shows us how to achieve incredible behaviors that push the limits of humanoid motion, by (1) training a flow-based motion generative model, and (2) doing residual RL post-training to handle complex real-world dynamics. Yunsheng Wang and Shaohang Zhu join us to talk about their work towards general-purpose high performance humanoid robot control. Watch Episode #76 of RoboPapers, with Michael Cho and Jiafei Duan, now! Abstract High-fidelity motion tracking serves as the ultimate litmus test for generalizable, human-level motor skills. However, current policies often hit a "generality barrier": as motion libraries scale in diversity, tracking fidelity inevitably collapses - especially for real-world deployment of high-dynamic motions. We identify this failure as the result of two compounding factors: the learning bottleneck in scaling multi-motion optimization and the physical executability constraints that arise in real-world actuation. To overcome these challenges, we introduce OmniXtreme, a scalable framework that decouples general motor skill learning from sim-to-real physical skill refinement. Our approach uses a flow-matching policy with high-capacity architectures to scale representation capacity without interference-intensive multi-motion RL optimization, followed by an actuation-aware refinement phase that ensures robust performance on physical hardware. Extensive experiments demonstrate that OmniXtreme maintains high-fidelity tracking across diverse, high-difficulty datasets. On real robots, the unified policy successfully executes multiple extreme motions, effectively breaking the long-standing fidelity-scalability trade-off in high-dynamic humanoid control. Learn More Project Page: https://extreme-humanoid.github.io/ Github: https://github.com/Perkins729/OmniXtreme ArXiV: https://arxiv.org/abs/2602.23843 Original thread on X: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    ٤٨ د
  6. Ep#75: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

    ٢٣ أبريل

    Ep#75: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

    Reinforcement on robots is highly limited by our ability to design good reward functions; this means that designing strong, generalizable reward functions is a key enabler to progress on real-world reinforcement learning. But we already have a very general class of models: VLMs. Wouldn’t it be great if you could just use a VLM to generate rewards, then? TOPReward directly generates rewards from the probability of the “True” token of a VLM question-answering response; this makes it easy to implement, incredibly general, and surprisingly powerful. We talked to Shirui Chen and Cole Harrison to learn more. Watch Episode#75 of RoboPapers now to learn more, with Chris Paxton and Jiafei Duan! Abstract While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning. Learn More Project Page: https://topreward.github.io/webpage/ ArXiV: https://arxiv.org/abs/2602.19313 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    ١ س ١ د

التقييمات والمراجعات

٥
من ٥
‫٢ من التقييمات‬

حول

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

قد يعجبك أيضًا