RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

  1. Ep#63: NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

    1D AGO

    Ep#63: NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

    The holy grail of robotics is to be able to perform previously-unseen, out-of-distribution manipulation tasks “zero shot” in a new environment. NovaFlow proposes an approach which (1) generates a video, (2) computes predicted flow — how points move through the scene — and (3) uses this flow as an objective to generate a motion. Using this procedure, NovaFlow generates motions in unseen scenes, for unseen tasks, and can transfer across embodiments. To learn more, we are joined by Hongyu Li and Jiahui Fu from RAI. Watch Episode #63 of RoboPapers now to learn more! Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Learn more: Project site: https://novaflow.lhy.xyz/ ArXiV: https://arxiv.org/abs/2510.08568 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1h 9m
  2. Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

    FEB 11

    Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

    Evaluating robot policies is hard. Ideally, instead of testing every new policy on a real robot, you could test in simulation; but simulations rarely correlate well with real-world performance. In order to make good, useful simulations, you need to spend a great deal of time and effort. That’s where PolaRiS comes in: it’s a toolkit that lets you take a short video of a real scene and turn it into a high-fidelity simulation. It provides what you need to build a good evaluation environment, and it “ships” with off-the-shelf environments that already show strong sim-to-real correlation, meaning that they can be used to inform policy performance. Arhan Jain and Karl Pertsch join us to talk about what they have built, why, and how you can use it. Watch Episode #62 of RoboPapers, with Chris Paxton and Jiafei Duan, now! Abstract: A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models. Learn More: Project Page: https://polaris-evals.github.io/ ArXiV: https://arxiv.org/abs/2512.16881 This post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1 hr
  3. Ep#61: 1x World Model

    FEB 4

    Ep#61: 1x World Model

    Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot manipulation of a wide range of tasks — which is a real challenge for robotics, since so many cutting-edge approaches require expert fine tuning on a small set of in-domain data. Humanoid company 1X has a solution: world models. The internet is filled with human videos; this has resulted in incredible performance in video models. Why not leverage the semantic and spatial knowledge captured by those video models, to tell robots like the 1X NEO what to do? 1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about the new work the company is doing in world models, why this is the future, and how to use video models to control a home robot to perform any task. Watch Episode #61 of RoboPapers, with Michael Cho and Chris Paxton, now! In their words, from the official 1x blog post: Many robot foundation models today are vision-language-action models (VLAs), which take a pretrained VLM and add an output head to predict robot actions (PI0.6, Helix, Groot N1.5). VLMs benefit from internet-scale knowledge, but are trained on objectives that emphasize visual and semantic understanding over prediction of physical dynamics. Tens of thousands of hours of costly robot data are needed to teach a model how to solve tasks considered simple for a human. Additionally, auxiliary objectives are often used to further coax spatial reasoning of physical interactions (MolmoAct, Gemini-Robotics 1.5). Learn more: Project Page: https://www.1x.tech/discover/world-model-self-learning This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    55 min
  4. JAN 28

    Ep#60: Sim-to-Real Manipulation with VIRAL and Doorman

    For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation. Recently, though, this has begun to change. Two recent papers from Tairen He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot. Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer. Paper #1: VIRAL Abstract: A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice. Project page: https://viral-humanoid.github.io/ ArXiV: https://arxiv.org/abs/2511.15200 Original thread on X: Paper #2: Doorman Abstract: Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception. Project page: https://doorman-humanoid.github.io/ ArXiV: https://arxiv.org/abs/2512.01061 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1h 14m
  5. Ep#59: SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies

    JAN 21

    Ep#59: SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies

    Teleoperating a robot is hard. This means that when performing a robot task via teleoperation — say, to collect examples for training a robot policy — it’s almost unavoidably slower than you would like, below either the capabilities of the human expert on their own or the robot performing the task. Wouldn’t it be great if there was a way to fix this? Unfortunately, it’s harder than it looks. You can’t just execute faster, as this alters the distribution of environment states the policy will encounter. Nadun Ranakawa Arachchige and Zhenyang Chen propose Speed-Adaptive Imitation Learning (SAIL), which adds error-adaptive guidance, adapts execution speed according to task structure, predicts controller-invariant action targets to ensure robustness across execution speeds, and explicitly models delays from, for example, sensor latency. Watch episode #59 of RoboPapers, with Chris Paxton and Michael Cho to learn more! Abstract: Offline Imitation Learning (IL) methods such as Behavior Cloning are effective at acquiring complex robotic manipulation skills. However, existing IL-trained policies are confined to executing the task at the same speed as shown in demonstration data. This limits the task throughput of a robotic system, a critical requirement for applications such as industrial automation. In this paper, we introduce and formalize the novel problem of enabling faster-than-demonstration execution of visuomotor policies and identify fundamental challenges in robot dynamics and state-action distribution shifts. We instantiate the key insights as SAIL (Speed Adaptation for Imitation Learning), a full-stack system integrating four tightly-connected components: (1) a consistency-preserving action inference algorithm for smooth motion at high speed, (2) high-fidelity tracking of controller-invariant motion targets, (3) adaptive speed modulation that dynamically adjusts execution speed based on motion complexity, and (4) action scheduling to handle real-world system latencies. Experiments on 12 tasks across simulation and two real, distinct robot platforms show that SAIL achieves up to a 4x speedup over demonstration speed in simulation and up to 3.2x speedup in the real world. Additional detail is available at this https URL Project site: https://nadunranawaka1.github.io/sail-policy/ ArXiV: https://arxiv.org/abs/2506.11948 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    49 min
  6. JAN 14

    Ep#58: RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

    In order for robots to be deployed in the real world, performing tasks of real value, they must be reliable. Unfortunately, even more, most robotic demos work maybe 70-80% of the time at best. The way to get better reliability is to do real-world reinforcement learning: having the robot teach itself how to perform the task up to a high level of success. The key to doing this is to start with a core of expert human data, use that to train a policy then iteratively improve it, until finally finishing with on-policy reinforcement learning. Kun Lei talks through a unified framework for imitation and reinforcement learning based on PPO, which enables this improvement process. In this episode, Kun Lei explains the theory behind his reinforcement learning method and how it allowed his robot to run in a shopping mall juicing oranges for seven hours at a time, among experiments on a wide variety of tasks and embodiments. Watch episode 58 of RoboPapers now, hosted by Michael Cho and Chris Paxton! Abstract: Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass the performance of skilled human operators. We present RL-100, a real-world reinforcement learning framework built on diffusion-based visuomotor policies. RL-100 unifies imitation and reinforcement learning under a single PPO-style objective applied within the denoising process, yielding conservative and stable policy improvements across both offline and online stages. To meet deployment latency constraints, we employ a lightweight consistency distillation procedure that compresses multi-step diffusion into a one-step controller for high-frequency control. The framework is task-, embodiment-, and representation-agnostic, and supports both single-action outputs and action-chunking control. We evaluate RL-100 on seven diverse real-robot manipulation tasks, ranging from dynamic pushing and agile bowling to pouring, cloth folding, unscrewing, and multi-stage juicing. RL-100 attains 100% success across evaluated trials, achieving 900 out of 900 successful episodes, including up to 250 out of 250 consecutive trials on one task, and matches or surpasses expert teleoperators in time-to-completion. Without retraining, a single policy attains approximately 90% zero-shot success under environmental and dynamics shifts, adapts in a few-shot regime to significant task variations (86.7%), and remains robust to aggressive human perturbations (about 95%). In a public shopping-mall deployment, the juicing robot served random customers continuously for roughly seven hours without failure. Together, these results suggest a practical path toward deployment-ready robot learning: start from human priors, align training objectives with human-grounded metrics, and reliably extend performance beyond human demonstrations. Learn more: Project Page: https://lei-kun.github.io/RL-100/ ArXiV: https://arxiv.org/abs/2510.14830 Original thread on X: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1h 12m
  7. Ep#57: Learning Dexterity from Human Videos with Gen2Act and SPIDER

    JAN 6

    Ep#57: Learning Dexterity from Human Videos with Gen2Act and SPIDER

    Teaching robots from human video is an important part of overcoming the “data gap” in robotics, but many of the details still need to be worked out. Homanga Bharadwaj tells us about two recent research papers, Gen2Act and Spider, which go over different aspects of the problem: Gen2Act uses generative video models to create a reference of how a task should be performed given a language prompt; then, it uses a multi-purpose policy that can “translate” from human video to robot motion. However, Gen2Act has its limitations, in particular when it comes to dexterous, contact-rich tasks. That’s where SPIDER comes in: it uses human data together with simulation to train policies across many different humanoid hands and datasets. Also of note is that this is our first episode with our new rotating co-host, Jiafei Duan. To learn more, watch Episode #57 of RoboPapers now, with Chris Paxton and Jiafei Duan! Abstract: How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Abstract for SPIDER: Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality. Due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. We propose Scalable Physics-Informed DExterous Retargeting (SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences. SPIDER scales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10× faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. By aligning human motion and robot feasibility at scale, SPIDER offers a general, embodiment-agnostic foundation for humanoid and dexterous hand control. As a universal retargeting method, SPIDER can work with diverse quality data, including single RGB camera video and can be applied to real robot deployment and other downstream learning methods like RL to enable efficient closed-loop policy learning. Learn more: Project Page for Gen2Act: https://homangab.github.io/gen2act/ ArXiV: https://arxiv.org/pdf/2409.16283 Project Page for SPIDER: https://jc-bao.github.io/spider-project/ ArXiV: https://arxiv.org/abs/2511.09484 This post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    51 min
  8. Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    12/22/2025

    Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    It’s long been a dream of roboticists to be able to teach a robot in simulation so as to skip the long and expensive process of collecting large amounts of real-world training data. However, building simulations for robot tasks is extremely hard. Ideally, we could go from real data to a useful simulation. This is exactly what Guangqi Jiang and his co-authors do. they use 3d Gaussian splatting to reconstructed scenes which let them create interactive environments that, when combined with a physcs engine, allow for training robot policies that show zero-shot sim-to-real transfer (i.e., using no real-world demonstrations). To learn more, watch Episode 56 of Robopapers with Michael Cho and Chris Paxton now! Abstract: This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL. Learn more: Project Page: https://3dgsworld.github.io/ ArXiV: https://arxiv.org/abs/2510.20813 Authors’ Original Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    46 min

About

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com