RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

  1. Ep#61: 1x World Model

    2 DAYS AGO

    Ep#61: 1x World Model

    Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot manipulation of a wide range of tasks — which is a real challenge for robotics, since so many cutting-edge approaches require expert fine tuning on a small set of in-domain data. Humanoid company 1X has a solution: world models. The internet is filled with human videos; this has resulted in incredible performance in video models. Why not leverage the semantic and spatial knowledge captured by those video models, to tell robots like the 1X NEO what to do? 1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about the new work the company is doing in world models, why this is the future, and how to use video models to control a home robot to perform any task. Watch Episode #61 of RoboPapers, with Michael Cho and Chris Paxton, now! In their words, from the official 1x blog post: Many robot foundation models today are vision-language-action models (VLAs), which take a pretrained VLM and add an output head to predict robot actions (PI0.6, Helix, Groot N1.5). VLMs benefit from internet-scale knowledge, but are trained on objectives that emphasize visual and semantic understanding over prediction of physical dynamics. Tens of thousands of hours of costly robot data are needed to teach a model how to solve tasks considered simple for a human. Additionally, auxiliary objectives are often used to further coax spatial reasoning of physical interactions (MolmoAct, Gemini-Robotics 1.5). Learn more: Project Page: https://www.1x.tech/discover/world-model-self-learning This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    55 min
  2. 28 JAN

    Ep#60: Sim-to-Real Manipulation with VIRAL and Doorman

    For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation. Recently, though, this has begun to change. Two recent papers from Tairen He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot. Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer. Paper #1: VIRAL Abstract: A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice. Project page: https://viral-humanoid.github.io/ ArXiV: https://arxiv.org/abs/2511.15200 Original thread on X: Paper #2: Doorman Abstract: Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception. Project page: https://doorman-humanoid.github.io/ ArXiV: https://arxiv.org/abs/2512.01061 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1h 14m
  3. Ep#59: SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies

    21 JAN

    Ep#59: SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies

    Teleoperating a robot is hard. This means that when performing a robot task via teleoperation — say, to collect examples for training a robot policy — it’s almost unavoidably slower than you would like, below either the capabilities of the human expert on their own or the robot performing the task. Wouldn’t it be great if there was a way to fix this? Unfortunately, it’s harder than it looks. You can’t just execute faster, as this alters the distribution of environment states the policy will encounter. Nadun Ranakawa Arachchige and Zhenyang Chen propose Speed-Adaptive Imitation Learning (SAIL), which adds error-adaptive guidance, adapts execution speed according to task structure, predicts controller-invariant action targets to ensure robustness across execution speeds, and explicitly models delays from, for example, sensor latency. Watch episode #59 of RoboPapers, with Chris Paxton and Michael Cho to learn more! Abstract: Offline Imitation Learning (IL) methods such as Behavior Cloning are effective at acquiring complex robotic manipulation skills. However, existing IL-trained policies are confined to executing the task at the same speed as shown in demonstration data. This limits the task throughput of a robotic system, a critical requirement for applications such as industrial automation. In this paper, we introduce and formalize the novel problem of enabling faster-than-demonstration execution of visuomotor policies and identify fundamental challenges in robot dynamics and state-action distribution shifts. We instantiate the key insights as SAIL (Speed Adaptation for Imitation Learning), a full-stack system integrating four tightly-connected components: (1) a consistency-preserving action inference algorithm for smooth motion at high speed, (2) high-fidelity tracking of controller-invariant motion targets, (3) adaptive speed modulation that dynamically adjusts execution speed based on motion complexity, and (4) action scheduling to handle real-world system latencies. Experiments on 12 tasks across simulation and two real, distinct robot platforms show that SAIL achieves up to a 4x speedup over demonstration speed in simulation and up to 3.2x speedup in the real world. Additional detail is available at this https URL Project site: https://nadunranawaka1.github.io/sail-policy/ ArXiV: https://arxiv.org/abs/2506.11948 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    49 min
  4. 14 JAN

    Ep#58: RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

    In order for robots to be deployed in the real world, performing tasks of real value, they must be reliable. Unfortunately, even more, most robotic demos work maybe 70-80% of the time at best. The way to get better reliability is to do real-world reinforcement learning: having the robot teach itself how to perform the task up to a high level of success. The key to doing this is to start with a core of expert human data, use that to train a policy then iteratively improve it, until finally finishing with on-policy reinforcement learning. Kun Lei talks through a unified framework for imitation and reinforcement learning based on PPO, which enables this improvement process. In this episode, Kun Lei explains the theory behind his reinforcement learning method and how it allowed his robot to run in a shopping mall juicing oranges for seven hours at a time, among experiments on a wide variety of tasks and embodiments. Watch episode 58 of RoboPapers now, hosted by Michael Cho and Chris Paxton! Abstract: Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass the performance of skilled human operators. We present RL-100, a real-world reinforcement learning framework built on diffusion-based visuomotor policies. RL-100 unifies imitation and reinforcement learning under a single PPO-style objective applied within the denoising process, yielding conservative and stable policy improvements across both offline and online stages. To meet deployment latency constraints, we employ a lightweight consistency distillation procedure that compresses multi-step diffusion into a one-step controller for high-frequency control. The framework is task-, embodiment-, and representation-agnostic, and supports both single-action outputs and action-chunking control. We evaluate RL-100 on seven diverse real-robot manipulation tasks, ranging from dynamic pushing and agile bowling to pouring, cloth folding, unscrewing, and multi-stage juicing. RL-100 attains 100% success across evaluated trials, achieving 900 out of 900 successful episodes, including up to 250 out of 250 consecutive trials on one task, and matches or surpasses expert teleoperators in time-to-completion. Without retraining, a single policy attains approximately 90% zero-shot success under environmental and dynamics shifts, adapts in a few-shot regime to significant task variations (86.7%), and remains robust to aggressive human perturbations (about 95%). In a public shopping-mall deployment, the juicing robot served random customers continuously for roughly seven hours without failure. Together, these results suggest a practical path toward deployment-ready robot learning: start from human priors, align training objectives with human-grounded metrics, and reliably extend performance beyond human demonstrations. Learn more: Project Page: https://lei-kun.github.io/RL-100/ ArXiV: https://arxiv.org/abs/2510.14830 Original thread on X: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1h 12m
  5. Ep#57: Learning Dexterity from Human Videos with Gen2Act and SPIDER

    6 JAN

    Ep#57: Learning Dexterity from Human Videos with Gen2Act and SPIDER

    Teaching robots from human video is an important part of overcoming the “data gap” in robotics, but many of the details still need to be worked out. Homanga Bharadwaj tells us about two recent research papers, Gen2Act and Spider, which go over different aspects of the problem: Gen2Act uses generative video models to create a reference of how a task should be performed given a language prompt; then, it uses a multi-purpose policy that can “translate” from human video to robot motion. However, Gen2Act has its limitations, in particular when it comes to dexterous, contact-rich tasks. That’s where SPIDER comes in: it uses human data together with simulation to train policies across many different humanoid hands and datasets. Also of note is that this is our first episode with our new rotating co-host, Jiafei Duan. To learn more, watch Episode #57 of RoboPapers now, with Chris Paxton and Jiafei Duan! Abstract: How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Abstract for SPIDER: Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality. Due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. We propose Scalable Physics-Informed DExterous Retargeting (SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences. SPIDER scales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10× faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. By aligning human motion and robot feasibility at scale, SPIDER offers a general, embodiment-agnostic foundation for humanoid and dexterous hand control. As a universal retargeting method, SPIDER can work with diverse quality data, including single RGB camera video and can be applied to real robot deployment and other downstream learning methods like RL to enable efficient closed-loop policy learning. Learn more: Project Page for Gen2Act: https://homangab.github.io/gen2act/ ArXiV: https://arxiv.org/pdf/2409.16283 Project Page for SPIDER: https://jc-bao.github.io/spider-project/ ArXiV: https://arxiv.org/abs/2511.09484 This post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    51 min
  6. Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    22/12/2025

    Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    It’s long been a dream of roboticists to be able to teach a robot in simulation so as to skip the long and expensive process of collecting large amounts of real-world training data. However, building simulations for robot tasks is extremely hard. Ideally, we could go from real data to a useful simulation. This is exactly what Guangqi Jiang and his co-authors do. they use 3d Gaussian splatting to reconstructed scenes which let them create interactive environments that, when combined with a physcs engine, allow for training robot policies that show zero-shot sim-to-real transfer (i.e., using no real-world demonstrations). To learn more, watch Episode 56 of Robopapers with Michael Cho and Chris Paxton now! Abstract: This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL. Learn more: Project Page: https://3dgsworld.github.io/ ArXiV: https://arxiv.org/abs/2510.20813 Authors’ Original Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    46 min
  7. Ep#55: Trace Anything: Representing Any Video in 4D via Trajectory Fields

    19/12/2025

    Ep#55: Trace Anything: Representing Any Video in 4D via Trajectory Fields

    Modeling how worlds evolve over time is an important aspect of interacting with them. Video world models have become an exciting area of research in robotics over the past year in part for this reason. What if there was a better way to represent changes over time, though? Trace Anything represents each frame in a video as a trajectory field, i.e. a trajectory through 3d space. This provides a very unique foundation for all kinds of downstream tasks like goal-conditioned manipulation and motion forecasting. We talked to Xinhang Liu to learn more. Watch Episode 55 of RoboPapers with Michael Cho and Chris Paxton now! Abstract: Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project Page: https://trace-anything.github.io/ ArXiV: https://arxiv.org/abs/2510.13802 This Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    54 min
  8. Ep#54: MemER: Scaling Up Memory for Robot Control via Experience Retrieval

    17/12/2025

    Ep#54: MemER: Scaling Up Memory for Robot Control via Experience Retrieval

    Most robot policies today still largely lack memory: they make all their decisions based on what they can see right now. MemER aims to change that by learning which frames are important; this lets it deal with tasks like object search. Ajay Sridhar, Jenny Pan, and Satvik Sharma tell us about how to achieve this fundamental capability for long-horizon task execution. Watch Episode #54 of RoboPapers with Michael Cho and Chris Paxton to learn more! Abstract Humans rely on memory to perform tasks; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous task-relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we fine-tune Qwen2.5-VL-3B-Instruct and as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Project page: https://jen-pan.github.io/memer/ ArXiV: https://arxiv.org/abs/2510.20328 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    51 min

About

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com