RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

  1. Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    5 DAYS AGO

    Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    It’s long been a dream of roboticists to be able to teach a robot in simulation so as to skip the long and expensive process of collecting large amounts of real-world training data. However, building simulations for robot tasks is extremely hard. Ideally, we could go from real data to a useful simulation. This is exactly what Guangqi Jiang and his co-authors do. they use 3d Gaussian splatting to reconstructed scenes which let them create interactive environments that, when combined with a physcs engine, allow for training robot policies that show zero-shot sim-to-real transfer (i.e., using no real-world demonstrations). To learn more, watch Episode 56 of Robopapers with Michael Cho and Chris Paxton now! Abstract: This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL. Learn more: Project Page: https://3dgsworld.github.io/ ArXiV: https://arxiv.org/abs/2510.20813 Authors’ Original Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    46 min
  2. Ep#55: Trace Anything: Representing Any Video in 4D via Trajectory Fields

    19 DEC

    Ep#55: Trace Anything: Representing Any Video in 4D via Trajectory Fields

    Modeling how worlds evolve over time is an important aspect of interacting with them. Video world models have become an exciting area of research in robotics over the past year in part for this reason. What if there was a better way to represent changes over time, though? Trace Anything represents each frame in a video as a trajectory field, i.e. a trajectory through 3d space. This provides a very unique foundation for all kinds of downstream tasks like goal-conditioned manipulation and motion forecasting. We talked to Xinhang Liu to learn more. Watch Episode 55 of RoboPapers with Michael Cho and Chris Paxton now! Abstract: Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project Page: https://trace-anything.github.io/ ArXiV: https://arxiv.org/abs/2510.13802 This Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    54 min
  3. Ep#54: MemER: Scaling Up Memory for Robot Control via Experience Retrieval

    17 DEC

    Ep#54: MemER: Scaling Up Memory for Robot Control via Experience Retrieval

    Most robot policies today still largely lack memory: they make all their decisions based on what they can see right now. MemER aims to change that by learning which frames are important; this lets it deal with tasks like object search. Ajay Sridhar, Jenny Pan, and Satvik Sharma tell us about how to achieve this fundamental capability for long-horizon task execution. Watch Episode #54 of RoboPapers with Michael Cho and Chris Paxton to learn more! Abstract Humans rely on memory to perform tasks; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous task-relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we fine-tune Qwen2.5-VL-3B-Instruct and as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Project page: https://jen-pan.github.io/memer/ ArXiV: https://arxiv.org/abs/2510.20328 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    51 min
  4. Ep#53: Semantic World Models

    15 DEC

    Ep#53: Semantic World Models

    World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable. Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more. Watch Episode #53 of RoboPapers now, with Michael Cho and Chris Paxton! Abstract: Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as “semantic” world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at this https URL. Project Page: https://weirdlabuw.github.io/swm/ ArXiV: https://arxiv.org/abs/2510.19818 You may also find this episode interesting, which covers ideas in symbolic learning: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1h 5m
  5. Ep#52: Probe, Learn, Distill: Self-improving Vision-Language-Action Models

    12 DEC

    Ep#52: Probe, Learn, Distill: Self-improving Vision-Language-Action Models

    On their own, vision-language-action models are powerful tools for general robot skills that show impressive generalization. However, they don’t achieve useful levels of reliability on valuable manipulation tasks. Wenli Xiao teaches us one way to achieve this reliability: Probe, Learn, Distill. By freezing the VLA and learning residual actors, specialized policies which predict actions on top of what the underlying VLA predicts. Rollouts from these residual actors can then be distilled back into the generalist VLA. Watch Episode #52 of RoboPapers with Michael Cho and Chris Paxton to learn more! Abstract: Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the VLA generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalist’s deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates a 100% success rate on real-world Franka arm and YAM arm dexterous manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs’ capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models. Project Page: https://www.wenlixiao.com/self-improve-VLA-PLD ArXiV: https://arxiv.org/abs/2511.00091 Thread on X: https://x.com/_wenlixiao/status/1984307913247375428 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    46 min
  6. 10 DEC

    Ep#51: Humanoid Everyday

    Robotics, as we know, has a data problem. Many workarounds have been proposed, but one of the most important things is just to collect a large amount of real-robot data — something very difficult, especially for mobile humanoids. Enter Humanoid Everyday, which provides a large, diverse dataset of humanoid mobile manipulation examples. With 260 tasks across 7 different categories, this is the largest humanoid robot dataset we’ve ever seen — and, most importantly, the authors have provided clear evidence that it works for robot learning. Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, and Yue Wang all join us to tell us more about their thought process, their dataset, and the future of humanoid robot evaluation. Watch Episode #51 of RoboPapers, with Michael Cho and Chris Paxton, now! Abstract: From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Project Page: https://humanoideveryday.github.io/ ArXiV: https://arxiv.org/abs/2510.08807 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    54 min
  7. Ep#50: EMMA: Scaling Mobile Manipulation via Egocentric Human Data

    8 DEC

    Ep#50: EMMA: Scaling Mobile Manipulation via Egocentric Human Data

    Collecting robot teleoperation data for mobile manipulation is incredibly time consuming, even moreso than collecting teleoperation data for a stationary mobile manipulator. Fortunately, Lawrence and Pranav have a solution: EMMA, or Egocentric Mobile MAnipulation. In short, they find that they can skip mobile teleoperation entirely, just using static arms for manipulation tasks and co-training with egocentric human video. This is enough to show generalization to more complex scenes and tasks. To learn more, watch Episode #50 of RoboPapers now, hosted by Michael Cho and Chris Paxton! Abstract: Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train human full-body motion data with static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments. Details of this project can be found at this https URL. Project Page: https://ego-moma.github.io/ ArXiV: https://arxiv.org/abs/2509.04443 Original Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    1h 4m
  8. Ep#49: Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

    5 DEC

    Ep#49: Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

    Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this? Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP. To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho and Chris Paxton. Abstract: Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on learning position or force control, overlooking their co-learning. In this work, we propose the first unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of position and force commands alongside external disturbance forces, we use reinforcement learning to learn a policy that estimates forces from historical robot states and compensates for them through position and velocity adjustments. This policy enables a wide range of manipulation behaviors under varying force and position inputs, including position tracking, force application, force tracking, and compliant interactions. Furthermore, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately 39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal manipulator and a humanoid robot validate the versatility and robustness of the proposed policy across diverse scenarios. Project Page: https://unified-force.github.io/ ArXiV: https://arxiv.org/abs/2505.20829 Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

    50 min

About

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com