RoboPapers

Chris Paxton and Michael Cho

٥٫٠ (٢)
التكنولوجيا
يتم التحديث أسبوعيًا

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

قبل ٥ أيام

Ep#88: DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

Human skin plays an important role in how we interact with the world and robustly manipulate objects. It’s not just important when we can’t see things with out eyes, but when we want to pick up something heavy, or apply a very specific amount of force. So, it makes sense to want to give robots skin. Enter DexSkin: a soft, deformable electronic skin which can be applied across different surfaces and used to cover robot hands or fingers. Suzannah Wistreich and Baiyu Shi talk to us about their work building DexSkin, showing how it’s useful for policy learning, including online reinforcement learning, and how it' can be calibrated and policies transferred across sensors. They also open sourced their code and methods for building the sensors. To learn more, watch Episode #88 of RoboPapers now, hosted by Chris Paxton and Jiafei Duan! Abstract Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin's capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin's suitability and practicality for learning real-world, contact-rich manipulation. Please see our project webpage for videos and visualizations: this https URL. Learn More ArXiV: https://arxiv.org/abs/2509.18830 Project Page: https://dex-skin.github.io/ Github: https://github.com/sdwistreich/dexskin Datasets: https://huggingface.co/datasets/swistreich/dexskin This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

٥٥ د
١٨ يونيو

Ep#87: MolmoAct 2: An open foundation for robots that work in the real world

There are few truly open models in the world, including both weights and data. However, these models are crucial for research and development of new systems — they help us learn which data is important and help develop new capabilities for deploying robots in the real world. MolmoAct2 provides a foundation for open research into robotics. It is associated with its own open dataset, an open-data action tokenizer, and a reasoning variant which predicts depth tokens. And people have actually been using it across the community, running experiments in their own labs or homes. Haoquan Fang and Jiafei Duan tell us more. Watch Episode 87 of RoboPapers, with Michael Cho and Chris Paxton, now! Abstract Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today’s systems fall short for real-world deployment. Frontier models are closed; open-weight alternatives are tied to expensive hardware; reasoning-augmented policies pay prohibitive latency for their grounding; and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor, MolmoAct along five axes. (1) MolmoAct2 is built on top of our new Molmo2-ER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. (2) We release three new robot datasets spanning low-to-medium cost platforms: MolmoAct2-BimanualYAM Dataset, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date; MolmoAct2-DROID Dataset, a quality-filtered Franka subset of DROID; and MolmoAct2-SO100/101 Dataset, a quality-filtered SO-100/101 subset. (3) We train and release MolmoAct2-FAST Tokenizer, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. (4) We design a new VLA architecture to graft the discrete-token VLM into the flow-matching continuous-action expert via per-layer key-value (KV) conditioning. (5) we propose MolmoAct2-Think, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including π0.5, while Molmo2-ER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Learn More Project page: https://allenai.org/blog/molmoact2 Code: https://github.com/allenai/molmoact2 ArXiV: https://arxiv.org/pdf/2605.02881v1 And check out our episode on the original MolmoAct: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

١ س ٢ د
١٢ يونيو

Ep#86: RISE: Self-Improving Robot Policy with Compositional World Model

Robot policies must be both reliable and highly capable to be useful; the best way to achieve this level of performance is with reinforcement learning. However, for reinforcement learning you are usually stuck between two difficult options: reinforcement in the real world is often risky and expensive, while reinforcement learning in a traditional simulator takes a lot of engineering work and has a persistent sim-to-real gap. What if instead you could train your robot purely in a world model? RISE by Jiazhi Yang et al. uses a compositional world model to predict the future and evaluate progress. This allows for a self-improving pipeline, which learns a world model from real data and then learns how the robot should perform different tasks. This pipeline results in a data-driven way to improve policy performance from real data but without real-world reinforcement learning. Watch Episode #86 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more! Abstract Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviations can compound into failures. While reinforcement learning (RL) offers a principled path to robustness, on-policy RL in the physical world is constrained by safety risk, hardware cost, and environment reset. To bridge this gap, we present RISE, a scalable framework of robotic reinforcement learning via imagination. At its core is a Compositional World Model that (i) predicts multi-view future via a controllable dynamics model, and (ii) evaluates imagined outcomes with a progress value model, producing informative advantages for the policy improvement. Such compositional design allows state and value to be tailored by best-suited yet distinct architectures and objectives. These components are integrated into a closed-loop self-improving pipeline that continuously generates imaginary rollouts, estimates advantages, and updates the policy in imaginary space without costly physical interaction. Across three challenging real-world tasks, RISE yields significant improvement over prior art, with more than +35% absolute performance increase in dynamic brick sorting, +45% for backpack packing, and +35% for box closing, respectively. Learn More Project Page: https://opendrivelab.com/RISE/ ArXiV: https://arxiv.org/abs/2602.11075 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

٥٤ د
٤ يونيو

Ep#85: Tutor Intelligence

Collecting robot data at scale is key to deploying working manipulation policies, and the team from Tutor Intelligence is here to tell us about how to accomplish it. Their new announcement: a massive, 100-robot “data factory,” with a behind-the-scenes look at how to build a teleoperation platform and how to make robots and policies that are useful for their customers. Tutor Intelligence is a full-stack robotics company: they build robot arms, they sell robot arms, they write the software and they train neural networks. Josh Gruenstein, Jesse Michel, Shiraz Khan, and Joe McCalmon join us to tell us more about how they scale both teleop data and human interventions from their teleoperators in order to train the policies they need. Watch Episode #85 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more! Learn More Blog post: https://tutorintelligence.com/blog/building-a-100-robot-data-factory-toward-factory-ready-ai This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

١ س ١ د
٢ يونيو

Ep#84: Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Learning robust, general-purpose reward functions for robotics unlocks many potential applications, like on-robot reinforcement learning or dataset validation. However, there’s a question of how to actually train such reward functions. Training success/failure prediction leads to ambiguous signals partway through a demonstration — it’s hard to measure progress — making the method unsuitable for reinforcement learning, among other things. Predicting progress, on the other hand, does not give a good way of using failure data. So why not do both? Robometer combines both progress and preference supervision, resulting in a stable, scalable, and highly general reward learning approach. Anthony Liang, Yigit Korkmaz, and Jesse Zhang join us to tell us more. Watch Episode #84 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more! Abstract General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at this https URL. Learn More Project page: https://robometer.github.io/ ArXiV: https://arxiv.org/abs/2603.02115 Code on Github: https://github.com/robometer/robometer This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

٥٩ د
٢٩ مايو

Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more! Abstract Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. References Project page: https://point-world.github.io/ ArXiV: https://arxiv.org/abs/2601.03782 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

١ س ٢٣ د
٢٧ مايو

Ep#82: SimTooReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Humans use tools to perform almost all of the physical work that we do from day to day. However, tools come in many different sizes and shapes, and it’s very difficult to collect human data for them in general. What about training in simulation? SimTooReal aims to address this through, unsurprisingly, sim-to-real learning. Kushal Kedia and Tyler Lum talk about how it works: they procedurally generate tool-like objects, and then train with the universal objective of moving objects around to different locations. This creates a general-purpose model which can manipulate various tools to perform a variety of tasks in the real world. Watch episode #82 of RoboPapers, hosted by Michael Cho and Jiafei Duan, now to learn more! Abstract The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories. Learn More Project page: https://simtoolreal.github.io/ ArXiV: https://arxiv.org/abs/2602.16863 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

٥٥ د
٢٠ مايو

Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about mimic-video and Mimic Robotics. Mimic-ivdeo is part of a new class of video-action models, capable of achieving complex, dexterous bimanual robotic manipulation with relatively little robot data. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho and Chris Paxton to learn more! Abstract Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures. Learn More Project page: https://mimic-video.github.io/ ArXiV: https://arxiv.org/abs/2512.15692 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

١ س ٩ د

مشاهدة الكل (٨٨)

من ٥

‫٢ من التقييمات‬

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

صناع العمل

Chris Paxton and Michael Cho
سنوات النشاط

٢٠٢٥ - ٢٠٢٦
الحلقات

٨٨
التقييم

ملائم
موقع البرنامج على الويب

RoboPapers

استثمار

استثمار

يتم التحديث يوميًا
التكنولوجيا

التكنولوجيا

يتم التحديث أسبوعيًا
التكنولوجيا

التكنولوجيا

مرتان في الأسبوع
التكنولوجيا

التكنولوجيا

يتم التحديث أسبوعيًا
التكنولوجيا

التكنولوجيا

يتم التحديث أسبوعيًا
العلوم الطبيعية

العلوم الطبيعية

يتم التحديث أسبوعيًا

RoboPapers

Ep#88: DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

Ep#87: MolmoAct 2: An open foundation for robots that work in the real world

Ep#86: RISE: Self-Improving Robot Policy with Compositional World Model

Ep#85: Tutor Intelligence

Ep#84: Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Ep#82: SimTooReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

التقييمات والمراجعات

حول

المعلومات

قد يعجبك أيضًا

RoboPapers

الحلقات

Ep#88: DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

Ep#87: MolmoAct 2: An open foundation for robots that work in the real world

Ep#86: RISE: Self-Improving Robot Policy with Compositional World Model

Ep#85: Tutor Intelligence

Ep#84: Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Ep#82: SimTooReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

التقييمات والمراجعات

حول

المعلومات

قد يعجبك أيضًا