AI Illuminated

The AI Illuminators

A new way to keep up with AI research. Delivered to your ears. Illuminated by AI. Part of the GenAI4Good initiative.

  1. MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

    07/12/2024

    MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

    0:00 Introduction  0:20 Limitations of traditional SfM and SLAM techniques. 0:57 Shortcomings of existing neural network methods. 1:07 MegaSaM's approach: balance of accuracy, speed, and robustness. 1:31 Differentiable bundle adjustment (BA) layer. 2:03 Integration of monocular depth priors and motion probability maps. 2:37 Uncertainty-aware global BA scheme. 3:14 Two-stage training scheme. 3:45 Consistent video depth estimation without test-time fine-tuning. 4:16 Key quantitative and qualitative improvements. 4:49 Limitations of MegaSaM and future research avenues. 5:15 Synthetic data for training and generalization to real-world videos. 5:49 Datasets used for evaluation. 6:26 DepthAnything and UniDepth for monocular depth estimation. 7:02 Summary of MegaSaM's advancements. Authors: Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, Noah Snavely Affiliations: Google DeepMind, UC Berkeley, University of Michigan Abstract: We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network-based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of a deep visual SLAM framework: with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times. See interactive results on our project page: this https URL Link: https://mega-sam.github.io/

    7 min
  2. SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

    12/11/2024

    SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

    [00:00] SVD-Quant: 4-bit diffusion model quantization [00:27] Challenge: Outlier sensitivity in 4-bit quantization [00:59] Solution: Smoothing + SVD approach [01:37] Technical: SVD's role in low-rank approximation [02:08] Nunchuku: New inference engine with kernel fusion [02:35] Comparison: INT4 vs FP4 quantization methods [03:00] Results: 3.5x memory reduction on Flux-1.0 [03:44] Feature: Seamless LoRA compatibility [04:06] Study: Validating combined approach effectiveness [04:40] Future: Hardware compatibility and improvements [06:12] Methods: Image quality assessment metrics [06:53] Impact: Open-source deployment benefits Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han Affiliations: MIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU, Pika Labs Abstract: Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"ıvely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-Σ, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5×, achieving 3.0× speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced. Link: https://hanlab.mit.edu/projects/svdquant

    7 min
  3. D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

    11/11/2024

    D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

    [00:00] Intro [00:18] Current limitations in depth-sensing technology [00:56] D3RoMa's diffusion model approach to depth estimation [01:47] Integration of geometric constraints in the model [02:27] HiSS: New dataset for transparent/specular objects [03:18] Benchmark results showing major accuracy improvements [04:02] Current limitations and future development areas [05:34] Technical details of HiSS dataset creation [06:30] Real-world testing with robotic systems [07:15] Why diffusion models outperform GANs [08:54] Implementation of consistency loss functions [12:00] Solving simulation-to-real-world transfer [13:25] Potential expansion to single-camera systems Authors: Songlin Wei, Haoran Geng, Jiayi Chen, Congyue Deng, Wenbo Cui, Chengyang Zhao, Xiaomeng Fang, Leonidas Guibas, He Wang Affiliations: Peking University, UC Berkeley, Stanford, Galbot, University of Chinese Academy of Sciences, Beijing Academy of Artificial Intelligence Abstract: Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios. Link: https://arxiv.org/abs/2409.14365

    15 min
  4. Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

    09/11/2024

    Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

    [00:00] Intro [00:21] Key problem: Poor generalization in robotic learning [00:51] HPT: New transformer architecture for robotics [00:59] Core components of HPT architecture [01:44] Scale analysis: Data and model size impacts [02:16] Training data: Real robots, simulations, human videos [02:54] Results: 20% improvement on new tasks [04:04] Real-world testing limitations [05:18] Future additions: Tactile and 3D data [05:57] Requirements for better robotics datasets [06:48] Weight sampling in heterogeneous data [08:55] Benefits of modular architecture [10:30] Scaling challenges and trade-offs Authors: Lirui Wang, Xinlei Chen, Jialiang Zhao, Kaiming He Affiliations: MIT CSAIL, Meta FAIR Abstract: One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (this https URL) for code and videos. Link: https://arxiv.org/abs/2409.20537

    11 min
  5. HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots

    04/11/2024

    HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots

    [00:00] Introduction to Hover: Neural Whole Body Controller for Humanoids [00:15] Problem: Current controllers lack versatility across tasks [00:50] Human motion imitation as a unified control approach [01:23] Policy distillation: Learning from an oracle policy [02:01] Command space: Kinematic, joint angle, and root tracking modes [02:34] Motion retargeting: From human data to robot movements [03:09] Performance comparison with specialist policies [03:43] Real-world testing on Unitree H1 robot [04:15] Comparison with MHC and Masked Mimic approaches [04:49] Future work and current limitations [05:18] Reward function design and components [06:02] D-Agger advantages in policy learning [06:33] Domain randomization for sim-to-real transfer [07:06] Conclusions on Hover's contributions Authors: Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, Linxi Fan, Yuke Zhu Affiliations: NVIDIA, CMU, UC Berkeley, UT Austin, UC San Diego Abstract: Humanoid whole-body control requires adapting to diverse tasks such as navigation, loco-manipulation, and tabletop manipulation, each demanding a different mode of control. For example, navigation relies on root velocity tracking, while tabletop manipulation prioritizes upper-body joint angle tracking. Existing approaches typically train individual policies tailored to a specific command space, limiting their transferability across modes. We present the key insight that full-body kinematic motion imitation can serve as a common abstraction for all these tasks and provide general-purpose motor skills for learning multiple modes of whole-body control. Building on this, we propose HOVER (Humanoid Versatile Controller), a multi-mode policy distillation framework that consolidates diverse control modes into a unified policy. HOVER enables seamless transitions between control modes while preserving the distinct advantages of each, offering a robust and scalable solution for humanoid control across a wide range of modes. By eliminating the need for policy retraining for each control mode, our approach improves efficiency and flexibility for future humanoid applications. Link: https://hover-versatile-humanoid.github.io/

    7 min
  6. Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

    03/11/2024

    Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

    [00:00] Intro [00:24] Tackles RL challenges using a visual backbone, efficient RL, and human feedback. [01:20] Pretrained backbone boosts stability and exploration efficiency. [02:06] RLPD combines offline data and human corrections effectively. [02:57] Human-guided interventions reduce errors, enabling gradual autonomy. [03:42] System choices aid spatial generalization and safe exploration. [04:40] RL outperforms imitation learning in success and speed. [05:29] Funnel model shows reliable, focused policy improvement. [06:07] Learns both reactive and predictive tasks, enhancing flexibility. [06:57] HIL-SERL excels over baselines in integrating human data. [07:27] Outperforms diffusion policy on reactive tasks. [08:04] Future work: longer tasks, pretraining, unstructured testing. [08:57] Key takeaway: human-in-the-loop RL enables adaptable, efficient robotic policies. Authors: Jianlan Luo, Charles Xu, Jeffrey Wu, Sergey Levine Affiliations: UC Berkeley Abstract: Reinforcement learning (RL) holds great promise for enabling autonomous acquisition of complex robotic manipulation skills, but realizing this potential in real-world settings has been challenging. We present a human-in-the-loop vision-based RL system that demonstrates impressive performance on a diverse set of dexterous manipulation tasks, including dynamic manipulation, precision assembly, and dual-arm coordination. Our approach integrates demonstrations and human corrections, efficient RL algorithms, and other system-level design choices to learn policies that achieve near-perfect success rates and fast cycle times within just 1 to 2.5 hours of training. We show that our method significantly outperforms imitation learning baselines and prior RL approaches, with an average 2x improvement in success rate and 1.8x faster execution. Through extensive experiments and analysis, we provide insights into the effectiveness of our approach, demonstrating how it learns robust, adaptive policies for both reactive and predictive control strategies. Our results suggest that RL can indeed learn a wide range of complex vision-based manipulation policies directly in the real world within practical training times. We hope this work will inspire a new generation of learned robotic manipulation techniques, benefiting both industrial applications and research advancements. Videos and code are available at our project website this https URL. Link: https://hil-serl.github.io/

    9 min
  7. Local Policies Enable Zero-shot Long Horizon Manipulation

    02/11/2024

    Local Policies Enable Zero-shot Long Horizon Manipulation

    [00:00] Paper intro: Zero-shot robotic manipulation via local policies [00:26] Key challenges: Limited generalization and sim-to-real transfer [01:03] Local policies: Task decomposition through localized focus regions [01:38] Foundation models: VLMs for task understanding [02:07] Training approach: Simulation-based RL + visuomotor policy distillation [02:46] Implementation: Depth maps and impedance control system [03:25] Results: 97% simulation success, 76% real-world success [04:02] Challenges: Vision errors and collision handling [04:32] Limitations: Issues with reflective objects and complex contacts [05:48] Impact: Advancing autonomous robotic manipulation [06:36] Design: Modular system for continuous improvement [07:21] Dependencies: VLM and motion planner requirements Authors: Murtaza Dalal, Min Liu, Walter Talbott, Chen Chen, Deepak Pathak, Jian Zhang, Ruslan Salakhutdinov Affiliations: Carnegie Mellon University, Apple Abstract: Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at this https URL Link: https://mihdalal.github.io/manipgen/

    8 min
  8. MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning

    30/10/2024

    MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning

    [00:00] Introduction to Mentor system for visual RL [00:29] Problem: Sample inefficiency in robotic learning [00:59] Innovation: Mixture of Experts (MoE) architecture [01:55] Results: MoE achieves 100% success in multi-task testing [02:33] Feature: Task-oriented perturbation for exploration [03:55] Real-world testing: 83% success in robotic tasks [04:33] Study: MoE and perturbation each boost performance by 30% [05:14] Future work: Optimizing MoE implementation [05:59] Challenge: Bridging simulation-to-real-world gap [06:45] Impact: Advancing practical robotics applications Authors: Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, Huazhe Xu Affiliations: Tsinghua University, Shanghai Qi Zhi Institute, Shanghai AI Lab Abstract: Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone, enhancing the agent's ability to handle complex tasks by leveraging modular expert learning to avoid gradient conflicts. Furthermore, MENTOR introduces a task-oriented perturbation mechanism, which heuristically samples perturbation candidates containing task-relevant information, leading to more targeted and effective optimization. MENTOR outperforms state-of-the-art methods across three simulation domains -- DeepMind Control Suite, Meta-World, and Adroit. Additionally, MENTOR achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks including peg insertion, cable routing, and tabletop golf, which significantly surpasses the success rate of 32% from the current strongest model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at this https URL. Link: https://arxiv.org/abs/2410.14972

    9 min

About

A new way to keep up with AI research. Delivered to your ears. Illuminated by AI. Part of the GenAI4Good initiative.