HuggingFace 每日AI论文速递

duan

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

  1. 1D AGO

    2026.03.13 | 流式空间记忆2B小模型逆袭;AI“蛮力”翻页不敌人类策略

    【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:32] 🧠 Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training(Spatial-TTT:基于测试时训练的流式视觉空间智能) [01:17] 🤔 Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections(策略性导航还是随机搜索?智能体与人类在文档集合上的推理方式研究) [02:11] ⚡ IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse(IndexCache:通过跨层索引复用加速稀疏注意力) [02:54] 🎬 Video-Based Reward Modeling for Computer-Use Agents(基于视频的计算机使用智能体奖励建模) [03:55] 🎬 DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning(DreamVideo-Omni:基于潜在身份强化学习的全运动控制多主体视频定制) [04:46] 🎯 Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation(信任你的评判者:用于忠实图像编辑与生成的鲁棒奖励建模与强化学习) [05:40] 🎬 DVD: Deterministic Video Depth Estimation with Generative Priors(DVD:基于生成先验的确定性视频深度估计) [06:29] 🖼 WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing(WeEdit:面向文本中心图像编辑的数据集、基准与字形引导框架) [07:29] 🎬 ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation(ShotVerse:面向文本驱动多镜头视频创作的电影级摄像机控制技术) [08:24] 🧠 GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing(GRADE:基准测试学科知识驱动的图像编辑推理能力) [09:08] 🎬 EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation(EVATok:面向高效视觉自回归生成的自适应长度视频分词) [09:55] ⚡ One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers(一模型,多预算:用于扩散变换器的弹性潜在接口) [10:46] 🤖 OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams(OmniStream:在连续流中掌握感知、重建与行动) [11:29] 🧠 EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models(EndoCoT:在扩散模型中扩展内生思维链推理) [12:37] 🧠 XSkill: Continual Learning from Experience and Skills in Multimodal Agents(XSkill:多模态智能体从经验与技能中的持续学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    14 min
  2. 2D AGO

    2026.03.12 | 边聊边训智能体;GPU秒解亿级K均

    【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:29] 🤖 OpenClaw-RL: Train Any Agent Simply by Talking(OpenClaw-RL:通过对话训练任意智能体) [01:17] ⚡ Flash-KMeans: Fast and Memory-Efficient Exact K-Means(Flash-KMeans:快速且内存高效的精确K-Means算法) [02:01] 👁 MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents(MA-EgoQA:基于多具身智能体第一人称视角视频的问答) [02:43] 🧠 In-Context Reinforcement Learning for Tool Use in Large Language Models(大语言模型中工具使用的上下文强化学习) [03:19] 🧠 ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning(ReMix:基于强化学习的LoRA混合路由用于大语言模型微调) [04:10] 📊 Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams(大型语言模型能否跟上?在线适应持续知识流的基准测试) [05:00] 🧠 RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback(RetroAgent:通过回顾性双重内在反馈实现从解决问题到持续进化) [05:50] 🔬 CodePercept: Code-Grounded Visual STEM Perception for MLLMs(CodePercept:基于代码的多模态大语言模型视觉STEM感知) [06:44] 🎯 Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models(Prism-Δ:面向大语言模型提示高亮的差分子空间导向方法) [07:31] 🧠 LLM2Vec-Gen: Generative Embeddings from Large Language Models(LLM2Vec-Gen:基于大语言模型的生成式嵌入方法) [08:22] ⚖ $V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts(V_{0.5}:作为稀疏强化学习rollouts先验的通用价值模型) [09:05] ⚡ Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers(即时:无需训练的空间加速方法用于扩散Transformer) [09:47] 🧠 Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning(强化学习中利用群体级自然语言反馈引导探索) [10:39] 💬 RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation(RbtAct:以反驳作为监督的可操作审稿反馈生成) [11:14] 🧠 Hindsight Credit Assignment for Long-Horizon LLM Agents(面向长视野LLM智能体的后见之明信用分配) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    12 min
  3. 3D AGO

    2026.03.11 | 几何强化3D编辑;掩码扩散多模态

    【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:32] 🎨 Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing(几何引导的强化学习用于多视角一致的3D场景编辑) [01:11] 🔄 Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion(Omni-Diffusion:基于掩码离散扩散的统一多模态理解与生成) [02:06] 🧠 Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs(思考以回忆:推理如何解锁大语言模型中的参数化知识) [02:55] 🚀 MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data(MM-Zero:从零数据自演进的多模态视觉语言模型) [03:41] 🧠 InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing(InternVL-U:民主化统一多模态模型,实现理解、推理、生成与编辑) [04:34] 🏸 Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports(让视觉语言模型踏上赛场:体育场景空间智能基准测试) [05:15] 🔍 Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs(阅读而非思考:理解并弥合多模态大语言模型中文本像素化时的模态鸿沟) [06:01] 🗣 Fish Audio S2 Technical Report(Fish Audio S2 技术报告) [06:48] 🎧 Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering(音频语言模型在聆听吗?用于自适应音频引导的音频专家注意力头) [07:45] 📱 MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants(MiniAppBench:评估LLM驱动助手中从文本到交互式HTML响应的转变) [08:48] 🔍 VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?(VLM-SubtleBench:视觉语言模型距离人类级别的细微比较推理还有多远?) [09:34] 🗣 Do What I Say: A Spoken Prompt Dataset for Instruction-Following(按我说的做:一个用于指令跟随的语音提示数据集) [10:20] 🎬 Streaming Autoregressive Video Generation via Diagonal Distillation(通过对角线蒸馏实现流式自回归视频生成) [11:08] 🧪 Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications(测试驱动AI智能体定义(TDAD):从行为规范编译工具使用型智能体) [11:58] ⚖ Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards(解耦推理与置信度:在可验证奖励的强化学习中重建校准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    13 min
  4. 4D AGO

    2026.03.10 | 长故事一致性漏洞扫描;零人工3D空间智能标注

    【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:32] 📖 Lost in Stories: Consistency Bugs in Long Story Generation by LLMs(迷失于故事:大语言模型生成长篇故事中的一致性错误) [01:16] 🧠 Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence(Holi-Spatial:将视频流演化为整体的3D空间智能) [02:17] 📈 How Far Can Unsupervised RLVR Scale LLM Training?(无监督强化学习验证奖励能将LLM训练扩展到何种程度?) [03:11] 📊 Believe Your Model: Distribution-Guided Confidence Calibration(相信你的模型:基于分布引导的置信度校准) [04:12] 🧠 LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory(LoGeR:基于混合内存的长上下文几何重建) [05:07] 🎨 CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing(CARE-Edit:基于条件感知专家路由的上下文图像编辑) [05:51] 💻 CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation(CoCo:将代码作为思维链用于文本到图像预览与稀有概念生成) [06:30] 🎬 HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising(HiAR:通过分层去噪实现高效的自回归长视频生成) [07:36] 📊 \$OneMillion-Bench: How Far are Language Agents from Human Experts?(OneMillion-Bench:语言智能体距离人类专家还有多远?) [08:24] ⚡ NLE: Non-autoregressive LLM-based ASR by Transcript Editing(NLE:基于转录编辑的非自回归大语言模型语音识别) [09:17] 🧠 Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness(概念引导的微调:引导视觉Transformer远离虚假相关性以提升鲁棒性) [10:03] 🚀 TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward(TDM-R1:利用不可微奖励增强少步扩散模型) [11:02] 📈 Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training(解锁金融数据价值:关于蒸馏与难度感知训练的研究) [11:40] 🤖 Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces(扩展智能体能力,而非上下文:面向大规模工具空间的高效强化微调) [12:36] 🔍 PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents(PIRA-Bench:从反应式GUI代理到基于GUI的主动意图推荐代理的转变) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    14 min
  5. 5D AGO

    2026.03.09 | LLM做视觉编码器;BandPO剪得更聪明

    【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:34] 🐧 Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders(Penguin-VL:探索基于LLM视觉编码器的VLM效率极限) [01:16] 🚀 BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning(BandPO:通过概率感知边界桥接信任区域与比率裁剪以用于大语言模型强化学习) [02:02] ⚡ Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model(8个令牌的规划:用于潜在世界模型的紧凑离散分词器) [02:43] 🚀 Progressive Residual Warmup for Language Model Pretraining(语言模型预训练的渐进残差预热方法) [03:41] 🎬 WildActor: Unconstrained Identity-Preserving Video Generation(WildActor:无约束身份保持的视频生成) [04:38] 🧠 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies(RoboMME:机器人通用策略的记忆基准测试与理解) [05:31] 🤔 Reasoning Models Struggle to Control their Chains of Thought(推理模型难以控制其思维链) [06:13] 🧭 HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel(HiMAP-Travel:面向长时域约束旅行的分层多智能体规划) [06:59] ⚡ FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling(FlashPrefill:面向超快速长上下文预填充的即时模式发现与阈值化方法) [07:49] 🚀 $π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs(π-StepNFT:在线强化学习中,基于流的视觉语言动作模型需要更精细的步骤以适应更广的空间) [08:32] 🧠 Mario: Multimodal Graph Reasoning with Large Language Models(Mario:基于大语言模型的多模态图推理) [09:22] 🎬 Physical Simulator In-the-Loop Video Generation(物理模拟器在环视频生成) [10:14] 🧩 Dynamic Chunking Diffusion Transformer(动态分块扩散变换器) [11:05] 🔄 SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration(SLER-IR:面向一体化图像修复的球面分层专家路由框架) [11:50] 🧊 PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction(PixARMesh:基于自回归网格原生模型的单视角场景重建) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    13 min
  6. MAR 7

    【月末特辑】2月最火AI论文 | VBVR百万视频炼视觉推理;OPUS同频优化器省算力

    【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 10 篇论文如下: [00:44] TOP1(🔥508) | 🧠 A Very Big Video Reasoning Suite(一个超大规模视频推理套件) [03:11] TOP2(🔥343) | 🚀 OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration(OPUS:迈向大规模语言模型预训练中高效且原理化的逐轮数据选择) [05:13] TOP3(🔥312) | 🤖 Green-VLA: Staged Vision-Language-Action Model for Generalist Robots(Green-VLA:面向通用机器人的分阶段视觉-语言-动作模型) [07:12] TOP4(🔥278) | 📈 Weak-Driven Learning: How Weak Agents make Strong Agents Stronger(弱驱动学习:弱智能体如何使强智能体更强) [09:36] TOP5(🔥262) | 🧠 ERNIE 5.0 Technical Report(ERNIE 5.0 技术报告) [11:23] TOP6(🔥259) | 💭 Does Your Reasoning Model Implicitly Know When to Stop Thinking?(你的推理模型是否隐含地知道何时停止思考?) [13:34] TOP7(🔥254) | 🤖 Kimi K2.5: Visual Agentic Intelligence(Kimi K2.5:视觉智能体) [15:14] TOP8(🔥240) | 🧠 Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs(少即是够:在大型语言模型特征空间中合成多样化数据) [17:24] TOP9(🔥216) | ⚖ VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training(VESPO:用于稳定离策略LLM训练的变分序列级软策略优化) [19:30] TOP10(🔥213) | 🍌 PaperBanana: Automating Academic Illustration for AI Scientists(PaperBanana:面向AI科学家的学术插图自动化生成框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    23 min
  7. MAR 6

    2026.03.06 | MOOSE-Star打破科学发现训练壁垒;DARE让LLM秒变严谨统计助手

    【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:32] 🚀 MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier(MOOSE-Star:通过打破复杂性壁垒解锁科学发现的可处理训练) [01:50] 📊 DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval(DARE:通过分布感知检索实现LLM智能体与R统计生态系统的对齐) [02:39] 🧠 SkillNet: Create, Evaluate, and Connect AI Skills(SkillNet:创建、评估与连接AI技能) [03:28] 📱 RoboPocket: Improve Robot Policies Instantly with Your Phone(RoboPocket:用手机即时提升机器人策略) [04:15] 🎨 HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images(HiFi-Inpaint:面向高保真参考的图像修复,用于生成细节保留的人-物图像) [04:59] 🔍 AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios(AgentVista:在超挑战性真实视觉场景中评估多模态智能体) [05:37] 🔬 SageBwd: A Trainable Low-bit Attention(SageBwd:一种可训练的低比特注意力机制) [06:21] 🧠 Large Multimodal Models as General In-Context Classifiers(大型多模态模型作为通用上下文分类器) [07:01] ⚖ MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models(MASQuant:面向多模态大语言模型的模态感知平滑量化) [07:54] 🌍 DreamWorld: Unified World Modeling in Video Generation(DreamWorld:视频生成中的统一世界建模) [08:34] 🎬 RealWonder: Real-Time Physical Action-Conditioned Video Generation(RealWonder:基于物理仿真的实时动作条件视频生成) [09:35] 🧠 Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline(迈向多模态终身理解:数据集与智能体基线) [10:14] 🧠 On-Policy Self-Distillation for Reasoning Compression(基于策略自蒸馏的推理压缩方法) [10:55] 🤖 KARL: Knowledge Agents via Reinforcement Learning(KARL:基于强化学习的知识智能体) [11:39] 🔍 Locality-Attending Vision Transformer(局部性感知视觉Transformer) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    13 min

Ratings & Reviews

5
out of 5
2 Ratings

About

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

You Might Also Like