HuggingFace 每日AI论文速递

duan

5.0 (3)
科技
一日一更

每天10分钟，带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新，欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版，可在小红书搜索并关注【AI速递】

1天前

【周末特辑】11月第2周最火AI论文 | 视频生成即推理；SVG草图变代码

本期的 5 篇论文如下： [00:31] TOP1(🔥137) | 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm（用视频思考：视频生成作为统一多模态推理新范式） [02:43] TOP2(🔥95) | 🖼 VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation（VCode：以SVG为符号视觉表征的多模态代码评测基准） [05:12] TOP3(🔥90) | 🚀 Diffusion Language Models are Super Data Learners（扩散语言模型是超级数据学习者） [07:18] TOP4(🔥88) | 👁 Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization（别让VLA变盲：对齐视觉表征实现分布外泛化） [09:24] TOP5(🔥79) | 🧠 Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation（全激活赋能：将通用推理模型扩展到万亿参数的开放语言基座）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12 分钟
2天前

2025.11.07 | 视频推理新范式；图像互动促思维

本期的 12 篇论文如下： [00:21] 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm（用视频思考：视频生成作为统一多模态推理新范式） [00:58] 🧠 V-Thinker: Interactive Thinking with Images（V-Thinker：与图像互动的思维推理） [01:39] 🧠 Scaling Agent Learning via Experience Synthesis（基于经验合成的智能体规模化强化学习） [02:23] 🧠 Cambrian-S: Towards Spatial Supersensing in Video（Cambrian-S：迈向视频中的空间超感） [03:06] 🖥 GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents（GUI-360°：面向计算机使用智能体的大规模综合数据集与评测基准） [03:51] 📄 NVIDIA Nemotron Nano V2 VL（NVIDIA Nemotron Nano V2 VL：面向文档与长视频理解的高效视觉语言模型） [04:28] 🎟 The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms（多头注意力机制的强彩票假设） [05:12] 🕵 Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts（基准设计者应“在测试集上训练”以暴露可利用的非视觉捷径） [05:48] ⚽ Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots（人形机器人视觉驱动反应式足球技能学习） [06:18] 🔍 Contamination Detection for VLMs using Multi-Modal Semantic Perturbation（基于多模态语义扰动的视觉语言模型污染检测） [06:53] 🎧 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics（如何借助源语言感知的神经机器翻译指标评估语音翻译） [07:32] 🚀 RDMA Point-to-Point Communication for LLM Systems（面向LLM系统的RDMA点对点通信）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8 分钟
3天前

2025.11.06 | 扩散模型省数据；音视频对口型

本期的 9 篇论文如下： [00:17] 🚀 Diffusion Language Models are Super Data Learners（扩散语言模型是超级数据学习者） [01:06] 🎬 UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions（统一音视频生成的不对称跨模态交互方法） [01:42] 🧩 LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation（LEGO-Eval：面向具身3D环境合成工具增强细粒度评测） [02:25] 📊 Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning（Orion-MSP：面向表格上下文学习的多尺度稀疏注意力机制） [03:15] 📊 TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models（TabTune：面向表格基础模型推理与微调的一站式统一库） [03:46] 🦾 Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects（Kinematify：开放词汇的高自由度关节物体合成） [04:30] 🧠 MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity（MME-CC：一项面向多模态认知能力的挑战性评测基准） [05:06] 📈 LiveTradeBench: Seeking Real-World Alpha with Large Language Models（LiveTradeBench：用大模型在真实市场里挖掘超额收益） [05:55] 🔍 Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation（多模态嵌入器自适应决定何时增强查询的所罗门方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7 分钟
4天前

2025.11.05 | 向量草图测代码；先画后想补视觉

本期的 15 篇论文如下： [00:21] 🖼 VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation（VCode：以SVG为符号视觉表征的多模态代码评测基准） [01:12] 🧠 When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought（当可视化成为推理第一步：MIRA视觉思维链基准测试） [01:48] ⚖ When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs（当模态冲突时：单模态推理不确定性如何左右多模态大模型的偏好） [02:36] 🪙 Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR（更短却更好：用易题作长度正则化实现节俭推理） [03:11] 🧠 Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer（Brain-IT：基于脑交互Transformer的fMRI图像重建） [03:49] 👁 Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization（别让VLA变盲：对齐视觉表征实现分布外泛化） [04:33] 🎨 LTD-Bench: Evaluating Large Language Models by Letting Them Draw（LTD-Bench：让大模型画画来测评空间推理力） [05:15] 🤖 TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System（TWIST2：可扩展、便携且全面的人形机器人数据采集系统） [06:01] 🗜 Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models（视觉输入能否被压缩？面向大型多模态模型的视觉Token压缩基准） [06:46] 🏆 CodeClash: Benchmarking Goal-Oriented Software Engineering（CodeClash：面向目标的软件工程基准测试） [07:29] 🎭 VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models（VidEmo：面向情感中心视频基础模型的情感树推理） [08:03] 🧠 BRAINS: A Retrieval-Augmented System for Alzheimer's Detection and Monitoring（BRAINS：用于阿尔茨海默病检测与监测的检索增强系统） [08:42] 📊 ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension（ChartM³：面向图表理解的多维多步视觉推理数据构建的多阶段代码驱动流水线） [09:45] 📊 TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data（TabDSR：表格复杂数值推理的分解-清洗-推理框架） [10:17] 🤖 iFlyBot-VLA Technical Report（iFlyBot-VLA技术报告：大规模视觉-语言-动作模型新框架）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12 分钟
5天前

2025.11.04 | 超稀疏MoE激活万亿参数；视觉模型看图胜GNN

本期的 15 篇论文如下： [00:23] 🧠 Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation（全激活赋能：将通用推理模型扩展到万亿参数的开放语言基座） [01:03] 👁 The Underappreciated Power of Vision Models for Graph Structural Understanding（被低估的视觉模型在图结构理解中的强大潜能） [01:38] 💡 UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback（UniLumos：基于物理可信反馈的统一图像与视频快速重打光框架） [02:37] 🕸 Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph（将测试时计算最优扩展泛化为可优化的图） [03:11] 🤖 PHUMA: Physically-Grounded Humanoid Locomotion Dataset（PHUMA：基于物理的人形机器人运动数据集） [03:48] 🔭 ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use（ToolScope：面向视觉引导与长程工具使用的智能体框架） [04:30] 🧠 UniREditBench: A Unified Reasoning-based Image Editing Benchmark（UniREditBench：基于统一推理的图像编辑评测基准） [05:23] 🔄 ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation（ROVER：统一多模态生成中的双向跨模态推理基准测试） [06:04] 🌍 Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum（迈向通用视频检索：通过合成多模态金字塔课程泛化视频嵌入） [06:44] 🌍 World Simulation with Video Foundation Models for Physical AI（基于视频基础模型的物理AI世界仿真） [07:20] 🧠 TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning（TIR-Bench：面向“图像思维”智能体推理的综合评测基准） [08:03] 🧭 NaviTrace: Evaluating Embodied Navigation of Vision-Language Models（NaviTrace：评测视觉-语言模型具身导航能力） [08:45] 📏 Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench（视觉语言模型能否胜任？基于MeasureBench的视觉测量读数基准测试） [09:23] 🧭 Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models（激活多模态大语言模型的空间推理能力） [10:07] 🐱 LongCat-Flash-Omni Technical Report（LongCat-Flash-Omni技术报告：5600亿参数开源全模态实时音视频交互模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11 分钟
6天前

2025.11.03 | OS-Sentinel实时守护手机操作安全；ThinkMorph让小模型边想边画

本期的 15 篇论文如下： [00:21] 🛡 OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows（OS-Sentinel：在真实工作流中通过混合验证提升移动GUI代理安全性） [01:13] 🧠 ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning（ThinkMorph：多模态交错思维链中的涌现特性） [01:49] ⚔ INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats（INT对决FP：细粒度低比特量化格式的综合研究） [02:38] 🤖 $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models（π_RL：面向流式视觉-语言-动作模型的在线强化学习微调） [03:26] 🚀 Continuous Autoregressive Language Models（连续自回归语言模型） [03:54] 🧭 Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning（Spatial-SSRL：通过自监督强化学习增强空间理解） [04:37] 🎯 HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration（HyperClick：通过不确定性校准推动可靠GUI定位） [05:15] 🎯 Defeating the Training-Inference Mismatch via FP16（用FP16打败训练-推理失配） [05:52] 🪜 Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals（分阶段DMD：在子区间内做分数匹配实现少步分布匹配蒸馏） [06:28] 🧭 Revisiting Multimodal Positional Encoding in Vision-Language Models（再探视觉-语言模型中的多模态位置编码） [07:09] ⚡ Higher-order Linear Attention（高阶线性注意力机制） [07:55] 🌐 Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model（双流扩散助力世界模型增强视觉-语言-动作模型） [08:36] 🔬 The Denario project: Deep knowledge AI agents for scientific discovery（Denario项目：面向科学发现的深度知识AI智能体） [09:14] 🎯 Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning（面向具身决策的多模态大模型视觉后门攻击：对比触发学习方法） [09:51] 🏙 Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery（Mask-to-Height：基于YOLOv11的联合建筑实例分割与高度分类架构）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11 分钟
11月2日

【月末特辑】10月最火AI论文 | 幼龙BDH稀疏可解释；迷你递归7兆碾压大模型

本期的 10 篇论文如下： [00:30] TOP1(🔥522) | 🐣 The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain（幼龙破壳： Transformer 与大脑模型之间缺失的环节） [02:31] TOP2(🔥462) | 🧠 Less is More: Recursive Reasoning with Tiny Networks（小而精：用微型网络递归推理） [04:48] TOP3(🔥255) | 🌱 Agent Learning via Early Experience（基于早期经验的主体学习） [07:04] TOP4(🔥182) | 🔄 Scaling Latent Reasoning via Looped Language Models（通过循环语言模型扩展潜在推理能力） [09:11] TOP5(🔥170) | 🔥 MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use（MCPMark：面向真实且全面的MCP应用场景的压力测试基准） [11:18] TOP6(🔥169) | 🚀 QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs（QeRL：超越效率——面向大语言模型的量化增强强化学习） [13:10] TOP7(🔥167) | 🎼 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations（Concerto：2D-3D联合自监督学习涌现空间表征） [15:38] TOP8(🔥160) | 🧠 Diffusion Transformers with Representation Autoencoders（基于表示自编码器的扩散Transformer） [17:59] TOP9(🔥144) | 🧠 A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning（大模型推理中内部概率与自洽性桥接的理论研究） [20:09] TOP10(🔥142) | 🎯 Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model（空间强迫：面向视觉-语言-动作模型的隐式空间表征对齐）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

23 分钟
11月1日

【周末特辑】11月第1周最火AI论文 | 循环模型省参强推理；Concerto 2D-3D自监督涨点

本期的 5 篇论文如下： [00:35] TOP1(🔥174) | 🔄 Scaling Latent Reasoning via Looped Language Models（通过循环语言模型扩展潜在推理能力） [02:30] TOP2(🔥166) | 🎼 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations（Concerto：2D-3D联合自监督学习涌现空间表征） [05:17] TOP3(🔥115) | 🧩 ReCode: Unify Plan and Action for Universal Granularity Control（ReCode：用递归代码统一规划与行动，实现通用粒度控制） [07:02] TOP4(🔥94) | 🗣 InteractComp: Evaluating Search Agents With Ambiguous Queries（InteractComp：用含混查询检验搜索智能体的交互能力） [09:14] TOP5(🔥90) | 🧠 DeepAgent: A General Reasoning Agent with Scalable Toolsets（DeepAgent：具备可扩展工具集的通用推理智能体）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12 分钟

查看全部 433 集

共 5 分

3 个评分

创作者

duan
活跃年份

2024年 - 2025年
单集

433
分级

儿童适宜
节目网站

HuggingFace 每日AI论文速递

商务

商务

一周一更
科技

科技

两周一更
科技

科技

两周一更
休闲

休闲

一周一更
社会与文化

社会与文化

一周一更
投资

投资

一周一更
商务

商务

一日一更

HuggingFace 每日AI论文速递

【周末特辑】11月第2周最火AI论文 | 视频生成即推理；SVG草图变代码

2025.11.07 | 视频推理新范式；图像互动促思维

2025.11.06 | 扩散模型省数据；音视频对口型

2025.11.05 | 向量草图测代码；先画后想补视觉

2025.11.04 | 超稀疏MoE激活万亿参数；视觉模型看图胜GNN

2025.11.03 | OS-Sentinel实时守护手机操作安全；ThinkMorph让小模型边想边画

【月末特辑】10月最火AI论文 | 幼龙BDH稀疏可解释；迷你递归7兆碾压大模型

【周末特辑】11月第1周最火AI论文 | 循环模型省参强推理；Concerto 2D-3D自监督涨点

评分及评论

关于

信息

你可能还喜欢

HuggingFace 每日AI论文速递

单集

【周末特辑】11月第2周最火AI论文 | 视频生成即推理；SVG草图变代码

2025.11.07 | 视频推理新范式；图像互动促思维

2025.11.06 | 扩散模型省数据；音视频对口型

2025.11.05 | 向量草图测代码；先画后想补视觉

2025.11.04 | 超稀疏MoE激活万亿参数；视觉模型看图胜GNN

2025.11.03 | OS-Sentinel实时守护手机操作安全；ThinkMorph让小模型边想边画

【月末特辑】10月最火AI论文 | 幼龙BDH稀疏可解释；迷你递归7兆碾压大模型

【周末特辑】11月第1周最火AI论文 | 循环模型省参强推理；Concerto 2D-3D自监督涨点

评分及评论

关于

信息

你可能还喜欢