HuggingFace 每日AI论文速递

duan

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

  1. HÁ 10 H

    2025.10.31 | Emu3.5统一预测时空;扩散提示驱动机器人

    本期的 15 篇论文如下: [00:26] 🌍 Emu3.5: Native Multimodal Models are World Learners(Emu3.5:原生多模态世界模型让AI看懂并预测未来) [01:04] 🤖 Exploring Conditions for Diffusion models in Robotic Control(探索扩散模型在机器人控制中的条件化策略) [01:42] 🎬 Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark(视频模型已准备好做零样本推理了吗?基于MME-CoF基准的实证研究) [02:22] ⚡ Kimi Linear: An Expressive, Efficient Attention Architecture(Kimi线性:一种富有表现力的高效注意力架构) [02:55] 🧮 AMO-Bench: Large Language Models Still Struggle in High School Math Competitions(AMO-Bench:大语言模型在高中数学奥赛级难题前仍举步维艰) [03:35] 🕺 The Quest for Generalizable Motion Generation: Data, Model, and Evaluation(可泛化动作生成之路:数据、模型与评测) [04:17] 🌐 Surfer 2: The Next Generation of Cross-Platform Computer Use Agents(Surfer 2:下一代跨平台计算机使用智能体) [04:42] 🌍 OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes(OmniX:从统一全景生成与感知到可渲染3D场景) [05:21] 🤝 The Era of Agentic Organization: Learning to Organize with Language Models(智能体组织时代:用语言模型学会协同) [05:57] 🧠 Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning(监督式强化学习:从专家轨迹到逐步推理) [06:32] 🕹 Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games(智能体能征服网络吗?探索 ChatGPT Atlas 在网络游戏中的能力边界) [07:10] 🏥 EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis(EHR-R1:面向电子健康记录分析的推理增强型基础语言模型) [07:55] 📄 OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation(OmniLayout:基于LLM的粗到细通用文档版面生成) [08:38] 🎯 MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency(MIRO:多奖励条件预训练提升文本到图像生成质量与效率) [09:09] 🤖 Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets(Magentic市集:一个用于研究智能代理市场的开源环境) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    10min
  2. HÁ 1 DIA

    2025.10.30 | 看图写码7B逆袭;视频思维RL破局

    本期的 15 篇论文如下: [00:22] 👁 JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence(JanusCoder:面向代码智能的基础视觉-编程接口) [01:00] 🧠 Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning(Video-Thinker:用强化学习点燃“视频思维”) [01:55] 🔄 ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization(ReForm:基于前瞻性有界序列优化的反思式自动化形式化) [02:42] 🔄 Scaling Latent Reasoning via Looped Language Models(通过循环语言模型扩展潜在推理能力) [03:22] 🧠 Reasoning-Aware GRPO using Process Mining(基于过程挖掘的推理感知GRPO方法) [03:52] 🎬 VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning(VFXMaster:通过上下文学习解锁动态视觉特效生成) [04:33] 🏆 The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution(工具十项全能:面向多样、真实、长周期任务的语言智能体基准测试) [05:11] 🖼 RegionE: Adaptive Region-Aware Generation for Efficient Image Editing(RegionE:面向高效图像编辑的自适应区域感知生成) [06:22] 🎮 ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks(ChronoPlay:面向游戏RAG评测的双动态与真实性建模框架) [06:58] 🧭 Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks(大模型时代的多模态空间推理:综述与基准) [07:44] 🔗 PairUni: Pairwise Training for Unified Multimodal Language Models(PairUni:面向统一多模态语言模型的成对训练) [08:33] ⚡ Parallel Loop Transformer for Efficient Test-Time Computation Scaling(并行循环Transformer:零延迟的测试时计算扩展) [09:08] 🚗 Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks(重新审视驾驶世界模型:面向感知任务的合成数据生成器) [09:55] 🧬 ODesign: A World Model for Biomolecular Interaction Design(ODesign:面向生物分子相互作用设计的全原子生成式世界模型) [10:31] 🧬 Evolving Diagnostic Agents in a Virtual Clinical Environment(虚拟临床环境中进化诊断智能体) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    11min
  3. HÁ 2 DIAS

    2025.10.29 | 通义深度研究报告;小模型折记忆胜671B巨模型

    本期的 10 篇论文如下: [00:23] 🔍 Tongyi DeepResearch Technical Report(通义深度研究报告:面向长程深度信息检索任务的智能体大模型) [01:00] 🧠 AgentFold: Long-Horizon Web Agents with Proactive Context Management(AgentFold:面向长程任务的主动式上下文管理智能体) [01:36] 🤖 RoboOmni: Proactive Robot Manipulation in Omni-modal Context(RoboOmni:全模态上下文下的主动机器人操作) [02:33] 🎮 Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents(Game-TARS:面向可扩展通才多模态游戏智能体的预训练基础模型) [03:05] 🎬 Uniform Discrete Diffusion with Metric Path for Video Generation(面向视频生成的度量路径均匀离散扩散模型) [03:42] 🛠 OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents(OSWorld-MCP:评测计算机代理调用MCP工具能力的基准) [04:28] 🎨 Group Relative Attention Guidance for Image Editing(基于群组相对注意力引导的图像编辑方法) [05:14] 🚀 WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking(WebLeaper:通过富信息搜索赋能网络智能体效率与效能) [06:04] 🧭 Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance(MoE路由关乎成败:显式路由引导扩散Transformer扩容) [07:01] 🧠 ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking(并行缪斯:面向深度信息搜寻的主体化并行思考) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    8min
  4. HÁ 3 DIAS

    2025.10.28 | Point Transformer无标对齐长空间;代码递归统一粗细粒度

    本期的 15 篇论文如下: [00:23] 🎼 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations(Concerto:2D-3D联合自监督学习涌现空间表征) [01:06] 🧩 ReCode: Unify Plan and Action for Universal Granularity Control(ReCode:用递归代码统一规划与行动,实现通用粒度控制) [01:44] 🤖 A Survey of Data Agents: Emerging Paradigm or Overstated Hype?(数据智能体全景透视:新范式还是泡沫?) [02:23] 🌾 FARMER: Flow AutoRegressive Transformer over Pixels(基于像素流自回归变换器的可逆生成模型) [03:07] 🤖 VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting(VITA-E:能同时看、听、说、做的自然具身交互框架) [03:45] 🎭 Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation(前瞻锚定:在音频驱动人体动画中保持角色身份) [04:17] 🤖 ACG: Action Coherence Guidance for Flow-based VLA models(面向流式VLA模型的动作连贯性引导) [04:56] 🔍 $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker(E²Rank:你的文本嵌入也能成为高效列表级重排器) [05:40] 🌐 Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences(全模态奖励模型:用自由格式偏好迈向通用奖励建模) [06:30] 🔍 PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity(PixelRefer:任意粒度时空目标指代的统一框架) [07:06] 🧠 Knocking-Heads Attention(敲头注意力:让多头彼此“敲一敲”) [07:42] 🧩 IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction(IGGT:面向语义三维重建的实例锚定几何Transformer) [08:30] 🎯 The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation(多选一最优:用max@k优化将强化学习与Best-of-N采样对齐) [09:14] 🥯 LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation(LightBagel:面向统一多模态理解与生成的轻量级双重融合框架) [09:51] 🧠 LimRank: Less is More for Reasoning-Intensive Information Reranking(LimRank:少即是多的推理密集型信息重排序) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    11min
  5. HÁ 4 DIAS

    2025.10.27 | DeepAgent一步推理+ToolPO;视频即提示DiT秒控百种语义

    本期的 15 篇论文如下: [00:27] 🧠 DeepAgent: A General Reasoning Agent with Scalable Toolsets(DeepAgent:具备可扩展工具集的通用推理智能体) [01:01] 🎬 Video-As-Prompt: Unified Semantic Control for Video Generation(视频即提示:统一语义控制的视频生成新范式) [01:35] 🔧 From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model(从去噪到精修:视觉-语言扩散模型的纠错式生成框架) [02:14] 🧩 Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation(逐段采样、分块优化:面向文本到图像生成的块级GRPO方法) [02:51] 🧠 A Definition of AGI(AGI的量化定义) [03:23] 🧩 Sparser Block-Sparse Attention via Token Permutation(基于Token置换的稀疏块稀疏注意力机制) [04:14] 🧭 UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning(UI-Ins:以“指令即推理”多视角增强GUI定位) [04:57] 🧠 Reasoning with Sampling: Your Base Model is Smarter Than You Think(基于采样的推理:你的基础模型比你想象的更聪明) [05:30] 🧠 RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging(RECALL:基于表示对齐的层级模型融合缓解大模型灾难性遗忘) [06:08] 📐 Visual Diffusion Models are Geometric Solvers(视觉扩散模型是几何求解器) [06:56] 🌍 WorldGrow: Generating Infinite 3D World(无限3D世界生成:WorldGrow) [07:35] 🎬 RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling(RAPO++:面向文生视频的跨阶段提示优化——数据对齐与测试时缩放) [08:14] 🔗 Model Merging with Functional Dual Anchors(基于功能双锚点的模型融合方法) [08:49] 🧭 Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs(揭示VideoLLM隐藏信息通路:视频语言模型内部流动图谱) [09:34] 📊 Document Understanding, Measurement, and Manipulation Using Category Theory(基于范畴论的文档理解、度量与操控) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    10min
  6. 24 DE OUT.

    2025.10.24 | AdaSPEC挑40% token提速两成;AutoPage 10美分生成交互网页

    本期的 15 篇论文如下: [00:23] 🎯 AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders(AdaSPEC:面向高效推测解码的选择性知识蒸馏) [00:57] 🤖 Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1(低成本人机协作论文一键成页:低于0.1美元) [01:35] 🔍 Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence(Open-o3视频:显式时空证据支撑的开放域视频推理) [02:06] 🎬 HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives(HoloCine:端到端生成多镜头长时电影级叙事视频) [02:52] 🌀 Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall(绕过离散扩散采样墙的确定性捷径) [03:33] 💎 Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values(每个问题都有它的价值:显式人类价值驱动的强化学习) [04:06] ⚖ The Massive Legal Embedding Benchmark (MLEB)(大规模法律嵌入评测基准(MLEB)) [04:48] 🔍 DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion(DyPE:面向超高分辨率扩散模型的动态位置外推方法) [05:33] 🕵 Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence(柯南:像侦探一样在多尺度视觉证据上渐进式推理) [06:12] 🤖 Search Self-play: Pushing the Frontier of Agent Capability without Supervision(搜索自博弈:无需监督即可拓展智能体能力边界) [06:56] 🎭 Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations(探究大音频语言模型在说话人情绪变化下的安全漏洞) [07:42] 🖼 LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas(LayerComposer:基于空间感知分层画布的交互式个性化文生图) [08:10] 🎧 SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models(SAKE:面向大型音频-语言模型听觉属性知识编辑的探索) [08:51] 🖼 ARGenSeg: Image Segmentation with Autoregressive Image Generation Model(ARGenSeg:基于自回归图像生成的图像分割) [09:39] 🧩 Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets(Seed3D 1.0:从单张图像生成高保真、可仿真的3D资产) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    11min
  7. 23 DE OUT.

    2025.10.23 | 线性注意力显存降十倍;动态裁剪PPO稳提分

    本期的 15 篇论文如下: [00:19] 🧠 Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning(每一种注意力都重要:面向长上下文推理的高效混合架构) [00:59] ⚖ BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping(BAPO:通过自适应裁剪的平衡策略优化稳定LLM离策略强化学习) [01:40] 🧠 LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts(LoongRL:面向长文本高级推理的强化学习方法) [02:18] 🌍 GigaBrain-0: A World Model-Powered Vision-Language-Action Model(GigaBrain-0:基于世界模型的通才视觉-语言-动作大模型) [02:49] 🔄 Language Models are Injective and Hence Invertible(语言模型是单射的,因此可逆) [03:25] 📹 VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos(VideoAgentTrek:利用无标注视频预训练计算机操作智能体) [04:01] 📲 DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents(DaMo:面向手机智能体的多模态大模型微调数据配比优化器) [04:55] 🚀 Unified Reinforcement and Imitation Learning for Vision-Language Models(统一强化与模仿学习的视觉-语言模型) [05:28] 🖼 Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing(Pico-Banana-400K:面向文本引导图像编辑的大规模高质量数据集) [06:17] 📊 FinSight: Towards Real-World Financial Deep Research(FinSight:迈向真实场景的金融深度研究) [07:06] 🧠 Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues(他们是恋人还是朋友?评估大语言模型在英韩对话中的社会推理能力) [07:43] 🌍 OmniNWM: Omniscient Driving Navigation World Models(OmniNWM:全景驾驶导航全知世界模型) [08:28] 🕳 Attention Sinks in Diffusion Language Models(扩散语言模型中的注意力沉陷现象) [09:04] 📄 olmOCR 2: Unit Test Rewards for Document OCR(olmOCR 2:基于单元测试奖励的文档OCR系统) [09:42] 🧠 KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints(KORE:通过知识导向增强与约束为大模型持续注入知识) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    11min

Classificações e avaliações

5
de 5
2 avaliações

Sobre

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

Você também pode gostar de