HuggingFace 每日AI论文速递

duan

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

  1. 15 HR AGO

    2025.08.18 | 超越图像思考;自搜索强化

    本期的 13 篇论文如下: [00:19] 💡 Thyme: Think Beyond Images(Thyme:超越图像的思考) [00:48] 🧠 SSRL: Self-Search Reinforcement Learning(SSRL:自搜索强化学习) [01:16] 🚀 DINOv3(DINOv3:视觉基础模型新里程碑) [01:42] 🔍 PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing(PaperRegister:通过分层寄存器索引提升灵活粒度论文搜索) [02:13] 🚀 XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization(XQuant:通过KV缓存重物化突破LLM推理的内存瓶颈) [02:40] 🚀 BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining(BeyondWeb:万亿规模预训练中合成数据扩展的经验教训) [03:09] 🎨 StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation(StyleMM:通过文本驱动的对齐图像翻译实现风格化3D可变形人脸模型) [03:35] 🌌 TexVerse: A Universe of 3D Objects with High-Resolution Textures(TexVerse:高分辨率纹理3D对象宇宙) [03:59] 🗣 FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation(FantasyTalking2:面向音频驱动人像动画的时间步-层级自适应偏好优化) [04:32] 💡 X-Node: Self-Explanation is All We Need(X-Node:自解释即是我们所需的一切) [04:57] ⚙ Controlling Multimodal LLMs via Reward-guided Decoding(通过奖励引导解码控制多模态大语言模型) [05:21] ✨ SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation(稀疏数据,丰硕成果:通过类别条件图像转换实现小样本半监督学习) [05:52] 🌍 MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data(MAESTRO:用于多模态、多时相、多光谱地球观测数据的掩码自编码器) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7 min
  2. 3 DAYS AGO

    2025.08.15 | 数学推理手册提升模型能力;连续令牌生成图像模型

    本期的 12 篇论文如下: [00:23] 📚 We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning(We-Math 2.0:一个激励视觉数学推理的多功能数学手册系统) [00:50] 🚀 NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale(NextStep-1:迈向大规模连续令牌自回归图像生成) [01:17] 🎨 ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing(ToonComposer:通过生成式关键帧后处理简化卡通制作) [01:43] 🤔 PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts(PRELUDE:一个旨在要求长上下文全局理解与推理的基准) [02:14] 🚀 UI-Venus Technical Report: Building High-performance UI Agents with RFT(UI-Venus技术报告:采用RFT构建高性能UI智能体) [02:42] 🚀 STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer(STream3R:基于因果Transformer的可扩展序列三维重建) [03:11] ⚖ Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models(Pass@k 训练:自适应平衡大型推理模型的探索与利用) [03:37] 🤔 HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs(HumanSense:通过推理型多模态大语言模型实现从多模态感知到共情语境感知响应) [04:08] 📚 A Survey on Diffusion Language Models(扩散语言模型综述) [04:39] 💡 From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms(从黑箱到透明:在大学课堂中利用可解释人工智能提升自动化口译评估) [05:03] 📸 Processing and acquisition traces in visual encoders: What does CLIP know about your camera?(视觉编码器中的处理与采集痕迹:CLIP对你的相机了解多少?) [05:30] ⚖ When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing(当可解释性遇上隐私:后验可解释性与差分隐私在自然语言处理背景下的交集研究) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    6 min
  3. 4 DAYS AGO

    2025.08.14 | 分子推理框架提升性能;视频身份控制轻量高效

    本期的 15 篇论文如下: [00:17] 🧪 Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery(Mol-R1:迈向分子发现中的显式长链思维推理) [00:38] ✨ Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation(Stand-In:视频生成中轻量级即插即用的身份控制) [01:06] 🎥 Story2Board: A Training-Free Approach for Expressive Storyboard Generation(Story2Board:一种富有表现力的故事板生成免训练方法) [01:32] 🛡 AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving(AWorld:具有稳定操控能力的动态多智能体系统,用于鲁棒的GAIA问题解决) [01:59] ⚡ Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing(扩散大语言模型通过离散扩散强制实现超越自回归的推理速度) [02:21] 🪄 Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation(Echo-4o:利用GPT-4o合成图像的力量改进图像生成) [02:51] 🧠 Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory(感知、聆听、记忆与推理:一种具备长期记忆的多模态智能体) [03:21] 🤝 Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment(学习对齐,对齐以学习:一种自优化对齐的统一方法) [03:48] 🚧 MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models(MathReal:我们来真的!一个用于评估多模态大语言模型数学推理能力的真实场景基准) [04:12] 💡 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models(Cooper:大型语言模型强化学习中策略与奖励模型的协同优化) [04:32] 👻 IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding(IAG:针对视觉定位中VLMs的输入感知后门攻击) [04:59] 💡 Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models(噪声超网络:均摊扩散模型中的测试时计算量) [05:21] 💻 VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models(VisCodex:通过融合视觉和编码模型实现统一多模态代码生成) [05:47] ✨ GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors(GSFixer:利用参考引导的视频扩散先验改进3D高斯泼溅) [06:13] ✨ CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing(CannyEdit:选择性Canny控制与双提示引导的免训练图像编辑) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7 min
  4. 5 DAYS AGO

    2025.08.13 | 多模态AI突破;3D世界生成

    本期的 15 篇论文如下: [00:22] 🤖 WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent(WebWatcher:突破视觉-语言深度研究智能体的新前沿) [00:45] 🌎 Matrix-3D: Omnidirectional Explorable 3D World Generation(Matrix-3D:全向可探索三维世界生成) [01:17] 🚀 Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL(超越十回合:通过大规模异步强化学习解锁长周期智能体搜索) [01:43] 🕺 CharacterShot: Controllable and Consistent 4D Character Animation(CharacterShot:可控且一致的4D角色动画) [02:05] ⏳ Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models(时间即特征:利用扩散语言模型中的时序动态) [02:29] 🔍 HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches(HierSearch:一个整合本地和网络搜索的分层企业深度搜索框架) [02:55] 🧊 VertexRegen: Mesh Generation with Continuous Level of Detail(VertexRegen:连续细节层次的网格生成) [03:16] 🎯 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency(基于区域一致性的GUI定位测试时强化学习) [03:43] ⏱ Train Long, Think Short: Curriculum Learning for Efficient Reasoning(长程训练,短程思考:高效推理的课程学习) [04:05] 🎓 Aryabhata: An exam-focused language model for JEE Math(Aryabhata:一个专注于JEE数学考试的语言模型) [04:30] 🖼 UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation(UNCAGE:文本到图像生成中掩码生成式Transformer的对比注意力引导) [04:52] 🧠 Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy(民主化外交:一个评估任意大型语言模型在《外交》游戏中表现的工具) [05:20] 👋 Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors(迈向融合类人先验的可供性感知机器人灵巧抓取) [05:45] 📈 Adversarial Video Promotion Against Text-to-Video Retrieval(针对文本到视频检索的对抗性视频推广) [06:10] 🎬 Cut2Next: Generating Next Shot via In-Context Tuning(Cut2Next:通过上下文调优生成下一镜头) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7 min
  5. 6 DAYS AGO

    2025.08.12 | ReasonRank提升段落排序推理;WideSearch评估智能体广域搜寻

    本期的 15 篇论文如下: [00:18] 🧠 ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability(ReasonRank:赋予段落排序强大推理能力) [00:41] 🔍 WideSearch: Benchmarking Agentic Broad Info-Seeking(WideSearch:智能体广域信息搜寻基准测试) [01:01] ✨ Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation(Omni-Effects:统一且空间可控的视觉效果生成) [01:26] 🧠 Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization(Klear-Reasoner:通过梯度保留剪裁策略优化提升推理能力) [01:59] 💬 UserBench: An Interactive Gym Environment for User-Centric Agents(UserBench:面向用户中心智能体的交互式Gym基准环境) [02:22] 💡 SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens(SONAR-LLM:以句子嵌入思考并以Token表达的自回归Transformer) [02:50] 🌱 A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems(自进化AI智能体综合综述:连接基础模型与终身智能体系统的新范式) [03:15] 🔬 BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent(BrowseComp-Plus:一种更公平透明的深度研究智能体评估基准) [03:45] 🤖 MolmoAct: Action Reasoning Models that can Reason in Space(MolmoAct:可进行空间推理的动作推理模型) [04:11] 🤖 OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks(OmniEAR:具身任务中智能体推理的基准测试) [04:38] 💡 Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts(Grove MoE:面向高效卓越的伴随专家MoE大语言模型) [05:05] ⏳ Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future(时序自奖励语言模型:通过过去-未来解耦选择与拒绝) [05:29] 🗺 Reinforcement Learning in Vision: A Survey(视觉强化学习:综述) [05:59] 🔍 Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning(第一部分:技巧还是陷阱?深入探究强化学习在大型语言模型推理中的应用) [06:23] 🖌 Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control(随形而动:轨迹引导区域控制的形状感知图像编辑) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7 min
  6. 12 AUG

    2025.08.11 | GLM-4.5统一智能体推理编程;Voost高保真虚拟试穿试脱

    本期的 11 篇论文如下: [00:20] 🚀 GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models(GLM-4.5:智能体、推理与编程(ARC)基础模型) [00:47] 👕 Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off(Voost:一种统一且可扩展的双向虚拟试穿与试脱扩散Transformer) [01:11] 🎯 InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization(InfiGUI-G1:通过自适应探索策略优化推进 GUI 元素定位能力) [01:34] 🧠 Memp: Exploring Agent Procedural Memory(Memp:探索智能体程序性记忆) [02:03] ✂ Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal(剪枝非关键信息:基于首令牌惊奇度的高效代码推理) [02:29] 🪄 GENIE: Gaussian Encoding for Neural Radiance Fields Interactive Editing(GENIE:用于神经辐射场交互式编辑的高斯编码) [02:50] 📚 Adapting Vision-Language Models Without Labels: A Comprehensive Survey(无标签视觉-语言模型适应:一项全面综述) [03:15] 🌍 MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs(MELLA:弥合低资源语言多模态大语言模型的语言能力与文化扎根性) [03:37] 🧱 MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh(MeshLLM:赋能大型语言模型逐步理解和生成3D网格) [04:02] 🎯 UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding(UI-AGILE:以有效强化学习和精准推断时定位提升图形用户界面智能体) [04:30] ✨ LightSwitch: Multi-view Relighting with Material-guided Diffusion(光开关:基于材料引导扩散的多视角重照明) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    5 min

About

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

You Might Also Like