HuggingFace 每日AI论文速递

duan

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

  1. 11시간 전

    2025.08.22 | 科学多模态缩小差距;GUI自动化解决挑战

    本期的 15 篇论文如下: [00:22] 🧪 Intern-S1: A Scientific Multimodal Foundation Model(Intern-S1:一个科学多模态基础模型) [00:46] 🤖 Mobile-Agent-v3: Foundamental Agents for GUI Automation(Mobile-Agent-v3:GUI自动化基础智能体) [01:10] ✅ Deep Think with Confidence(置信深思) [01:31] 🤔 LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries(LiveMCP-101:在挑战性查询上对启用MCP的智能体进行压力测试与诊断) [02:01] 🎬 Waver: Wave Your Way to Lifelike Video Generation(Waver:驾驭波形,生成栩栩如生的视频) [02:25] 🏞 SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass(SceneGen:单图一次前向传播生成三维场景) [02:56] 📚 A Survey on Large Language Model Benchmarks(大语言模型基准测试综述) [03:20] 🤸 ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling(ATLAS:解耦骨骼与形状参数,实现富有表现力的参数化人体建模) [03:46] 🎨 Visual Autoregressive Modeling for Instruction-Guided Image Editing(用于指令引导图像编辑的视觉自回归建模) [04:15] 🤖 aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists(aiXiv:由AI科学家生成的下一代开放获取科学发现生态系统) [04:40] 🗺 "Does the cafe entrance look accessible? Where is the door?" Towards Geospatial AI Agents for Visual Inquiries(“咖啡馆入口是否无障碍?门在哪里?”——迈向地理空间AI智能体实现视觉查询) [05:12] 🔍 When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding(何时何物:基于扩散模型的视频大语言模型,结合实体感知分割实现长视频理解) [05:44] 💰 Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models(Fin-PRM:大型语言模型金融推理的领域专用过程奖励模型) [06:08] ⚡ Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds(Snap-Snap:双图快拍,毫秒级3D人体高斯重建) [06:37] 🫂 INTIMA: A Benchmark for Human-AI Companionship Behavior(INTIMA:人机陪伴行为基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7분
  2. 1일 전

    2025.08.21 | 金融大模型认知诊断;DuPO优化自验证

    本期的 15 篇论文如下: [00:22] 🧠 From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models(从分数到技能:金融大语言模型认知诊断评估框架) [00:49] ✅ DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization(DuPO:通过双重偏好优化实现大模型可靠自验证) [01:17] 🔮 FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction(FutureX:面向LLM智能体未来预测的先进实时基准) [01:44] 🏗 MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds(MeshCoder:LLM赋能的点云结构化网格代码生成) [02:14] 🪄 Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization(Tinker:扩散模型赋能3D——从稀疏输入实现多视角一致性编辑,无需逐场景优化) [02:40] 🤖 From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery(从科学AI到具身科学:自主科学发现综述) [03:06] ⚙ Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs(量化技术邂逅扩散大语言模型:扩散大语言模型后训练量化系统性研究) [03:37] 🛠 MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers(MCP-Universe:基于真实世界模型上下文协议服务器的大语言模型基准测试) [04:12] ⚡ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model(NVIDIA Nemotron Nano 2:一个准确高效的混合Mamba-Transformer推理模型) [04:45] 🤖 RynnEC: Bringing MLLMs into Embodied World(RynnEC:将多模态大语言模型引入具身世界) [05:12] ⚖ On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting(在线强化学习与离线专家融合:通过动态加权协调监督微调与强化学习) [05:41] 🧐 ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?(ViExam:视觉语言模型在越南语多模态考试题上能否超越人类?) [06:08] ⚡ Leuvenshtein: Efficient FHE-based Edit Distance Computation with Single Bootstrap per Cell(Leuvenshtein: 基于FHE的高效编辑距离计算,每单元单次自举) [06:40] 📏 Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer(基于潜在深度平衡规范器的局部尺度等变性) [07:06] 🤔 mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning(mSCoRe: 一个多语言、可扩展的基于技能的常识推理基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    8분
  3. 2일 전

    2025.08.20 | 智能体链提升效率;长视频3D重建优化

    本期的 15 篇论文如下: [00:23] 🤖 Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL(智能体链:基于多智能体蒸馏与智能体强化学习的端到端智能体基础模型) [00:52] 🎥 LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos(LongSplat:针对随意长视频的鲁棒无姿态3D高斯泼溅) [01:13] 🛠 Prompt Orchestration Markup Language(提示编排标记语言) [01:33] 🎨 MultiRef: Controllable Image Generation with Multiple Visual References(MultiRef:多视觉参考可控图像生成) [02:00] 🤖 Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge(基于用户画像感知的LLM评判播客推荐效果评估) [02:29] 🦾 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation(Embodied-R1:强化具身推理实现通用机器人操作) [02:59] ✅ Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation(关注生成过程:LLM生成时的细粒度置信度估计) [03:22] 🎨 Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer(基于多模态扩散Transformer的免训练文本引导颜色编辑) [03:45] 🪄 OmniTry: Virtual Try-On Anything without Masks(OmniTry:无需掩膜的万物虚拟试穿) [04:08] ⏰ A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models(防患未然:语言模型的主动式自我精炼) [04:32] 👂 Advances in Speech Separation: Techniques, Challenges, and Future Trends(语音分离的进展:技术、挑战与未来趋势) [05:04] 😥 Leveraging Large Language Models for Predictive Analysis of Human Misery(利用大型语言模型对人类痛苦进行预测性分析) [05:27] ⏳ TempFlow-GRPO: When Timing Matters for GRPO in Flow Models(TempFlow-GRPO:时序性在流模型GRPO中的关键作用) [05:58] 🗺 CAMAR: Continuous Actions Multi-Agent Routing(CAMAR:连续动作多智能体路径规划) [06:25] 🔒 Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends(大型语言模型版权保护:方法、挑战与趋势综述) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7분
  4. 3일 전

    2025.08.19 | Ovis2.5提升多模态;ComoRAG优化长叙事推理

    本期的 15 篇论文如下: [00:20] ✨ Ovis2.5 Technical Report(Ovis2.5 技术报告) [00:51] 🧠 ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning(ComoRAG:一种认知启发式记忆组织RAG,用于有状态长叙事推理) [01:14] 🎥 4DNeX: Feed-Forward 4D Generative Modeling Made Easy(4DNeX:前馈4D生成建模轻松实现) [01:38] ✨ Next Visual Granularity Generation(下一视觉粒度生成) [01:57] ⚡ Speed Always Wins: A Survey on Efficient Architectures for Large Language Models(速度至上:大型语言模型高效架构综述) [02:30] 🤔 Has GPT-5 Achieved Spatial Intelligence? An Empirical Study(GPT-5是否已实现空间智能?一项实证研究) [03:00] 🎮 HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds(HeroBench:虚拟世界中长周期规划与结构化推理的基准测试) [03:26] ❗ When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs(当标点符号至关重要时:大型语言模型提示鲁棒性方法的大规模比较) [03:56] 🎮 Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model(矩阵游戏 2.0:一个开源、实时、流式的交互式世界模型) [04:21] 💡 Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models(Lumen:基于视频生成模型的一致性视频重打光与和谐背景替换) [04:47] 🌐 G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration(G-CUT3R:融合相机与深度先验的引导式三维重建) [05:15] ✨ S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models(S^2-Guidance:扩散模型无训练增强的随机自引导) [05:49] 👂 Representing Speech Through Autoregressive Prediction of Cochlear Tokens(通过自回归预测耳蜗令牌实现语音表征) [06:09] 💡 Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping(逆向LLaVA:通过文本到视觉映射消除对齐预训练) [06:40] 🎬 Precise Action-to-Video Generation Through Visual Action Prompts(通过视觉动作提示实现精确的动作到视频生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    8분
  5. 4일 전

    2025.08.18 | 超越图像思考;自搜索强化

    本期的 13 篇论文如下: [00:19] 💡 Thyme: Think Beyond Images(Thyme:超越图像的思考) [00:48] 🧠 SSRL: Self-Search Reinforcement Learning(SSRL:自搜索强化学习) [01:16] 🚀 DINOv3(DINOv3:视觉基础模型新里程碑) [01:42] 🔍 PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing(PaperRegister:通过分层寄存器索引提升灵活粒度论文搜索) [02:13] 🚀 XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization(XQuant:通过KV缓存重物化突破LLM推理的内存瓶颈) [02:40] 🚀 BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining(BeyondWeb:万亿规模预训练中合成数据扩展的经验教训) [03:09] 🎨 StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation(StyleMM:通过文本驱动的对齐图像翻译实现风格化3D可变形人脸模型) [03:35] 🌌 TexVerse: A Universe of 3D Objects with High-Resolution Textures(TexVerse:高分辨率纹理3D对象宇宙) [03:59] 🗣 FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation(FantasyTalking2:面向音频驱动人像动画的时间步-层级自适应偏好优化) [04:32] 💡 X-Node: Self-Explanation is All We Need(X-Node:自解释即是我们所需的一切) [04:57] ⚙ Controlling Multimodal LLMs via Reward-guided Decoding(通过奖励引导解码控制多模态大语言模型) [05:21] ✨ SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation(稀疏数据,丰硕成果:通过类别条件图像转换实现小样本半监督学习) [05:52] 🌍 MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data(MAESTRO:用于多模态、多时相、多光谱地球观测数据的掩码自编码器) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7분
  6. 8월 16일

    2025.08.15 | 数学推理手册提升模型能力;连续令牌生成图像模型

    本期的 12 篇论文如下: [00:23] 📚 We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning(We-Math 2.0:一个激励视觉数学推理的多功能数学手册系统) [00:50] 🚀 NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale(NextStep-1:迈向大规模连续令牌自回归图像生成) [01:17] 🎨 ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing(ToonComposer:通过生成式关键帧后处理简化卡通制作) [01:43] 🤔 PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts(PRELUDE:一个旨在要求长上下文全局理解与推理的基准) [02:14] 🚀 UI-Venus Technical Report: Building High-performance UI Agents with RFT(UI-Venus技术报告:采用RFT构建高性能UI智能体) [02:42] 🚀 STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer(STream3R:基于因果Transformer的可扩展序列三维重建) [03:11] ⚖ Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models(Pass@k 训练:自适应平衡大型推理模型的探索与利用) [03:37] 🤔 HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs(HumanSense:通过推理型多模态大语言模型实现从多模态感知到共情语境感知响应) [04:08] 📚 A Survey on Diffusion Language Models(扩散语言模型综述) [04:39] 💡 From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms(从黑箱到透明:在大学课堂中利用可解释人工智能提升自动化口译评估) [05:03] 📸 Processing and acquisition traces in visual encoders: What does CLIP know about your camera?(视觉编码器中的处理与采集痕迹:CLIP对你的相机了解多少?) [05:30] ⚖ When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing(当可解释性遇上隐私:后验可解释性与差分隐私在自然语言处理背景下的交集研究) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    6분
  7. 8월 14일

    2025.08.14 | 分子推理框架提升性能;视频身份控制轻量高效

    本期的 15 篇论文如下: [00:17] 🧪 Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery(Mol-R1:迈向分子发现中的显式长链思维推理) [00:38] ✨ Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation(Stand-In:视频生成中轻量级即插即用的身份控制) [01:06] 🎥 Story2Board: A Training-Free Approach for Expressive Storyboard Generation(Story2Board:一种富有表现力的故事板生成免训练方法) [01:32] 🛡 AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving(AWorld:具有稳定操控能力的动态多智能体系统,用于鲁棒的GAIA问题解决) [01:59] ⚡ Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing(扩散大语言模型通过离散扩散强制实现超越自回归的推理速度) [02:21] 🪄 Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation(Echo-4o:利用GPT-4o合成图像的力量改进图像生成) [02:51] 🧠 Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory(感知、聆听、记忆与推理:一种具备长期记忆的多模态智能体) [03:21] 🤝 Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment(学习对齐,对齐以学习:一种自优化对齐的统一方法) [03:48] 🚧 MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models(MathReal:我们来真的!一个用于评估多模态大语言模型数学推理能力的真实场景基准) [04:12] 💡 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models(Cooper:大型语言模型强化学习中策略与奖励模型的协同优化) [04:32] 👻 IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding(IAG:针对视觉定位中VLMs的输入感知后门攻击) [04:59] 💡 Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models(噪声超网络:均摊扩散模型中的测试时计算量) [05:21] 💻 VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models(VisCodex:通过融合视觉和编码模型实现统一多模态代码生成) [05:47] ✨ GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors(GSFixer:利用参考引导的视频扩散先验改进3D高斯泼溅) [06:13] ✨ CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing(CannyEdit:选择性Canny控制与双提示引导的免训练图像编辑) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    7분

평가 및 리뷰

5
최고 5점
2개의 평가

소개

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

좋아할 만한 다른 항목