HuggingFace 每日AI论文速递

duan

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

  1. 3 HR AGO

    2025.09.29 | 实时长视频边聊边播;分位数基线稳控推理熵

    本期的 15 篇论文如下: [00:20] 🎬 LongLive: Real-time Interactive Long Video Generation(LongLive:实时交互式长视频生成框架) [00:56] 🎯 Quantile Advantage Estimation for Entropy-Safe Reasoning(用于熵安全推理的分位数优势估计) [01:34] 📄 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing(MinerU2.5:面向高效高分辨率文档解析的解耦视觉-语言模型) [02:11] 🧠 EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning(EPO:面向LLM智能体强化学习的熵正则策略优化) [03:08] 🧠 Variational Reasoning for Language Models(语言模型的变分推理框架) [03:37] 💬 Language Models Can Learn from Verbal Feedback Without Scalar Rewards(无需标量奖励,语言模型也能从语言反馈中学习) [04:32] 🔍 ReviewScore: Misinformed Peer Review Detection with Large Language Models(ReviewScore:用大模型揪出“跑偏”的同行评审) [05:12] 🎯 CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning(CapRL:用强化学习激发稠密图像描述潜能) [05:49] 🪄 MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning(MesaTask:面向任务驱动的桌面场景生成与3D空间推理) [06:32] 🎯 No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping(零方差提示不浪费:基于熵引导优势塑造的LLM强化学习新范式) [07:14] 🗣 VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing(VoiceAssistant-Eval:横跨听、说、看的AI助手基准测评) [07:58] 🧭 UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios(UltraHorizon:在长周期场景中评估智能体能力的基准) [08:29] 🖼 LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer(LucidFlux:无需文字描述的大规模扩散Transformer通用图像修复) [09:16] 🌐 WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning(WebGen-Agent:借助多级反馈与步骤级强化学习提升交互式网页生成) [09:49] 🔄 SPARK: Synergistic Policy And Reward Co-Evolving Framework(SPARK:策略与奖励协同演化的强化学习框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    11 min
  2. 3 DAYS AGO

    2025.09.26 | SciReasoner八项全能;MMR1模糊区炼出开源多模态

    本期的 15 篇论文如下: [00:20] 🔬 SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines(SciReasoner:跨学科夯实科学推理基石) [01:00] 🧠 MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources(MMR1:基于方差感知采样与开放资源的多模态推理增强) [01:41] 📈 VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models(VCRL:面向大语言模型的方差驱动课程强化学习) [02:26] 🌳 Tree Search for LLM Agent Reinforcement Learning(基于树搜索的大语言模型智能体强化学习) [03:06] 🖼 Seedream 4.0: Toward Next-generation Multimodal Image Generation(Seedream 4.0:面向下一代多模态图像生成) [03:40] 🎯 Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets(Hunyuan3D-Omni:统一可控3D资产生成框架) [04:29] 🤖 AutoIntent: AutoML for Text Classification(AutoIntent:面向文本分类任务的自动化机器学习框架) [05:10] ⚖ TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them(TrustJudge:LLM-as-a-Judge的评分不一致性及缓解之道) [05:43] 🎢 CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning(CE-GPPO:通过梯度保留裁剪策略优化控制强化学习中的熵) [06:30] 🖼 Does FLUX Already Know How to Perform Physically Plausible Image Composition?(FLUX已掌握物理可信图像合成?) [07:31] ✂ CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling(CHARM:基于控制点的3D动漫发型自回归建模) [08:26] 🧠 Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution(Recon-Act:基于网络侦察、工具生成与任务执行的自我演化多智能体浏览器操作系统) [09:12] 🎮 V-GameGym: Visual Game Generation for Code Large Language Models(V-GameGym:面向代码大模型的视觉游戏生成基准) [09:49] 🗣 Interactive Recommendation Agent with Active User Commands(支持主动用户指令的交互式推荐智能体) [10:22] 🔍 BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback(BESPOKE:基于诊断反馈的搜索增强大模型个性化评测基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    11 min
  3. 4 DAYS AGO

    2025.09.25 | 视频模型零样本全能;隐式思维链省token提效

    本期的 10 篇论文如下: [00:22] 🎥 Video models are zero-shot learners and reasoners(视频模型是零样本学习者与推理者) [01:09] 🧠 SIM-CoT: Supervised Implicit Chain-of-Thought(SIM-CoT:基于监督式隐式思维链的高效推理) [01:55] 🪶 EmbeddingGemma: Powerful and Lightweight Text Representations(EmbeddingGemma:强大而轻量的文本表征模型) [02:29] 🗣 Advancing Speech Understanding in Speech-Aware Language Models with GRPO(基于GRPO提升语音感知大模型开放域理解能力) [03:06] 🌍 LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines(LLMs4All:面向各学科研究与应用的通用大模型综述) [03:52] 🎬 EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning(EditVerse:用上下文学习统一图像与视频编辑生成) [04:29] 🌀 Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation(Lavida-O:弹性大掩码扩散模型统一多模态理解与生成) [05:19] 🎬 PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation(PhysCtrl:基于生成式物理的可控且物理真实的视频生成框架) [05:58] 📄 Logics-Parsing Technical Report(Logics-Parsing 技术报告:基于强化学习的大模型端到端文档解析) [06:44] 🤖 On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub(关于自主编码的实证研究:GitHub上由AI代理发起的拉取请求分析) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    8 min
  4. 5 DAYS AGO

    2025.09.24 | 阿语OCR刷新指标;无标注RL涨分

    本期的 15 篇论文如下: [00:24] 📜 Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR(Baseer:面向阿拉伯文档OCR的视觉-语言模型) [00:58] 🚀 Reinforcement Learning on Pre-Training Data(基于预训练数据的强化学习) [01:37] 👁 Do You Need Proprioceptive States in Visuomotor Policies?(无需本体感觉状态的视觉-运动策略是否可行?) [02:36] 🚀 MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe(MiniCPM-V 4.5:通过架构、数据与训练配方烹饪高效多模态大模型) [03:24] 🎯 MAPO: Mixed Advantage Policy Optimization(混合优势策略优化:解决GRPO中优势分配难题) [04:06] 🚀 Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation(Hyper-Bagel:统一加速多模态理解与生成的一体化框架) [04:44] 🎯 VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction(VolSplat:基于体素对齐预测的前馈3D高斯抛雪球重建新范式) [05:31] 🌌 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation(Lyra:基于视频扩散模型自蒸馏的生成式3D场景重建) [06:08] 🧩 What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT(有效推理的密码:重新审视思维链长度、回顾与结构) [06:41] 🗣 Large Language Models Discriminate Against Speakers of German Dialects(大型语言模型对德语方言使用者的歧视) [07:32] 📊 OpenGVL - Benchmarking Visual Temporal Progress for Data Curation(OpenGVL——面向数据整理的视觉时序进展评测基准) [08:19] 🪄 HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis(HyRF:混合辐射场实现内存高效且高质量的新视角合成) [09:07] 🛠 CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching(条件感知重参数化对齐源域与目标域的流匹配) [09:41] 🛰 Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications(零样本多光谱学习:让通用多模态Gemini 2.5模型在遥感任务中重焕新生) [10:28] 🌍 VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction(VIR-Bench:通过旅行视频行程重建评测多模态大模型的地理-时空理解力) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    12 min
  5. 6 DAYS AGO

    2025.09.23 | 少78条示范让AI飙73.5%;免掩膜视频插主体超Pika

    本期的 15 篇论文如下: [00:21] 🚀 LIMI: Less is More for Agency(LIMI:少即是多,打造AI智能体) [00:55] 🎬 OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models(无需掩膜的视频任意主体插入:基于扩散Transformer模型) [01:28] 🧩 OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System(OnePiece:面向工业级级联排序系统的上下文工程与推理融合框架) [02:19] 🌐 Qwen3-Omni Technical Report(Qwen3-Omni技术报告:首个无性能损耗的全模态大模型) [02:55] 🎬 TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs(TempSamp-R1:面向视频时序定位任务的高效离策略强化微调框架) [03:28] 📐 GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning(GeoPQA:弥合多模态大模型几何推理中的视觉感知鸿沟) [04:15] 🎯 DiffusionNFT: Online Diffusion Reinforcement with Forward Process(DiffusionNFT:基于前向过程在线扩散强化学习) [05:05] 🤖 ByteWrist: A Parallel Robotic Wrist Enabling Flexible and Anthropomorphic Motion for Confined Spaces(ByteWrist:面向狭窄空间的可穿戴并行机器人腕关节) [05:42] 💬 EpiCache: Episodic KV Cache Management for Long Conversational Question Answering(EpiCache:面向长对话问答的情景式KV缓存管理) [06:24] 🧠 SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?(SWE-Bench Pro:AI智能体能攻克长周期软件工程难题吗?) [07:01] 🧠 FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions(FlagEval发现报告:大推理模型在可自动验证文本与视觉问题上的初步测评) [08:05] 🎬 VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models(VideoFrom3D:基于互补图像与视频扩散模型的3D场景视频生成) [08:53] 🧪 ARE: Scaling Up Agent Environments and Evaluations(ARE:扩展智能体环境与评测规模) [09:28] 🧩 QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models(QWHA:面向大模型量化部署的沃尔什-哈达玛参数高效微调方法) [10:17] 🔍 Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels(从token与参数双视角解析监督微调对模型知识的影响) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    11 min
  6. 22 SEPT

    2025.09.22 | 有向图驱动代码生成;双通道视觉统一模型

    本期的 13 篇论文如下: [00:25] 🗺 RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation(RPG:用于统一可扩展代码库生成的仓库规划图) [01:00] 🌉 MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer(MANZANO:基于混合视觉词元器的简洁可扩展统一多模态模型) [01:42] 🧩 Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification(潜区分网络:生成建模、表示学习与分类的统一原理) [02:25] 🎯 BaseReward: A Strong Baseline for Multimodal Reward Model(BaseReward:多模态奖励模型的强力基线) [02:56] 🏠 SPATIALGEN: Layout-guided 3D Indoor Scene Generation(SpatialGen:布局引导的3D室内场景生成) [03:46] 🧠 BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent(BTL-UI:面向GUI智能体的“眨眼-思考-连接”脑启发推理模型) [04:30] 🎭 Lynx: Towards High-Fidelity Personalized Video Generation(Lynx:面向高保真个性化视频生成) [05:20] 🤖 A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning(用于机器人真实强化学习的视觉-语言-动作-评价模型) [05:54] 📹 RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes(动态场景下仅基于RGB视频监督的相机参数优化) [06:21] 🗣 Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems(你听见的是我想表达的吗?量化指令感知差距的表达型文本转语音系统研究) [07:07] 🎬 Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents(Video2Roleplay:面向视频引导角色扮演智能体的多模态数据集与框架) [07:50] 🗣 WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers(WhisTLE:面向预训练语音识别Transformer的纯文本深度监督域适应方法) [08:30] 🗣 Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue(主动询问以澄清:通过多轮对话消解指令歧义) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

    10 min

About

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

You Might Also Like