2025.10.16 | UniMoE一统语音音乐；注意力图点亮大模型推理

本期的 15 篇论文如下：

[00:21] 🎧 UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE（UniMoE-Audio：基于动态容量MoE的统一语音与音乐生成模型）

[00:57] 🔍 Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization（注意力照亮大模型推理：预规划-锚定节奏实现细粒度策略优化）

[01:38] ⚡ FlashWorld: High-quality 3D Scene Generation within Seconds（FlashWorld：秒级高质量3D场景生成）

[02:06] 🐝 Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs（Bee：高质量语料与全栈套件解锁完全开源多模态大模型）

[02:37] 🗣 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue（InteractiveOmni：面向音视频多轮对话的统一全模态模型）

[03:24] 🌍 PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning（PhysMaster：通过强化学习掌握视频生成的物理表征）

[04:00] 🧪 LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models（LIBERO-Plus：视觉-语言-动作模型鲁棒性深度剖析）

[04:43] 🚗 CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving（CVD-STORM：面向自动驾驶的跨视角视频扩散时空重建模型）

[05:21] 🔍 Generative Universal Verifier as Multimodal Meta-Reasoner（生成式通用验证器：多模态元推理的反思引擎）

[06:07] ⚖ ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs（ParallelBench：探明扩散式大模型并行解码的取舍）

[06:43] 🎞 Trace Anything: Representing Any Video in 4D via Trajectory Fields（任意视频4D轨迹场表示：一次前馈即可还原每像素连续时空路径）

[07:27] 🌍 Reasoning in Space via Grounding in the World（基于世界锚定的空间推理）

[07:54] 🧠 The Role of Computing Resources in Publishing Foundation Model Research（计算资源在基础模型研究发表中的角色）

[08:28] ⚖ UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning（UniME-V2：用多模态大模型当裁判，打造通用多模态表征）

[09:05] 🤖 InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy（InternVLA-M1：面向通用机器人策略的空间引导视觉-语言-动作框架）

【关注我们】

您还可以在以下平台找到我们，获得播客内容以外更多信息

小红书: AI速递

Informations