2025.06.18 | MultiFinBen揭示金融模型局限;测试时计算提升LLM Agent性能。

HuggingFace 每日AI论文速递

本期的 15 篇论文如下:

[00:23] 📊 MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation(MultiFinBen:一个多语言、多模态和难度感知的金融领域大语言模型评估基准)

[01:03] 🤖 Scaling Test-time Compute for LLM Agents(扩展LLM Agent的测试时计算)

[01:38] 🎼 CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following(CMI-Bench:一个评估音乐指令跟随的综合性基准)

[02:16] 💬 LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs(LongLLaDA:解锁扩散语言模型中的长文本能力)

[02:57] 🤔 Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs(基于可验证奖励的强化学习隐式地激励基础大语言模型中的正确推理)

[03:40] 🧠 Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team(Xolver: 像奥林匹克团队一样利用整体经验进行多智能体推理)

[04:20] 🗣 Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model(Stream-Omni:与大型语言-视觉-语音模型的同时多模态交互)

[05:02] ⚕ Efficient Medical VIE via Reinforcement Learning(基于强化学习的高效医学视觉信息抽取)

[05:40] 🤔 Reasoning with Exploration: An Entropy Perspective(基于探索的推理:一个熵的视角)

[06:18] 🧠 QFFT, Question-Free Fine-Tuning for Adaptive Reasoning(QFFT:用于自适应推理的无问题微调)

[06:52] 🎨 Align Your Flow: Scaling Continuous-Time Flow Map Distillation(对齐你的流:扩展连续时间流映射蒸馏)

[07:27] 🧪 Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure(大语言模型能否为算法问题生成高质量测试用例?TestCase-Eval:容错覆盖和暴露的系统性评估)

[08:07] 🤖 Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees(有保证的猜测:一种基于语言建模的CISC到RISC代码转换方法,并提供测试保证)

[08:58] 🛠 CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios(CRITICTOOL:评估大型语言模型在工具调用错误场景中的自我批判能力)

[09:38] 📊 xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations(xbench:通过与职业对齐的真实世界评估追踪Agent的生产力提升)

【关注我们】

您还可以在以下平台找到我们,获得播客内容以外更多信息

小红书: AI速递

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada