1天前
15 分钟

Sample Efficient Preference Alignment in LLMs via Active Exploration

This research introduces an active exploration algorithm to enhance the efficiency of preference alignment in large language models (LLMs) by strategically selecting human feedback. The authors frame this as an active contextual dueling bandit problem, where the system actively chooses which "contexts" (prompts) and "actions" (LLM responses) to present to human evaluators. Their proposed method, AE-Borda, leverages uncertainty estimation and a generalized Borda function to identify the most informative data points for training, leading to faster learning and reduced data collection costs. The paper validates its theoretical guarantees with synthetic experiments and demonstrates practical improvements on LLM performance across various datasets, including two new contributions: Jeopardy! for factual correctness and Haikus for creative writing.

单集网页

节目

Best AI papers explained
频率

一周一更
发布时间

2025年9月6日 UTC 04:32
长度

15 分钟
分级

儿童适宜

Sample Efficient Preference Alignment in LLMs via Active Exploration

信息