1일 전
에피소드 1.2천
24분

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Daily Paper Cast

🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG

Authors:
Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang

Title:
Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Arxiv:
http://arxiv.org/abs/2509.22638v1

Abstract:
LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

에피소드 웹페이지

프로그램

Daily Paper Cast
주기

매일 업데이트
발행일

2025년 9월 30일 오전 4:11 UTC
길이

24분
에피소드

1.2천
등급

전체 연령 사용가

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

정보