
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
This paper investigate two major drawbacks in the reward learning phase of RLHF: reward overfitting and reward overoptimization, which often occur because the standard cross-entropy loss is inadequate for imbalanced preference datasets. To address these issues, the paper introduces a novel algorithm called Iterative Data Smoothing (IDS), which mitigates these problems by iteratively updating hard comparison labels with softer, model-predicted labels during training. Theoretical analysis and empirical results in both multi-armed bandit and neural network settings demonstrate that IDS outperforms traditional Maximum Likelihood Estimation (MLE), offering a more robust approach to reward training.
Information
- Show
- FrequencyUpdated Weekly
- PublishedOctober 9, 2025 at 4:35 PM UTC
- Length17 min
- RatingClean