This paper studies scaling reinforcement learning (RL) compute for large language models (LLMs), introducing a principled framework to predict performance. The authors develop ScaleRL, a best-practice recipe derived from ablating various algorithmic choices, and demonstrate its predictable scaling trajectory using a sigmoidal function to fit compute-performance curves. Accompanying figures illustrate validation performance over increasing GPU hours (log scale) for different RL configurations, showing that ScaleRL achieves higher asymptotic performance and efficiency than prevalent methods while maintaining stability across various scaling axes, including model size and batch size. The work establishes that predictable scaling laws, similar to those in LLM pre-training, can be applied to the RL fine-tuning stage.
정보
- 프로그램
- 주기매주 업데이트
- 발행일2025년 10월 16일 오후 10:54 UTC
- 길이14분
- 등급전체 연령 사용가
