This paper introduces DELTA, a controlled benchmark of synthetic programming tasks—such as Manufactoria puzzles and BouncingSim physics simulations—specifically designed to isolate and evaluate whether reinforcement learning (RL) can teach large language models (LLMs) genuinely new reasoning procedures. The study demonstrates that RL can achieve **learnability beyond pretraining** on tasks where reference models previously failed completely, noting that naive binary reward training fails. This success is enabled by a **two-stage training strategy** that begins with dense, per-test case rewards for warm-up before switching to strict binary rewards, which triggers an abrupt **grokking transition** from exploration to mastery. Furthermore, the analysis of transferability shows that these learned skills generalize robustly across exploratory and **compose effectively** across combined skills, though performance remains poor under **transformative shifts** requiring qualitatively novel solution schemas.
Informations
- Émission
- FréquenceChaque semaine
- Publiée28 novembre 2025 à 00:43 UTC
- Durée11 min
- ClassificationTous publics
