I want to highlight this paper (from Sept 29, 2025) of an alternative to RL (for fine-tuning pre-trained LLMs) which:
- Performs better
- Requires less data
- Consistent across seeds
- Robust (ie don't need to do a grid search on your hyperparameters)
- Less "Reward Hacking" (ie when optimizing for conciseness, it naturally stays close to the original model ie low KL-Divergence)
They claim the magic sauce behind all this is the evolutionary strategy optimizing over distributions of model parameters. Surprisingly, they've scaled this to optimize over billion-parameter models.
Let's get into their method.
Evolutionary Strategy (ES) Algorithm
They start w/ a "Basic ES Algorithm" which is:
In other words, we're gonna sample noise around the original model's weights N times (ie we're going to explore around the model weights where the variance I is the identity covariance).
[Below is an example explaining more in depth, feel free to skip [...]
---
Outline:
(00:54) Evolutionary Strategy (ES) Algorithm
(02:41) New ES Implementation
(03:28) Task 1: Countdown task
(05:05) Task 2: Conciseness
(06:00) Future Work
---
First published:
October 8th, 2025
Source:
https://www.lesswrong.com/posts/282Sv9JePpNpQktKP/replacing-rl-w-parameter-based-evolutionary-strategies
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
정보
- 프로그램
- 주기매일 업데이트
- 발행일2025년 10월 8일 오전 8:15 UTC
- 길이9분
- 등급전체 연령 사용가