LessWrong (30+ Karma)

“Replacing RL w/ Parameter-based Evolutionary Strategies” by Logan Riggs

I want to highlight this paper (from Sept 29, 2025) of an alternative to RL (for fine-tuning pre-trained LLMs) which:

  • Performs better
  • Requires less data
  • Consistent across seeds
  • Robust (ie don't need to do a grid search on your hyperparameters)
  • Less "Reward Hacking" (ie when optimizing for conciseness, it naturally stays close to the original model ie low KL-Divergence)

They claim the magic sauce behind all this is the evolutionary strategy optimizing over distributions of model parameters. Surprisingly, they've scaled this to optimize over billion-parameter models.

Let's get into their method.

Evolutionary Strategy (ES) Algorithm

They start w/ a "Basic ES Algorithm" which is:

In other words, we're gonna sample noise around the original model's weights N times (ie we're going to explore around the model weights where the variance I is the identity covariance).

[Below is an example explaining more in depth, feel free to skip [...]

---

Outline:

(00:54) Evolutionary Strategy (ES) Algorithm

(02:41) New ES Implementation

(03:28) Task 1: Countdown task

(05:05) Task 2: Conciseness

(06:00) Future Work

---

First published:
October 8th, 2025

Source:
https://www.lesswrong.com/posts/282Sv9JePpNpQktKP/replacing-rl-w-parameter-based-evolutionary-strategies

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.