15 min

Utility Maximization = Description Length Minimization by johnswentworth The Nonlinear Library: Alignment Forum Top Posts

- Education

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Utility Maximization = Description Length Minimization, published by johnswentworth on the AI Alignment Forum.
There’s a useful intuitive notion of “optimization” as pushing the world into a small set of states, starting from any of a large number of states. Visually:
Yudkowsky and Flint both have notable formalizations of this “optimization as compression” idea.
This post presents a formalization of optimization-as-compression grounded in information theory. Specifically: to “optimize” a system is to reduce the number of bits required to represent the system state using a particular encoding. In other words, “optimizing” a system means making it compressible (in the information-theoretic sense) by a particular model.
This formalization turns out to be equivalent to expected utility maximization, and allows us to interpret any expected utility maximizer as “trying to make the world look like a particular model”.
Conceptual Example: Building A House
Before diving into the formalism, we’ll walk through a conceptual example, taken directly from Flint’s Ground of Optimization: building a house. Here’s Flint’s diagram:
The key idea here is that there’s a wide variety of initial states (piles of lumber, etc) which all end up in the same target configuration set (finished house). The “perturbation” indicates that the initial state could change to some other state - e.g. someone could move all the lumber ten feet to the left - and we’d still end up with the house.
In terms of information-theoretic compression: we could imagine a model which says there is probably a house. Efficiently encoding samples from this model will mean using shorter bit-strings for world-states with a house, and longer bit-strings for world-states without a house. World-states with piles of lumber will therefore generally require more bits than world-states with a house. By turning the piles of lumber into a house, we reduce the number of bits required to represent the world-state using this particular encoding/model.
If that seems kind of trivial and obvious, then you’ve probably understood the idea; later sections will talk about how it ties into other things. If not, then the next section is probably for you.
Background Concepts From Information Theory
The basic motivating idea of information theory is that we can represent information using fewer bits, on average, if we use shorter representations for states which occur more often. For instance, Morse code uses only a single bit (“.”) to represent the letter “e”, but four bits (“- - . -”) to represent “q”. This creates a strong connection between probabilistic models/distributions and optimal codes: a code which requires minimal average bits for one distribution (e.g. with lots of e’s and few q’s) will not be optimal for another distribution (e.g. with few e’s and lots of q’s).
For any random variable
X
generated by a probabilistic model
M
, we can compute the minimum average number of bits required to represent
X
. This is Shannon’s famous entropy formula
−
∑
X
P
X
M
log
P
X
M
Assuming we’re using an optimal encoding for model
M
, the number of bits used to encode a particular value
x
is
log
P
X
x
M
. (Note that this is sometimes not an integer! Today we have algorithms which encode many samples at once, potentially even from different models/distributions, to achieve asymptotically minimal bit-usage. The “rounding error” only happens once for the whole collection of samples, so as the number of samples grows, the rounding error per sample goes to zero.)
Of course, we could be wrong about the distribution - we could use a code optimized for a model
M
2
which is different from the “true” model
M
1
. In this case, the average number of bits used will be
−
∑
X
P
X
M