13
Curiosity, some intuition

2026-02-13 · 2246 words

I figure I’ll publish this long post in a couple parts, over today and tomorrow or perhaps the day after (as tomorrow is going to be a busy day). In today’s post I plan to go over curiosity and the Noisy TV Problem, to help build some practical and mathematical intuition. Tomorrow, I’d like to talk about some information theory, specifically aleatoric and epistemic uncertainty, and how this framing can be used to improve learning through exploration.

One of my favorite ML papers of all time was published in 2018. It’s called Exploration by Random Network Distillation (RND), by Burda et al.. Some other favorites are PILCO and DLGN. Let’s start with some background:

RL and rewards

Reinforcement Learning (RL) is concerned with learning a policy (rules about how to act in a situation) to complete some general objective in an environment. The environment can be anything, like chess or a video game or chatting with a human or writing code or working on a math problem. The objective is traditionally given using a reward function, which assigns a numeric value to each possible state of an environment. For example, +1 if the policy did something good, -1 if it did not.

To train an RL policy, we traditionally “roll out” many episodes—or recordings—of a policy acting in a given environment. We then calculate the total reward for each episode by applying the reward function to each state in the episode and summing. Policies are then iteratively trained to maximize the total reward collected over the course of an episode, using techniques like PPO, GRPO, etc..

As you can imagine, it’s hard to come up with a good reward function. If a new policy is learning chess, initialized to make random moves, it is very unlikely that the policy will randomly checkmate its opponent. A reward function that assigns +100 to winning, -100 to losing, and 0 reward otherwise is considered to be a very sparse reward signal. Sparse rewards are few and far between, so it’s hard for the policy to learn stepping-stone milestones and make partial progress. To overcome this problem, we can enrich reward functions with added heuristics—e.g. which side has more pieces, whether a piece was taken on the last turn, whether a king was put in check, etc.—and these heuristics can provide enough of an intermediate dense reward signal to guide the policy towards discovering the large rewards we know exist at the end.

Human curiosity

Why do we know things exist? Side-stepping existentialism, we know things exist because people are naturally curious. When we don’t know what the reward for doing something is, we generally try it out. We like to learn and make discoveries; this pioneering spirit has pushed us to the four corners of the globe. You could say that humans are intrinsically motivated to explore; we find boredom to be boring, and boring in excess quantities to be bad. We feel a thrill of surprise when we experience something new and unexpected.

There is this idea of the “expected free energy” model of intelligence. (This is more “nice model to think about” than an “empirical claim about how the brain works”.) It goes something like this: if you accept that intelligence is prediction, then it follows that an intelligent system, like the brain, is a prediction-error-minimization machine. To improve its models of the world, the brain seeks out states that, when observed, will decrease prediction error.

Talking about humans, you can take the expected free energy model pretty far: there might be some predictions wired into us, like “feeling full”. Then, hunger is the feeling of the prediction error we intuit when our body is not full. Likewise, when you pick up an orange, your brain predicts that you will be holding an orange, and your body, being acutely aware of the fact that you are not holding an orange, moves to correct that error. There’s this wave of prediction crashing into this wave of sensory input, and the differences are noted in one direction to learn and in the other direction to act.

Returning to the land of RL: there’s a current school of thought that instead of hand-crafting bespoke dense reward functions, we should emulate expected free energy when training policies. We decompose the reward function into two parts: the sparse reward we ultimately care about (e.g. win/loss), and the dense reward tracking the “exploration” of new states. We can combine these two parts together to derive a reward function that better guides the policy towards interesting and rewarding states.

Curiosity and RND

How can you quantify which states are new and exciting to explore, and which states are routine and boring?

Usually we treat an environment like something of a black box. The environment has an internal state, which we’ll denote with $s$. This state $s$ is then mapped to some observation $o$ (which may not contain all the information present in $s$). Our policy $\pi$ receives a history of observations $o^*$, and from it we sample an action $a$. This action is then sent to the environment, and the next state $s’$ is stochastically determined from $(s, a)$, the state and the action. In the literature, this is called a partially-observable markov decision process, or POMDP for short (let’s not worry about reward for now).

As a simplifying assumption, let’s say observations map fairly directly to states. Suppose we had an embedding of all observations in a well-structured vector space $O$. Then, for any pair of observations $a, b \in O$, we can directly measure their distance using $|a - b|^2$.

Imagine we collected episodes using a policy $\pi$, and recorded all observation vectors from each episode in a large database $D \subseteq O$. Upon receiving a new observation $o’$, one way to determine how “novel” the observation was would be to see how far it is from each of the previously seen observation vectors in the database, and return the smallest distance:

$$\text{novelty}(o’) = \min_{d \in D} |o’ - d|^2$$

This is a sound approach in theory, but in practice, if $D$ grows to be quite large, this process could get quite slow: calculating novelty is linear in $|D|$.

The Random Network Distillation paper proposes a clever hack. The key idea is to approximate $D$, the set of seen observations, using two neural networks. In more detail:

  1. We start with two neural networks mapping from observation space $O$ to a smaller embedding space $V$: call the first the target network $T$ and the second the predictor network $\hat{T}$. Initialize $T$ and $\hat{T}$ using different initial random weights, but freeze the weights of the target.
  2. In each training step, using the current policy $\pi$, collect an episode of observations $o_{1:n}$. For each observation $o_i \in o_{1:n}$, use the target network $T$ to embed this observation: $v_{i} = T(o_{i})$. Then use the predictor $\hat{T}$ to estimate ${\hat{v}}_i = \hat{T}(o_i)$, the target’s embeddings.
  3. We can calculate the “surprise” for any individual observation $o_i$ to be $\text{surprise}(o_i) = |T(o_i) - \hat{T}(o_i)|^2 = |v_i - \hat{v}_i|^2$, the MSE between the embeddings. If the target and the predictor agree, surprise is low. Otherwise, surprise is high.
  4. This is the important part: As we train, keep the target $T$ fixed, but train the predictor $\hat{T}$ to predict the output of $T$ for the previous observation and embedding pairs $(o, v)^*$ we have seen. This means that, for observations similar to previously-seen observations, the surprise $|v - \hat{v}|^2$ will be very small, but for new observations, the surprise will be very large.
  5. Label each observation with surprise in lieu of (or in combination with) a reward function, to intrinsically motivate the behavior of “seeking out surprising states” in the policy. (Otherwise train the policy $\pi$ along with the predictor $\hat{T}$ as you normally would.)

This intrinsic exploration reward is often called curiosity, and I like that name quite a bit. (What was revolutionary about RND was that it made great progress on Montezuma’s Revenge, an old Atari game that is very exploration-driven and thus provides very sparse rewards. It’s hard to emphasize how plain cool this result was at the time.)

RND and TV

Let’s say you’re a very curious fellow. I show you a coin. Will it land heads or tails?

I flip the coin. Tails! Were you right? I flip it again and again. For whatever reason, you can only seem to predict when it lands heads-up about half the time. Stochastic? Sounds pretty unfair. I can hardly pronounce the word. How are you ever supposed to learn my trick!?

Different from RND, previous exploration approaches used forward-prediction to estimate surprise. Surprise from prediction is very intuitive: “how different is what I expected from what I observed?” Forward-prediction, however, falls apart if state transitions are random: it’s impossible to predict a coin flip, for instance. A TV playing static is a coin-flip per pixel. Agents getting stuck trying to predict noise is called, quite unsurprisingly, the Noisy TV Problem.

RND, by virtue of not using forward prediction, is not vulnerable to stochastic state transitions like earlier approaches were, and this is one of the key contributions of the paper. There are a couple issues with RND as I’ve formulated it in this post, though:

First, while RND overcame a temporal version of the Noisy TV Problem, policies can get distracted. Imagine an environment with a TV that changed to a random, unpredictable picture every frame. The target $T$ would always produce some unique random embedding $v_{\text{wow}}$, and our predictor $\hat{T}$ would produce a completely different embedding $\hat{v}_{\text{hmm}}$. As these embeddings are different (because the picture is new!), the policy receives a very high reward for staying near and observing the TV. The policy becomes glued to the TV, addicted to a random stream of meaningless information that is impossible to learn.

Second, as a brief aside, we assumed that states could be mostly recovered from observations. This means we assumed (mostly-)total information, which is true of games like chess, but not true in general. If you’re playing a video game, there are a lot of things the computer is keeping track of in $s$ (like the location of enemies offscreen), that cannot necessarily be deduced from observation alone. In the real world, we’d like some way to model the underlying environment state, and not just our observations of this unobservable state.

Taking a step back, though: why does RND work at all?

Is it possible to solve the Noisy TV problem generally, in a way that is theoretically sound and computationally tractable? The Noisy TV Problem goes far beyond literal noise in the environment: anything that is random, or otherwise too complicated for the model to realistically learn, may become an unwanted distraction. You can’t fit a line to a parabola, no matter how much data you have. How do we seek out the signal in the noise?

You’ll have to stay tuned because it’s way too late and I’m going to bed

Surprise! I’m sure you saw that coming. I’m *so good* at writing exposition I didn’t even get to the fun part!

Tomorrow I’ll write a little primer on information theory, including entropy, surprisal, aleatoric uncertainty, epistemic uncertainty, total uncertainty (spoiler: it’s just entropy!), and learnable uncertainty. The key point is that we want curiosity to map to the “number of learnable bits collected throughout the course of an episode”.

If I have time tomorrow, and if not sometime the day after or next week, I will write about how to compute and approximate these quantities, some state of the art approaches and other tricks (ensembles, SWAG, BALD, KFAC, BAIT, etc.). I’ll then wrap things up by talking about some approaches I find fun and/or promising, like Decision Transformers with “learnable bits of information to go” in place of reward-to-go. Let’s make mid-training a thing!

To tide you over until next time, you might enjoy this thematically-appropriate post:

Padded so you can keep scrolling. I know. I love you. How about we take you back up to the top of this page?