17
Epistemic and Aleatoric Uncertainty

2026-02-17 · 2932 words

This is a follow-on post to ~~yesterday’s~~, um, last Friday’s post about curiosity driven approaches to reinforcement learning. I intended to publish this post on the 14th, but it was Valentine’s day weekend and $\text{irl} > \text{blog}$. To summarize:

Imagine we have some black-box environment with a state space $s$. At each timestep, we receive an observation $o$ (derived from $s$), and can pick an action $a$. The environment evolves according to a stochastic state transition dynamic $s’ \sim P(\cdot \mid s, a)$ depending on the current state and our action, producing a new observation $o’$.

Suppose we don’t have a goal or reward function for our environment; we simply would like to train a neural network to model $P(s’ \mid s, a)$. How can we sample sequences of actions, or episodes, that contain all the information we need to build an accurate model of state transition dynamics?

The solution looks something like curiosity: construct a reward signal along the lines of “the number of learnable bits of information per observation”, and train a policy $\pi$ using traditional RL techniques like PPO to improve the policy and sample divers information-dense trajectories.

Today I’m going to try to explain the following relation:

$$\text{uncertainty} = \text{aleatoric} + \text{epistemic}$$

If that doesn’t make sense now, don’t worry. It should be clear by the end of the post.

Our goal is to learn a policy that is capable of exploring an arbitrary environment. In essence, we want to seek out states that are surprising, but not totally random; in other words, we want to find states where we are uncertain, but capable of learning through interaction and observation.

Surprisal

Let’s start by quantifying this idea of surprise. In information theory, there is this concept of surprisal, which is defined as:

$$h(y) = -\log P(y)$$

Here, $y \sim Y$ is some specific outcome; an event sampled from a (discrete) random variable $Y$. Surprisal $h(y)$ is measured in bits. A likely outcome can be encoded in few bits, and likewise tells you little; an unlikely outcome takes many bits to encode, and conversely tells you a lot.

To make surprisal concrete, rolling a 4 on a fair die has probability $\frac{1}{6}$. The surprisal is $h(⚃) = \log_2 6 \approx 2.58$ bits of information.

Entropy, total uncertainty

Surprisal is concerned with a single event. If we have a distribution $Y$ over multiple events, is there a way to describe how much information we expect to learn upon measuring the outcome?

Yes, and this is exactly entropy $H[Y]$, or “expected surprisal”:

$$H[Y] = \mathbb{E}[-\log P(Y)]$$

We use lowercase $h(y)$ for the surprisal of a single event $y$, and uppercase $H[Y]$ for the surprisal over a distribution of events $Y$. The expectation $\mathbb{E}$ here just measures the surprisal of each event, weighted by how probable it is. Mathematically speaking:

$$H[Y] = \sum_y P(y) \cdot h(y)$$

You can think about entropy as, “in expectation, how many bits of information will we gain by making an observation?”

Conditional entropy

Entropy measures how much information we expect to learn about $Y$… in isolation. Oftentimes, we’ve observed some correlated event $x$. In that case, how much information do we expect to learn from $Y$, given we already know $x$? This is precisely conditional entropy $H[Y \mid x]$:

$$H[Y \mid x] = \mathbb{E}[-\log P(Y \mid x)]$$

We write lowercase $x$ because we’re conditioning on a specific input, rather than averaging over all possible inputs, which is what you normally see.

Note that this is exactly the same thing as entropy, just over the conditional distribution $P(Y \mid x)$ instead of $P(Y)$. If knowing $x$ makes $Y$ easier to predict, the conditional entropy is lower. If $x$ is independent of $Y$, then $H[Y \mid x] = H[Y]$. Like entropy, we can also write conditional entropy as a weighted sum, but conditioned on $x$:

$$H[Y \mid x] = \sum_y P(y \mid x) \cdot h(y \mid x)$$

In our environment, if $S$ is the distribution over next states, and $(s, a)$ is our state action pair, then $H[S \mid s, a]$ is an exact measure of how unpredictable the next state is from a given state-action pair. This measures, in essence, how random state transitions are. For a deterministic environment, $H[S \mid s, a] = 0$. In practice, we observe $o$, not $s$. Reconstructing state from observations is a different filtery, hidden-markov-modelly can of worms. We’ll gloss over this for now, and condition on $s$ directly, but the intractable component of epistemic uncertainty I touch on later comes partly from this gap.

Conditional entropy tells us how uncertain $Y$ is given a specific event $x$. But in general, if we know the value of a correlated random variable $Z$, we often want to ask how much of $Y$ it tells us. If measured in bits, this quantity is called mutual information.

Mutual information

A coin flip has high conditional entropy regardless of context; you can never learn to predict it. For curiosity, we want our policy to seek states that it finds unpredictable because it has not yet seen them, not because the environment is random and can’t be predicted.

Mutual information between $Y$ and some other variable $Z$ measures how many bits of the expected surprisal of $Y$ are accounted for if you already know $Z$. It’s precisely the gap between entropy and conditional entropy:

$$I[Y; Z] = H[Y] - H[Y \mid Z]$$

We can draw this as a Venn diagram, which is nicely symmetric:

$H[Y]$ $H[Z]$

$H[Y \mid Z]$

$I[Y; Z]$

$H[Z \mid Y]$

$H[Z]$

Consider the limiting cases: if $Z$ tells us everything about $Y$, then $H[Y \mid Z] = 0$ and $I[Y; Z] = H[Y]$. All of $Y$’s information is already contained in $Z$; the circles fully overlap. On the other hand, if $Y$ and $Z$ are independent, $Z$ tells us nothing about $Y$, so $I[Y; Z] = 0$, and the circles are disjoint.

Aleatoric uncertainty

Up until this point, we have been talking about information abstractly. With surprisal, (conditional) entropy, and mutual information, we have the tools to quantify how much we learn from an observation, and what it means for a state transition to be “random”.

In statistics, we talk about models. Models have parameters, often denoted $\theta$, and let us make predictions through inference. (Note that the model $\theta$ is distinct from the policy $\pi$. The model predicts state transitions, and the policy decides actions. We train $\pi$ to seek states where $\theta$ can learn.) In practice, you often train them together, so the world model is contained either implicitly or explicitly in the policy.

Concretely, with a model $\theta$ we can ask: Given an input $x$, what distribution does the model assign to the outcome $Y$? We can write this as $P(Y = y \mid x, \theta)$, for the probability of a specific outcome $y$.

We can also talk about uncertainty. Given specific parameters $\theta$ and an input $x$, we can quantify the uncertainty of the outcome $Y$ using conditional entropy:

$$\text{prediction uncertainty} = H[Y \mid x, \theta]$$

This “prediction uncertainty” is the noise that the model can never explain. If we had a perfect model of a deterministic process, $H[Y \mid x, \theta] = 0$. However, even if we have a perfect model $\theta^*$ for a random, or stochastic, process, by virtue of being stochastic, $H[Y \mid x, \theta] > 0$. We can never predict a random outcome with 100% certainty.

In real life, we don’t have perfect models like $\theta^*$. We derive parameters $\theta$ for models from data, $D$, a set of observations. In practice, we often find $\theta$ by randomly initializing a model and training it on $D$. Training through, for example, stochastic gradient descent. We could end up with many different parameters $\theta$ depending on how the model is initialized. Given a model class, training on $D$ gives us a posterior distribution over all models, which we can write $P(\theta \mid D)$. In English, what is the distribution over model parameters $\theta$, given the training data $D$ we have?

Different models make different predictions, but no model can explain away noise. We want to quantify this irreducible noise. Since we don’t know which model is correct, the natural way is to average prediction uncertainty over the posterior $P(\theta \mid D)$:

$$\text{aleatoric} = \mathbb{E}_{P(\theta \mid D)}[H[Y \mid x, \theta]]$$

This is the noise floor, the uncertainty that remains no matter how much data we collect or how well we train. This quantity is called aleatoric uncertainty. In Portuguese, the word for “random” is aleatório, so that’s how I remember “aleatoric”.

If we take $H[Y \mid x, \theta]$ and marginalize out $\theta$, we are left with $H[Y \mid x]$, our uncertainty over $Y$ given $x$. There is this gap, then, between total uncertainty $H[Y \mid x]$ and aleatoric uncertainty $\mathbb{E}[H[Y \mid x, \theta]]$. The gap is the uncertainty that comes from not knowing which model is correct. This is the uncertainty in the model that will shrink with more of the right data.

Epistemic uncertainty

Uncertainty can come from two places: our environment can be noisy, or our model can be incorrect. We just derived aleatoric uncertainty, which quantifies how noisy our environment is. How do we know when our model is uncertain?

If we take our total uncertainty $H[Y \mid x]$ and we subtract out the aleatoric uncertainty $\mathbb{E}[H[Y \mid x, \theta]]$, we are left with a gap. This gap is all the uncertainty that is not explained by noise. This gap is precisely how uncertain our model is:

$$\text{epistemic} = H[Y \mid x] - \mathbb{E}_{P(\theta \mid D)}[H[Y \mid x, \theta]]$$

Those who were paying attention earlier might have noticed that this quantity, epistemic uncertainty is exactly the mutual information between the outcome $Y$ and model parameters $\theta$, conditioned on our training data $D$ and our input $x$:

$$\begin{align} I[Y; \theta \mid x, D] &= H[Y \mid x] - \mathbb{E}_{P(\theta \mid D)}[H[Y \mid x, \theta]] \\ \text{epistemic} &= \text{total} - \text{aleatoric} \end{align}$$

Another way to think about this: if the models in the posterior $P(\theta \mid D)$ disagree on what $Y$ will be given the training data $D$ we have, epistemic uncertainty is high.

If we collect more training data in states where epistemic uncertainty is high, we can reduce it. More data shrinks the disagreement of the posterior. As the distribution of possible models converges, $I[Y; \theta \mid x, D] \to 0$, epistemic uncertainty decreases.

Curiosity should target states of high epistemic uncertainty. Our goal is to build a dataset $D$ on which we can train a model $\theta$ that accurately models the state transition dynamics of the environment. To do this, we want to learn a policy $\pi$ that can navigate to regions of state space that are hard to reach, and explore all there is to learn.

If we train a policy to maximize epistemic uncertainty over an episode, we are training the policy to maximize the time it spends in states when the model can still learn.

Decomposition

Now we have all the tools to understand the original equation:

$$\text{uncertainty} = \text{aleatoric} + \text{epistemic}$$

Where uncertainty $H[Y \mid x]$ is the total uncertainty over outcomes given the input, measured using conditional entropy. Aleatoric uncertainty $\mathbb{E}_{P(\theta \mid D)}[H[Y \mid x, \theta]]$ is the irreducible noise floor, what we can never learn through observation. Epistemic uncertainty $I[Y; \theta \mid x, D]$ is the mutual information between outcomes and model parameters.

We can make this relationship explicit by drawing it on the mutual information Venn diagram from earlier:

$H[Y \mid x]$ $H[\theta \mid D]$

aleatoric$\mathbb{E}[H[Y \mid x, \theta]]$

epistemic$I[Y; \theta \mid x, D]$

$H[\theta \mid Y, x, D]$

$H[\theta \mid D]$

The left circle is the total uncertainty over outcomes $H[Y \mid x]$, the right circle is the posterior uncertainty over model parameters $H[\theta \mid D]$.

To minimize uncertainty in our model $\theta$, we must train a policy $\pi$ that maximizes this overlap, epistemic uncertainty, by seeking states where knowing $\theta$ would tell us the most about $Y$. The policy needs to go places where the model we have matters.

Concretely, if within our posterior $P(\theta \mid D)$, some models say $P(Y = a \mid x, \theta) > P(Y = b \mid x, \theta)$, and other models say the converse, we want to collect more instances of $x$ to determine whether $Y$ is $a$ or $b$. This marginal data helps us shrink parameter space $\theta$ as quickly as possible; this marginal data helps us learn as quickly as possible.

But what can be learned in the first place?

Learnable vs Intractable

You can’t fit a line to a parabola, sometimes even the best parameters are wrong if the model doesn’t make sense.

Some deterministic processes, like hash functions, are designed to be computationally intractable to run in inverse. Hash functions aren’t random, but they certainly don’t count as a process we can learn with more data.

What is randomness, anyway? Here is a completely deterministic simulation of 15 balls bouncing around in a 16×16 box, always starting from the same state. (Note that n is the number of particles in the right half of the box, and H is the entropy of the empirical distribution.)

Is $n$ random?

Well, no! Despite being chaotic, this simulation is not random! (Try resetting the simulation, and stepping it forward a couple of times to convince yourself of this fact.) For any given state $s$, we can deterministically step the system and observe $n$. There is no aleatoric uncertainty, no randomness. But this doesn’t feel fair.

Let’s say we can only observe $n$, and want to predict how it evolves over time. Since this system is completely deterministic, if we had perfect information about the position and direction of each particle, we could build a model that perfectly captures $n$.

However, we can only observe $n$. We don’t know the state of the box. Even if we knew the dimensions of the box, the rules of the game, and the number of balls, it would be very hard to infer the position and direction of each particle purely from observing how $n$ changes over time.

N.B. Concretely, there are $\binom{256}{15}$ ways to place 15 particles on a 16×16 grid (~79 bits) and 4 directions each particle can take (30 bits). That’s 109 bits of state total. From $n$ we get $\leq 4$ bits of information per observation. It would take us $>28$ observations to gain $>109$ bits of information. We could only reconstruct the state at this point if we could use the information from each $n$ perfectly, and I would be surprised if there’s a computationally tractable way to do that.

The key insight is that although $n$ is not random, it appears to be random to an outside observer. Empirically, this is what statistics is: how do we model processes that we can not observe perfectly?

Even though the evolution of $n$ is deterministic, it is not learnable; it will look like a random variable sampled from some distribution. The distribution is approximated with the histogram we draw to the side.

If a latent variable is not learnable, it might as well be random. So we might as well divide our equation for uncertainty further:

$$\begin{align} \text{uncertainty} &= \text{aleatoric} + \text{epistemic} \\ &= \text{aleatoric} + (\text{learnable} + \text{intractable}) \end{align}$$

This split is a practical or computational one, but there are competing definitions. In general, a portion of epistemic uncertainty is intractable for one of two reasons:

If our model class cannot represent the true dynamics, no amount of data will close the gap; you can’t fit a line to a parabola.
If observations never contain information present in some part of state space, that information is inaccessible through the information channel we have.

In practice, we can train a policy $\pi$ to collect episodes that maximize epistemic uncertainty in a way that ignores this intractable component. If epistemic uncertainty $I[Y; \theta \mid x, D]$ is decreasing as $D$ grows, those bits are learnable. If it plateaus, they are intractable. Instead of maximizing epistemic uncertainty, we maximize learning progress: the rate at which epistemic uncertainty decreases as we collect more data.

Closing thoughts

In theory, practice is theory. In practice, theory isn’t. Going from theory to practice requires some good engineering.

As you can imagine, measuring learning progress by taking the derivative of an estimated epistemic uncertainty value is… quite unstable. Modern research applies a whole slew of tricks to go from theory to practice. This engineering is where the excitement lies.

I care about this because I want to build data-efficient models of physical systems; the kind where episodes are expensive and you can’t afford to waste trajectories on noise. Flapping airplanes, for instance.

The next time I write about this, curiosity, I’ll talk about learning progress, stable exploration signals, RND’s practical success, how to compute epistemic and aleatoric uncertainty in practice (ensembles, BALD, SWAG, etc.), and other directions for sussing out learnable structure.

Until then, you might enjoy:

Daily reading: Learning to Discover at Test Time

Padded so you can keep scrolling. I know. I love you. How about we take you back up to the top of this page? All prose on this website is written by me, Isaac. I feel very strongly about preserving my voice, and will not use AI to publish prose under my name.

17Epistemic and Aleatoric Uncertainty