Diffusion Models Explained: The Guide I Wish I Had (Part 2)

Diffusion Models, Explained the Way I Wish They Were (Part 2)

Illustration generated with Google Nano Banana

Part 2: Background, The Mathematical Foundations for Diffusion Models

In the first part of this series, we built an intuition for generative modeling and placed diffusion models in the broader landscape of VAEs, GANs, and likelihood-based methods. We ended with a high-level idea: diffusion models gradually add Gaussian noise to data, step by step, and then learn to reverse that process.

The motivation for going all the way to noise is simple. If we can start from pure noise and reliably transform it into a realistic image, then we have a generative model. The forward noising process gives us a clean, well-defined starting point for sampling.

What’s less obvious is why this particular construction works.
Why does the noising happen gradually, instead of in a single step?
How does the model actually learn to denoise, and what training signal tells it that it is doing the right thing?
How does this correspond to learning a probability distribution at all?
Do we have mathematical guarantees that this process should work at all, rather than being a heuristic that just happens to work in practice?

These questions will be answered across the next two parts of the series. This second part is about setting up the necessary background. The third part will then focus entirely on diffusion models themselves. There, we will derive the forward noising process, the learned reverse dynamics, and the training objective in full detail. The goal here is not to memorize equations or reproduce derivations, but to build enough intuition to see the structure behind diffusion models. This is the level of understanding that lets you read papers, make sense of new variants as they appear, and reason about why certain design choices are made in the first place.

By the end of this part and the next, you should have a clear mental model for:

  • why the forward noising process is constructed the way it is,

  • why the reverse process can be learned at all,

  • and how all of this connects to likelihood-based training.

With that foundation in place, we’ll be ready to implement our first diffusion model from scratch.

Much of the intuition and probabilistic framing in this part and the next, is adapted from [1] Calvin Luo, Understanding Diffusion Models: A Unified Perspective. This article selectively borrows from that work and focuses on building intuition rather than presenting full mathematical rigor. If you’re looking for a more complete and formal treatment, the original paper is exceptionally well written and highly recommended.

1. Background: ELBO

In Part 1, we explained that the goal of a generative model is to estimate the data density p(x). We also introduced latent variable models, where the observed data x is assumed to be generated with the help of unobserved latent variables z. In this setting, we can think of the model as described by a joint distribution: p(x,z).

Our goal is to learn the parameters of this distribution using the observed data. A natural approach is the Maximum Likelihood Estimation (MLE) we also presented in part 1, where we choose the parameters that maximize the likelihood of the observed data. The likelihood function measures the goodness of fit of a statistical model to a sample of data and it is formed from the joint probability distribution of the sample. Thus we want to to learn a model to maximize the likelihood p(x) of all observed x.

To connect the latent-variable model to the likelihood of the observed data, we marginalize out the latent variables:

\begin{equation} p(x) = \int_{}^{} p(x,z) \,dz \end{equation}

We can also use the chain rule of probability:

\begin{equation} p(x) = \frac{p(x,z)}{p(z|x)} \end{equation}

Directly computing and maximizing this likelihood is difficult, because it requires integrating over all possible latent variables z or having access to the true posterior distribution p(z/x).  As discussed in Part 1, instead of maximizing the likelihood directly, we work with the log-likelihood,  also called the evidence. The log-likelihood is easier to work with mathematically, but the core problem remains: the marginalization over z is still intractable.

To address this, we introduce a quantity called the Evidence Lower Bound (ELBO). The equation of the ELBO is:

\begin{equation} \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{q_{\phi}(z|x)}}\Biggr] \end{equation}

As its name suggests, the ELBO is a lower bound on the evidence:

\begin{equation} \log{p(x)} \geq \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{q_{\phi}(z|x)}}\Biggr] \end{equation}
\text{Here, }q_{\phi}(z|x) \text{ is a parameterizable model that is trained to estimate the distribution over latent variables z given an observation }x \text{.}
\text{In other words, it is an approximation of the true posterior }p(z/x)\text{.}

Instead of optimizing the log-likelihood directly, we optimize a bound that we can work with. Maximizing the ELBO is a practical proxy with which to optimize a latent variable model.

Why is the Evidence Lower Bound (ELBO) a meaningful objective to optimize?

  • Starting from Equation (1):
    \begin{equation} \begin{split} \log p(x) & = \log \int p(x,z) \,dz \\ & = \log \int \frac{p(x,z) q_{\phi}(z|x)}{q_{\phi}(z|x)}\\ & = \log{\mathbb{E}_{q_{\phi}(z|x)}\Biggl[\frac{p(x,z)}{q_{\phi}(z|x)}\Biggr]} \text{\quad \quad \quad \quad \quad (Definition of Expectation)}\\ & \geq \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{q_{\phi}(z|x)}}\Biggr] \text{\quad \quad \quad \quad \quad (Jensen's Inequality)} \end{split} \end{equation}
  • Using Equation 2:
    \begin{equation} \begin{split} \log{p(x)} & = \log{p(x)} \int q_{\phi}(z|x) \,dz \\ & = \int q_{\phi}(z|x) (\log{p(x)})\,dz \\ & = \mathbb{E}_{q_{\phi}(z|x)}[\log{p(x)}] \text{\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad (Definition of Expectation)}\\ & = \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{p(z|x)}}\Biggl] \text{\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad (Equation 2)}\\ & = \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)q_{\phi}(z|x)}{p(z|x)q_{\phi}(z|x)}}\Biggl]\\ & = \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{q_{\phi}(z|x)}}\Biggl] + \mathbb{E}_{q_{\phi}(z|z)}\Biggl[\log{\frac{q_{\phi}(z|x)}{p(z|x)}}\Biggl]\\ & = \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{q_{\phi}(z|x)}}\Biggl] + D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) \text{\quad \quad (Definition of KL Divergence)}\\ & \geq \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{q_{\phi}(z|x)}}\Biggl] \text{\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \, (KL Divergence always $\geq 0$)} \end{split} \end{equation}

The last equation tells us that the evidence (the log-likelihood) can be decomposed into two terms: the Evidence Lower Bound (ELBO) and the KL divergence between the approximate posterior and the true posterior.

What does this mean in practice?

Recall that we introduced the latent variable z to capture the underlying structure that explains the observed data x. Our original objective, is to find the parameters of the generative model, denoted by θ, that maximize the log-likelihood log⁡ p(x) but it is difficult since we don’t have access to the true posterior distribution p(z/x). To make progress, we introduced an approximation parameterized by Φ. We want to optimize the parameters of our approximation to exactly match the true posterior distribution.
The KL divergence measures how close this approximation is to the true posterior distribution. Minimizing the KL divergence corresponds exactly to making this approximation better. In the idealized case where our approximation is expressive enough and optimization succeeds perfectly, the KL divergence goes to zero.

The key point is that maximizing the ELBO plays two different roles. The evidence term is a constant with respect to the variational parameters Φ. Since the ELBO and the KL divergence sum to this constant, any increase in the ELBO must be accompanied by a corresponding decrease in the KL divergence. As a result, maximizing the ELBO with respect to Φ implicitly minimizes the KL divergence. With respect to the model parameters θ, however, the situation is different. The ELBO is a lower bound on the evidence. Therefore, maximizing the ELBO with respect to θ is equivalent to optimizing the evidence itself.

This variational perspective: introducing latent variables, approximating their posterior, and optimizing a lower bound on the data likelihood, is the conceptual foundation that we will later build upon

2. Background: Variational Autoencoders (VAEs)

At the end of the previous section, we reached an important conclusion: one way to build a generative model with latent variables is to maximize the Evidence Lower Bound (ELBO) with respect to both the generative model parameters θ and the variational parameters ϕ. One model that relies on this idea is the variational autoencoder (VAE).
To understand VAEs, we need to address a subtle point that may not be intuitive to everyone: the posterior distribution is different for each data point x.

Consider a simple example. You observe the sum of two dice, which we call X, and the latent variable Z is the pair of individual dice values. If the observed sum is 2, the posterior distribution is trivial: the probability that both dice are 1 is 1, and the probability of any other outcome is 0. If the observed sum is 7, the posterior distribution is very different: several combinations such as (1,6), (2,5), and (3,4) all have non-zero probability. The shape of the distribution over latent variables depends entirely on the observed value of X.

Because the posterior is different for every data point, in principle we would need a separate set of variational parameters ϕ for each individual x. The key idea behind variational autoencoders is to avoid this problem by learning an inference network. Instead of directly optimizing variational parameters for each data point, we train a neural network that takes x as input and outputs the parameters of the variational posterior. This network is trained jointly with the generative model. The generative model (often called the decoder) has parameters θ, while the inference network (can be interpreted as an encoder) has parameters ϕ. Both are optimized simultaneously by maximizing the ELBO.

Illustration of a VAE.[2]

Equation 7 dissects the ELBO term further, to make the connection explicit:

\begin{equation} \begin{split} \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(x,z)}{q_{\phi}(z|x)}}\Biggl] & = \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)}}\Biggl] \\ & = \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{p_{\theta}(x|z)}\Biggl] + \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{\frac{p(z)}{q_{\phi}(z|x)}}\Biggl] \\ & = \mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{p_{\theta}(x|z)}\Biggl] - D_{KL}(q_{\phi}(z|x) \parallel p(z)) \end{split} \end{equation}

The two terms in this decomposition play complementary roles. The first term, encourages the model to learn latent variables that retain the information necessary to reconstruct the observed data. In other words, it ensures that the learned latent representation is meaningful. The second term measures how similar the learned variational distribution is to a prior belief held over latent variables. Maximizing the ELBO therefore amounts to maximizing its first term and minimizing its second term.

Once the inference network is trained, computing the posterior for a new data point is straightforward: we simply pass the data through the encoder network.

This architecture is called a variational autoencoder. It is autoencoder-like because data is mapped to a latent representation and then reconstructed back to itself through a bottleneck. It is variational because the latent representation is learned using variational inference. This probabilistic formulation encourages a smooth and continuous latent space, which is fundamentally different from the deterministic representations learned by standard autoencoders.

In practice, the encoder of a VAE is commonly chosen to model a multivariate Gaussian distribution with diagonal covariance, while the prior over latent variables is taken to be a standard multivariate Gaussian:

\begin{equation} q_{\phi}(z|x) = \mathcal{N}(z; \mu_{\phi}(x), \sigma_{\phi}^{2}(x)\mathcal{I}) \end{equation}
\begin{equation} p(z) = \mathcal{N}(z; 0,\mathcal{I}) \end{equation}
A representation of a Gaussian Variational Autoencoder.

The KL divergence term of the ELBO can be computed analytically, and the first term can be approximated using a Monte Carlo estimate. Thus our objective can be rewritten as:

\begin{equation} \underset{\phi \theta}{arg max}\mathbb{E}_{q_{\phi}(z|x)}\Biggl[\log{p_{\theta}(x|z)}\Biggl] - D_{KL}(q_{\phi}(z|x) \parallel p(z)) \approx \underset{\phi \theta}{arg max}\sum_{l=1}^{L}\log{p_{\theta}(x|z^{(l)})} - D_{KL}(q_{\phi}(z|x) \parallel p(z)) \end{equation}

For each data point in the dataset, we sample latent variables \{z^{l}\}_{l=1}^{L}