Diffusion Models, Explained the Way I Wish They Were Part 1: Intuition and the Big Picture

Illustration generated with Google Nano Banana.

When I first set out to understand diffusion models, I kept running into the same frustrating pattern: short articles saying “they add noise to an image and then learn to denoise it,” with no real explanation of why that works, why it’s done step by step instead of once, or what the math is actually saying. On the other side of the spectrum, research papers dove into dense equations with shifting notations and an assumption that you already spoke the language fluently. There was nothing in between, nothing for someone who wanted to go beyond just downloading a pretrained model from Hugging Face, but wasn’t yet ready to reinvent the field.

After a year and a half of working with diffusion models, breaking them apart, implementing them, reading the papers in circles until they clicked, I want to write the series I wish I’d had back then. This will be a guided journey from the theory that makes diffusion models tick, through real implementations you can run and tinker with, and finally into the improvements that shaped the field, things like DDIM, classifier guidance, and beyond.

I’ll keep each article short enough to read in a sitting, but together they’ll form a complete roadmap: not just what diffusion models do, but why they work, and how to build on them yourself.

Roadmap:

Diffusion Models: Intuition and the Big Picture
In this part, we’ll start with an intuition for what it means to generate data. We’ll also look at the difference between discriminative and generative models, explore the main families of generative approaches and conclude with a first look at diffusion models
Why It Works: The Math Behind the Magic
In the second article, we’ll go deeper into the theory that explains why diffusion models work. We’ll start with latent variable models, the ELBO, and variational autoencoders, and build on that step by step until we reach diffusion models themselves.
Building a Diffusion Model from scratch
Once the theory is clear, we’ll implement a basic diffusion model from scratch. We’ll train it on a small car dataset and observe how it gradually learns to reconstruct structure from pure noise. This part will focus on practical understanding.
Beyond the Basics: Modern Improvements
Finally, we’ll cover the main improvements and extensions that shaped modern diffusion models. This includes Conditional Diffusion, DDIM, Classifier Guidance, and Classifier-Free Guidance.

Part 1: Intuition and the Big Picture

1. Generating Samples: A Dice Analogy

Before we get into the mechanics of diffusion models, let’s take a step back and talk about what generative models actually try to do.
At their core, generative models have one main job: to generate new samples that look like they came from some real distribution of data.

Let’s start with something simple: rolling a fair six-sided dice.
If you’ve ever played a board game, you already know what this means: each face (1 through 6) has an equal chance of showing up.
Mathematically, we can write this as:

P (x) = \frac{1}{6} for x \in {1, 2, 3, 4, 5, 6}

So if you wanted to generate samples from this distribution in other words, simulate dice rolls, that would be easy, right?
You could just write a small program that picks a number between 1 and 6 at random, each with equal probability.
Run it a hundred times, and you’ll get a list of numbers that look just like the outcomes of rolling a real fair dice.

But now imagine someone hands you a loaded dice.
This one doesn’t behave the same way, some numbers come up more often than others, and you have no idea how it’s rigged.
You can’t just assume each face has the same probability anymore. So what do you do?
To generate samples from this dice, you first need to understand how it behaves. So you start rolling it, over and over again, and record what you see. After enough rolls, you start to observe a pattern. Maybe 6 comes up more often than 1. Now, based on these observations, you try to estimate the underlying probability distribution P (x). Then, using that estimated probability, you can simulate new rolls of your loaded dice without ever touching the physical one again.

This, in essence, is what generative models do.

Given observed samples from a distribution of interest, the goal of a generative model is to learn the true underlying data distribution
Once that distribution is learned, we can generate new samples that follow the same statistical patterns as the real data.

The key idea is to learn the hidden distribution behind real-world data, whether that data represents dice rolls, images of cats, or snippets of human speech.

Image generated with Google Nano Banana.

2. Discriminative vs Generative Models

In machine learning, models are often classified as either discriminative or generative, depending on what they aim to learn. This distinction comes from the probabilistic formulations used to build and train these models.

Discriminative Models:
Discriminative models learn to predict a label given an input data point .
In other words, they learn the conditional probability distribution p(y|x).
The goal is to map data points to their correct labels. For example, a discriminative model trained to recognize digits in images learns how likely each digit is, given the image pixels.
Generative Models:
Generative models, try to learn a probability distribution over the data
points without external labels. They aim to learn p(x).
Our loaded dice analogy is an example: we observed outcomes $x$ and tried to estimate the underlying probability distribution $p (x)$ to generate new samples.
Conditional Generative Models:
Conditional generative models are still generative models. The difference is that they learn to generate data conditioned on additional information such as class labels, text prompts, or other context. They try to learn the probability
distribution of the data x conditioned on the labels y. This is denoted as p(x|y). Here, acts as a guiding signal, for example, generating an image of a “cat” when or a “dog” when .

Discriminative vs Generative models [1].

3. Generative Models

The goal of generative models is to learn the probability density function of our data p(x). This probability density describes the behavior of our training data and allows us to generate new data by sampling from it. Ideally, we want our model to learn a density that matches the true data distribution.

There are two broad classes of generative models:

Explicit Density Models
These models can compute the density function explicitly.
After training, if we feed a data point into the model, it can return its likelihood under the learned distribution.
Explicit models can be:
– Tractable: These models define a density that is computationally tractable, meaning we can directly calculate the likelihood for any given data point
– Approximate: These models still define an explicit density but parts of it are intractable to compute or optimize directly.
They rely on approximation techniques to make training feasible. A common example is the Variational Autoencoder (VAE), which uses latent variables and optimizes a lower bound on the likelihood instead of the exact value.
Implicit Density Models
Implicit density models do not compute directly. Instead, they are able to generate realistic samples from the data distribution without calculating the exact probability of each sample. The most common example is the Generative Adversarial Network (GAN), which learns to transform random noise into realistic data points through an adversarial training process.

Taxonomy of Generative Models [1].

4. Diffusion models

Diffusion models are a new class of state-of-the-art generative models that generate diverse high-resolution images. Diffusion models solve a task similar to other generative model types, they attempt to approximate some probability distribution of a given domain q(x) and most importantly, provide a way to sample from that distribution x ∼ q(x).
The basic idea behind diffusion models is rather simple. They take the input image x0 and gradually add Gaussian noise to it through a series of T steps. We will call this the forward process. Afterward, a neural network is trained to recover the original data by reversing the noising process. By being able to model the reverse process, we can generate new data. This is the so-called reverse diffusion process or, in general, the sampling process of a generative model.

Overview of different types of generative models [2].

The figure above shows a high-level comparison between the architectures of GANs, VAEs, and Diffusion Models.
If you’re already familiar with GANs or VAEs, this might help you form an initial intuition for how diffusion models operate. If not, don’t worry. In the next articles, we’ll explore how diffusion models actually work in detail, step by step.

5. Where We Go from Here

In this first part, we built the groundwork for understanding diffusion models: what generative models do, how they differ from discriminative ones, and where diffusion models fit among VAEs and GANs.

Next, we’ll dig into the actual mechanics: how diffusion models connect to latent variable models, what the ELBO is doing under the hood, and why the “add noise, then denoise” idea works mathematically. From there, we’ll move on to building a simple diffusion model from scratch and explore the improvements that make modern variants so effective.

If you want to prepare, refresh some basic probability concepts and make sure you’re comfortable with the basics of Python and PyTorch.

References:

[1] Sergios Karagiannakos. The theory behind latent variable models: formulating a variational autoencoder.

[2] Lilian Weng. What are Diffusion Models?

[3] Calvin Luo. Understanding Diffusion Models: A Unified Perspective.

by admin4813