Diffusion Models, Explained the Way I Wish They Were Part 1: Intuition and the Big Picture
Diffusion Models, Explained the Way I Wish They Were Part 1: Intuition and the Big Picture

When I first set out to understand diffusion models, I kept running into the same frustrating pattern: short articles saying “they add noise to an image and then learn to denoise it,” with no real explanation of why that works, why it’s done step by step instead of once, or what the math is actually saying. On the other side of the spectrum, research papers dove into dense equations with shifting notations and an assumption that you already spoke the language fluently. There was nothing in between, nothing for someone who wanted to go beyond just downloading a pretrained model from Hugging Face, but wasn’t yet ready to reinvent the field.
After a year and a half of working with diffusion models, breaking them apart, implementing them, reading the papers in circles until they clicked, I want to write the series I wish I’d had back then. This will be a guided journey from the theory that makes diffusion models tick, through real implementations you can run and tinker with, and finally into the improvements that shaped the field, things like DDIM, classifier guidance, and beyond.
I’ll keep each article short enough to read in a sitting, but together they’ll form a complete roadmap: not just what diffusion models do, but why they work, and how to build on them yourself.
Roadmap:
- Diffusion Models: Intuition and the Big Picture
In this part, we’ll start with an intuition for what it means to generate data. We’ll also look at the difference between discriminative and generative models, explore the main families of generative approaches and conclude with a first look at diffusion models - Why It Works: The Math Behind the Magic
In the second article, we’ll go deeper into the theory that explains why diffusion models work. We’ll start with latent variable models, the ELBO, and variational autoencoders, and build on that step by step until we reach diffusion models themselves. - Building a Diffusion Model from scratch
Once the theory is clear, we’ll implement a basic diffusion model from scratch. We’ll train it on a small car dataset and observe how it gradually learns to reconstruct structure from pure noise. This part will focus on practical understanding. - Beyond the Basics: Modern Improvements
Finally, we’ll cover the main improvements and extensions that shaped modern diffusion models. This includes Conditional Diffusion, DDIM, Classifier Guidance, and Classifier-Free Guidance.
Part 1: Intuition and the Big Picture
1. Generating Samples: A Dice Analogy
Before we get into the mechanics of diffusion models, let’s take a step back and talk about what generative models actually try to do.
At their core, generative models have one main job: to generate new samples that look like they came from some real distribution of data.
Let’s start with something simple: rolling a fair six-sided dice.
If you’ve ever played a board game, you already know what this means: each face (1 through 6) has an equal chance of showing up.
Mathematically, we can write this as:
So if you wanted to generate samples from this distribution in other words, simulate dice rolls, that would be easy, right?
You could just write a small program that picks a number between 1 and 6 at random, each with equal probability.
Run it a hundred times, and you’ll get a list of numbers that look just like the outcomes of rolling a real fair dice.
But now imagine someone hands you a loaded dice.
This one doesn’t behave the same way, some numbers come up more often than others, and you have no idea how it’s rigged.
You can’t just assume each face has the same probability anymore. So what do you do?
To generate samples from this dice, you first need to understand how it behaves. So you start rolling it, over and over again, and record what you see. After enough rolls, you start to observe a pattern. Maybe 6 comes up more often than 1. Now, based on these observations, you try to estimate the underlying probability distribution P (x). Then, using that estimated probability, you can simulate new rolls of your loaded dice without ever touching the physical one again.
This, in essence, is what generative models do.
Given observed samples x from a distribution of interest, the goal of a generative model is to learn the true underlying data distribution p(x).
Once that distribution is learned, we can generate new samples that follow the same statistical patterns as the real data.
The key idea is to learn the hidden distribution behind real-world data, whether that data represents dice rolls, images of cats, or snippets of human speech.

2. Discriminative vs Generative Models
In machine learning, models are often classified as either discriminative or generative, depending on what they aim to learn. This distinction comes from the probabilistic formulations used to build and train these models.
- Discriminative Models:
Discriminative models learn to predict a label y given an input data point x.
In other words, they learn the conditional probability distribution p(y|x).
The goal is to map data points to their correct labels. For example, a discriminative model trained to recognize digits in images learns how likely each digit is, given the image pixels. - Generative Models:
Generative models, try to learn a probability distribution over the data
points without external labels. They aim to learn p(x).
Our loaded dice analogy is an example: we observed outcomes xxx and tried to estimate the underlying probability distribution p(x)p(x)p(x) to generate new samples. - Conditional Generative Models:
Conditional generative models are still generative models. The difference is that they learn to generate data conditioned on additional information such as class labels, text prompts, or other context. They try to learn the probability
distribution of the data x conditioned on the labels y. This is denoted as p(x|y). Here, y acts as a guiding signal, for example, generating an image of a “cat” when y=cat or a “dog” when y=dog.

3. Generative Models
The goal of generative models is to learn the probability density function of our data p(x). This probability density describes the behavior of our training data and allows us to generate new data by sampling from it. Ideally, we want our model to learn a density that matches the true data distribution.
There are two broad classes of generative models:
- Explicit Density Models
These models can compute the density function p(x) explicitly.
After training, if we feed a data point x into the model, it can return its likelihood under the learned distribution.
Explicit models can be:
– Tractable: These models define a density that is computationally tractable, meaning we can directly calculate the likelihood for any given data point x. Examples include Autoregressive Models and Normalizing Flows.
– Approximate: These models still define an explicit density but parts of it are intractable to compute or optimize directly.
They rely on approximation techniques to make training feasible. A common example is the Variational Autoencoder (VAE), which uses latent variables and optimizes a lower bound on the likelihood instead of the exact value. - Implicit Density Models
Implicit density models do not compute p(x) directly. Instead, they are able to generate realistic samples from the data distribution without calculating the exact probability of each sample. The most common example is the Generative Adversarial Network (GAN), which learns to transform random noise into realistic data points through an adversarial training process.

4. Diffusion models
Diffusion models are a new class of state-of-the-art generative models that generate diverse high-resolution images. Diffusion models solve a task similar to other generative model types, they attempt to approximate some probability distribution of a given domain q(x) and most importantly, provide a way to sample from that distribution x ∼ q(x).
The basic idea behind diffusion models is rather simple. They take the input image x0 and gradually add Gaussian noise to it through a series of T steps. We will call this the forward process. Afterward, a neural network is trained to recover the original data by reversing the noising process. By being able to model the reverse process, we can generate new data. This is the so-called reverse diffusion process or, in general, the sampling process of a generative model.

The figure above shows a high-level comparison between the architectures of GANs, VAEs, and Diffusion Models.
If you’re already familiar with GANs or VAEs, this might help you form an initial intuition for how diffusion models operate. If not, don’t worry. In the next articles, we’ll explore how diffusion models actually work in detail, step by step.
5. Where We Go from Here
In this first part, we built the groundwork for understanding diffusion models: what generative models do, how they differ from discriminative ones, and where diffusion models fit among VAEs and GANs.
Next, we’ll dig into the actual mechanics: how diffusion models connect to latent variable models, what the ELBO is doing under the hood, and why the “add noise, then denoise” idea works mathematically. From there, we’ll move on to building a simple diffusion model from scratch and explore the improvements that make modern variants so effective.
If you want to prepare, refresh some basic probability concepts and make sure you’re comfortable with the basics of Python and PyTorch.
References:
[2] Lilian Weng. What are Diffusion Models?
[3] Calvin Luo. Understanding Diffusion Models: A Unified Perspective.
A Refined Training Recipe for Fine-Grained Visual Classification
A Refined Training Recipe for Fine-Grained Visual Classification

For the past year, my research at Multitel has focused on fine-grained visual classification (FGVC). Specifically, I worked on building a robust car classifier that can work in real-time on edge devices. This post is part of what may become a small series of reflections on this experience. I’m writing to share some of the lessons I learned but also to organize and compound what I’ve learned. At the same time, I hope this gives a sense of the kind of high-level engineering and applied research we do at Multitel, work that blends academic rigor with real-world constraints. Whether you’re a fellow researcher, a curious engineer, or someone considering joining our team, I hope this post offers both insight and inspiration.
1. The problem:
We needed a system that could identify specific car models, not just “this is a BMW,” but which BMW model and year. And it needed to run in real time on resource-constrained edge devices alongside other models. This kind of task falls under what’s known as fine-grained visual classification (FGVC).

FGVC aims to recognize images belonging to multiple subordinate categories of a super-category (e.g. species of animals / plants, models of cars etc). The difficulty lies with understanding fine-grained visual differences that sufficiently discriminate between objects that are highly similar in overall appearance but differ in fine-grained features [2].

What makes FGVC particularly tricky?
- Small inter-class variation: The visual differences between classes can be extremely subtle.
- Large intra-class variation: At the same time, instances within the same class may vary greatly due to changes in lighting, pose, background, or other environmental factors.
- The subtle visual differences can be easily overwhelmed by the other factors such as poses and viewpoints.
- Long-tailed distributions: Datasets typically have a few classes with many samples and many classes with very few examples. For example, you might have only a couple of images of a rare spider species found in a remote region, while common species have thousands of images. This imbalance makes it difficult for models to learn equally well across all categories.

2. The landscape:
When we first started tackling this problem, we naturally turned to literature. We dove into academic papers, examined benchmark datasets, and explored state-of-the-art FGVC methods. And at first, the problem seemed far more complicated than it actually turned out to be, at least in our specific context.
FGVC has been actively researched for years, and there’s no shortage of approaches that introduce increasingly complex architectures and pipelines. Many early works, for example, proposed two-stage models: a localization subnetwork would first identify discriminative object parts, and then a second network would classify based on those parts. Others focused on custom loss functions, high-order feature interactions, or label dependency modeling using hierarchical structures.
All of these methods were designed to tackle the subtle visual distinctions that make FGVC so challenging. If you’re curious about the evolution of these approaches, Wei et al [2]. provide a solid survey that covers many of them in depth.

When we looked closer at recent benchmark results (archived from Papers with Code), many of the top-performing solutions were based on transformer architectures. These models often reached state-of-the-art accuracy, but with little to no discussion of inference time or deployment constraints. Given our requirements, we were fairly certain that these models wouldn’t hold up in real-time on an edge device already running multiple models in parallel.
At the time of this work, the best reported result on Stanford Cars was 97.1% accuracy, achieved by CMAL-Net.
3. Our approach:
Instead of starting with the most complex or specialized solutions, we took the opposite approach: Could a model that we already knew would meet our real-time and deployment constraints perform well enough on the task? Specifically, we asked whether a solid general-purpose architecture could get us close to the performance of more recent, heavier models, if trained properly.
That line of thinking led us to a paper by Ross Wightman et al., “ResNet Strikes Back: An Improved Training Procedure in Timm.” In it, Wightman makes a compelling argument: most new architectures are trained using the latest advancements and techniques but then compared against older baselines trained with outdated recipes. Wightman argues that ResNet-50, which is frequently used as a benchmark, is often not given the benefit of these modern improvements. His paper proposes a refined training procedure and shows that, when trained properly, even a vanilla ResNet-50 can achieve surprisingly strong results, including on several FGVC benchmarks.
With these constraints and goals in mind, we set out to build our own strong, reusable training procedure, one that could deliver high performance on FGVC tasks without relying on architecture-specific tricks. The idea was simple: start with a known, efficient backbone like ResNet-50 and focus entirely on improving the training pipeline rather than modifying the model itself. That way, the same recipe could later be applied to other architectures with minimal adjustments.
We began collecting ideas, techniques, and training refinements from across several sources, compounding best practices into a single, cohesive pipeline. In particular, we drew from four key resources:
- Bag of Tricks for Image Classification with Convolutional Neural Networks (He et al.)
- Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network (Lee et al.)
- ResNet Strikes Back: An Improved Training Procedure in Timm (Wightman et al.)
- How to Train State-of-the-Art Models Using TorchVision’s Latest Primitives (Vryniotis)
Our goal was to create a robust training pipeline that didn’t rely on model-specific tweaks. That meant focusing on techniques that are broadly applicable across architectures.
To test and validate our training pipeline, we used the Stanford Cars dataset [9], a widely used fine-grained classification benchmark that closely aligns with our real-world use case. The dataset contains 196 car categories and 16,185 images, all taken from the rear to emphasize subtle inter-class differences. The data is nearly evenly split between 8,144 training images and 8,041 testing images. To simulate our deployment scenario, where the classification model operates downstream of an object detection system, we crop each image to its annotated bounding box before training and evaluation.
While the original hosting site for the dataset is no longer available, it remains accessible via curated repositories such as Kaggle, and Huggingface. The dataset is distributed under the BSD-3-Clause license, which permits both commercial and non-commercial use. In this work, it was used solely in a research context to produce the results presented here.

Building the Recipe
What follows is the distilled training recipe we arrived at, built through experimentation, iteration, and careful aggregation of ideas from the works mentioned above. The idea is to show that by simply applying modern training best practices, without any architecture-specific hacks, we could get a general-purpose model like ResNet-50 to perform competitively on a fine-grained benchmark.
We’ll start with a vanilla ResNet-50 trained using a basic setup and progressively introduce improvements, one step at a time.
With each technique, we’ll report:
- The individual performance gain
- The cumulative gain when added to the pipeline
While many of the techniques used are likely familiar, our intent is to highlight how powerful they can be when compounded intentionally. Benchmarks often obscure this by comparing new architectures trained with the latest advancements to old baselines trained with outdated recipes. Here, we want to flip that and show what’s possible with a carefully tuned recipe applied to a widely available, efficient backbone.
We also recognize that many of these techniques interact with each other. So, in practice, we tuned some combinations through greedy or grid search to account for synergies and interdependencies.
The Base Recipe:
Before diving into optimizations, we start with a clean, simple baseline.
We train a ResNet-50 model pretrained on ImageNet using the Stanford Cars dataset. Each model is trained for 600 epochs on a single RTX 4090 GPU, with early stopping based on validation accuracy using a patience of 200 epochs.
We use:
- Nesterov Accelerated Gradient (NAG) for optimization
- Learning rate: 0.01
- Batch size: 32
- Momentum: 0.9
- Loss function: Cross-entropy
All training and validation images are cropped to their bounding boxes and resized to 224×224 pixels. We start with the same standard augmentation policy as in [5].
Here’s a summary of the base training configuration and its performance:
Model | Pretrain | Optimizer | Learning rate | Momentum | Batch size | Loss function | Image size | Epochs | Patience | Augmentation | Accuracy |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet50 | ImageNet | NAG | 0.01 | 0.9 | 32 | Crossentropy Loss | 224x224 | 600 | 200 | Standard | 88.22% |
We fix the random seed across runs to ensure reproducibility and reduce variance between experiments. To assess the true effect of a change in the recipe, we follow best practices and average results over multiple runs (typically 3 to 5).
We’ll now build on top of this baseline step-by-step, introducing one technique at a time and tracking its impact on accuracy. The goal is to isolate what each component contributes and how they compound when applied together.
Large batch training:
In mini-batch SGD, gradient descending is a random process because the examples are randomly selected in each batch. Increasing the batch size does not change the expectation of the stochastic gradient but reduces its variance. Using large batch size, however, may slow down the training progress. For the same number of epochs, training with a large batch size results in a model with degraded validation accuracy compared to the ones trained with smaller batch sizes.
He et al [5] argues that linearly increasing the learning rate with the batch size works empirically for ResNet-50 training.
To improve both the accuracy and the speed of our training we change the batch size to 128 and the learning rate to 0.1. We add a StepLR scheduler that decays the learning rate of each parameter group by 0.1 every 30 epochs.
Learning rate warmup:
Since at the beginning of the training all parameters are typically random values using a too large learning rate may result in numerical instability.
In the warmup heuristic, we use a small learning rate at the beginning and then switch back to the initial learning rate when the training process is stable. We use a gradual warmup strategy that increases the learning rate from 0 to the initial learning rate linearly.
We add a linear warmup strategy for 5 epochs.

Model | Pretrain | Optimizer | Learning rate | Momentum | Batch size | Loss function | Image size | Epochs | Patience | |
---|---|---|---|---|---|---|---|---|---|---|
ResNet50 | ImageNet | NAG | 0.1 | 0.9 | 128 | Crossentropy Loss | 224x224 | 600 | 200 | |
Augmentation | Scheduler | Scheduler step size | Scheduler Gamma | Warmup Method | Warmup epochs | Warmup decay | Accuracy | Incremental Improvement | Absolute Improvement | |
Standard | StepLR | 30 | 0.1 | Linear | 5 | 0.01 | 89.21 | +0.99 | +0.99 |
Trivial Augment:
To explore the impact of stronger data augmentation, we replaced the baseline augmentation with TrivialAugment. Trivial Augment works as follows. It takes an image x and a set of augmentations A as input. It then simply samples an augmentation from A uniformly at random and applies this augmentation to the given image x with a strength m, sampled uniformly at random from the set of possible strengths {0, . . . , 30}, and returns the augmented image.
What makes TrivialAugment especially attractive is that it’s completely parameter-free, it doesn’t require search or tuning, making it a simple yet effective drop-in replacement that reduces experimental complexity.
While it may seem counterintuitive that such a generic and randomized strategy would outperform augmentations specifically tailored to the dataset or more sophisticated automated augmentation methods, we tried a variety of alternatives, and TrivialAugment consistently delivered strong results across runs. Its simplicity, stability, and surprisingly high effectiveness make it a compelling default choice.

Model | Pretrain | Optimizer | Learning rate | Momentum | Batch size | Loss function | Image size | Epochs | Patience |
---|---|---|---|---|---|---|---|---|---|
ResNet50 | ImageNet | NAG | 0.1 | 0.9 | 128 | Crossentropy Loss | 224x224 | 600 | 200 |
Scheduler | Scheduler step size | Scheduler Gamma | Warmup Method | Warmup epochs | Warmup decay | Augmentation | Accuracy | Incremental Improvement | Absolute Improvement |
StepLR | 30 | 0.1 | Linear | 5 | 0.01 | TrivialAugment | 92.66 | +3.45 | +4.44 |
Cosine Learning Rate Decay:
Next, we explored modifying the learning rate schedule. We switched to a cosine annealing strategy, which decreases the learning rate from the initial value to 0 by following the cosine function. A big advantage of cosine is that there are no hyper-parameters to optimize, which cuts down again our search space.

Model | Pretrain | Optimizer | Learning rate | Momentum | Batch size | Loss function | Image size | Epochs | Patience |
---|---|---|---|---|---|---|---|---|---|
ResNet50 | ImageNet | NAG | 0.1 | 0.9 | 128 | Crossentropy Loss | 224x224 | 600 | 200 |
Scheduler | Scheduler step size | Scheduler Gamma | Warmup Method | Warmup epochs | Warmup decay | Augmentation | Accuracy | Incremental Improvement | Absolute Improvement |
Cosine | - | - | Linear | 5 | 0.01 | TrivialAugment | 93.22 | +0.56 | +5 |
Label Smoothing:
A good technique to reduce overfitting is to stop the model from becoming overconfident. This can be achieved by softening the ground truth using Label Smoothing. The idea is to change the construction of the true label to:
There is a single parameter which controls the degree of smoothing (the higher the stronger) that we need to specify. We used a smoothing factor of ε = 0.1, which is the standard value proposed in the original paper and widely adopted in the literature.
Interestingly, we found empirically that adding label smoothing reduced gradient variance during training. This allowed us to safely increase the learning rate without destabilizing training. As a result, we increased the initial learning rate from 0.1 to 0.4
Model | Pretrain | Optimizer | Learning rate | Momentum | Batch size | Loss function | Image size |
---|---|---|---|---|---|---|---|
ResNet50 | ImageNet | NAG | 0.4 | 0.9 | 128 | Crossentropy Loss | 224x224 |
Epochs | Patience | Scheduler | Scheduler step size | Scheduler Gamma | Warmup Method | Warmup epochs | Warmup decay |
600 | 200 | Cosine | - | - | Linear | 5 | 0.01 |
Augmentation | Label Smoothing | Accuracy | Incremental Improvement | Absolute Improvement | |||
TrivialAugment | 0.1 | 94.5 | +1.28 | +6.28 |
Random Erasing:
As an additional form of regularization, we introduced Random Erasing into the training pipeline. This technique randomly selects a rectangular region within an image and replaces its pixels with random values, with a fixed probability.
Often paired with Automatic Augmentation methods, it usually yields additional improvements in accuracy due to its regularization effect. We added Random Erasing with a probability of 0.1.

Model | Pretrain | Optimizer | Learning rate | Momentum | Batch size | Loss function | Image size |
---|---|---|---|---|---|---|---|
ResNet50 | ImageNet | NAG | 0.4 | 0.9 | 128 | Crossentropy Loss | 224x224 |
Epochs | Patience | Scheduler | Scheduler step size | Scheduler Gamma | Warmup Method | Warmup epochs | Warmup decay |
600 | 200 | Cosine | - | - | Linear | 5 | 0.01 |
Augmentation | Label Smoothing | Random Erasing | Accuracy | Incremental Improvement | Absolute Improvement | ||
TrivialAugment | 0.1 | 0.1 | 94.93 | +0.43 | +6.71 |
Exponential Moving Average (EMA):
Training a neural network using mini batches introduces noise and less accurate gradients when gradient descent updates the model parameters between batches. Exponential moving average is used in training deep neural networks to improve their stability and generalization.
Instead of just using the raw weights that are directly learned during training, EMA maintains a running average of the model weights which are then updated at each training step using a weighted average of the current weights and the previous EMA values.
Specifically, at each training step, the EMA weights are updated using:
where θ are the current model weights and α is a decay factor controlling how much weight is given to the past.
By evaluating the EMA weights rather than the raw ones at test time, we found improved consistency in performance across runs, especially in the later stages of training.
Model | Pretrain | Optimizer | Learning rate | Momentum | Batch size | Loss function | Image size |
---|---|---|---|---|---|---|---|
ResNet50 | ImageNet | NAG | 0.4 | 0.9 | 128 | Crossentropy Loss | 224x224 |
Epochs | Patience | Scheduler | Scheduler step size | Scheduler Gamma | Warmup Method | Warmup epochs | Warmup decay |
600 | 200 | Cosine | - | - | Linear | 5 | 0.01 |
Augmentation | Label Smoothing | Random Erasing | EMA Steps | EMA Decay | Accuracy | Incremental Improvement | Absolute Improvement |
TrivialAugment | 0.1 | 0.1 | 32 | 0.994 | 94.93 | 0 | +6.71 |
We tested EMA in isolation, and found that it led to notable improvements in both training stability and validation performance. But when we integrated EMA into the full recipe alongside other techniques, it did not provide further improvement. The results appeared to plateau, suggesting that most of the gains had already been captured by the other components.
Because our goal is to develop a general-purpose training recipe rather than one overly tailored to a single dataset, we chose to keep EMA in the final setup. Its benefits may be more pronounced in other conditions, and its low overhead makes it a safe inclusion.
Optimizations we tested but didn’t adopt:
We also explored a range of additional techniques that are commonly effective in other image classification tasks, but found that they either did not lead to significant improvements or, in some cases, slightly regressed performance on the Stanford Cars dataset:
- Weight Decay: Adds L2 regularization to discourage large weights during training. We experimented extensively with weight decay in our use case, but it consistently regressed performance.
- Cutmix/Mixup: Cutmix replaces random patches between images and mixes the corresponding labels. Mixup creates new training samples by linearly combining pairs of images and labels. We tried applying either CutMix or MixUp randomly with equal probability during training, but this approach regressed results.
- AutoAugment: Delivered strong results and competitive accuracy, but we found TrivialAugment to be better. More importantly, TrivialAugment is completely parameter-free, which cuts down our search space and simplifies tuning.
- Alternative Optimizers and Schedulers: We experimented with a wide range of optimizers and learning rate schedules. Nesterov Accelerated Gradient (NAG) consistently gave us the best performance among optimizers, and Cosine Annealing stood out as the best scheduler, delivering strong results with no additional hyperparameters to tune.
4. Conclusion:
The graph below summarizes the improvements as we progressively built up our training recipe:

Using just a standard ResNet-50, we were able to achieve strong performance on the Stanford Cars dataset, demonstrating that careful tuning of a few simple techniques can go a long way in fine-grained classification.
However, it’s important to keep this in perspective. These results mainly show that we can train a model to distinguish between fine-grained, well-represented classes in a clean, curated dataset. The Stanford Cars dataset is nearly class-balanced, with high-quality, mostly frontal images and no major occlusion or real-world noise. It does not address challenges like long-tailed distributions, domain shift, or recognition of unseen classes.
In practice, you’ll never have a dataset that covers every car model—especially one that’s updated daily as new models appear. Real-world systems need to handle distributional shifts, open-set recognition, and imperfect inputs.
So while this served as a strong baseline and proof of concept, there was still significant work to be done to build something robust and production-ready.
References:
[1] Krause, Deng, et al. Collecting a Large-Scale Dataset of Fine-Grained Cars.
[2] Wei, et al. Fine-Grained Image Analysis with Deep Learning: A Survey.
[3] Reslan, Farou. Automatic Fine-grained Classification of Bird Species Using Deep Learning.
[4] Zhao, et al. A survey on deep learning-based fine-grained object clasiffication and semantic segmentation.
[5] He, et al. Bag of Tricks for Image Classification with Convolutional Neural Networks.
[6] Lee, et al. Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network.
[7] Wightman, et al. ResNet Strikes Back: An Improved Training Procedure in Timm.
[8] Vryniotis. How to Train State-of-the-Art Models Using TorchVision’s Latest Primitives.
[9] Krause et al, 3D Object Representations for Fine-Grained Catecorization.
[10] Müller, Hutter. TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation.
[11] Zhong et al, Random Erasing Data Augmentation.