Diffusion Models

DDPM: The Paper That Made Diffusion Models Actually Work

Denoising Diffusion Probabilistic Models trains a network to undo gradual Gaussian noise step by step, hitting FID 3.17 on CIFAR-10 — and laying the groundwork that Stable Diffusion and DALL-E 2 later built on.

DDPM: The Paper That Made Diffusion Models Actually Work

Quick answer

Denoising Diffusion Probabilistic Models (DDPM), by Ho, Jain, and Abbeel at UC Berkeley, showed that you can generate high-quality images by training one neural network to reverse a slow Gaussian-noising process — and that this beats most GANs on sample quality. On unconditional CIFAR-10 it reaches an Inception score of 9.46 and a then state-of-the-art FID of 3.17, and on 256x256 LSUN it matches ProgressiveGAN. The diffusion idea existed before 2020; this is the paper that made it train well and look good.

The forward and reverse diffusion process

The setup has two directions. The forward process takes a real image and adds a tiny bit of Gaussian noise over many steps (the paper uses T = 1000), until after the last step the image is indistinguishable from pure noise. This direction is fixed — there is nothing to learn, just a noise schedule.

The reverse process is where the model lives. A network is trained to undo one noising step at a time: given a noisy image at step t, predict how to step back toward step t-1. Chain those predictions from pure noise all the way back, and you get a fresh sample. Crucially, because each forward step is a small Gaussian, each reverse step can also be modeled as a Gaussian, which is what makes the whole thing tractable.

A neat property: thanks to the Gaussian math, you can jump to any noise level t in closed form without simulating every intermediate step. That is what makes training affordable — you sample a random t, corrupt the image directly to that level, and train on it.

The simplified objective that made it click

The principled way to train this is a variational bound on log-likelihood, full of weighting terms and KL divergences. DDPM’s key practical move was to rewrite the reverse step so the network predicts the noise that was added rather than the denoised image, and then to throw away the fancy per-step weights and just minimize a plain mean-squared error between the true noise and the predicted noise.

This L_simple objective is almost embarrassingly simple — predict the noise, take the MSE — and yet it trains more stably and produces better samples than the full weighted bound. The authors also draw the connection that earns the paper its subtitle: this objective is equivalent to denoising score matching across multiple noise scales, and sampling resembles annealed Langevin dynamics. Diffusion and score-based models are two views of the same thing.

Key results

  • CIFAR-10 (unconditional): Inception score 9.46 and FID 3.17 — the FID was state-of-the-art at publication, better than the strong GANs of the day.
  • LSUN 256x256: sample quality comparable to ProgressiveGAN on bedrooms and churches.
  • Architecture: a U-Net backbone with self-attention and a shared time-step embedding; no GAN discriminator, no adversarial training, no mode-collapse headaches.
  • Progressive decompression: because generation goes coarse-to-fine, the model naturally admits a lossy decompression view that the paper frames as a generalization of autoregressive decoding.

Limits and open questions

The honest weakness is speed. Sampling requires running the network sequentially for all T steps — 1000 forward passes for one image in the original setup — which is orders of magnitude slower than a GAN’s single pass. The paper also reports that DDPM’s strong sample quality does not come with competitive log-likelihoods; on the density-estimation metric it trails the best likelihood-based models, so it is a great sampler, not a great compressor. And the results here are unconditional or class-free — there is no text conditioning, no classifier-free guidance, none of the machinery that later turned diffusion into a text-to-image engine. Those came in follow-up work (DDIM for fast sampling, latent diffusion for scale, guidance for control).

FAQ

What is a Denoising Diffusion Probabilistic Model?

A DDPM is a generative model that creates images by reversing a gradual noising process: it learns to remove a little Gaussian noise at each of many steps, starting from pure noise and ending at a clean sample. The 2020 paper by Ho, Jain, and Abbeel is the canonical reference.

Why is the DDPM paper important?

It is the paper that made diffusion models practical for high-quality image synthesis, reaching FID 3.17 on CIFAR-10 and beating GANs on sample quality. The text-to-image systems that followed — Stable Diffusion, DALL-E 2, Imagen — all rest on the diffusion training recipe DDPM established.

How does DDPM differ from a GAN?

DDPM has no discriminator and no adversarial game. It trains one network with a simple mean-squared-error noise-prediction loss, which sidesteps GAN instability and mode collapse — at the cost of slow, multi-step sampling rather than a GAN’s single forward pass.

What is the simplified training objective in DDPM?

Instead of optimizing the full variational lower bound, DDPM trains the network to predict the noise added to an image and minimizes the mean squared error against the true noise (the L_simple loss). Dropping the per-step weights made training simpler and the samples better.

One line: predict the noise, reverse it step by step, and you get a generator that finally beat GANs — the seed of the modern image-generation era. Read the original paper on arXiv.