Diffusion Models

Generating images and video by learning to reverse a noise-adding process.

Why It Matters

The Engine Behind AI Image Generation

Diffusion models power DALL-E, Stable Diffusion, Midjourney, and Sora. The core idea is elegantly simple: teach a model to remove noise, then start from pure noise and let it "denoise" its way to a coherent image.

It's like a sculptor chipping away at marble — starting from chaos and gradually revealing structure.

Forward Process

Progressively add Gaussian noise to real images over ~1000 timesteps until the image becomes pure noise.

Backward Process

Learn to reverse this: start from noise, predict and remove noise at each step, guided by a text prompt.

Text Guidance

A text encoder (like CLIP) converts your prompt into a vector that steers the denoising process.

Interactive

Step Through Denoising

Click "Denoise" to remove noise step by step and watch an image emerge from pure randomness:

Pure Noise (t=1000)

Deep Dive

How Diffusion Works

In Practice

Diffusion Models in the Wild

DALL-E 3

OpenAI's image generator. Diffusion model with improved text understanding via native caption training.

Stable Diffusion

Open-source diffusion model that works in latent space (using a VAE encoder). Powers thousands of apps.

Sora

Video generation: VAE compresses video frames, then a diffusion transformer generates coherent video in latent space.

Knowledge Check

Test Your Understanding

Q1.What does a diffusion model learn to do during training?

Q2.During inference, what does the model start from?

Q3.How does a text prompt influence the generated image?

Q4.What is the relationship between VAEs and diffusion models in Stable Diffusion?