Stable Diffusion Pipeline

This is probably the best end-to-end semi-technical article:
https://stable-diffusion-art.com/how-stable-diffusion-work/

But this is a short look at the pipeline:

Encoder / Conditioning Text (via tokenizer) or image (via vision model) to semantic map
(e.g CLiP text encoder)
Sampler Generate noise which is starting point to map to content
(e.g. k_lms)
Diffuser Create vector content based on resolved noise + semantic map
(e.g. actual stable diffusion checkpoint)
Autoencoder Maps between latent and pixel space (actually creates images from vectors)
(e.g. typically some image-database trained GAN)
Denoising Get meaningful images from pixel signatures
Basically, blends what autoencoder inserted using information from diffuser
(e.g. U-NET)
Loop and repeat From step#3 with cross-attention to blend results
Run additional models as needed
- Upscale (e.g. ESRGAN)