mirror of https://github.com/vladmandic/automatic
Page:
SD Pipeline How it Works
Pages
AMD MIOpen
AMD ROCm
API
Advanced Install
Backend
Benchmark
CHANGELOG
CLI Arguments
CLI Tools
CLiP Skip
Caption
Control HowTo
Control Settings
Control Technical
Debug
Detailer
DirectML
Docker
Docs
Enso
Extensions
FAQ
FLUX
Features
FramePack
Gated
Getting Started
Google GenAI
HiDream
Hints
Home
Hotkeys
HuggingFace
IPAdapter
Installation
Intel ARC
Kanvas
LTX
Launcher
LoRA
Loader
Locale
MacOS Python
Malloc
Model Support
Models Tab
Models
Networks Search
Networks
Notes
NudeNet
Nunchaku
ONNX Runtime
Offload
OpenVINO
Outpaint
Parameters
Performance Tuning
Process
Profiling
Prompt Enhance
Prompting
Python
Quantization
Reprocess
SD Pipeline How it Works
SD Training Methods
SD XL
SD3
SDNQ Quantization
Schedulers
Scripts
Stability Matrix
Stable Cascade
Styles
Theme User
Themes
Troubleshooting
Update
Using LCM
VAE
Video
WSL
Wildcards
XYZ Grid
ZLUDA
_ToDo
index
nVidia
3
SD Pipeline How it Works
vladmandic edited this page 2026-02-11 11:15:50 +01:00
Table of Contents
Stable Diffusion Pipeline
This is probably the best end-to-end semi-technical article:
https://stable-diffusion-art.com/how-stable-diffusion-work/
And a detailed look at diffusion process: https://towardsdatascience.com/understanding-diffusion-probabilistic-models-dpms-1940329d6048
But this is a short look at the pipeline:
- Encoder / Conditioning
Text (via tokenizer) or image (via vision model) to semantic map
(e.g CLiP text encoder) - Sampler
Generate noise which is starting point to map to content
(e.g. k_lms) - Diffuser
Create vector content based on resolved noise + semantic map
(e.g. actual stable diffusion checkpoint) - Autoencoder
Maps between latent and pixel space (actually creates images from vectors)
(e.g. typically some image-database trained GAN) - Denoising
Get meaningful images from pixel signatures
Basically, blends what autoencoder inserted using information from diffuser
(e.g. U-NET) - Loop and repeat From step#3 with cross-attention to blend results
- Run additional models as needed
- Upscale (e.g. ESRGAN)