add wan asymmetric vae upscaler

Signed-off-by: Vladimir Mandic <mandic00@live.com>
pull/4322/head
Vladimir Mandic 2025-10-28 13:55:46 -04:00
parent df4571588b
commit bc775f0530
11 changed files with 145 additions and 35 deletions

View File

@ -1,15 +1,6 @@
# To use:
#
# pre-commit run -a
#
# Or:
#
# pre-commit install # (runs every time you commit in git)
#
# To update this file:
#
# pre-commit autoupdate
#
# To use: pre-commit run -a
# Or: pre-commit install # (runs every time you commit in git)
# To update this file: pre-commit autoupdate
# See https://github.com/pre-commit/pre-commit
ci:
@ -19,7 +10,7 @@ ci:
repos:
# Standard hooks
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
rev: v6.0.0
hooks:
- id: check-added-large-files
- id: check-case-conflict
@ -35,6 +26,7 @@ repos:
- id: check-json
- id: check-toml
- id: check-xml
- id: debug-statements
- id: end-of-file-fixer
- id: mixed-line-ending
- id: check-executables-have-shebangs

View File

@ -5,9 +5,11 @@
### Highlights for 2025-10-28
- Reorganization of **Reference Models** into *Base, Quantized, Distilled and Community* sections for easier navigation
- New models: **HunyuanImage 2.1** capable of generating 2K images natively, **Pony 7** based on AuraFlow architecture and **Kandinsky 5** 10s video models
- New models: **HunyuanImage 2.1** capable of generating 2K images natively, **Pony 7** based on AuraFlow architecture,
**Kandinsky 5** 10s video models, **Krea Realtime** autoregressive variant of WAN-2.1
- New **offline mode** to use previously downloaded models without internet connection
- New SOTA model loader using **Run:ai streamer**
- Optimizations to **WAN-2.2** given its popularity plus addition of native **VAE Upscaler** and optimized **pre-quantized** variants
- Updates to `rocm` and `xpu` backends
- Fixes, fixes, fixes... too many to list here!
@ -29,11 +31,14 @@
second series of models in *Kandinsky5* series is T2V model optimized for 10sec videos and uses Qwen2.5 text encoder
- [Pony 7](https://huggingface.co/purplesmartai/pony-v7-base)
Pony 7 steps in a different direction from previous Pony models and is based on AuraFlow architecture and UMT5 encoder
- **Models Auxiliary**
- add **Qwen 3-VL** VLM for interrogate and prompt enhance, thanks @CalamitousFelicitousness
- **Models Auxiliary**
- [Qwen 3-VL](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) VLM for interrogate and prompt enhance, thanks @CalamitousFelicitousness
this includes *2B, 4B and 8B* variants
- add **Apple DepthPro** controlnet processor, thanks @nolbert82
- add **LibreFlux** segmentation controlnet for FLUX.1
- [WAN Asymettric Upscale](https://huggingface.co/spacepxl/Wan2.1-VAE-upscale2x)
available as general purpose upscaler that can be used during standard workflow or process tab
available as VAE for compatible video models: *WAN-2.x-14B, SkyReels-v2* models
- [Apple DepthPro](https://huggingface.co/apple/DepthPro) controlnet processor, thanks @nolbert82
- [LibreFlux controlnet](https://huggingface.co/neuralvfx/LibreFlux-ControlNet) segmentation controlnet for FLUX.1
- **Features**
- **offline mode**: enable in *settings -> hugginface*
enables fully offline mode where previously downloaded models can be used as-is
@ -75,6 +80,7 @@
- fix `wan-2.2-14b-vace` single-stage exectution
- fix `wan-2.2-5b` tiled vae decode
- fix `controlnet` loading with quantization
- video use pre-quantized text-encoder if selected model is pre-quantized
- handle sparse `controlnet` models
- catch `xet` warnings
- validate pipelines on import

22
TODO.md
View File

@ -4,15 +4,17 @@ Main ToDo list can be found at [GitHub projects](https://github.com/users/vladma
## Future Candidates
- [Kanvas](https://github.com/vladmandic/kanvas)
- Transformers unified cache handler
- Remote TE
- Core: New inpaint/outpaint interface
[Kanvas](https://github.com/vladmandic/kanvas)
- Core: Create executable for SD.Next
- Feature: Transformers unified cache handler
- Remote Text-Encoder support
- Refactor: [Modular pipelines and guiders](https://github.com/huggingface/diffusers/issues/11915)
- Refactor: Sampler options
- Refactor: move sampler options to settings to config
- Refactor: [GGUF](https://huggingface.co/docs/diffusers/main/en/quantization/gguf)
- Feature: LoRA add OMI format support for SD35/FLUX.1
- Video Core: API
- Video LTX: TeaCache and others, API, Conditioning preprocess Video: LTX API
- Video tab: add full API support
- Control tab: add overrides handling
### Under Consideration
@ -26,13 +28,19 @@ Main ToDo list can be found at [GitHub projects](https://github.com/users/vladma
- [Dream0 guidance](https://huggingface.co/ByteDance/DreamO)
- [ByteDance OneReward](https://github.com/bytedance/OneReward)
- [ByteDance USO](https://github.com/bytedance/USO)
- [Video Inpaint Pipeline](https://github.com/huggingface/diffusers/pull/12506)
- Remove: `CodeFormer`
- Remove: `GFPGAN`
- ModernUI: Lite vs Expert mode
- Engine: TensorRT acceleration
### New models
### New models / Pipelines
- [Krea Realtime Video](https://huggingface.co/krea/krea-realtime-video)
- [Wan-2.2 Animate](https://github.com/huggingface/diffusers/pull/12526)
- [Wan-2.2 S2V](https://github.com/huggingface/diffusers/pull/12258)
- [LongCat-Video](https://huggingface.co/meituan-longcat/LongCat-Video)
- [MUG-V 10B](https://huggingface.co/MUG-V/MUG-V-inference)
- [Chroma1 Radiance](https://huggingface.co/lodestones/Chroma1-Radiance)
- [Ovi](https://github.com/character-ai/Ovi)
- [Bytedance Lynx](https://github.com/bytedance/lynx)

View File

@ -692,10 +692,7 @@ class DCSolverMultistepScheduler(SchedulerMixin, ConfigMixin):
rhos_c = torch.linalg.solve(R, b)
if self.predict_x0:
try:
x_t_ = sigma_t / sigma_s0 * x - alpha_t * h_phi_1 * m0
except Exception as e:
import pdb; pdb.set_trace()
x_t_ = sigma_t / sigma_s0 * x - alpha_t * h_phi_1 * m0
if D1s is not None:
corr_res = torch.einsum("k,bkc...->bc...", rhos_c[:-1], D1s)
else:

View File

@ -7,6 +7,14 @@ from modules import shared, sd_models, devices, timer, errors
debug = shared.log.trace if os.environ.get('SD_VIDEO_DEBUG', None) is not None else lambda *args, **kwargs: None
def hijack_vae_upscale(*args, **kwargs):
import torch.nn.functional as F
tensor = shared.sd_model.vae.orig_decode(*args, **kwargs)[0]
tensor = F.pixel_shuffle(tensor.movedim(2, 1), upscale_factor=2).movedim(1, 2) # vae returns 16-dim latents, we need to pixel shuffle to 4-dim images
tensor = tensor.unsqueeze(0) # add batch dimension
return tensor
def hijack_vae_decode(*args, **kwargs):
jobid = shared.state.begin('VAE Decode')
t0 = time.time()
@ -16,7 +24,10 @@ def hijack_vae_decode(*args, **kwargs):
sd_models.move_model(shared.sd_model.vae, devices.device)
if torch.is_tensor(args[0]):
latents = args[0].to(device=devices.device, dtype=shared.sd_model.vae.dtype) # upcast to vae dtype
res = shared.sd_model.vae.orig_decode(latents, *args[1:], **kwargs)
if hasattr(shared.sd_model.vae, '_asymmetric_upscale_vae'):
res = hijack_vae_upscale(latents, *args[1:], **kwargs)
else:
res = shared.sd_model.vae.orig_decode(latents, *args[1:], **kwargs)
t1 = time.time()
shared.log.debug(f'Decode: vae={shared.sd_model.vae.__class__.__name__} slicing={getattr(shared.sd_model.vae, "use_slicing", None)} tiling={getattr(shared.sd_model.vae, "use_tiling", None)} latents={list(latents.shape)}:{latents.device} dtype={latents.dtype} time={t1-t0:.3f}')
else:

View File

@ -617,7 +617,10 @@ class SDNQQuantizer(DiffusersQuantizer, HfQuantizer):
def _process_model_after_weight_loading(self, model, **kwargs): # pylint: disable=unused-argument
if shared.opts.diffusers_offload_mode != "none":
model = model.to(devices.cpu)
try:
model = model.to(device=devices.cpu)
except Exception:
model = model.to_empty(device=devices.cpu)
devices.torch_gc(force=True, reason="sdnq")
return model

View File

@ -112,16 +112,17 @@ class UpscalerAsymmetricVAE(Upscaler):
import torchvision.transforms.functional as F
import diffusers
from modules import shared, devices
if self.vae is None or selected_model != self.selected:
if self.vae is None or (selected_model != self.selected):
if 'v1' in selected_model:
repo_id = 'Heasterian/AsymmetricAutoencoderKLUpscaler'
else:
repo_id = 'Heasterian/AsymmetricAutoencoderKLUpscaler_v2'
self.vae = diffusers.AsymmetricAutoencoderKL.from_pretrained(repo_id, cache_dir=shared.opts.hfcache_dir)
shared.log.debug(f'Upscaler load: vae="{repo_id}"')
self.vae.requires_grad_(False)
self.vae = self.vae.to(device=devices.device, dtype=devices.dtype)
self.vae.eval()
self.selected = selected_model
shared.log.debug(f'Upscaler load: selected="{self.selected}" vae="{repo_id}"')
img = img.resize((8 * (img.width // 8), 8 * (img.height // 8)), resample=Image.Resampling.LANCZOS).convert('RGB')
tensor = (F.pil_to_tensor(img).unsqueeze(0) / 255.0).to(device=devices.device, dtype=devices.dtype)
self.vae = self.vae.to(device=devices.device)
@ -131,6 +132,54 @@ class UpscalerAsymmetricVAE(Upscaler):
return upscaled
class UpscalerWanUpscale(Upscaler):
def __init__(self, dirname=None): # pylint: disable=unused-argument
super().__init__(False)
self.name = "WAN Upscale"
self.vae_encode = None
self.vae_decode = None
self.selected = None
self.scalers = [
UpscalerData("WAN Asymmetric Upscale", None, self),
]
def do_upscale(self, img: Image, selected_model=None):
if selected_model is None:
return img
import torchvision.transforms.functional as F
import torch.nn.functional as FN
import diffusers
from modules import shared, devices
if (self.vae_encode is None) or (self.vae_decode is None) or (selected_model != self.selected):
repo_encode = 'Qwen/Qwen-Image-Edit-2509'
subfolder_encode = 'vae'
self.vae_encode = diffusers.AutoencoderKLWan.from_pretrained(repo_encode, subfolder=subfolder_encode, cache_dir=shared.opts.hfcache_dir)
self.vae_encode.requires_grad_(False)
self.vae_encode = self.vae_encode.to(device=devices.device, dtype=devices.dtype)
self.vae_encode.eval()
repo_decode = 'spacepxl/Wan2.1-VAE-upscale2x'
subfolder_decode = "diffusers/Wan2.1_VAE_upscale2x_imageonly_real_v1"
self.vae_decode = diffusers.AutoencoderKLWan.from_pretrained(repo_decode, subfolder=subfolder_decode, cache_dir=shared.opts.hfcache_dir)
self.vae_decode.requires_grad_(False)
self.vae_decode = self.vae_decode.to(device=devices.device, dtype=devices.dtype)
self.vae_decode.eval()
self.selected = selected_model
shared.log.debug(f'Upscaler load: selected="{self.selected}" encode="{repo_encode}" decode="{repo_decode}"')
self.vae_encode = self.vae_encode.to(device=devices.device)
tensor = (F.pil_to_tensor(img).unsqueeze(0).unsqueeze(2) / 255.0).to(device=devices.device, dtype=devices.dtype)
tensor = self.vae_encode.encode(tensor).latent_dist.mode()
self.vae_encode.to(device=devices.cpu)
self.vae_decode = self.vae_decode.to(device=devices.device)
tensor = self.vae_decode.decode(tensor).sample
tensor = FN.pixel_shuffle(tensor.movedim(2, 1), upscale_factor=2).movedim(1, 2) # pixel shuffle needs [..., C, H, W] format
self.vae_decode.to(device=devices.cpu)
upscaled = F.to_pil_image(tensor.squeeze().clamp(0.0, 1.0).float().cpu())
return upscaled
class UpscalerDCC(Upscaler):
def __init__(self, dirname=None): # pylint: disable=unused-argument
super().__init__(False)

View File

@ -281,6 +281,17 @@ try:
te_cls=getattr(transformers, 'UMT5EncoderModel', None),
dit_cls=getattr(diffusers, 'SkyReelsV2Transformer3DModel', None)),
],
"""
'Krea': [
Model(name='Krea Realtime WAN-2.1 14B T2V',
url='https://huggingface.co/krea/krea-realtime-video',
repo='krea/krea-realtime-video',
repo_cls=getattr(diffusers, 'WanPipeline', None),
te='Wan-AI/Wan2.1-T2V-14B-Diffusers',
te_cls=getattr(transformers, 'UMT5EncoderModel', None),
dit_cls=getattr(diffusers, 'WanTransformer3DModel', None)),
],
"""
'Mochi Video': [
Model(name='None'),
Model(name='Mochi 1 T2V',

View File

@ -43,7 +43,10 @@ def load_model(selected: models_def.Model):
selected.te_folder = ''
selected.te_revision = None
if selected.te_cls.__name__ == 'UMT5EncoderModel' and shared.opts.te_shared_t5:
selected.te = 'Wan-AI/Wan2.2-TI2V-5B-Diffusers'
if 'SDNQ' in selected.name:
selected.te = 'Disty0/Wan2.2-T2V-A14B-SDNQ-uint4-svd-r32'
else:
selected.te = 'Wan-AI/Wan2.2-TI2V-5B-Diffusers'
selected.te_folder = 'text_encoder'
selected.te_revision = None
if selected.te_cls.__name__ == 'LlamaModel' and shared.opts.te_shared_t5:
@ -154,3 +157,28 @@ def load_model(selected: models_def.Model):
shared.log.debug(f'Video hijacks: decode={decode} text={text} image={image} slicing={slicing} tiling={tiling} framewise={framewise}')
shared.state.end(jobid)
return msg
def load_upscale_vae():
if not hasattr(shared.sd_model, 'vae'):
return
if hasattr(shared.sd_model.vae, '_asymmetric_upscale_vae'):
return # already loaded
cls = shared.sd_model.vae.__class__.__name__
if cls != 'AutoencoderKLWan':
shared.log.warning('Video decode: upscale VAE unsupported')
return
import diffusers
repo_id = 'spacepxl/Wan2.1-VAE-upscale2x'
subfolder = "diffusers/Wan2.1_VAE_upscale2x_imageonly_real_v1"
vae_decode = diffusers.AutoencoderKLWan.from_pretrained(repo_id, subfolder=subfolder, cache_dir=shared.opts.hfcache_dir)
vae_decode.requires_grad_(False)
vae_decode = vae_decode.to(device=devices.device, dtype=devices.dtype)
vae_decode.eval()
shared.log.debug(f'Decode: load={repo_id}')
shared.sd_model.orig_vae = shared.sd_model.vae
shared.sd_model.vae = vae_decode
shared.sd_model.vae._asymmetric_upscale_vae = True # pylint: disable=protected-access
sd_hijack_vae.init_hijack(shared.sd_model)
sd_models.apply_balanced_offload(shared.sd_model, force=True) # reapply offload

View File

@ -113,6 +113,11 @@ def generate(*args, **kwargs):
video_overrides.set_overrides(p, selected)
debug(f'Video: task_args={p.task_args}')
if p.vae_type == 'Upscale':
video_load.load_upscale_vae()
elif hasattr(shared.sd_model, 'orig_vae'):
shared.sd_model.vae = shared.sd_model.orig_vae
# run processing
shared.state.disable_preview = True
shared.log.debug(f'Video: cls={shared.sd_model.__class__.__name__} width={p.width} height={p.height} frames={p.frames} steps={p.steps}')

View File

@ -141,7 +141,7 @@ def create_ui(prompt, negative, styles, overrides, init_image, init_strength, la
guidance_true = gr.Slider(label='True guidance', minimum=-1.0, maximum=14.0, step=0.1, value=-1.0, elem_id="video_guidance_true")
with gr.Accordion(open=False, label="Decode", elem_id='video_decode_accordion'):
with gr.Row():
vae_type = gr.Dropdown(label='VAE decode', choices=['Default', 'Tiny', 'Remote'], value='Default', elem_id="video_vae_type")
vae_type = gr.Dropdown(label='VAE decode', choices=['Default', 'Tiny', 'Remote', 'Upscale'], value='Default', elem_id="video_vae_type")
vae_tile_frames = gr.Slider(label='Tile frames', minimum=1, maximum=64, step=1, value=16, elem_id="video_vae_tile_frames")
vlm_enhance, vlm_model, vlm_system_prompt = ui_video_vlm.create_ui(prompt_element=prompt, image_element=init_image)