Commit Graph

2157 Commits (cc6f9500a1b972e9dca14e769f4b70a8927ffa43)

Author SHA1 Message Date
Octopus cc6f9500a1
fix: use Parameter assignment for Stable_Zero123 cc_projection weights (fixes #13492) (#13518)
On Windows with aimdo enabled, disable_weight_init.Linear uses lazy
initialization that sets weight and bias to None to avoid unnecessary
memory allocation. This caused a crash when copy_() was called on the
None weight attribute in Stable_Zero123.__init__.

Replace copy_() with direct torch.nn.Parameter assignment, which works
correctly on both Windows (aimdo enabled) and other platforms.
2026-04-22 15:05:43 -07:00
Jukka Seppänen eb22225387
Support standalone LTXV audio VAEs (#13499) 2026-04-21 10:46:37 -07:00
comfyanonymous ad94d47221
Make the ltx audio vae more native. (#13486) 2026-04-21 11:02:42 -04:00
comfyanonymous 3d816db07f
Some optimizations to make Ernie inference a bit faster. (#13472) 2026-04-18 23:02:29 -04:00
Jukka Seppänen b9dedea57d
feat: SUPIR model support (CORE-17) (#13250) 2026-04-18 23:02:01 -04:00
Bedovyy b41ab53b6f
Use `ErnieTEModel_` not `ErnieTEModel`. (#13431) 2026-04-16 10:11:58 -04:00
Jun Yamog 1de83f91c3
Fix OOM regression in _apply() for quantized models during inference (#13372)
Skip unnecessary clone of inference-mode tensors when already inside
torch.inference_mode(), matching the existing guard in set_attr_param.
The unconditional clone introduced in 20561aa9 caused transient VRAM
doubling during model movement for FP8/quantized models.
2026-04-15 02:10:36 -07:00
comfyanonymous cb0bbde402
Fix ernie on devices that don't support fp64. (#13414) 2026-04-14 22:54:47 -04:00
comfyanonymous 722bc73319
Make text generation work with ministral model. (#13395)
Needs template before it works properly.
2026-04-13 20:43:57 -04:00
comfyanonymous 402ff1cdb7
Fix issue with ernie image. (#13393) 2026-04-13 16:38:42 -04:00
comfyanonymous c2657d5fb9
Fix typo. (#13382) 2026-04-12 23:37:13 -04:00
comfyanonymous 31283d2892
Implement Ernie Image model. (#13369) 2026-04-11 22:29:31 -04:00
comfyanonymous 55ebd287ee
Add a supports_fp64 function. (#13368) 2026-04-11 21:06:36 -04:00
Jukka Seppänen a134423890
SDPose: resize input always (#13349) 2026-04-10 11:26:55 -10:00
huemin b615af1c65
Add support for small flux.2 decoder (#13314) 2026-04-07 03:44:18 -04:00
comfyanonymous 40862c0776
Support Ace Step 1.5 XL model. (#13317) 2026-04-07 03:13:47 -04:00
comfyanonymous 0c63b4f6e3
Remove dead code. (#13251) 2026-04-01 20:22:06 -04:00
comfyanonymous e2ddf28d78
Fix some fp8 scaled checkpoints no longer working. (#13239) 2026-03-31 14:27:17 -07:00
rattus 8d723d2caa
Fix/tweak pinned memory accounting (#13221)
* mm: Lower windows pin threshold

Some workflows have more extranous use of shared GPU memory than is
accounted for in the 5% pin headroom. Lower this for safety.

* mm: Remove pin count clearing threshold.

TOTAL_PINNED_MEMORY is shared between the legacy and aimdo pinning
systems, however this catch-all assumes only the legacy system exists.
Remove the catch-all as the PINNED_MEMORY buffer is coherent already.
2026-03-29 16:43:24 -07:00
Jukka Seppänen a500f1edac
CORE-13 feat: Support RT-DETRv4 detection model (#12748) 2026-03-28 23:34:10 -04:00
comfyanonymous 3f77450ef1
Fix #13214 (#13216) 2026-03-28 22:35:59 -04:00
rattus b353a7c863
Integrate RAM cache with model RAM management (#13173) 2026-03-27 21:34:16 -04:00
comfyanonymous 3a56201da5
Allow flux conditioning without a pooled output. (#13198) 2026-03-27 20:36:26 -04:00
Jukka Seppänen b0fd65e884
fix: regression in text generate with LTXAV model (#13170) 2026-03-26 09:55:05 -07:00
comfyanonymous 2a1f402601
Make Qwen 8B work with TextGenerate node. (#13160) 2026-03-25 23:21:44 -04:00
Jukka Seppänen 404d7b9978
feat: Support Qwen3.5 text generation models (#12771) 2026-03-25 22:48:28 -04:00
Kohaku-Blueleaf 5ebb0c2e0b
FP8 bwd training (#13121) 2026-03-24 20:39:04 -04:00
Jukka Seppänen e87858e974
feat: LTX2: Support reference audio (ID-LoRA) (#13111) 2026-03-23 18:22:24 -04:00
Talmaj d49420b3c7
LongCat-Image edit (#13003) 2026-03-21 23:51:05 -04:00
rattus 25b6d1d629
wan: vae: Fix light/color change (#13101)
There was an issue where the resample split was too early and dropped one
of the rolling convolutions a frame early. This is most noticable as a
lighting/color change between pixel frames 5->6 (latent 2->3), or as a
lighting change between the first and last frame in an FLF wan flow.
2026-03-21 18:44:35 -04:00
comfyanonymous 11c15d8832
Fix fp16 intermediates giving different results. (#13100) 2026-03-21 17:53:25 -04:00
comfyanonymous b5d32e6ad2
Fix sampling issue with fp16 intermediates. (#13099) 2026-03-21 17:47:42 -04:00
Jedrzej Kosinski 87cda1fc25
Move inline comfy.context_windows imports to top-level in model_base.py (#13083)
The recent PR that added resize_cond_for_context_window methods to
model classes used inline 'import comfy.context_windows' in each
method body. This moves that import to the top-level import section,
replacing 4 duplicate inline imports with a single top-level one.
2026-03-20 20:03:42 -04:00
drozbay 589228e671
Add slice_cond and per-model context window cond resizing (#12645)
* Add slice_cond and per-model context window cond resizing

* Fix cond_value.size() call in context window cond resizing

* Expose additional advanced inputs for ContextWindowsManualNode

Necessary for WanAnimate context windows workflow, which needs cond_retain_index_list = 0 to work properly with its reference input.

---------
2026-03-19 20:42:42 -07:00
rattus f49856af57
ltx: vae: Fix missing init variable (#13074)
Forgot to push this ammendment. Previous test results apply to this.
2026-03-19 22:34:58 -04:00
rattus 82b868a45a
Fix VRAM leak in tiler fallback in video VAEs (#13073)
* sd: soft_empty_cache on tiler fallback

This doesnt cost a lot and creates the expected VRAM reduction in
resource monitors when you fallback to tiler.

* wan: vae: Don't recursion in local fns (move run_up)

Moved Decoder3d’s recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.

* ltx: vae: Don't recursion in local fns (move run_up)

Mov the recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.
2026-03-19 22:30:27 -04:00
comfyanonymous 8458ae2686
Revert "fix: run text encoders on MPS GPU instead of CPU for Apple Silicon (#…" (#13070)
This reverts commit b941913f1d.
2026-03-19 15:27:55 -04:00
Jukka Seppänen fd0261d2bc
Reduce tiled decode peak memory (#13050) 2026-03-19 13:29:34 -04:00
rattus ab14541ef7
memory: Add more exclusion criteria to pinned read (#13067) 2026-03-19 10:03:20 -07:00
rattus fabed694a2
ltx: vae: implement chunked encoder + CPU IO chunking (Big VRAM reductions) (#13062)
* ltx: vae: add cache state to downsample block

* ltx: vae: Add time stride awareness to causal_conv_3d

* ltx: vae: Automate truncation for encoder

Other VAEs just truncate without error. Do the same.

* sd/ltx: Make chunked_io a flag in its own right

Taking this bi-direcitonal, so make it a for-purpose named flag.

* ltx: vae: implement chunked encoder + CPU IO chunking

People are doing things with big frame counts in LTX including V2V
flows. Implement the time-chunked encoder to keep the VRAM down, with
the converse of the new CPU pre-allocation technique, where the chunks
are brought from the CPU JIT.

* ltx: vae-encode: round chunk sizes more strictly

Only powers of 2 and multiple of 8 are valid due to cache slicing.
2026-03-19 09:58:47 -07:00
comfyanonymous f6b869d7d3
fp16 intermediates doen't work for some text enc models. (#13056) 2026-03-18 19:42:28 -04:00
comfyanonymous 56ff88f951
Fix regression. (#13053) 2026-03-18 18:35:25 -04:00
Jukka Seppänen 9fff091f35
Further Reduce LTX VAE decode peak RAM usage (#13052) 2026-03-18 18:32:26 -04:00
comfyanonymous dcd659590f
Make more intermediate values follow the intermediate dtype. (#13051) 2026-03-18 18:14:18 -04:00
Anton Bukov b941913f1d
fix: run text encoders on MPS GPU instead of CPU for Apple Silicon (#12809)
On Apple Silicon, `vram_state` is set to `VRAMState.SHARED` because
CPU and GPU share unified memory. However, `text_encoder_device()`
only checked for `HIGH_VRAM` and `NORMAL_VRAM`, causing all text
encoders to fall back to CPU on MPS devices.

Adding `VRAMState.SHARED` to the condition allows non-quantized text
encoders (e.g. bf16 Gemma 3 12B) to run on the MPS GPU, providing
significant speedup for text encoding and prompt generation.

Note: quantized models (fp4/fp8) that use float8_e4m3fn internally
will still fall back to CPU via the `supports_cast()` check in
`CLIP.__init__()`, since MPS does not support fp8 dtypes.
2026-03-17 21:21:32 -04:00
rattus cad24ce262
cascade: remove dead weight init code (#13026)
This weight init process is fully shadowed be the weight load and
doesnt work in dynamic_vram were the weight allocation is deferred.
2026-03-17 20:59:10 -04:00
comfyanonymous 68d542cc06
Fix case where pixel space VAE could cause issues. (#13030) 2026-03-17 20:46:22 -04:00
Jukka Seppänen 735a0465e5
Inplace VAE output processing to reduce peak RAM consumption. (#13028) 2026-03-17 20:20:49 -04:00
rattus 035414ede4
Reduce WAN VAE VRAM, Save use cases for OOM/Tiler (#13014)
* wan: vae: encoder: Add feature cache layer that corks singles

If a downsample only gives you a single frame, save it to the feature
cache and return nothing to the top level. This increases the
efficiency of cacheability, but also prepares support for going two
by two rather than four by four on the frames.

* wan: remove all concatentation with the feature cache

The loopers are now responsible for ensuring that non-final frames are
processes at least two-by-two, elimiating the need for this cat case.

* wan: vae: recurse and chunk for 2+2 frames on decode

Avoid having to clone off slices of 4 frame chunks and reduce the size
of the big 6 frame convolutions down to 4. Save the VRAMs.

* wan: encode frames 2x2.

Reduce VRAM usage greatly by encoding frames 2 at a time rather than
4.

* wan: vae: remove cloning

The loopers now control the chunking such there is noever more than 2
frames, so just cache these slices directly and avoid the clone
allocations completely.

* wan: vae: free consumer caller tensors on recursion

* wan: vae: restyle a little to match LTX
2026-03-17 17:34:39 -04:00
rattus 1a157e1f97
Reduce LTX VAE VRAM usage and save use cases from OOMs/Tiler (#13013)
* ltx: vae: scale the chunk size with the users VRAM

Scale this linearly down for users with low VRAM.

* ltx: vae: free non-chunking recursive intermediates

* ltx: vae: cleanup some intermediates

The conv layer can be the VRAM peak and it does a torch.cat. So cleanup
the pieces of the cat. Also clear our the cache ASAP as each layer detect
its end as this VAE surges in VRAM at the end due to the ended padding
increasing the size of the final frame convolutions off-the-books to
the chunker. So if all the earlier layers free up their cache it can
offset that surge.

Its a fragmentation nightmare, and the chance of it having to recache the
pyt allocator is very high, but you wont OOM.
2026-03-17 17:32:43 -04:00