ComfyUI

Commit Graph

Author	SHA1	Message	Date
Jun Yamog	1de83f91c3	Fix OOM regression in _apply() for quantized models during inference (#13372 ) Skip unnecessary clone of inference-mode tensors when already inside torch.inference_mode(), matching the existing guard in set_attr_param. The unconditional clone introduced in `20561aa9` caused transient VRAM doubling during model movement for FP8/quantized models.	2026-04-15 02:10:36 -07:00
comfyanonymous	2a1f402601	Make Qwen 8B work with TextGenerate node. (#13160 )	2026-03-25 23:21:44 -04:00
Kohaku-Blueleaf	5ebb0c2e0b	FP8 bwd training (#13121 )	2026-03-24 20:39:04 -04:00
Kohaku-Blueleaf	20561aa919	[Trainer] FP4, 8, 16 training by native dtype support and quant linear autograd function (#12681 )	2026-03-16 21:31:50 -04:00
rattus	e84a200a3c	ops: opt out of deferred weight init if subclassed (#12967 ) If a subclass BYO _load_from_state_dict and doesnt call the super() the needed default init of these weights is missed and can lead to problems for uninitialized weights.	2026-03-15 11:49:49 -07:00
Jukka Seppänen	1c5db7397d	feat: Support mxfp8 (#12907 )	2026-03-14 18:36:29 -04:00
rattus	7810f49702	comfy aimdo 0.2.11 + Improved RAM Pressure release strategies - Windows speedups (#12925 ) * Implement seek and read for pins Source pins from an mmap is pad because its its a CPU->CPU copy that attempts to fully buffer the same data twice. Instead, use seek and read which avoids the mmap buffering while usually being a faster read in the first place (avoiding mmap faulting etc). * pinned_memory: Use Aimdo pinner The aimdo pinner bypasses pytorches CPU allocator which can leak windows commit charge. * ops: bypass init() of weight for embedding layer This similarly consumes large commit charge especially for TEs. It can cause a permanement leaked commit charge which can destabilize on systems close to the commit ceiling and generally confuses the RAM stats. * model_patcher: implement pinned memory counter Implement a pinned memory counter for better accounting of what volume of memory pins have. * implement touch accounting Implement accounting of touching mmapped tensors. * mm+mp: add residency mmap getter * utils: use the aimdo mmap to load sft files * model_management: Implement tigher RAM pressure semantics Implement a pressure release on entire MMAPs as windows does perform faster when mmaps are unloaded and model loads free ramp into fully unallocated RAM. Make the concept of freeing for pins a completely separate concept. Now that pins are loadable directly from original file and don' touch the mmap, tighten the freeing budget to just the current loaded model - what you have left over. This still over-frees pins, but its a lot better than before. So after the pins are freed with that algorithm, bounce entire MMAPs to free RAM based on what the model needs, deducting off any known resident-in-mmap tensors to the free quota to keep it as tight as possible. * comfy-aimdo 0.2.11 Comfy aimdo 0.2.11 * mm: Implement file_slice path for QT * ruff * ops: put meta-tensors in place to allow custom nodes to check geo	2026-03-13 22:18:08 -04:00
comfyanonymous	1c3b651c0a	Refactor. (#12794 )	2026-03-05 13:35:56 -05:00
rattus	42e0e023ee	ops: Handle CPU weight in VBAR caster (#12792 ) This shouldn't happen but custom nodes gets there. Handle it as best we can.	2026-03-05 10:22:17 -08:00
comfyanonymous	f2ee7f2d36	Fix cublas ops on dynamic vram. (#12776 )	2026-03-05 01:21:55 -05:00
rattus	9b85cf9558	Comfy Aimdo 0.2.5 + Fix offload performance in DynamicVram (#12754 ) * ops: dont unpin nothing This was calling into aimdo in the none case (offloaded weight). Whats worse, is aimdo syncs for unpinning an offloaded weight, as that is the corner case of a weight getting evicted by its own use which does require a sync. But this was heppening every offloaded weight causing slowdown. * mp: fix get_free_memory policy The ModelPatcherDynamic get_free_memory was deducting the model from to try and estimate the conceptual free memory with doing any offloading. This is kind of what the old memory_memory_required was estimating in ModelPatcher load logic, however in practical reality, between over-estimates and padding, the loader usually underloaded models enough such that sampling could send CFG +/- through together even when partially loaded. So don't regress from the status quo and instead go all in on the idea that offloading is less of an issue than debatching. Tell the sampler it can use everything.	2026-03-04 07:49:13 -08:00
rattus	e721e24136	ops: implement lora requanting for non QuantizedTensor fp8 (#12668 ) Allow non QuantizedTensor layer to set want_requant to get the post lora calculation stochastic cast down to the original input dtype. This is then used by the legacy fp8 Linear implementation to set the compute_dtype to the preferred lora dtype but then want_requant it back down to fp8. This fixes the issue with --fast fp8_matrix_mult is combined with --fast dynamic_vram which doing a lora on an fp8_ non QT model.	2026-02-27 19:05:51 -05:00
rattus	4f5b7dbf1f	Fix Aimdo fallback on probe to not use zero-copy sft (#12634 ) * utils: dont use comfy sft loader in aimdo fallback This was going to the raw command line switch and should respect main.py probe of whether aimdo actually loaded successfully. * ops: dont use deferred linear load in Aimdo fallback Avoid changes of behaviour on --fast dynamic_vram when aimdo doesnt work.	2026-02-25 16:49:48 -05:00
comfyanonymous	599f9c5010	Don't crash right away if op is uninitialized. (#12615 )	2026-02-24 12:28:25 -05:00
rattus	58dcc97dcf	ops: limit return of requants (#12506 ) This check was far too broad and the dtype is not a reliable indicator of wanting the requant (as QT returns the compute dtype as the dtype). So explictly plumb whether fp8mm wants the requant or not.	2026-02-17 15:32:27 -05:00
comfyanonymous	4454fab7f0	Remove code to support RMSNorm on old pytorch. (#12499 )	2026-02-16 20:09:24 -05:00
rattus	d297a749a2	dynamic_vram: Fix windows Aimdo crash + Fix LLM performance (#12408 ) * model_management: lazy-cache aimdo_tensor These tensors cosntructed from aimdo-allocations are CPU expensive to make on the pytorch side. Add a cache version that will be valid with signature match to fast path past whatever torch is doing. * dynamic_vram: Minimize fast path CPU work Move as much as possible inside the not resident if block and cache the formed weight and bias rather than the flat intermediates. In extreme layer weight rates this adds up.	2026-02-11 14:50:16 -05:00
rattus	123a7874a9	ops: Fix vanilla-fp8 loaded lora quality (#12390 ) This was missing the stochastic rounding required for fp8 downcast to be consistent with model_patcher.patch_weight_to_device. Missed in testing as I spend too much time with quantized tensors and overlooked the simpler ones.	2026-02-10 13:38:28 -05:00
rattus	62315fbb15	Dynamic VRAM fixes - Ace 1.5 performance + a VRAM leak (#12368 ) * revert threaded model loader change This change was only needed to get around the pytorch 2.7 mempool bugs, and should have been reverted along with #12260. This fixes a different memory leak where pytorch gets confused about cache emptying. * load non comfy weights * MPDynamic: Pre-generate the tensors for vbars Apparently this is an expensive operation that slows down things. * bump to aimdo 1.8 New features: watermark limit feature logging enhancements -O2 build on linux	2026-02-09 16:16:08 -05:00
comfyanonymous	c8fcbd66ee	Try to fix ace text encoder slowness on some configs. (#12290 )	2026-02-04 19:37:05 -05:00
rattus	361b9a82a3	fix pinning with model defined dtype (#12208 ) pinned memory was converted back to pinning the CPU side weight without any changes. Fix the pinner to use the CPU weight and not the model defined geometry. This will either save RAM or stop buffer overruns when the types mismatch. Fix the model defined weight caster to use the [ s.weight, s.bias ] interpretation, as xfer_dest might be the flattened pin now. Fix the detection of needing to cast to not be conditional on !pin.	2026-02-01 08:42:32 -08:00
rattus	f8acd9c402	Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading (#11845 )	2026-02-01 01:01:11 -05:00
rattus	4e6a1b66a9	speed up and reduce VRAM of QWEN VAE and WAN (less so) (#12036 ) * ops: introduce autopad for conv3d This works around pytorch missing ability to causal pad as part of the kernel and avoids massive weight duplications for padding. * wan-vae: rework causal padding This currently uses F.pad which takes a full deep copy and is liable to be the VRAM peak. Instead, kick spatial padding back to the op and consolidate the temporal padding with the cat for the cache. * wan-vae: implement zero pad fast path The WAN VAE is also QWEN where it is used single-image. These convolutions are however zero padded 3d convolutions, which means the VAE is actually just 2D down the last element of the conv weight in the temporal dimension. Fast path this, to avoid adding zeros that then just evaporate in convoluton math but cost computation.	2026-01-23 19:56:14 -05:00
comfyanonymous	b3c0e4de57	Make loras work on nvfp4 models. (#11837 ) The initial applying is a bit slow but will probably be sped up in the future.	2026-01-12 22:33:54 -05:00
comfyanonymous	dc202a2e51	Properly save mixed ops. (#11772 )	2026-01-10 02:03:57 -05:00
comfyanonymous	bd0e6825e8	Be less strict when loading mixed ops weights. (#11769 )	2026-01-09 14:21:06 -05:00
rattus	b6c79a648a	ops: Fix offloading with FP8MM performance (#11697 ) This logic was checking comfy_cast_weights, and going straight to to the forward_comfy_cast_weights implementation without attempting to downscale input to fp8 in the event comfy_cast_weights is set. The main reason comfy_cast_weights would be set would be for async offload, which is not a good reason to nix FP8MM. So instead, and together the underlying exclusions for FP8MM which are: * having a weight_function (usually LowVramPatch) * force_cast_weights (compute dtype override) * the weight is not Quantized * the input is already quantized * the model or layer has MM explictily disabled. If you get past all of those exclusions, quantize the input tensor. Then hand the new input, quantized or not off to forward_comfy_cast_weights to handle it. If the weight is offloaded but input is quantized you will get an offloaded MM8.	2026-01-07 21:01:16 -05:00
comfyanonymous	b7d7cc1d49	Fix fp8 fast issue. (#11688 )	2026-01-07 01:39:06 -05:00
comfyanonymous	2c03884f5f	Skip fp4 matrix mult on devices that don't support it. (#11677 )	2026-01-06 18:07:26 -05:00
comfyanonymous	6da00dd899	Initial ops changes to use comfy_kitchen: Initial nvfp4 checkpoint support. (#11635 ) --------- Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>	2026-01-05 21:48:58 -05:00
comfyanonymous	971cefe7d4	Fix pytorch warnings. (#11314 )	2025-12-13 18:45:23 -05:00
comfyanonymous	c5a47a1692	Fix bias dtype issue in mixed ops. (#11293 )	2025-12-12 11:49:35 -05:00
comfyanonymous	5495589db3	Respect the dtype the op was initialized in for non quant mixed op. (#11282 )	2025-12-11 23:32:27 -05:00
rattus	9d252f3b70	ops: delete dead code (#11204 ) This became dead code in https://github.com/comfyanonymous/ComfyUI/pull/11069	2025-12-09 00:55:13 -05:00
comfyanonymous	6fd463aec9	Fix regression when text encoder loaded directly on GPU. (#11129 )	2025-12-05 15:33:16 -05:00
comfyanonymous	43071e3de3	Make old scaled fp8 format use the new mixed quant ops system. (#11000 )	2025-12-05 14:35:42 -05:00
rattus	519c941165	Prs/lora reservations (reduce massive Lora reservations especially on Flux2) (#11069 ) * mp: only count the offload cost of math once This was previously bundling the combined weight storage and computation cost * ops: put all post async transfer compute on the main stream Some models have massive weights that need either complex dequantization or lora patching. Don't do these patchings on the offload stream, instead do them on the main stream to syncrhonize the potentially large vram spikes for these compute processes. This avoids having to assume a worst case scenario of multiple offload streams all spiking VRAM is parallel with whatever the main stream is doing.	2025-12-03 02:28:45 -05:00
rattus	0ff0457892	mm: wrap the raw stream in context manager (#10958 ) The documentation of torch.foo.Stream being usable with with: suggests it starts at version 2.7. Use the old API for backwards compatibility.	2025-11-28 16:38:12 -05:00
comfyanonymous	bdb10a583f	Fix loras not working on mixed fp8. (#10899 )	2025-11-26 00:07:58 -05:00
comfyanonymous	acfaa5c4a1	Don't try fp8 matrix mult in quantized ops if not supported by hardware. (#10874 )	2025-11-25 02:55:49 -05:00
comfyanonymous	25022e0b09	Cleanup and fix issues with text encoder quants. (#10872 )	2025-11-25 01:48:53 -05:00
comfyanonymous	cb96d4d18c	Disable workaround on newer cudnn. (#10807 )	2025-11-19 23:56:23 -05:00
contentis	3b3ef9a77a	Quantized Ops fixes (#10715 ) * offload support, bug fixes, remove mixins * add readme	2025-11-12 18:26:52 -05:00
rattus	c350009236	ops: Put weight cast on the offload stream (#10697 ) This needs to be on the offload stream. This reproduced a black screen with low resolution images on a slow bus when using FP8.	2025-11-09 22:52:11 -05:00
comfyanonymous	0f4ef3afa0	This seems to slow things down slightly on Linux. (#10624 )	2025-11-03 21:47:14 -05:00
comfyanonymous	0652cb8e2d	Speed up torch.compile (#10620 )	2025-11-03 17:37:12 -05:00
rattus	135fa49ec2	Small speed improvements to --async-offload (#10593 ) * ops: dont take an offload stream if you dont need one * ops: prioritize mem transfer The async offload streams reason for existence is to transfer from RAM to GPU. The post processing compute steps are a bonus on the side stream, but if the compute stream is running a long kernel, it can stall the side stream, as it wait to type-cast the bias before transferring the weight. So do a pure xfer of the weight straight up, then do everything bias, then go back to fix the weight type and do weight patches.	2025-11-01 18:48:53 -04:00
comfyanonymous	c58c13b2ba	Fix torch compile regression on fp8 ops. (#10580 )	2025-11-01 00:25:17 -04:00
comfyanonymous	906c089957	Fix small performance regression with fp8 fast and scaled fp8. (#10537 )	2025-10-29 19:29:01 -04:00
rattus	ab7ab5be23	Fix Race condition in --async-offload that can cause corruption (#10501 ) * mm: factor out the current stream getter Make this a reusable function. * ops: sync the offload stream with the consumption of w&b This sync is nessacary as pytorch will queue cuda async frees on the same stream as created to tensor. In the case of async offload, this will be on the offload stream. Weights and biases can go out of scope in python which then triggers the pytorch garbage collector to queue the free operation on the offload stream possible before the compute stream has used the weight. This causes a use after free on weight data leading to total corruption of some workflows. So sync the offload stream with the compute stream after the weight has been used so the free has to wait for the weight to be used. The cast_bias_weight is extended in a backwards compatible way with the new behaviour opt-in on a defaulted parameter. This handles custom node packs calling cast_bias_weight and defeatures async-offload for them (as they do not handle the race). The pattern is now: cast_bias_weight(... , offloadable=True) #This might be offloaded thing(weight, bias, ...) uncast_bias_weight(...) * controlnet: adopt new cast_bias_weight synchronization scheme This is nessacary for safe async weight offloading. * mm: sync the last stream in the queue, not the next Currently this peeks ahead to sync the next stream in the queue of streams with the compute stream. This doesnt allow a lot of parallelization, as then end result is you can only get one weight load ahead regardless of how many streams you have. Rotate the loop logic here to synchronize the end of the queue before returning the next stream. This allows weights to be loaded ahead of the compute streams position.	2025-10-29 17:17:46 -04:00

1 2 3

119 Commits (ec62a307a2df51cab9fe2e6cd810d2afe5019c4e)