* wan: vae: encoder: Add feature cache layer that corks singles
If a downsample only gives you a single frame, save it to the feature
cache and return nothing to the top level. This increases the
efficiency of cacheability, but also prepares support for going two
by two rather than four by four on the frames.
* wan: remove all concatentation with the feature cache
The loopers are now responsible for ensuring that non-final frames are
processes at least two-by-two, elimiating the need for this cat case.
* wan: vae: recurse and chunk for 2+2 frames on decode
Avoid having to clone off slices of 4 frame chunks and reduce the size
of the big 6 frame convolutions down to 4. Save the VRAMs.
* wan: encode frames 2x2.
Reduce VRAM usage greatly by encoding frames 2 at a time rather than
4.
* wan: vae: remove cloning
The loopers now control the chunking such there is noever more than 2
frames, so just cache these slices directly and avoid the clone
allocations completely.
* wan: vae: free consumer caller tensors on recursion
* wan: vae: restyle a little to match LTX
* ltx: vae: scale the chunk size with the users VRAM
Scale this linearly down for users with low VRAM.
* ltx: vae: free non-chunking recursive intermediates
* ltx: vae: cleanup some intermediates
The conv layer can be the VRAM peak and it does a torch.cat. So cleanup
the pieces of the cat. Also clear our the cache ASAP as each layer detect
its end as this VAE surges in VRAM at the end due to the ended padding
increasing the size of the final frame convolutions off-the-books to
the chunker. So if all the earlier layers free up their cache it can
offset that surge.
Its a fragmentation nightmare, and the chance of it having to recache the
pyt allocator is very high, but you wont OOM.
If a subclass BYO _load_from_state_dict and doesnt call the super() the
needed default init of these weights is missed and can lead to problems
for uninitialized weights.
This is an experimental WIP option that might not work in your workflow but
should lower memory usage if it does.
Currently only the VAE and the load image node will output in fp16 when
this option is turned on.
* Implement seek and read for pins
Source pins from an mmap is pad because its its a CPU->CPU copy that
attempts to fully buffer the same data twice. Instead, use seek and
read which avoids the mmap buffering while usually being a faster
read in the first place (avoiding mmap faulting etc).
* pinned_memory: Use Aimdo pinner
The aimdo pinner bypasses pytorches CPU allocator which can leak
windows commit charge.
* ops: bypass init() of weight for embedding layer
This similarly consumes large commit charge especially for TEs. It can
cause a permanement leaked commit charge which can destabilize on
systems close to the commit ceiling and generally confuses the RAM
stats.
* model_patcher: implement pinned memory counter
Implement a pinned memory counter for better accounting of what volume
of memory pins have.
* implement touch accounting
Implement accounting of touching mmapped tensors.
* mm+mp: add residency mmap getter
* utils: use the aimdo mmap to load sft files
* model_management: Implement tigher RAM pressure semantics
Implement a pressure release on entire MMAPs as windows does perform
faster when mmaps are unloaded and model loads free ramp into fully
unallocated RAM.
Make the concept of freeing for pins a completely separate concept.
Now that pins are loadable directly from original file and don' touch
the mmap, tighten the freeing budget to just the current loaded model
- what you have left over. This still over-frees pins, but its a lot
better than before.
So after the pins are freed with that algorithm, bounce entire MMAPs
to free RAM based on what the model needs, deducting off any known
resident-in-mmap tensors to the free quota to keep it as tight as
possible.
* comfy-aimdo 0.2.11
Comfy aimdo 0.2.11
* mm: Implement file_slice path for QT
* ruff
* ops: put meta-tensors in place to allow custom nodes to check geo
* fix: guard torch.AcceleratorError for compatibility with torch < 2.8.0
torch.AcceleratorError was introduced in PyTorch 2.8.0. Accessing it
directly raises AttributeError on older versions. Use a try/except
fallback at module load time, consistent with the existing pattern used
for OOM_EXCEPTION.
* fix: address review feedback for AcceleratorError compat
- Fall back to RuntimeError instead of type(None) for ACCELERATOR_ERROR,
consistent with OOM_EXCEPTION fallback pattern and valid for except clauses
- Add "out of memory" message introspection for RuntimeError fallback case
- Use RuntimeError directly in discard_cuda_async_error except clause
---------
Pytorch only filters for OOMs in its own allocators however there are
paths that can OOM on allocators made outside the pytorch allocators.
These manifest as an AllocatorError as pytorch does not have universal
error translation to its OOM type on exception. Handle it. A log I have
for this also shows a double report of the error async, so call the
async discarder to cleanup and make these OOMs look like OOMs.
Sync the compute stream before freeing the cast buffers. This can cause
use after free issues when the cast stream frees the buffer while the
compute stream is behind enough to still needs a casted weight.
* mp: respect model_defined_dtypes in default caster
This is needed for parametrizations when the dtype changes between sd
and model.
* audio_encoders: archive model dtypes
Archive model dtypes to stop the state dict load override the dtypes
defined by the core for compute etc.
* ops: dont unpin nothing
This was calling into aimdo in the none case (offloaded weight). Whats worse,
is aimdo syncs for unpinning an offloaded weight, as that is the corner case of
a weight getting evicted by its own use which does require a sync. But this
was heppening every offloaded weight causing slowdown.
* mp: fix get_free_memory policy
The ModelPatcherDynamic get_free_memory was deducting the model from
to try and estimate the conceptual free memory with doing any
offloading. This is kind of what the old memory_memory_required
was estimating in ModelPatcher load logic, however in practical
reality, between over-estimates and padding, the loader usually
underloaded models enough such that sampling could send CFG +/-
through together even when partially loaded.
So don't regress from the status quo and instead go all in on the
idea that offloading is less of an issue than debatching. Tell the
sampler it can use everything.
Define a threshold below which a weight loading takes priority. This
actually makes the offload consistent with non-dynamic, because what
happens, is when non-dynamic fills ints to_load list, it will fill-up
any left-over pieces that could fix large weights with small weights
and load them, even though they were lower priority. This actually
improves performance because the timy weights dont cost any VRAM and
arent worth the control overhead of the DMA etc.
* respect model dtype in non-comfy caster
* utils: factor out parent and name functionality of set_attr
* utils: implement set_attr_buffer for torch buffers
* ModelPatcherDynamic: Implement torch Buffer loading
If there is a buffer in dynamic - force load it.
* model_management: Remove non-comfy dynamic _v caster
* Force pre-load non-comfy weights to GPU in ModelPatcherDynamic
Non-comfy weights may expect to be pre-cast to the target
device without in-model casting. Previously they were allocated in
the vbar with _v which required the _v fault path in cast_to.
Instead, back up the original CPU weight and move it directly to GPU
at load time.
* draft zeta (z-image pixel space)
* revert gitignore
* model loaded and able to run however vector direction still wrong tho
* flip the vector direction to original again this time
* Move wrongly positioned Z image pixel space class
* inherit Radiance LatentFormat class
* Fix parameters in classes for Zeta x0 dino
* remove arbitrary nn.init instances
* Remove unused import of lru_cache
---------
Co-authored-by: silveroxides <ishimarukaito@gmail.com>
This was previously considering the pool of dynamic models as one giant
entity for the sake of smart memory, but that isnt really the useful
or what a user would reasonably expect. Make Dynamic VRAM properly purge
its models just like the old --disable-smart-memory but conditioning
the dynamic-for-dynamic bypass on smart memory.
Re-enable dynamic smart memory.
Multi-step samplers (eg. dpmpp_2s_ancestral) call the model at intermediate sigma values not present in the schedule. This caused set_step to crash with "No sample_sigmas matched current timestep" when context windows were enabled.
The fix is to keep self._step from the last exact match when a substep sigma is encountered, since substeps are still logically part of their parent step and should use the same context windows.
Co-authored-by: ozbayb <17261091+ozbayb@users.noreply.github.com>
* sd: add support for clip model reconstruction
* nodes: SetClipHooks: Demote the dynamic model patcher
* mp: Make dynamic_disable more robust
The backup need to not be cloned. In addition add a delegate object
to ModelPatcherDynamic so that non-cloning code can do
ModelPatcherDynamic demotion
* sampler_helpers: Demote to non-dynamic model patcher when hooking
* code rabbit review comments
Allow non QuantizedTensor layer to set want_requant to get the post lora
calculation stochastic cast down to the original input dtype.
This is then used by the legacy fp8 Linear implementation to set the
compute_dtype to the preferred lora dtype but then want_requant it back
down to fp8.
This fixes the issue with --fast fp8_matrix_mult is combined with
--fast dynamic_vram which doing a lora on an fp8_ non QT model.
Implements per-guide attention attenuation via log-space additive bias
in self-attention. Each guide reference tracks its own strength and
optional spatial mask in conditioning metadata (guide_attention_entries).
* utils: dont use comfy sft loader in aimdo fallback
This was going to the raw command line switch and should respect main.py
probe of whether aimdo actually loaded successfully.
* ops: dont use deferred linear load in Aimdo fallback
Avoid changes of behaviour on --fast dynamic_vram when aimdo doesnt work.
* mp: attach re-construction arguments to model patcher
When making a model-patcher from a unet or ckpt, attach a callable
function that can be called to replay the model construction. This
can be used to deep clone model patcher WRT the actual model.
Originally written by Kosinkadink
f4b99bc623
* mp: Add disable_dynamic clone argument
Add a clone argument that lets a caller clone a ModelPatcher but disable
dynamic to demote the clone to regular MP. This is useful for legacy
features where dynamic_vram support is missing or TBD.
* torch_compile: disable dynamic_vram
This is a bigger feature. Disable for the interim to preserve
functionality.
Integrate comfy-aimdo 0.2 which takes a different approach to
installing the memory allocator hook. Instead of using the complicated
and buggy pytorch MemPool+CudaPluggableAlloctor, cuda is directly hooked
making the process much more transparent to both comfy and pytorch. As
far as pytorch knows, aimdo doesnt exist anymore, and just operates
behind the scenes.
Remove all the mempool setup stuff for dynamic_vram and bump the
comfy-aimdo version. Remove the allocator object from memory_management
and demote its use as an enablment check to a boolean flag.
Comfy-aimdo 0.2 also support the pytorch cuda async allocator, so
remove the dynamic_vram based force disablement of cuda_malloc and
just go back to the old settings of allocators based on command line
input.
This check was far too broad and the dtype is not a reliable indicator
of wanting the requant (as QT returns the compute dtype as the dtype).
So explictly plumb whether fp8mm wants the requant or not.
* lora: add weight shape calculations.
This lets the loader know if a lora will change the shape of a weight
so it can take appropriate action.
* MPDynamic: force load flux img_in weight
This weight is a bit special, in that the lora changes its geometry.
This is rather unique, not handled by existing estimate and doesn't
work for either offloading or dynamic_vram.
Fix for dynamic_vram as a special case. Ideally we can fully precalculate
these lora geometry changes at load time, but just get these models
working first.
Get rid of the cat and unary negation and inplace add-cmul the two
halves of the rope. Precompute -sin once at the start of the model
rather than every transformer block.
This is slightly faster on both GPU and CPU bound setups.
The current behaviour of the default ModelPatcher is to .to a model
only if its fully loaded, which is how random non-leaf weights get
loaded in non-LowVRAM conditions.
The however means they never get loaded in dynamic_vram. In the
dynamic_vram case, force load them to the GPU.
* model_management: lazy-cache aimdo_tensor
These tensors cosntructed from aimdo-allocations are CPU expensive to
make on the pytorch side. Add a cache version that will be valid with
signature match to fast path past whatever torch is doing.
* dynamic_vram: Minimize fast path CPU work
Move as much as possible inside the not resident if block and cache
the formed weight and bias rather than the flat intermediates. In
extreme layer weight rates this adds up.
* Fix bypass dtype/device moving
* Force offloading mode for training
* training context var
* offloading implementation in training node
* fix wrong input type
* Support bypass load lora model, correct adapter/offloading handling
This was missing the stochastic rounding required for fp8 downcast
to be consistent with model_patcher.patch_weight_to_device.
Missed in testing as I spend too much time with quantized tensors
and overlooked the simpler ones.