When training_dtype is set to "none" and the model's native dtype is
float16, GradScaler was unconditionally enabled. However, GradScaler
does not support bfloat16 gradients (only float16/float32), causing a
NotImplementedError when lora_dtype is "bf16" (the default).
Fix by only enabling GradScaler when LoRA parameters are not in
bfloat16, since bfloat16 has the same exponent range as float32 and
does not need gradient scaling to avoid underflow.
Fixes#13124
* Fix bypass dtype/device moving
* Force offloading mode for training
* training context var
* offloading implementation in training node
* fix wrong input type
* Support bypass load lora model, correct adapter/offloading handling
* Create nodes_dataset.py
* Add encoded dataset caching mechanism
* make training node to work with our dataset system
* allow trainer node to get different resolution dataset
* move all dataset related implementation to nodes_dataset
* Rewrite dataset system with new io schema
* Rewrite training system with new io schema
* add ui pbar
* Add outputs' id/name
* Fix bad id/naming
* use single process instead of input list when no need
* fix wrong output_list flag
* use torch.load/save and fix bad behaviors
* Add factorization utils for lokr
* Add lokr train impl
* Add loha train impl
* Add adapter map for algo selection
* Add optional grad ckpt and algo selection
* Update __init__.py
* correct key name for loha
* Use custom fwd/bwd func and better init for loha
* Support gradient accumulation
* Fix bugs of loha
* use more stable init
* Add OFT training
* linting