mirror of https://github.com/vladmandic/automatic
151 lines
11 KiB
Markdown
151 lines
11 KiB
Markdown
# TODO
|
|
|
|
<https://github.com/huggingface/diffusers/pull/13317>
|
|
|
|
## Internal
|
|
|
|
- Feature: implement `unload_auxiliary_models`
|
|
- Feature: RIFE update
|
|
- Feature: RIFE in processing
|
|
- Feature: SeedVR2 in processing
|
|
- Feature: Add video models to `Reference`
|
|
- Deploy: Lite vs Expert mode
|
|
- Engine: [mmgp](https://github.com/deepbeepmeep/mmgp)
|
|
- Engine: `TensorRT` acceleration
|
|
- Feature: Auto handle scheduler `prediction_type`
|
|
- Feature: Cache models in memory
|
|
- Feature: JSON image metadata
|
|
- Validate: Control tab add overrides handling
|
|
- Feature: Integrate natural language image search
|
|
[ImageDB](https://github.com/vladmandic/imagedb)
|
|
- Feature: Multi-user support
|
|
- Feature: Settings profile manager
|
|
- Feature: Video tab add full API support
|
|
- Refactor: Unify *huggingface* and *diffusers* model folders
|
|
- Refactor: [GGUF](https://huggingface.co/docs/diffusers/main/en/quantization/gguf)
|
|
- Reimplement `llama` remover for Kanvas
|
|
- Integrate: [Depth3D](https://github.com/vladmandic/sd-extension-depth3d)
|
|
|
|
## OnHold
|
|
|
|
- Feature: LoRA add OMI format support for SD35/FLUX.1, on-hold
|
|
- Feature: Remote Text-Encoder support, sidelined for the moment
|
|
|
|
## Modular
|
|
|
|
*Pending finalization of modular pipelines implementation and development of compatibility layer*
|
|
|
|
- Switch to modular pipelines
|
|
- Feature: Transformers unified cache handler
|
|
- Refactor: [Modular pipelines and guiders](https://github.com/huggingface/diffusers/issues/11915)
|
|
- [MagCache](https://github.com/huggingface/diffusers/pull/12744)
|
|
- [SmoothCache](https://github.com/huggingface/diffusers/issues/11135)
|
|
- [STG](https://github.com/huggingface/diffusers/blob/main/examples/community/README.md#spatiotemporal-skip-guidance)
|
|
|
|
## New models / Pipelines
|
|
|
|
TODO: Investigate which models are diffusers-compatible and prioritize!
|
|
|
|
### Image-Base
|
|
|
|
- [Chroma Zeta](https://huggingface.co/lodestones/Zeta-Chroma): Image and video generator for creative effects and professional filters
|
|
- [Chroma Radiance](https://huggingface.co/lodestones/Chroma1-Radiance): Pixel-space model eliminating VAE artifacts for high visual fidelity
|
|
- [Bria FIBO](https://huggingface.co/briaai/FIBO): Fully JSON based
|
|
- [Liquid](https://github.com/FoundationVision/Liquid): Unified vision-language auto-regressive generation paradigm
|
|
- [Lumina-DiMOO](https://huggingface.co/Alpha-VLLM/Lumina-DiMOO): Foundational multi-modal generation and understanding via discrete diffusion
|
|
- [nVidia Cosmos-Predict-2.5](https://huggingface.co/nvidia/Cosmos-Predict2.5-2B): Physics-aware world foundation model for consistent scene prediction
|
|
- [Liquid (unified multimodal generator)](https://github.com/FoundationVision/Liquid): Auto-regressive generation paradigm across vision and language
|
|
- [Lumina-DiMOO](https://huggingface.co/Alpha-VLLM/Lumina-DiMOO): foundational multi-modal multi-task generation and understanding
|
|
|
|
### Image-Edit
|
|
|
|
- [Bria FIBO-Edit](https://huggingface.co/briaai/Fibo-Edit-RMBG): Fully JSON-based instruction-following image editing framework
|
|
- [Meituan LongCat-Image-Edit-Turbo](https://huggingface.co/meituan-longcat/LongCat-Image-Edit-Turbo):6B instruction-following image editing with high visual consistency
|
|
- [VIBE Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit): (Sana+Qwen-VL)Fast visual instruction-based image editing framework
|
|
- [LucyEdit](https://github.com/huggingface/diffusers/pull/12340):Instruction-guided video editing while preserving motion and identity
|
|
- [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit):Multimodal image editing decoding MLLM tokens via DiT
|
|
- [OneReward](https://github.com/bytedance/OneReward):Reinforcement learning grounded generative reward model for image editing
|
|
- [ByteDance DreamO](https://huggingface.co/ByteDance/DreamO): image customization framework for IP adaptation and virtual try-on
|
|
- [nVidia Cosmos-Transfer-2.5](https://github.com/huggingface/diffusers/pull/13066)
|
|
|
|
### Video
|
|
|
|
- [LTX-Condition](https://github.com/huggingface/diffusers/pull/13058)
|
|
- [LTX-Distilled](https://github.com/huggingface/diffusers/pull/12934)
|
|
- [OpenMOSS MOVA](https://huggingface.co/OpenMOSS-Team/MOVA-720p): Unified foundation model for synchronized high-fidelity video and audio
|
|
- [Wan family (Wan2.1 / Wan2.2 variants)](https://huggingface.co/Wan-AI/Wan2.2-Animate-14B): MoE-based foundational tools for cinematic T2V/I2V/TI2V
|
|
example: [Wan2.1-T2V-14B-CausVid](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-CausVid)
|
|
distill / step-distill examples: [Wan2.1-StepDistill-CfgDistill](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill)
|
|
- [Krea Realtime Video](https://huggingface.co/krea/krea-realtime-video): (Wan2.1)Distilled real-time video diffusion using self-forcing techniques
|
|
- [MAGI-1 (autoregressive video)](https://github.com/SandAI-org/MAGI-1): Autoregressive video generation allowing infinite and timeline control
|
|
- [MUG-V 10B (video generation)](https://huggingface.co/MUG-V/MUG-V-inference): large-scale DiT-based video generation system trained via flow-matching
|
|
- [Ovi (audio/video generation)](https://github.com/character-ai/Ovi): (Wan2.2)Speech-to-video with synchronized sound effects and music
|
|
- [HunyuanVideo-Avatar / HunyuanCustom](https://huggingface.co/tencent/HunyuanVideo-Avatar): (HunyuanVideo)MM-DiT based dynamic emotion-controllable dialogue generation
|
|
- [Sana Image→Video (Sana-I2V)](https://github.com/huggingface/diffusers/pull/12634#issuecomment-3540534268): (Sana)Compact Linear DiT framework for efficient high-resolution video
|
|
- [Wan-2.2 S2V (diffusers PR)](https://github.com/huggingface/diffusers/pull/12258): (Wan2.2)Audio-driven cinematic speech-to-video generation
|
|
- [LongCat-Video](https://huggingface.co/meituan-longcat/LongCat-Video): Unified framework for minutes-long coherent video generation via Block Sparse Attention
|
|
- [LTXVideo / LTXVideo LongMulti (diffusers PR)](https://github.com/huggingface/diffusers/pull/12614): Real-time DiT-based generation with production-ready camera controls
|
|
- [DiffSynth-Studio (ModelScope)](https://github.com/modelscope/DiffSynth-Studio): (Wan2.2)Comprehensive training and quantization tools for Wan video models
|
|
- [Phantom (Phantom HuMo)](https://github.com/Phantom-video/Phantom): Human-centric video generation framework focus on subject ID consistency
|
|
- [CausVid-Plus / WAN-CausVid-Plus](https://github.com/goatWu/CausVid-Plus/): (Wan2.1)Causal diffusion for high-quality temporally consistent long videos
|
|
- [Wan2GP (workflow/GUI for Wan)](https://github.com/deepbeepmeep/Wan2GP): (Wan)Web-based UI focused on running complex video models for GPU-poor setups
|
|
- [LivePortrait](https://github.com/KwaiVGI/LivePortrait): Efficient portrait animation system with high stitching and retargeting control
|
|
- [Magi (SandAI)](https://github.com/SandAI-org/MAGI-1): High-quality autoregressive video generation framework
|
|
- [Ming (inclusionAI)](https://github.com/inclusionAI/Ming): Unified multimodal model for processing text, audio, image, and video
|
|
|
|
### Other/Unsorted
|
|
|
|
- [DiffusionForcing](https://github.com/kwsong0113/diffusion-forcing-transformer): Full-sequence diffusion with autoregressive next-token prediction
|
|
- [Self-Forcing](https://github.com/guandeh17/Self-Forcing): Framework for improving temporal consistency in long-horizon video generation
|
|
- [SEVA](https://github.com/huggingface/diffusers/pull/11440): Stable Virtual Camera for novel view synthesis and 3D-consistent video
|
|
- [ByteDance USO](https://github.com/bytedance/USO): Unified Style-Subject Optimized framework for personalized image generation
|
|
- [ByteDance Lynx](https://github.com/bytedance/lynx): State-of-the-art high-fidelity personalized video generation based on DiT
|
|
- [LanDiff](https://github.com/landiff/landiff): Coarse-to-fine text-to-video integrating Language and Diffusion Models
|
|
- [Video Inpaint Pipeline](https://github.com/huggingface/diffusers/pull/12506): Unified inpainting pipeline implementation within Diffusers library
|
|
- [Sonic Inpaint](https://github.com/ubc-vision/sonic): Audio-driven portrait animation system focus on global audio perception
|
|
- [Make-It-Count](https://github.com/Litalby1/make-it-count): CountGen method for precise numerical control of objects via object identity features
|
|
- [ControlNeXt](https://github.com/dvlab-research/ControlNeXt/): Lightweight architecture for efficient controllable image and video generation
|
|
- [MS-Diffusion](https://github.com/MS-Diffusion/MS-Diffusion): Layout-guided multi-subject image personalization framework
|
|
- [UniRef](https://github.com/FoundationVision/UniRef): Unified model for segmentation tasks designed as foundation model plug-in
|
|
- [FlashFace](https://github.com/ali-vilab/FlashFace): High-fidelity human image customization and face swapping framework
|
|
- [ReNO](https://github.com/ExplainableML/ReNO): Reward-based Noise Optimization to improve text-to-image quality during inference
|
|
|
|
### Not Planned
|
|
|
|
- [LoRAdapter](https://github.com/CompVis/LoRAdapter): Not recently updated
|
|
- [SD3 UltraEdit](https://github.com/HaozheZhao/UltraEdit): Based on SD3
|
|
- [PowerPaint](https://github.com/open-mmlab/PowerPaint): Based on SD15
|
|
- [FreeCustom](https://github.com/aim-uofa/FreeCustom): Based on SD15
|
|
- [AnyDoor](https://github.com/ali-vilab/AnyDoor): Based on SD21
|
|
- [AnyText2](https://github.com/tyxsspa/AnyText2): Based on SD15
|
|
- [DragonDiffusion](https://github.com/MC-E/DragonDiffusion): Based on SD15
|
|
- [DenseDiffusion](https://github.com/naver-ai/DenseDiffusion): Based on SD15
|
|
- [IC-Light](https://github.com/lllyasviel/IC-Light): Based on SD15
|
|
|
|
## Code TODO
|
|
|
|
> npm run todo
|
|
|
|
```code
|
|
installer.py:TODO rocm: switch to pytorch source when it becomes available
|
|
modules/control/run.py:TODO modernui: monkey-patch for missing tabs.select event
|
|
modules/history.py:TODO: apply metadata, preview, load/save
|
|
modules/image/resize.py:TODO resize image: enable full VAE mode for resize-latent
|
|
modules/lora/lora_apply.py:TODO lora: add other quantization types
|
|
modules/lora/lora_apply.py:TODO lora: maybe force imediate quantization
|
|
modules/lora/lora_extract.py:TODO: lora: support pre-quantized flux
|
|
modules/lora/lora_load.py:TODO lora: add t5 key support for sd35/f1
|
|
modules/masking.py:TODO: additional masking algorithms
|
|
modules/modular_guiders.py:TODO: guiders
|
|
modules/processing_class.py:TODO processing: remove duplicate mask params
|
|
modules/sd_hijack_hypertile.py:TODO hypertile: vae breaks when using non-standard sizes
|
|
modules/sd_models.py:TODO model load: implement model in-memory caching
|
|
modules/sd_samplers_diffusers.py:TODO enso-required
|
|
modules/sd_unet.py:TODO model load: force-reloading entire model as loading transformers only leads to massive memory usage
|
|
modules/transformer_cache.py:TODO fc: autodetect distilled based on model
|
|
modules/transformer_cache.py:TODO fc: autodetect tensor format based on model
|
|
modules/ui_models_load.py:TODO loader: load receipe
|
|
modules/ui_models_load.py:TODO loader: save receipe
|
|
modules/video_models/video_save.py:TODO audio set time-base
|
|
```
|