Freeinit (#458)

* draft for freeinit * update for v1.x * Update README.md * Update features.md * Update how-to-use.md * Update README.md * Update README.md --------- Co-authored-by: Chengsong Zhang <continuerevolution@gmail.com>
2024-03-10 17:27:58 +08:00 · 2024-03-10 17:27:58 +08:00 · a390500002
parent 12a503b8b7
commit a390500002
7 changed files with 414 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -17,7 +17,9 @@ You might also be interested in another extension I created: [Segment Anything f
 ## Update
 - [v2.0.0-a](https://github.com/continue-revolution/sd-webui-animatediff/tree/v2.0.0-a) in `03/02/2023`: The whole extension has been reworked to make it easier to maintain.
  - Prerequisite: WebUI >= 1.8.0 & ControlNet >=1.1.441
-  - New feature: ControlNet inpaint / IP-Adapter prompt travel / SparseCtrl / ControlNet keyframe, see [ControlNet V2V](docs/features.md#controlnet-v2v)
+  - New feature:
+      - ControlNet inpaint / IP-Adapter prompt travel / SparseCtrl / ControlNet keyframe, see [ControlNet V2V](docs/features.md#controlnet-v2v)
+      - FreeInit, see [FreeInit](docs/features.md#FreeInit)
  - Minor: mm filter based on sd version (click refresh button if you switch between SD1.5 and SDXL) / display extension version in infotext
  - Breaking change: You must use Motion LoRA, Hotshot-XL, AnimateDiff V3 Motion Adapter from my [huggingface repo](https://huggingface.co/conrevo/AnimateDiff-A1111/tree/main).

@ -54,6 +56,7 @@ We thank all developers and community users who contribute to this repository in
 - [@limbo0000](https://github.com/limbo0000) for responding to my questions about AnimateDiff
 - [@neggles](https://github.com/neggles) and [@s9roll7](https://github.com/s9roll7) for developing [AnimateDiff CLI Prompt Travel](https://github.com/s9roll7/animatediff-cli-prompt-travel)
 - [@zappityzap](https://github.com/zappityzap) for developing the majority of the [output features](https://github.com/continue-revolution/sd-webui-animatediff/blob/master/scripts/animatediff_output.py)
+- [@thiswinex](https://github.com/thiswinex) for developing FreeInit
 - [@lllyasviel](https://github.com/lllyasviel) for adding me as a collaborator of sd-webui-controlnet and offering technical support for Forge
 - [@KohakuBlueleaf](https://github.com/KohakuBlueleaf) for helping with FP8 and LCM development
 - [@TDS4874](https://github.com/TDS4874) and [@opparco](https://github.com/opparco) for resolving the grey issue which significantly improve the performance
--- a/docs/features.md
+++ b/docs/features.md
@ -30,6 +30,16 @@ The last line is tail prompt, which is optional. You can write no/single/multipl
 smile
 ```

+## FreeInit
+
+It allows you to use more time to get more coherent and consistent video frames.
+
+The default parameters provide satisfactory results for most use cases. Increasing the number of iterations can yield better outcomes, but it also prolongs the processing time. If your video contains more intense or rapid motions, consider switching the filter to Gaussian. For a detailed explanation of each parameter, please refer to the documentation in the [original repository](https://github.com/TianxingWu/FreeInit).
+
+| without FreeInit | with FreeInit (default params) |
+| --- | --- |
+| ![00003-1234](https://github.com/thiswinex/sd-webui-animatediff/assets/29111172/631e1f4e-5c7e-44b8-bffb-e9f3e95ee304) | ![00002-1234](https://github.com/thiswinex/sd-webui-animatediff/assets/29111172/f4ba7132-7daf-4e26-86cc-766353e79fec) |
+

 ## ControlNet V2V
 You need to go to txt2img / img2img-batch and submit source video or path to frames. Each ControlNet will find control images according to this priority:
--- a/docs/how-to-use.md
+++ b/docs/how-to-use.md
@ -83,5 +83,9 @@ It is quite similar to the way you use ControlNet. API will return a video in ba
 1. **Interp X** — Replace each input frame with X interpolated output frames. [#128](https://github.com/continue-revolution/sd-webui-animatediff/pull/128).
 1. **Video source** — [Optional] Video source file for [ControlNet V2V](features.md#controlnet-v2v). You MUST enable ControlNet. It will be the source control for ALL ControlNet units that you enable without submitting a single control image to `Single Image` tab or a path to `Batch Folder` tab in ControlNet panel. You can of course submit one control image via `Single Image` tab or an input directory via `Batch Folder` tab, which will override this video source input and work as usual.
 1. **Video path** — [Optional] Folder for source frames for [ControlNet V2V](features.md#controlnet-v2v), but higher priority than `Video source`. You MUST enable ControlNet. It will be the source control for ALL ControlNet units that you enable without submitting a control image or a path to ControlNet. You can of course submit one control image via `Single Image` tab or an input directory via `Batch Folder` tab, which will override this video path input and work as usual.
+1. **FreeInit** - [Optional] Using FreeInit to improve temporal consistency of your videos.
+   1. The default parameters provide satisfactory results for most use cases.
+   1. Use "Gaussian" filter when your motion is intense.
+   1. See [original repo of Freeinit](https://github.com/TianxingWu/FreeInit) to for more parameter settings.

 See [ControlNet V2V](features.md#controlnet-v2v) for an example parameter fill-in and more explanation.
--- a/scripts/animatediff.py
+++ b/scripts/animatediff.py
@ -19,6 +19,7 @@ from scripts.animatediff_settings import on_ui_settings
 from scripts.animatediff_infotext import update_infotext, infotext_pasted
 from scripts.animatediff_utils import get_animatediff_arg
 from scripts.animatediff_i2ibatch import * # this is necessary for CN to find the function
+from scripts.animatediff_freeinit import AnimateDiffFreeInit

 script_dir = scripts.basedir()
 motion_module.set_script_dir(script_dir)
@ -64,6 +65,9 @@ class AnimateDiffScript(scripts.Script):
            params.set_p(p)
            params.prompt_scheduler = AnimateDiffPromptSchedule(p, params)
            update_infotext(p, params)
+            if params.freeinit_enable:
+                self.freeinit_hacker = AnimateDiffFreeInit(params)
+                self.freeinit_hacker.hack(p, params)
            self.hacked = True
        elif self.hacked:
            motion_module.restore(p.sd_model)
--- a/scripts/animatediff_freeinit.py
+++ b/scripts/animatediff_freeinit.py
@ -0,0 +1,322 @@
+import torch
+import torch.fft as fft
+import math
+
+
+import os
+import re
+import sys
+
+from modules import sd_models, shared, sd_samplers, devices
+from modules.paths import extensions_builtin_dir
+from modules.processing import StableDiffusionProcessing, opt_C, opt_f, StableDiffusionProcessingTxt2Img, StableDiffusionProcessingImg2Img, decode_latent_batch
+from types import MethodType
+
+from scripts.animatediff_logger import logger_animatediff as logger
+from scripts.animatediff_ui import AnimateDiffProcess
+
+
+
+def ddim_add_noise(
+        original_samples: torch.FloatTensor,
+        noise: torch.FloatTensor,
+        timesteps: torch.IntTensor,
+    ) -> torch.FloatTensor:
+
+        alphas_cumprod = shared.sd_model.alphas_cumprod
+        # Make sure alphas_cumprod and timestep have same device and dtype as original_samples
+        alphas_cumprod = alphas_cumprod.to(device=original_samples.device, dtype=original_samples.dtype)
+        timesteps = timesteps.to(original_samples.device)
+
+        sqrt_alpha_prod = alphas_cumprod[timesteps] ** 0.5
+        sqrt_alpha_prod = sqrt_alpha_prod.flatten()
+        while len(sqrt_alpha_prod.shape) < len(original_samples.shape):
+            sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1)
+
+        sqrt_one_minus_alpha_prod = (1 - alphas_cumprod[timesteps]) ** 0.5
+        sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten()
+        while len(sqrt_one_minus_alpha_prod.shape) < len(original_samples.shape):
+            sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1)
+
+        noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noise
+        return noisy_samples
+
+
+
+class AnimateDiffFreeInit:
+    def __init__(self, params):
+        self.num_iters = params.freeinit_iters
+        self.method = params.freeinit_filter
+        self.d_s = params.freeinit_ds
+        self.d_t = params.freeinit_dt
+
+    @torch.no_grad()
+    def init_filter(self, video_length, height, width, filter_params):
+        # initialize frequency filter for noise reinitialization
+        batch_size = 1
+        filter_shape = [
+            batch_size, 
+            opt_C, 
+            video_length, 
+            height // opt_f, 
+            width // opt_f
+        ]
+        self.freq_filter = get_freq_filter(filter_shape, device=devices.device, params=filter_params)
+
+
+    def hack(self, p: StableDiffusionProcessing, params: AnimateDiffProcess):
+        # init filter
+        filter_params = { 
+            'method': self.method,
+            'd_s': self.d_s,
+            'd_t': self.d_t,
+        }
+        self.init_filter(params.video_length, p.height, p.width, filter_params)
+
+
+        def sample_t2i(self, conditioning, unconditional_conditioning, seeds, subseeds, subseed_strength, prompts):
+            self.sampler = sd_samplers.create_sampler(self.sampler_name, self.sd_model)
+
+            # hack total progress bar （works in an ugly way)
+            setattr(self.sampler, 'freeinit_num_iters', self.num_freeinit_iters)  
+            setattr(self.sampler, 'freeinit_num_iter', 0)  
+
+            def callback_hack(self, d):
+                step = d['i'] // self.freeinit_num_iters + self.freeinit_num_iter * (shared.state.sampling_steps // self.freeinit_num_iters)
+
+                if self.stop_at is not None and step > self.stop_at:
+                    raise InterruptedException
+
+                shared.state.sampling_step = step
+
+                if d['i'] % self.freeinit_num_iters == 0:
+                    shared.total_tqdm.update()
+
+            self.sampler.callback_state = MethodType(callback_hack, self.sampler) 
+
+            # Sampling with FreeInit
+            x = self.rng.next()
+            x_dtype = x.dtype
+ 
+            for iter in range(self.num_freeinit_iters):
+                self.sampler.freeinit_num_iter = iter
+                if iter == 0:
+                    initial_x = x.detach().clone()
+                else:
+                    # z_0
+                    diffuse_timesteps = torch.tensor(1000 - 1)
+                    z_T = ddim_add_noise(x, initial_x, diffuse_timesteps)   # [16, 4, 64, 64]
+                    # z_T
+                    # 2. create random noise z_rand for high-frequency
+                    z_T = z_T.permute(1, 0, 2, 3)[None, ...]    # [bs, 4, 16, 64, 64]
+                    #z_rand = torch.randn(z_T.shape, device=devices.device)
+                    z_rand = initial_x.detach().clone().permute(1, 0, 2, 3)[None, ...]
+                    # 3. Roise Reinitialization
+                    x = freq_mix_3d(z_T.to(dtype=torch.float32), z_rand, LPF=self.freq_filter)
+                    
+                    x = x[0].permute(1, 0, 2, 3)
+                    x = x.to(x_dtype)
+
+                x = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
+                devices.torch_gc()
+
+            samples = x
+            del x
+
+            if not self.enable_hr:
+                return samples
+
+            devices.torch_gc()
+
+            if self.latent_scale_mode is None:
+                decoded_samples = torch.stack(decode_latent_batch(self.sd_model, samples, target_device=devices.cpu, check_for_nans=True)).to(dtype=torch.float32)
+            else:
+                decoded_samples = None
+
+            with sd_models.SkipWritingToConfig():
+                sd_models.reload_model_weights(info=self.hr_checkpoint_info)
+
+            return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subseed_strength, prompts)
+
+
+        def sample_i2i(self, conditioning, unconditional_conditioning, seeds, subseeds, subseed_strength, prompts):
+            x = self.rng.next()
+            x_dtype = x.dtype
+ 
+
+            if self.initial_noise_multiplier != 1.0:
+                self.extra_generation_params["Noise multiplier"] = self.initial_noise_multiplier
+                x *= self.initial_noise_multiplier
+
+            for iter in range(self.num_freeinit_iters):
+                if iter == 0:
+                    initial_x = x.detach().clone()
+                else:
+                    # z_0
+                    diffuse_timesteps = torch.tensor(1000 - 1)
+                    z_T = ddim_add_noise(x, initial_x, diffuse_timesteps)   # [16, 4, 64, 64]
+                    # z_T
+                    # 2. create random noise z_rand for high-frequency
+                    z_T = z_T.permute(1, 0, 2, 3)[None, ...]    # [bs, 4, 16, 64, 64]
+                    #z_rand = torch.randn(z_T.shape, device=devices.device)
+                    z_rand = initial_x.detach().clone().permute(1, 0, 2, 3)[None, ...]
+                    # 3. Roise Reinitialization
+                    x = freq_mix_3d(z_T.to(dtype=torch.float32), z_rand, LPF=self.freq_filter)
+                    
+                    x = x[0].permute(1, 0, 2, 3)
+                    x = x.to(x_dtype)
+
+                x = self.sampler.sample_img2img(self, self.init_latent, x, conditioning, unconditional_conditioning, image_conditioning=self.image_conditioning)
+            samples = x
+
+            if self.mask is not None:
+                samples = samples * self.nmask + self.init_latent * self.mask
+
+            del x
+            devices.torch_gc()
+
+            return samples
+
+        if isinstance(p, StableDiffusionProcessingTxt2Img):
+            p.sample = MethodType(sample_t2i, p)
+        elif isinstance(p, StableDiffusionProcessingImg2Img):
+            p.sample = MethodType(sample_i2i, p)
+        else:
+            raise NotImplementedError
+
+        setattr(p, 'freq_filter', self.freq_filter)  
+        setattr(p, 'num_freeinit_iters', self.num_iters)  
+
+
+def freq_mix_3d(x, noise, LPF):
+    """
+    Noise reinitialization.
+
+    Args:
+        x: diffused latent
+        noise: randomly sampled noise
+        LPF: low pass filter
+    """
+    # FFT
+    x_freq = fft.fftn(x, dim=(-3, -2, -1))
+    x_freq = fft.fftshift(x_freq, dim=(-3, -2, -1))
+    noise_freq = fft.fftn(noise, dim=(-3, -2, -1))
+    noise_freq = fft.fftshift(noise_freq, dim=(-3, -2, -1))
+
+    # frequency mix
+    HPF = 1 - LPF
+    x_freq_low = x_freq * LPF
+    noise_freq_high = noise_freq * HPF
+    x_freq_mixed = x_freq_low + noise_freq_high # mix in freq domain
+
+    # IFFT
+    x_freq_mixed = fft.ifftshift(x_freq_mixed, dim=(-3, -2, -1))
+    x_mixed = fft.ifftn(x_freq_mixed, dim=(-3, -2, -1)).real
+
+    return x_mixed
+
+
+def get_freq_filter(shape, device, params: dict):
+    """
+    Form the frequency filter for noise reinitialization.
+
+    Args:
+        shape: shape of latent (B, C, T, H, W)
+        params: filter parameters
+    """
+    if params['method'] == "gaussian":
+        return gaussian_low_pass_filter(shape=shape, d_s=params['d_s'], d_t=params['d_t']).to(device)
+    elif params['method'] == "ideal":
+        return ideal_low_pass_filter(shape=shape, d_s=params['d_s'], d_t=params['d_t']).to(device)
+    elif params['method'] == "box":
+        return box_low_pass_filter(shape=shape, d_s=params['d_s'], d_t=params['d_t']).to(device)
+    elif params['method'] == "butterworth":
+        return butterworth_low_pass_filter(shape=shape, n=4, d_s=params['d_s'], d_t=params['d_t']).to(device)
+    else:
+        raise NotImplementedError
+
+def gaussian_low_pass_filter(shape, d_s=0.25, d_t=0.25):
+    """
+    Compute the gaussian low pass filter mask.
+
+    Args:
+        shape: shape of the filter (volume)
+        d_s: normalized stop frequency for spatial dimensions (0.0-1.0)
+        d_t: normalized stop frequency for temporal dimension (0.0-1.0)
+    """
+    T, H, W = shape[-3], shape[-2], shape[-1]
+    mask = torch.zeros(shape)
+    if d_s==0 or d_t==0:
+        return mask
+    for t in range(T):
+        for h in range(H):
+            for w in range(W):
+                d_square = (((d_s/d_t)*(2*t/T-1))**2 + (2*h/H-1)**2 + (2*w/W-1)**2)
+                mask[..., t,h,w] = math.exp(-1/(2*d_s**2) * d_square)
+    return mask
+
+
+def butterworth_low_pass_filter(shape, n=4, d_s=0.25, d_t=0.25):
+    """
+    Compute the butterworth low pass filter mask.
+
+    Args:
+        shape: shape of the filter (volume)
+        n: order of the filter, larger n ~ ideal, smaller n ~ gaussian
+        d_s: normalized stop frequency for spatial dimensions (0.0-1.0)
+        d_t: normalized stop frequency for temporal dimension (0.0-1.0)
+    """
+    T, H, W = shape[-3], shape[-2], shape[-1]
+    mask = torch.zeros(shape)
+    if d_s==0 or d_t==0:
+        return mask
+    for t in range(T):
+        for h in range(H):
+            for w in range(W):
+                d_square = (((d_s/d_t)*(2*t/T-1))**2 + (2*h/H-1)**2 + (2*w/W-1)**2)
+                mask[..., t,h,w] = 1 / (1 + (d_square / d_s**2)**n)
+    return mask
+
+
+def ideal_low_pass_filter(shape, d_s=0.25, d_t=0.25):
+    """
+    Compute the ideal low pass filter mask.
+
+    Args:
+        shape: shape of the filter (volume)
+        d_s: normalized stop frequency for spatial dimensions (0.0-1.0)
+        d_t: normalized stop frequency for temporal dimension (0.0-1.0)
+    """
+    T, H, W = shape[-3], shape[-2], shape[-1]
+    mask = torch.zeros(shape)
+    if d_s==0 or d_t==0:
+        return mask
+    for t in range(T):
+        for h in range(H):
+            for w in range(W):
+                d_square = (((d_s/d_t)*(2*t/T-1))**2 + (2*h/H-1)**2 + (2*w/W-1)**2)
+                mask[..., t,h,w] =  1 if d_square <= d_s*2 else 0
+    return mask
+
+
+def box_low_pass_filter(shape, d_s=0.25, d_t=0.25):
+    """
+    Compute the ideal low pass filter mask (approximated version).
+
+    Args:
+        shape: shape of the filter (volume)
+        d_s: normalized stop frequency for spatial dimensions (0.0-1.0)
+        d_t: normalized stop frequency for temporal dimension (0.0-1.0)
+    """
+    T, H, W = shape[-3], shape[-2], shape[-1]
+    mask = torch.zeros(shape)
+    if d_s==0 or d_t==0:
+        return mask
+
+    threshold_s = round(int(H // 2) * d_s)
+    threshold_t = round(T // 2 * d_t)
+
+    cframe, crow, ccol = T // 2, H // 2, W //2
+    mask[..., cframe - threshold_t:cframe + threshold_t, crow - threshold_s:crow + threshold_s, ccol - threshold_s:ccol + threshold_s] = 1.0
+
+    return mask
--- a/scripts/animatediff_infv2v.py
+++ b/scripts/animatediff_infv2v.py
@ -156,9 +156,10 @@ class AnimateDiffInfV2V:
                    mm_cn_restore(_context)
                return x_out

-            logger.info("inner model forward hooked")
-            cfg_params.denoiser.inner_model.original_forward = cfg_params.denoiser.inner_model.forward
-            cfg_params.denoiser.inner_model.forward = MethodType(mm_sd_forward, cfg_params.denoiser.inner_model)
+            if getattr(cfg_params.denoiser.inner_model, 'original_forward', None) is None:
+                logger.info("inner model forward hooked")
+                cfg_params.denoiser.inner_model.original_forward = cfg_params.denoiser.inner_model.forward
+                cfg_params.denoiser.inner_model.forward = MethodType(mm_sd_forward, cfg_params.denoiser.inner_model)

        cfg_params.text_cond = ad_params.text_cond
        ad_params.step = cfg_params.denoiser.step
--- a/scripts/animatediff_ui.py
+++ b/scripts/animatediff_ui.py
@ -46,6 +46,11 @@ class AnimateDiffProcess:
        video_source=None,
        video_path='',
        mask_path='',
+        freeinit_enable=False,
+        freeinit_filter="butterworth",
+        freeinit_ds=0.25,
+        freeinit_dt=0.25,
+        freeinit_iters=3,
        latent_power=1,
        latent_scale=32,
        last_frame=None,
@ -68,6 +73,11 @@ class AnimateDiffProcess:
        self.video_source = video_source
        self.video_path = video_path
        self.mask_path = mask_path
+        self.freeinit_enable = freeinit_enable
+        self.freeinit_filter = freeinit_filter
+        self.freeinit_ds = freeinit_ds
+        self.freeinit_dt = freeinit_dt
+        self.freeinit_iters = freeinit_iters
        self.latent_power = latent_power
        self.latent_scale = latent_scale
        self.last_frame = last_frame
@ -82,7 +92,7 @@ class AnimateDiffProcess:


    def get_list(self, is_img2img: bool):
-        return list(vars(self).values())[:(20 if is_img2img else 15)]
+        return list(vars(self).values())[:(25 if is_img2img else 20)]


    def get_dict(self, is_img2img: bool):
@ -97,6 +107,7 @@ class AnimateDiffProcess:
            "overlap": self.overlap,
            "interp": self.interp,
            "interp_x": self.interp_x,
+            "freeinit_enable": self.freeinit_enable,
        }
        if self.request_id:
            infotext['request_id'] = self.request_id
@ -233,6 +244,14 @@ class AnimateDiffUiGroup:
        self.params = AnimateDiffProcess()
        AnimateDiffUiGroup.animatediff_ui_group.append(self)

+        # Free-init
+        self.filter_type_list = [
+            "butterworth",
+            "gaussian",
+            "box",
+            "ideal"
+        ]
+

    def get_model_list(self):
        model_dir = motion_module.get_model_dir()
@ -350,6 +369,52 @@ class AnimateDiffUiGroup:
                    value=self.params.interp_x, label="Interp X", precision=0, 
                    elem_id=f"{elemid_prefix}interp-x"
                )
+            with gr.Accordion("FreeInit Params", open=False):
+                gr.Markdown(
+                    """
+                    Adjust to control the smoothness.
+                    """
+                )
+                self.params.freeinit_enable = gr.Checkbox(
+                    value=self.params.freeinit_enable, 
+                    label="Enable FreeInit", 
+                    elem_id=f"{elemid_prefix}freeinit-enable"
+                )
+                self.params.freeinit_filter = gr.Dropdown(
+                    value=self.params.freeinit_filter, 
+                    label="Filter Type", 
+                    info="Default as Butterworth. To fix large inconsistencies, consider using Gaussian.",
+                    choices=self.filter_type_list,
+                    interactive=True, 
+                    elem_id=f"{elemid_prefix}freeinit-filter"
+                )
+                self.params.freeinit_ds = gr.Slider( 
+                    value=self.params.freeinit_ds, 
+                    minimum=0, 
+                    maximum=1, 
+                    step=0.125, 
+                    label="d_s", 
+                    info="Stop frequency for spatial dimensions (0.0-1.0)", 
+                    elem_id=f"{elemid_prefix}freeinit-ds"
+                )
+                self.params.freeinit_dt = gr.Slider(
+                    value=self.params.freeinit_dt, 
+                    minimum=0, 
+                    maximum=1, 
+                    step=0.125, 
+                    label="d_t", 
+                    info="Stop frequency for temporal dimension (0.0-1.0)", 
+                    elem_id=f"{elemid_prefix}freeinit-dt"
+                )
+                self.params.freeinit_iters = gr.Slider(
+                    value=self.params.freeinit_iters, 
+                    minimum=2, 
+                    maximum=5, 
+                    step=1, 
+                    label="FreeInit Iterations", 
+                    info="Larger value leads to smoother results & longer inference time.", 
+                    elem_id=f"{elemid_prefix}freeinit-dt",
+                )
            self.params.video_source = gr.Video(
                value=self.params.video_source,
                label="Video source",