release code and model

2023-08-16 16:00:11 +08:00 · 2023-08-16 16:00:11 +08:00 · 3006c41947
parent b2424a50df
commit 3006c41947
24 changed files with 1971 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1,4 +1,4 @@
- # IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
+# IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

 <div align="center">

@ -7,3 +7,61 @@

 ---

+
+## Introduction
+
+we present IP-Adapter, an effective and lightweight
+adapter to achieve image prompt capability for the pretrained
+text-to-image diffusion models. An IP-Adapter
+with only 22M parameters can achieve comparable or even
+better performance to a fine-tuned image prompt model. IPAdapter
+can be generalized not only to other custom models
+fine-tuned from the same base model, but also to controllable
+generation using existing controllable tools. Moreover, the image prompt
+can also work well with the text prompt to accomplish multimodal
+image generation.
+
+![arch](assets/figs/fig1.png)
+
+## Release
+- [2023/8/16] 🔥 We release the code and models.
+
+
+## Dependencies
+- diffusers >= 0.19.3
+
+## Download Models
+
+you can download models from [here](https://huggingface.co/h94/IP-Adapter). To run the demo, you should also download the following models:
+- [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)
+- [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse)
+- [SG161222/Realistic_Vision_V4.0_noVAE](https://huggingface.co/SG161222/Realistic_Vision_V4.0_noVAE)
+
+## How to Use
+
+- [**ip_adapter_demo**](ip_adapter_demo.ipynb): image variations, image-to-image, and inpainting with image prompt.
+
+![image variations](assets/demo/image_variations.jpg)
+
+![image-to-image](assets/demo/image-to-image.jpg)
+
+![inpainting](assets/demo/inpainting.jpg)
+- [**ip_adapter_controlnet_demo**](ip_adapter_controlnet_demo.ipynb): structural generation with image prompt.
+
+![structural_cond](assets/demo/structural_cond.jpg)
+
+- [**ip_adapter_multimodal_prompts_demo**](ip_adapter_multimodal_prompts_demo.ipynb): generation with multimodal prompts.
+
+![multi_prompts](assets/demo/multi_prompts.jpg)
+
+
+## Citation
+If you find IP-Adapter useful for your your research and applications, please cite using this BibTeX:
+```bibtex
+@article{ye2023ip-adapter,
+  title={IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models},
+  author={Ye, Hu and Zhang, Jun and Liu, Sibo and Han, Xiao and Yang, Wei},
+  booktitle={arXiv preprint arxiv:2308.06721},
+  year={2023}
+}
+```
--- a/assets/demo/image-to-image.jpg
+++ b/assets/demo/image-to-image.jpg
--- a/assets/demo/image_variations.jpg
+++ b/assets/demo/image_variations.jpg
--- a/assets/demo/inpainting.jpg
+++ b/assets/demo/inpainting.jpg
--- a/assets/demo/multi_prompts.jpg
+++ b/assets/demo/multi_prompts.jpg
--- a/assets/demo/structural_cond.jpg
+++ b/assets/demo/structural_cond.jpg
--- a/assets/figs/fig1.png
+++ b/assets/figs/fig1.png
--- a/assets/images/girl.png
+++ b/assets/images/girl.png
--- a/assets/images/river.png
+++ b/assets/images/river.png
--- a/assets/images/statue.png
+++ b/assets/images/statue.png
--- a/assets/images/stone.png
+++ b/assets/images/stone.png
--- a/assets/images/vermeer.jpg
+++ b/assets/images/vermeer.jpg
--- a/assets/images/woman.png
+++ b/assets/images/woman.png
--- a/assets/inpainting/image.png
+++ b/assets/inpainting/image.png
--- a/assets/inpainting/mask.png
+++ b/assets/inpainting/mask.png
--- a/assets/structure_controls/depth.png
+++ b/assets/structure_controls/depth.png
--- a/assets/structure_controls/openpose.png
+++ b/assets/structure_controls/openpose.png
--- a/ip_adapter/init.py
+++ b/ip_adapter/init.py
@ -0,0 +1 @@
+from .ip_adapter import IPAdapter
--- a/ip_adapter/attention_processor.py
+++ b/ip_adapter/attention_processor.py
@ -0,0 +1,176 @@
+import torch
+import torch.nn as nn
+
+
+class AttnProcessor(nn.Module):
+    r"""
+    Default processor for performing attention-related computations.
+    """
+    def __init__(
+        self,
+        hidden_size=None,
+        cross_attention_dim=None,
+    ):
+        super().__init__()
+
+    def __call__(
+        self,
+        attn,
+        hidden_states,
+        encoder_hidden_states=None,
+        attention_mask=None,
+        temb=None,
+    ):
+        residual = hidden_states
+
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+
+        input_ndim = hidden_states.ndim
+
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+
+        batch_size, sequence_length, _ = (
+            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
+        )
+        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
+
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
+
+        query = attn.to_q(hidden_states)
+
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        elif attn.norm_cross:
+            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+
+        key = attn.to_k(encoder_hidden_states)
+        value = attn.to_v(encoder_hidden_states)
+
+        query = attn.head_to_batch_dim(query)
+        key = attn.head_to_batch_dim(key)
+        value = attn.head_to_batch_dim(value)
+
+        attention_probs = attn.get_attention_scores(query, key, attention_mask)
+        hidden_states = torch.bmm(attention_probs, value)
+        hidden_states = attn.batch_to_head_dim(hidden_states)
+
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+
+        hidden_states = hidden_states / attn.rescale_output_factor
+
+        return hidden_states
+    
+    
+class IPAttnProcessor(nn.Module):
+    r"""
+    Attention processor for IP-Adapater.
+    Args:
+        hidden_size (`int`):
+            The hidden size of the attention layer.
+        cross_attention_dim (`int`):
+            The number of channels in the `encoder_hidden_states`.
+        text_context_len (`int`, defaults to 77):
+            The context length of the text features.
+        scale (`float`, defaults to 1.0):
+            the weight scale of image prompt.
+    """
+
+    def __init__(self, hidden_size, cross_attention_dim=None, text_context_len=77, scale=1.0):
+        super().__init__()
+
+        self.hidden_size = hidden_size
+        self.cross_attention_dim = cross_attention_dim
+        self.text_context_len = text_context_len
+        self.scale = scale
+
+        self.to_k_ip = nn.Linear(cross_attention_dim or hidden_size, hidden_size, bias=False)
+        self.to_v_ip = nn.Linear(cross_attention_dim or hidden_size, hidden_size, bias=False)
+
+    def __call__(
+        self,
+        attn,
+        hidden_states,
+        encoder_hidden_states=None,
+        attention_mask=None,
+        temb=None,
+    ):
+        residual = hidden_states
+
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+
+        input_ndim = hidden_states.ndim
+
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+
+        batch_size, sequence_length, _ = (
+            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
+        )
+        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
+
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
+
+        query = attn.to_q(hidden_states)
+
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        elif attn.norm_cross:
+            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+            
+        # split hidden states
+        encoder_hidden_states, ip_hidden_states = encoder_hidden_states[:, :self.text_context_len, :], encoder_hidden_states[:, self.text_context_len:, :]
+
+        key = attn.to_k(encoder_hidden_states)
+        value = attn.to_v(encoder_hidden_states)
+
+        query = attn.head_to_batch_dim(query)
+        key = attn.head_to_batch_dim(key)
+        value = attn.head_to_batch_dim(value)
+
+        attention_probs = attn.get_attention_scores(query, key, attention_mask)
+        hidden_states = torch.bmm(attention_probs, value)
+        hidden_states = attn.batch_to_head_dim(hidden_states)
+        
+        # for ip-adapter
+        ip_key = self.to_k_ip(ip_hidden_states)
+        ip_value = self.to_v_ip(ip_hidden_states)
+        
+        ip_key = attn.head_to_batch_dim(ip_key)
+        ip_value = attn.head_to_batch_dim(ip_value)
+        
+        ip_attention_probs = attn.get_attention_scores(query, ip_key, None)
+        ip_hidden_states = torch.bmm(ip_attention_probs, ip_value)
+        ip_hidden_states = attn.batch_to_head_dim(ip_hidden_states)
+        
+        hidden_states = hidden_states + self.scale * ip_hidden_states
+
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+
+        hidden_states = hidden_states / attn.rescale_output_factor
+
+        return hidden_states
--- a/ip_adapter/ip_adapter.py
+++ b/ip_adapter/ip_adapter.py
@ -0,0 +1,184 @@
+import os
+from typing import List
+
+import torch
+from diffusers import StableDiffusionPipeline
+from transformers import CLIPVisionModelWithProjection, CLIPImageProcessor
+from PIL import Image
+
+from .attention_processor import IPAttnProcessor, AttnProcessor
+
+
+class ImageProjModel(torch.nn.Module):
+    """Projection Model"""
+    def __init__(self, cross_attention_dim=1024, clip_embeddings_dim=1024, clip_extra_context_tokens=4):
+        super().__init__()
+        
+        self.cross_attention_dim = cross_attention_dim
+        self.clip_extra_context_tokens = clip_extra_context_tokens
+        self.proj = torch.nn.Linear(clip_embeddings_dim, self.clip_extra_context_tokens * cross_attention_dim)
+        self.norm = torch.nn.LayerNorm(cross_attention_dim)
+        
+    def forward(self, image_embeds):
+        embeds = image_embeds
+        clip_extra_context_tokens = self.proj(embeds).reshape(-1, self.clip_extra_context_tokens, self.cross_attention_dim)
+        clip_extra_context_tokens = self.norm(clip_extra_context_tokens)
+        return clip_extra_context_tokens
+
+
+class IPAdapter:
+    
+    def __init__(self, sd_pipe, image_encoder_path, ip_ckpt, device):
+        
+        self.device = device
+        self.image_encoder_path = image_encoder_path
+        self.ip_ckpt = ip_ckpt
+        
+        self.pipe = sd_pipe.to(self.device)
+        self.set_ip_adapter()
+        
+        # load image encoder
+        self.image_encoder = CLIPVisionModelWithProjection.from_pretrained(self.image_encoder_path).to(self.device, dtype=torch.float16)
+        self.clip_image_processor = CLIPImageProcessor()
+        # image proj model
+        self.image_proj_model = ImageProjModel(cross_attention_dim=768, clip_embeddings_dim=1024,
+                clip_extra_context_tokens=4).to(self.device, dtype=torch.float16)
+        
+        self.load_ip_adapter()
+        
+    def set_ip_adapter(self):
+        unet = self.pipe.unet
+        attn_procs = {}
+        for name in unet.attn_processors.keys():
+            cross_attention_dim = None if name.endswith("attn1.processor") else unet.config.cross_attention_dim
+            if name.startswith("mid_block"):
+                hidden_size = unet.config.block_out_channels[-1]
+            elif name.startswith("up_blocks"):
+                block_id = int(name[len("up_blocks.")])
+                hidden_size = list(reversed(unet.config.block_out_channels))[block_id]
+            elif name.startswith("down_blocks"):
+                block_id = int(name[len("down_blocks.")])
+                hidden_size = unet.config.block_out_channels[block_id]
+            if cross_attention_dim is None:
+                attn_procs[name] = AttnProcessor()
+            else:
+                attn_procs[name] = IPAttnProcessor(hidden_size=hidden_size, cross_attention_dim=cross_attention_dim,
+                scale=1.0).to(self.device, dtype=torch.float16)
+        unet.set_attn_processor(attn_procs)
+        
+    def load_ip_adapter(self):
+        state_dict = torch.load(self.ip_ckpt, map_location="cpu")
+        self.image_proj_model.load_state_dict(state_dict["image_proj"])
+        ip_layers = torch.nn.ModuleList(self.pipe.unet.attn_processors.values())
+        ip_layers.load_state_dict(state_dict["ip_adapter"])
+        
+    @torch.inference_mode()
+    def get_image_embeds(self, pil_image):
+        if isinstance(pil_image, Image.Image):
+            pil_image = [pil_image]
+        clip_image = self.clip_image_processor(images=pil_image, return_tensors="pt").pixel_values
+        clip_image_embeds = self.image_encoder(clip_image.to(self.device, dtype=torch.float16)).image_embeds
+        image_prompt_embeds = self.image_proj_model(clip_image_embeds)
+        uncond_image_prompt_embeds = self.image_proj_model(torch.zeros_like(clip_image_embeds))
+        return image_prompt_embeds, uncond_image_prompt_embeds
+    
+    def set_scale(self, scale):
+        for attn_processor in self.pipe.unet.attn_processors.values():
+            if isinstance(attn_processor, IPAttnProcessor):
+                attn_processor.scale = scale
+        
+    def generate(
+        self,
+        pil_image,
+        prompt=None,
+        negative_prompt=None,
+        scale=1.0,
+        num_samples=4,
+        seed=-1,
+        guidance_scale=7.5,
+        num_inference_steps=30,
+        **kwargs,
+    ):
+        self.set_scale(scale)
+        
+        if isinstance(pil_image, Image.Image):
+            num_prompts = 1
+        else:
+            num_prompts = len(pil_image)
+        
+        if prompt is None:
+            prompt = "best quality, high quality"
+        if negative_prompt is None:
+            negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
+            
+        if not isinstance(prompt, List):
+            prompt = [prompt] * num_prompts
+        if not isinstance(negative_prompt, List):
+            negative_prompt = [negative_prompt] * num_prompts
+        
+        image_prompt_embeds, uncond_image_prompt_embeds = self.get_image_embeds(pil_image)
+        bs_embed, seq_len, _ = image_prompt_embeds.shape
+        image_prompt_embeds = image_prompt_embeds.repeat(1, num_samples, 1)
+        image_prompt_embeds = image_prompt_embeds.view(bs_embed * num_samples, seq_len, -1)
+        uncond_image_prompt_embeds = uncond_image_prompt_embeds.repeat(1, num_samples, 1)
+        uncond_image_prompt_embeds = uncond_image_prompt_embeds.view(bs_embed * num_samples, seq_len, -1)
+
+        with torch.inference_mode():
+            prompt_embeds = self.pipe._encode_prompt(prompt, self.device, num_samples, True, negative_prompt)
+            negative_prompt_embeds_, prompt_embeds_ = prompt_embeds.chunk(2)
+            prompt_embeds = torch.cat([prompt_embeds_, image_prompt_embeds], dim=1)
+            negative_prompt_embeds = torch.cat([negative_prompt_embeds_, uncond_image_prompt_embeds], dim=1)
+            
+        generator = torch.Generator(self.device).manual_seed(seed) if seed is not None else None
+        images = self.pipe(
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            guidance_scale=guidance_scale,
+            num_inference_steps=num_inference_steps,
+            generator=generator,
+            **kwargs,
+        ).images
+        
+        return images
+
+    
+def image_grid(imgs, rows, cols):
+    assert len(imgs) == rows*cols
+
+    w, h = imgs[0].size
+    grid = Image.new('RGB', size=(cols*w, rows*h))
+    grid_w, grid_h = grid.size
+    
+    for i, img in enumerate(imgs):
+        grid.paste(img, box=(i%cols*w, i//cols*h))
+    return grid
+    
+    
+if __name__ == "__main__":
+    base_model_path = "/mnt/aigc_cq/shared/txt2img_models/Realistic_Vision_V5.1_noVAE/"
+    image_encoder_path = "/mnt/aigc_cq/private/huye/t2i_trained_models/ip_adapter_sd15_clip-H/image_encoder/"
+    ip_ckpt = "/mnt/aigc_cq/private/huye/t2i_trained_models/ip_adapter_sd15_clip-H/ip-dapter_1000000.bin"
+    device = "cuda:3"
+    
+    
+    pipe = StableDiffusionPipeline.from_pretrained(
+            base_model_path,
+            torch_dtype=torch.float16,
+            feature_extractor=None,
+            safety_checker=None,
+    )
+    
+    
+    ip_model = IPAdapter(pipe, image_encoder_path, ip_ckpt, device)
+    
+    image_files = ["../assets/Taylor_Swift.png", "../assets/3.png"]
+    num_samples = 2
+    pil_images = [Image.open(image_file) for image_file in image_files]
+    
+    images = ip_model.generate(pil_image=pil_images, num_samples=num_samples)
+    grid = image_grid(images, 1, 4)
+    grid.save("output.png")
+    
+    images = ip_model.generate(pil_image=pil_images, num_samples=num_samples, prompt="best quality, high quality, wearing a hat on the beach", scale=0.5)
+    grid = image_grid(images, 1, 4)
+    grid.save("output_hat.png")
--- a/ip_adapter/utils.py
+++ b/ip_adapter/utils.py
@ -0,0 +1,362 @@
+import inspect
+import warnings
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import PIL.Image
+import torch
+from diffusers.utils import is_compiled_module
+from diffusers.pipelines.controlnet.multicontrolnet import MultiControlNetModel
+from diffusers.models import ControlNetModel
+from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
+
+
+@torch.no_grad()
+def generate(
+    self,
+    prompt: Union[str, List[str]] = None,
+    image: Union[
+        torch.FloatTensor,
+        PIL.Image.Image,
+        np.ndarray,
+        List[torch.FloatTensor],
+        List[PIL.Image.Image],
+        List[np.ndarray],
+    ] = None,
+    height: Optional[int] = None,
+    width: Optional[int] = None,
+    num_inference_steps: int = 50,
+    guidance_scale: float = 7.5,
+    negative_prompt: Optional[Union[str, List[str]]] = None,
+    num_images_per_prompt: Optional[int] = 1,
+    eta: float = 0.0,
+    generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+    latents: Optional[torch.FloatTensor] = None,
+    prompt_embeds: Optional[torch.FloatTensor] = None,
+    negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+    output_type: Optional[str] = "pil",
+    return_dict: bool = True,
+    callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+    callback_steps: int = 1,
+    cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+    controlnet_conditioning_scale: Union[float, List[float]] = 1.0,
+    guess_mode: bool = False,
+    control_guidance_start: Union[float, List[float]] = 0.0,
+    control_guidance_end: Union[float, List[float]] = 1.0,
+):
+    r"""
+    Function invoked when calling the pipeline for generation.
+
+    Args:
+        prompt (`str` or `List[str]`, *optional*):
+            The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+            instead.
+        image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, `List[np.ndarray]`,:
+                `List[List[torch.FloatTensor]]`, `List[List[np.ndarray]]` or `List[List[PIL.Image.Image]]`):
+            The ControlNet input condition. ControlNet uses this input condition to generate guidance to Unet. If
+            the type is specified as `Torch.FloatTensor`, it is passed to ControlNet as is. `PIL.Image.Image` can
+            also be accepted as an image. The dimensions of the output image defaults to `image`'s dimensions. If
+            height and/or width are passed, `image` is resized according to them. If multiple ControlNets are
+            specified in init, images must be passed as a list such that each element of the list can be correctly
+            batched for input to a single controlnet.
+        height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+            The height in pixels of the generated image.
+        width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+            The width in pixels of the generated image.
+        num_inference_steps (`int`, *optional*, defaults to 50):
+            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+            expense of slower inference.
+        guidance_scale (`float`, *optional*, defaults to 7.5):
+            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+            `guidance_scale` is defined as `w` of equation 2. of [Imagen
+            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+            1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+            usually at the expense of lower image quality.
+        negative_prompt (`str` or `List[str]`, *optional*):
+            The prompt or prompts not to guide the image generation. If not defined, one has to pass
+            `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+            less than `1`).
+        num_images_per_prompt (`int`, *optional*, defaults to 1):
+            The number of images to generate per prompt.
+        eta (`float`, *optional*, defaults to 0.0):
+            Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
+            [`schedulers.DDIMScheduler`], will be ignored for others.
+        generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+            One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+            to make generation deterministic.
+        latents (`torch.FloatTensor`, *optional*):
+            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+            tensor will ge generated by sampling using the supplied random `generator`.
+        prompt_embeds (`torch.FloatTensor`, *optional*):
+            Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+            provided, text embeddings will be generated from `prompt` input argument.
+        negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+            Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+            weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+            argument.
+        output_type (`str`, *optional*, defaults to `"pil"`):
+            The output format of the generate image. Choose between
+            [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+        return_dict (`bool`, *optional*, defaults to `True`):
+            Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
+            plain tuple.
+        callback (`Callable`, *optional*):
+            A function that will be called every `callback_steps` steps during inference. The function will be
+            called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+        callback_steps (`int`, *optional*, defaults to 1):
+            The frequency at which the `callback` function will be called. If not specified, the callback will be
+            called at every step.
+        cross_attention_kwargs (`dict`, *optional*):
+            A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+            `self.processor` in
+            [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+        controlnet_conditioning_scale (`float` or `List[float]`, *optional*, defaults to 1.0):
+            The outputs of the controlnet are multiplied by `controlnet_conditioning_scale` before they are added
+            to the residual in the original unet. If multiple ControlNets are specified in init, you can set the
+            corresponding scale as a list.
+        guess_mode (`bool`, *optional*, defaults to `False`):
+            In this mode, the ControlNet encoder will try best to recognize the content of the input image even if
+            you remove all prompts. The `guidance_scale` between 3.0 and 5.0 is recommended.
+        control_guidance_start (`float` or `List[float]`, *optional*, defaults to 0.0):
+            The percentage of total steps at which the controlnet starts applying.
+        control_guidance_end (`float` or `List[float]`, *optional*, defaults to 1.0):
+            The percentage of total steps at which the controlnet stops applying.
+
+    Examples:
+
+    Returns:
+        [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
+        [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
+        When returning a tuple, the first element is a list with the generated images, and the second element is a
+        list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
+        (nsfw) content, according to the `safety_checker`.
+    """
+    controlnet = self.controlnet._orig_mod if is_compiled_module(self.controlnet) else self.controlnet
+
+    # align format for control guidance
+    if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
+        control_guidance_start = len(control_guidance_end) * [control_guidance_start]
+    elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
+        control_guidance_end = len(control_guidance_start) * [control_guidance_end]
+    elif not isinstance(control_guidance_start, list) and not isinstance(control_guidance_end, list):
+        mult = len(controlnet.nets) if isinstance(controlnet, MultiControlNetModel) else 1
+        control_guidance_start, control_guidance_end = mult * [control_guidance_start], mult * [
+            control_guidance_end
+        ]
+
+    # 1. Check inputs. Raise error if not correct
+    self.check_inputs(
+        prompt,
+        image,
+        callback_steps,
+        negative_prompt,
+        prompt_embeds,
+        negative_prompt_embeds,
+        controlnet_conditioning_scale,
+        control_guidance_start,
+        control_guidance_end,
+    )
+
+    # 2. Define call parameters
+    if prompt is not None and isinstance(prompt, str):
+        batch_size = 1
+    elif prompt is not None and isinstance(prompt, list):
+        batch_size = len(prompt)
+    else:
+        batch_size = prompt_embeds.shape[0]
+
+    device = self._execution_device
+    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+    # corresponds to doing no classifier free guidance.
+    do_classifier_free_guidance = guidance_scale > 1.0
+
+    if isinstance(controlnet, MultiControlNetModel) and isinstance(controlnet_conditioning_scale, float):
+        controlnet_conditioning_scale = [controlnet_conditioning_scale] * len(controlnet.nets)
+
+    global_pool_conditions = (
+        controlnet.config.global_pool_conditions
+        if isinstance(controlnet, ControlNetModel)
+        else controlnet.nets[0].config.global_pool_conditions
+    )
+    guess_mode = guess_mode or global_pool_conditions
+
+    # 3. Encode input prompt
+    text_encoder_lora_scale = (
+        cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
+    )
+    prompt_embeds = self._encode_prompt(
+        prompt,
+        device,
+        num_images_per_prompt,
+        do_classifier_free_guidance,
+        negative_prompt,
+        prompt_embeds=prompt_embeds,
+        negative_prompt_embeds=negative_prompt_embeds,
+        lora_scale=text_encoder_lora_scale,
+    )
+
+    # 4. Prepare image
+    if isinstance(controlnet, ControlNetModel):
+        image = self.prepare_image(
+            image=image,
+            width=width,
+            height=height,
+            batch_size=batch_size * num_images_per_prompt,
+            num_images_per_prompt=num_images_per_prompt,
+            device=device,
+            dtype=controlnet.dtype,
+            do_classifier_free_guidance=do_classifier_free_guidance,
+            guess_mode=guess_mode,
+        )
+        height, width = image.shape[-2:]
+    elif isinstance(controlnet, MultiControlNetModel):
+        images = []
+
+        for image_ in image:
+            image_ = self.prepare_image(
+                image=image_,
+                width=width,
+                height=height,
+                batch_size=batch_size * num_images_per_prompt,
+                num_images_per_prompt=num_images_per_prompt,
+                device=device,
+                dtype=controlnet.dtype,
+                do_classifier_free_guidance=do_classifier_free_guidance,
+                guess_mode=guess_mode,
+            )
+
+            images.append(image_)
+
+        image = images
+        height, width = image[0].shape[-2:]
+    else:
+        assert False
+
+    # 5. Prepare timesteps
+    self.scheduler.set_timesteps(num_inference_steps, device=device)
+    timesteps = self.scheduler.timesteps
+
+    # 6. Prepare latent variables
+    num_channels_latents = self.unet.config.in_channels
+    latents = self.prepare_latents(
+        batch_size * num_images_per_prompt,
+        num_channels_latents,
+        height,
+        width,
+        prompt_embeds.dtype,
+        device,
+        generator,
+        latents,
+    )
+
+    # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+    extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+
+    # 7.1 Create tensor stating which controlnets to keep
+    controlnet_keep = []
+    for i in range(len(timesteps)):
+        keeps = [
+            1.0 - float(i / len(timesteps) < s or (i + 1) / len(timesteps) > e)
+            for s, e in zip(control_guidance_start, control_guidance_end)
+        ]
+        controlnet_keep.append(keeps[0] if isinstance(controlnet, ControlNetModel) else keeps)
+
+    # 8. Denoising loop
+    num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+    with self.progress_bar(total=num_inference_steps) as progress_bar:
+        for i, t in enumerate(timesteps):
+            # expand the latents if we are doing classifier free guidance
+            latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+
+            # controlnet(s) inference
+            if guess_mode and do_classifier_free_guidance:
+                # Infer ControlNet only for the conditional batch.
+                control_model_input = latents
+                control_model_input = self.scheduler.scale_model_input(control_model_input, t)
+                controlnet_prompt_embeds = prompt_embeds[:, :77, :].chunk(2)[1]
+            else:
+                control_model_input = latent_model_input
+                controlnet_prompt_embeds = prompt_embeds[:, :77, :]
+
+            if isinstance(controlnet_keep[i], list):
+                cond_scale = [c * s for c, s in zip(controlnet_conditioning_scale, controlnet_keep[i])]
+            else:
+                controlnet_cond_scale = controlnet_conditioning_scale
+                if isinstance(controlnet_cond_scale, list):
+                    controlnet_cond_scale = controlnet_cond_scale[0]
+                cond_scale = controlnet_cond_scale * controlnet_keep[i]
+
+            down_block_res_samples, mid_block_res_sample = self.controlnet(
+                control_model_input,
+                t,
+                encoder_hidden_states=controlnet_prompt_embeds,
+                controlnet_cond=image,
+                conditioning_scale=cond_scale,
+                guess_mode=guess_mode,
+                return_dict=False,
+            )
+
+            if guess_mode and do_classifier_free_guidance:
+                # Infered ControlNet only for the conditional batch.
+                # To apply the output of ControlNet to both the unconditional and conditional batches,
+                # add 0 to the unconditional batch to keep it unchanged.
+                down_block_res_samples = [torch.cat([torch.zeros_like(d), d]) for d in down_block_res_samples]
+                mid_block_res_sample = torch.cat([torch.zeros_like(mid_block_res_sample), mid_block_res_sample])
+
+            # predict the noise residual
+            noise_pred = self.unet(
+                latent_model_input,
+                t,
+                encoder_hidden_states=prompt_embeds,
+                cross_attention_kwargs=cross_attention_kwargs,
+                down_block_additional_residuals=down_block_res_samples,
+                mid_block_additional_residual=mid_block_res_sample,
+                return_dict=False,
+            )[0]
+
+            # perform guidance
+            if do_classifier_free_guidance:
+                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+
+            # compute the previous noisy sample x_t -> x_t-1
+            latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+
+            # call the callback, if provided
+            if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                progress_bar.update()
+                if callback is not None and i % callback_steps == 0:
+                    callback(i, t, latents)
+
+    # If we do sequential model offloading, let's offload unet and controlnet
+    # manually for max memory savings
+    if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
+        self.unet.to("cpu")
+        self.controlnet.to("cpu")
+        torch.cuda.empty_cache()
+
+    if not output_type == "latent":
+        image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
+        image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
+    else:
+        image = latents
+        has_nsfw_concept = None
+
+    if has_nsfw_concept is None:
+        do_denormalize = [True] * image.shape[0]
+    else:
+        do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+
+    image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+
+    # Offload last model to CPU
+    if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
+        self.final_offload_hook.offload()
+
+    if not return_dict:
+        return (image, has_nsfw_concept)
+
+    return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
--- a/ip_adapter_controlnet_demo.ipynb
+++ b/ip_adapter_controlnet_demo.ipynb
--- a/ip_adapter_demo.ipynb
+++ b/ip_adapter_demo.ipynb
--- a/ip_adapter_multimodal_prompts_demo.ipynb
+++ b/ip_adapter_multimodal_prompts_demo.ipynb