openvino/docs/notebooks/231-instruct-pix2pix-image-editing-with-output.rst

Image Editing with InstructPix2Pix and OpenVINO
===============================================


The InstructPix2Pix is a conditional diffusion model that edits images
based on written instructions provided by the user. Generative image
editing models traditionally target a single editing task like style
transfer or translation between image domains. Text guidance gives us an
opportunity to solve multiple tasks with a single model. The
InstructPix2Pix method works different than existing text-based image
editing in that it enables editing from instructions that tell the model
what action to perform instead of using text labels, captions or
descriptions of input/output images. A key benefit of following editing
instructions is that the user can just tell the model exactly what to do
in natural written text. There is no need for the user to provide extra
information, such as example images or descriptions of visual content
that remain constant between the input and output images. More details
about this approach can be found in this
`paper <https://arxiv.org/pdf/2211.09800.pdf>`__ and
`repository <https://github.com/timothybrooks/instruct-pix2pix>`__.

This notebook demonstrates how to convert and run the InstructPix2Pix
model using OpenVINO.

Notebook contains the following steps:

1. Convert PyTorch models to ONNX format.
2. Convert ONNX models to OpenVINO IR format, using model conversion
   API.
3. Run InstructPix2Pix pipeline with OpenVINO.


.. _top:

**Table of contents**:

- `Prerequisites <#prerequisites>`__
- `Create Pytorch Models pipeline <#create-pytorch-models-pipeline>`__
- `Convert Models to OpenVINO IR <#convert-models-to-openvino-ir>`__

  - `Text Encoder <#text-encoder>`__
  - `VAE <#vae>`__
  - `Unet <#unet>`__

- `Prepare Inference Pipeline <#prepare-inference-pipeline>`__

Prerequisites `⇑ <#top>`__
###############################################################################################################################


Install necessary packages

.. code:: ipython3

    !pip install "transformers>=4.25.1" accelerate
    !pip install "git+https://github.com/huggingface/diffusers.git"


.. parsed-literal::

    Requirement already satisfied: transformers>=4.25.1 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (4.25.1)
    Requirement already satisfied: accelerate in /home/ea/work/notebooks_env/lib/python3.8/site-packages (0.13.2)
    Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (0.11.1)
    Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (0.13.2)
    Requirement already satisfied: filelock in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (3.9.0)
    Requirement already satisfied: regex!=2019.12.17 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (2022.10.31)
    Requirement already satisfied: packaging>=20.0 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (23.0)
    Requirement already satisfied: numpy>=1.17 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (1.23.4)
    Requirement already satisfied: pyyaml>=5.1 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (6.0)
    Requirement already satisfied: requests in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (2.28.2)
    Requirement already satisfied: tqdm>=4.27 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from transformers>=4.25.1) (4.64.1)
    Requirement already satisfied: torch>=1.4.0 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from accelerate) (1.13.1+cpu)
    Requirement already satisfied: psutil in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from accelerate) (5.9.4)
    Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.10.0->transformers>=4.25.1) (4.4.0)
    Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->transformers>=4.25.1) (1.26.14)
    Requirement already satisfied: idna<4,>=2.5 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->transformers>=4.25.1) (3.4)
    Requirement already satisfied: charset-normalizer<4,>=2 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->transformers>=4.25.1) (2.1.1)
    Requirement already satisfied: certifi>=2017.4.17 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->transformers>=4.25.1) (2022.12.7)

    [notice] A new release of pip available: 22.3.1 -> 23.0
    [notice] To update, run: pip install --upgrade pip
    Collecting git+https://github.com/huggingface/diffusers.git
      Cloning https://github.com/huggingface/diffusers.git to /tmp/pip-req-build-tj6ekfd9
      Running command git clone --filter=blob:none --quiet https://github.com/huggingface/diffusers.git /tmp/pip-req-build-tj6ekfd9
      Resolved https://github.com/huggingface/diffusers.git to commit 1e5eaca754bce676ce9142cab7ccaaee78df4696
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Preparing metadata (pyproject.toml) ... done
    Requirement already satisfied: huggingface-hub>=0.10.0 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from diffusers==0.14.0.dev0) (0.11.1)
    Requirement already satisfied: regex!=2019.12.17 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from diffusers==0.14.0.dev0) (2022.10.31)
    Requirement already satisfied: numpy in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from diffusers==0.14.0.dev0) (1.23.4)
    Requirement already satisfied: filelock in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from diffusers==0.14.0.dev0) (3.9.0)
    Requirement already satisfied: importlib-metadata in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from diffusers==0.14.0.dev0) (4.13.0)
    Requirement already satisfied: Pillow in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from diffusers==0.14.0.dev0) (9.4.0)
    Requirement already satisfied: requests in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from diffusers==0.14.0.dev0) (2.28.2)
    Requirement already satisfied: pyyaml>=5.1 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from huggingface-hub>=0.10.0->diffusers==0.14.0.dev0) (6.0)
    Requirement already satisfied: tqdm in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from huggingface-hub>=0.10.0->diffusers==0.14.0.dev0) (4.64.1)
    Requirement already satisfied: packaging>=20.9 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from huggingface-hub>=0.10.0->diffusers==0.14.0.dev0) (23.0)
    Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from huggingface-hub>=0.10.0->diffusers==0.14.0.dev0) (4.4.0)
    Requirement already satisfied: zipp>=0.5 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from importlib-metadata->diffusers==0.14.0.dev0) (3.11.0)
    Requirement already satisfied: idna<4,>=2.5 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->diffusers==0.14.0.dev0) (3.4)
    Requirement already satisfied: certifi>=2017.4.17 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->diffusers==0.14.0.dev0) (2022.12.7)
    Requirement already satisfied: charset-normalizer<4,>=2 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->diffusers==0.14.0.dev0) (2.1.1)
    Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ea/work/notebooks_env/lib/python3.8/site-packages (from requests->diffusers==0.14.0.dev0) (1.26.14)

    [notice] A new release of pip available: 22.3.1 -> 23.0
    [notice] To update, run: pip install --upgrade pip


Create Pytorch Models pipeline `⇑ <#top>`__
###############################################################################################################################


``StableDiffusionInstructPix2PixPipeline`` is an end-to-end inference
pipeline that you can use to edit images from text instructions with
just a few lines of code provided as part
🤗🧨\ `diffusers <https://huggingface.co/docs/diffusers/index>`__ library.

First, we load the pre-trained weights of all components of the model.

.. note::

   Initially, model loading can take some time due to
   downloading the weights. Also, the download speed depends on your
   internet connection.

.. code:: ipython3

    import torch
    from diffusers import StableDiffusionInstructPix2PixPipeline, EulerAncestralDiscreteScheduler
    model_id = "timbrooks/instruct-pix2pix"
    pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float32, safety_checker=None)
    scheduler_config = pipe.scheduler.config
    text_encoder = pipe.text_encoder
    text_encoder.eval()
    unet = pipe.unet
    unet.eval()
    vae = pipe.vae
    vae.eval()

    del pipe


.. parsed-literal::

    Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]


Convert Models to OpenVINO IR `⇑ <#top>`__
###############################################################################################################################


OpenVINO supports PyTorch through export to the ONNX format. We will use
``torch.onnx.export`` function for obtaining an ONNX model. For more
information, refer to the `PyTorch
documentation <https://pytorch.org/docs/stable/onnx.html>`__. We need to
provide a model object, input data for model tracing and a path for
saving the model. Optionally, we can provide target onnx opset for
conversion and other parameters specified in the documentation (for
example, input and output names or dynamic shapes).

While ONNX models are directly supported by OpenVINO™ runtime, it can be
useful to convert them to OpenVINO Intermediate Representation (IR)
format to take the advantage of advanced OpenVINO optimization tools and
features. We will use OpenVINO Model Optimizer to convert the model to
IR format and compress weights to the ``FP16`` format.

The InstructPix2Pix model is based on Stable Diffusion, a large-scale
text-to-image latent diffusion model. You can find more details about
how to run Stable Diffusion for text-to-image generation with OpenVINO
in a separate
`tutorial <225-stable-diffusion-text-to-image-with-output.html>`__.

The model consists of three important parts:

-  Text Encoder - to create conditions from a text prompt.
-  Unet - for step-by-step denoising latent image representation.
-  Autoencoder (VAE) - to encode the initial image to latent space for
   starting the denoising process and decoding latent space to image,
   when denoising is complete.

Let us convert each part.

Text Encoder `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


The text-encoder is responsible for transforming the input prompt, for
example, “a photo of an astronaut riding a horse” into an embedding
space that can be understood by the UNet. It is usually a simple
transformer-based encoder that maps a sequence of input tokens to a
sequence of latent text embeddings.

Input of the text encoder is tensor ``input_ids``, which contains
indexes of tokens from text processed by tokenizer and padded to maximum
length accepted by the model. Model outputs are two tensors:
``last_hidden_state`` - hidden state from the last MultiHeadAttention
layer in the model and ``pooler_out`` - pooled output for whole model
hidden states. You will use ``opset_version=14``, since model contains
``triu`` operation, supported in ONNX only starting from this opset.

.. code:: ipython3

    from pathlib import Path
    from openvino.tools import mo
    from openvino.runtime import serialize, Core

    core = Core()

    TEXT_ENCODER_ONNX_PATH = Path('text_encoder.onnx')
    TEXT_ENCODER_OV_PATH = TEXT_ENCODER_ONNX_PATH.with_suffix('.xml')


    def convert_encoder_onnx(text_encoder, onnx_path: Path):
        """
        Convert Text Encoder model to ONNX.
        Function accepts pipeline, prepares example inputs for ONNX conversion via torch.export,
        Parameters:
            text_encoder: InstrcutPix2Pix text_encoder model
            onnx_path (Path): File for storing onnx model
        Returns:
            None
        """
        if not onnx_path.exists():
            # switch model to inference mode
            text_encoder.eval()
            input_ids = torch.ones((1, 77), dtype=torch.long)

            # disable gradients calculation for reducing memory consumption
            with torch.no_grad():
                # infer model, just to make sure that it works
                text_encoder(input_ids)
                # export model to ONNX format
                torch.onnx.export(
                    text_encoder,  # model instance
                    input_ids,  # inputs for model tracing
                    onnx_path,  # output file for saving result
                    # model input name for onnx representation
                    input_names=['input_ids'],
                    # model output names for onnx representation
                    output_names=['last_hidden_state', 'pooler_out'],
                    opset_version=14  # onnx opset version for export
                )
            print('Text Encoder successfully converted to ONNX')


    if not TEXT_ENCODER_OV_PATH.exists():
        convert_encoder_onnx(text_encoder, TEXT_ENCODER_ONNX_PATH)
        text_encoder = mo.convert_model(
            TEXT_ENCODER_ONNX_PATH, compress_to_fp16=True)
        serialize(text_encoder, str(TEXT_ENCODER_OV_PATH))
        print('Text Encoder successfully converted to IR')
    else:
        print(f"Text encoder will be loaded from {TEXT_ENCODER_OV_PATH}")

    del text_encoder


.. parsed-literal::

    Text encoder will be loaded from text_encoder.xml


VAE `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


The VAE model consists of two parts: an encoder and a decoder.

-  The encoder is used to convert the image into a low dimensional
   latent representation, which will serve as the input to the UNet
   model.
-  The decoder, conversely, transforms the latent representation back
   into an image.

In comparison with a text-to-image inference pipeline, where VAE is used
only for decoding, the pipeline also involves the original image
encoding. As the two parts are used separately in the pipeline on
different steps, and do not depend on each other, we should convert them
into two independent models.

.. code:: ipython3

    VAE_ENCODER_ONNX_PATH = Path('vae_encoder.onnx')
    VAE_ENCODER_OV_PATH = VAE_ENCODER_ONNX_PATH.with_suffix('.xml')


    def convert_vae_encoder_onnx(vae, onnx_path: Path):
        """
        Convert VAE model to ONNX, then IR format.
        Function accepts pipeline, creates wrapper class for export only necessary for inference part,
        prepares example inputs for ONNX conversion via torch.export,
        Parameters:
            vae: InstrcutPix2Pix VAE model
            onnx_path (Path): File for storing onnx model
        Returns:
            None
        """
        class VAEEncoderWrapper(torch.nn.Module):
            def __init__(self, vae):
                super().__init__()
                self.vae = vae

            def forward(self, image):
                return self.vae.encode(image).latent_dist.mode()

        if not onnx_path.exists():
            vae_encoder = VAEEncoderWrapper(vae)
            vae_encoder.eval()
            image = torch.zeros((1, 3, 512, 512))
            with torch.no_grad():
                torch.onnx.export(vae_encoder, image, onnx_path, input_names=[
                                  'image'], output_names=['image_latent'])
            print('VAE encoder successfully converted to ONNX')


    if not VAE_ENCODER_OV_PATH.exists():
        convert_vae_encoder_onnx(vae, VAE_ENCODER_ONNX_PATH)
        vae_encoder = mo.convert_model(VAE_ENCODER_ONNX_PATH, compress_to_fp16=True)
        serialize(vae_encoder, str(VAE_ENCODER_OV_PATH))
        print('VAE encoder successfully converted to IR')
        del vae_encoder
    else:
        print(f"VAE encoder will be loaded from {VAE_ENCODER_OV_PATH}")


.. parsed-literal::

    VAE encoder will be loaded from vae_encoder.xml


.. code:: ipython3

    VAE_DECODER_ONNX_PATH = Path('vae_decoder.onnx')
    VAE_DECODER_OV_PATH = VAE_DECODER_ONNX_PATH.with_suffix('.xml')


    def convert_vae_decoder_onnx(vae, onnx_path: Path):
        """
        Convert VAE model to ONNX, then IR format.
        Function accepts pipeline, creates wrapper class for export only necessary for inference part,
        prepares example inputs for ONNX conversion via torch.export,
        Parameters:
            vae: InstrcutPix2Pix VAE model
            onnx_path (Path): File for storing onnx model
        Returns:
            None
        """
        class VAEDecoderWrapper(torch.nn.Module):
            def __init__(self, vae):
                super().__init__()
                self.vae = vae

            def forward(self, latents):
                return self.vae.decode(latents)

        if not onnx_path.exists():
            vae_decoder = VAEDecoderWrapper(vae)
            latents = torch.zeros((1, 4, 64, 64))

            vae_decoder.eval()
            with torch.no_grad():
                torch.onnx.export(vae_decoder, latents, onnx_path, input_names=[
                                  'latents'], output_names=['sample'])
            print('VAE decoder successfully converted to ONNX')


    if not VAE_DECODER_OV_PATH.exists():
        convert_vae_decoder_onnx(vae, VAE_DECODER_ONNX_PATH)
        vae_decoder = mo.convert_model(VAE_DECODER_ONNX_PATH, compress_to_fp16=True)
        print('VAE decoder successfully converted to IR')
        serialize(vae_decoder, str(VAE_DECODER_OV_PATH))
        del vae_decoder
    else:
        print(f"VAE decoder will be loaded from {VAE_DECODER_OV_PATH}")
    del vae


.. parsed-literal::

    VAE decoder successfully converted to IR


Unet `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


The Unet model has three inputs:

-  ``scaled_latent_model_input`` - the latent image sample from previous
   step. Generation process has not been started yet, so you will use
   random noise.
-  ``timestep`` - a current scheduler step.
-  ``text_embeddings`` - a hidden state of the text encoder.

Model predicts the ``sample`` state for the next step.

.. code:: ipython3

    import numpy as np

    UNET_ONNX_PATH = Path('unet/unet.onnx')
    UNET_OV_PATH = UNET_ONNX_PATH.parents[1] / 'unet.xml'


    def convert_unet_onnx(unet, onnx_path: Path):
        """
        Convert Unet model to ONNX, then IR format.
        Function accepts pipeline, prepares example inputs for ONNX conversion via torch.export,
        Parameters:
            unet: InstrcutPix2Pix unet model
            onnx_path (Path): File for storing onnx model
        Returns:
            None
        """
        if not onnx_path.exists():
            # prepare inputs
            latents_shape = (3, 8, 512 // 8, 512 // 8)
            latents = torch.randn(latents_shape)
            t = torch.from_numpy(np.array(1, dtype=float))
            encoder_hidden_state = torch.randn((3,77,768))

            # if the model size > 2Gb, it will be represented as ONNX with external data files and we will store it in a separate directory to avoid having a lot of files in current directory
            onnx_path.parent.mkdir(exist_ok=True, parents=True)
            with torch.no_grad():
                torch.onnx.export(
                    unet,
                    (latents, t, encoder_hidden_state), str(onnx_path),
                    input_names=['scaled_latent_model_input',
                                 'timestep', 'text_embeddings'],
                    output_names=['sample']
                )
            print('Unet successfully converted to ONNX')


    if not UNET_OV_PATH.exists():
        convert_unet_onnx(unet, UNET_ONNX_PATH)
        unet = mo.convert_model(UNET_ONNX_PATH, compress_to_fp16=True)
        serialize(unet, str(UNET_OV_PATH))
        print('Unet successfully converted to IR')
    else:
        print(f"Unet successfully loaded from {UNET_OV_PATH}")
    del unet


.. parsed-literal::

    Unet successfully loaded from unet.xml


Prepare Inference Pipeline `⇑ <#top>`__
###############################################################################################################################


Putting it all together, let us now take a closer look at how the model
inference works by illustrating the logical flow.

.. figure:: https://user-images.githubusercontent.com/29454499/214895365-3063ac11-0486-4d9b-9e25-8f469aba5e5d.png
   :alt: diagram

   diagram

The InstructPix2Pix model takes both an image and a text prompt as an
input. The image is transformed to latent image representations of size
:math:`64 \times 64`, using the encoder part of variational autoencoder,
whereas the text prompt is transformed to text embeddings of size
:math:`77 \times 768` via CLIP’s text encoder.

Next, the UNet model iteratively *denoises* the random latent image
representations while being conditioned on the text embeddings. The
output of the UNet, being the noise residual, is used to compute a
denoised latent image representation via a scheduler algorithm.

The *denoising* process is repeated a given number of times (by default
100) to retrieve step-by-step better latent image representations. Once
it has been completed, the latent image representation is decoded by the
decoder part of the variational auto encoder.

.. code:: ipython3

    from diffusers.pipeline_utils import DiffusionPipeline
    from openvino.runtime import Model, Core
    from transformers import CLIPTokenizer
    from typing import Union, List, Optional, Tuple
    import PIL
    import cv2


    def scale_fit_to_window(dst_width:int, dst_height:int, image_width:int, image_height:int):
        """
        Preprocessing helper function for calculating image size for resize with peserving original aspect ratio
        and fitting image to specific window size

        Parameters:
          dst_width (int): destination window width
          dst_height (int): destination window height
          image_width (int): source image width
          image_height (int): source image height
        Returns:
          result_width (int): calculated width for resize
          result_height (int): calculated height for resize
        """
        im_scale = min(dst_height / image_height, dst_width / image_width)
        return int(im_scale * image_width), int(im_scale * image_height)


    def preprocess(image: PIL.Image.Image):
        """
        Image preprocessing function. Takes image in PIL.Image format, resizes it to keep aspect ration and fits to model input window 512x512,
        then converts it to np.ndarray and adds padding with zeros on right or bottom side of image (depends from aspect ratio), after that
        converts data to float32 data type and change range of values from [0, 255] to [-1, 1], finally, converts data layout from planar NHWC to NCHW.
        The function returns preprocessed input tensor and padding size, which can be used in postprocessing.

        Parameters:
          image (PIL.Image.Image): input image
        Returns:
           image (np.ndarray): preprocessed image tensor
           pad (Tuple[int]): pading size for each dimension for restoring image size in postprocessing
        """
        src_width, src_height = image.size
        dst_width, dst_height = scale_fit_to_window(
            512, 512, src_width, src_height)
        image = np.array(image.resize((dst_width, dst_height),
                         resample=PIL.Image.Resampling.LANCZOS))[None, :]
        pad_width = 512 - dst_width
        pad_height = 512 - dst_height
        pad = ((0, 0), (0, pad_height), (0, pad_width), (0, 0))
        image = np.pad(image, pad, mode="constant")
        image = image.astype(np.float32) / 255.0
        image = 2.0 * image - 1.0
        image = image.transpose(0, 3, 1, 2)
        return image, pad


    def randn_tensor(
        shape: Union[Tuple, List],
        dtype: Optional[np.dtype] = np.float32,
    ):
        """
        Helper function for generation random values tensor with given shape and data type

        Parameters:
          shape (Union[Tuple, List]): shape for filling random values
          dtype (np.dtype, *optiona*, np.float32): data type for result
        Returns:
          latents (np.ndarray): tensor with random values with given data type and shape (usually represents noise in latent space)
        """
        latents = np.random.randn(*shape).astype(dtype)

        return latents


    class OVInstructPix2PixPipeline(DiffusionPipeline):
        """
        OpenVINO inference pipeline for InstructPix2Pix
        """
        def __init__(
            self,
            tokenizer: CLIPTokenizer,
            scheduler: EulerAncestralDiscreteScheduler,
            core: Core,
            text_encoder: Model,
            vae_encoder: Model,
            unet: Model,
            vae_decoder: Model,
            device:str = "AUTO"
        ):
            super().__init__()
            self.tokenizer = tokenizer
            self.vae_scale_factor = 8
            self.scheduler = scheduler
            self.load_models(core, device, text_encoder,
                             vae_encoder, unet, vae_decoder)

        def load_models(self, core: Core, device: str, text_encoder: Model, vae_encoder: Model, unet: Model, vae_decoder: Model):
            """
            Function for loading models on device using OpenVINO

            Parameters:
              core (Core): OpenVINO runtime Core class instance
              device (str): inference device
              text_encoder (Model): OpenVINO Model object represents text encoder
              vae_encoder (Model): OpenVINO Model object represents vae encoder
              unet (Model): OpenVINO Model object represents unet
              vae_decoder (Model): OpenVINO Model object represents vae decoder
            Returns
              None
            """
            self.text_encoder = core.compile_model(text_encoder, device)
            self.text_encoder_out = self.text_encoder.output(0)
            self.vae_encoder = core.compile_model(vae_encoder, device)
            self.vae_encoder_out = self.vae_encoder.output(0)
            self.unet = core.compile_model(unet, device)
            self.unet_out = self.unet.output(0)
            self.vae_decoder = core.compile_model(vae_decoder)
            self.vae_decoder_out = self.vae_decoder.output(0)

        def __call__(
            self,
            prompt: Union[str, List[str]],
            image: PIL.Image.Image,
            num_inference_steps: int = 10,
            guidance_scale: float = 7.5,
            image_guidance_scale: float = 1.5,
            eta: float = 0.0,
            latents: Optional[np.array] = None,
            output_type: Optional[str] = "pil",
        ):
            """
            Function invoked when calling the pipeline for generation.

            Parameters:
                prompt (`str` or `List[str]`):
                    The prompt or prompts to guide the image generation.
                image (`PIL.Image.Image`):
                    `Image`, or tensor representing an image batch which will be repainted according to `prompt`.
                num_inference_steps (`int`, *optional*, defaults to 100):
                    The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                    expense of slower inference.
                guidance_scale (`float`, *optional*, defaults to 7.5):
                    Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                    `guidance_scale` is defined as `w` of equation 2. of [Imagen
                    Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
                    1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
                    usually at the expense of lower image quality. This pipeline requires a value of at least `1`.
                image_guidance_scale (`float`, *optional*, defaults to 1.5):
                    Image guidance scale is to push the generated image towards the inital image `image`. Image guidance
                    scale is enabled by setting `image_guidance_scale > 1`. Higher image guidance scale encourages to
                    generate images that are closely linked to the source image `image`, usually at the expense of lower
                    image quality. This pipeline requires a value of at least `1`.
                latents (`torch.FloatTensor`, *optional*):
                    Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                    generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                    tensor will ge generated by sampling using the supplied random `generator`.
                output_type (`str`, *optional*, defaults to `"pil"`):
                    The output format of the generate image. Choose between
                    [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
            Returns:
                image ([List[Union[np.ndarray, PIL.Image.Image]]): generaited images

            """

            # 1. Define call parameters
            batch_size = 1 if isinstance(prompt, str) else len(prompt)
            # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
            # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
            # corresponds to doing no classifier free guidance.
            do_classifier_free_guidance = guidance_scale > 1.0 and image_guidance_scale >= 1.0
            # check if scheduler is in sigmas space
            scheduler_is_in_sigma_space = hasattr(self.scheduler, "sigmas")

            # 2. Encode input prompt
            text_embeddings = self._encode_prompt(prompt)

            # 3. Preprocess image
            orig_width, orig_height = image.size
            image, pad = preprocess(image)
            height, width = image.shape[-2:]

            # 4. set timesteps
            self.scheduler.set_timesteps(num_inference_steps)
            timesteps = self.scheduler.timesteps

            # 5. Prepare Image latents
            image_latents = self.prepare_image_latents(
                image,
                do_classifier_free_guidance=do_classifier_free_guidance,
            )

            # 6. Prepare latent variables
            num_channels_latents = 4
            latents = self.prepare_latents(
                batch_size,
                num_channels_latents,
                height,
                width,
                text_embeddings.dtype,
                latents,
            )

            # 7. Denoising loop
            num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
            with self.progress_bar(total=num_inference_steps) as progress_bar:
                for i, t in enumerate(timesteps):
                    # Expand the latents if we are doing classifier free guidance.
                    # The latents are expanded 3 times because for pix2pix the guidance\
                    # is applied for both the text and the input image.
                    latent_model_input = np.concatenate(
                        [latents] * 3) if do_classifier_free_guidance else latents

                    # concat latents, image_latents in the channel dimension
                    scaled_latent_model_input = self.scheduler.scale_model_input(
                        latent_model_input, t)
                    scaled_latent_model_input = np.concatenate(
                        [scaled_latent_model_input, image_latents], axis=1)

                    # predict the noise residual
                    noise_pred = self.unet([scaled_latent_model_input, t, text_embeddings])[
                        self.unet_out]

                    # Hack:
                    # For karras style schedulers the model does classifier free guidance using the
                    # predicted_original_sample instead of the noise_pred. So we need to compute the
                    # predicted_original_sample here if we are using a karras style scheduler.
                    if scheduler_is_in_sigma_space:
                        step_index = (self.scheduler.timesteps == t).nonzero().item()
                        sigma = self.scheduler.sigmas[step_index].numpy()
                        noise_pred = latent_model_input - sigma * noise_pred

                    # perform guidance
                    if do_classifier_free_guidance:
                        noise_pred_text, noise_pred_image, noise_pred_uncond = noise_pred[
                            0], noise_pred[1], noise_pred[2]
                        noise_pred = (
                            noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_image) + image_guidance_scale * (noise_pred_image - noise_pred_uncond)
                        )

                    # For karras style schedulers the model does classifier free guidance using the
                    # predicted_original_sample instead of the noise_pred. But the scheduler.step function
                    # expects the noise_pred and computes the predicted_original_sample internally. So we
                    # need to overwrite the noise_pred here such that the value of the computed
                    # predicted_original_sample is correct.
                    if scheduler_is_in_sigma_space:
                        noise_pred = (noise_pred - latents) / (-sigma)

                    # compute the previous noisy sample x_t -> x_t-1
                    latents = self.scheduler.step(torch.from_numpy(noise_pred), t, torch.from_numpy(latents)).prev_sample.numpy()

                    # call the callback, if provided
                    if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                        progress_bar.update()

            # 8. Post-processing
            image = self.decode_latents(latents, pad)

            # 9. Convert to PIL
            if output_type == "pil":
                image = self.numpy_to_pil(image)
                image = [img.resize((orig_width, orig_height),
                                    PIL.Image.Resampling.LANCZOS) for img in image]
            else:
                image = [cv2.resize(img, (orig_width, orig_width))
                         for img in image]

            return image

        def _encode_prompt(self, prompt:Union[str, List[str]], num_images_per_prompt:int = 1, do_classifier_free_guidance:bool = True):
            """
            Encodes the prompt into text encoder hidden states.

            Parameters:
                prompt (str or list(str)): prompt to be encoded
                num_images_per_prompt (int): number of images that should be generated per prompt
                do_classifier_free_guidance (bool): whether to use classifier free guidance or not
            Returns:
                text_embeddings (np.ndarray): text encoder hidden states
            """
            batch_size = len(prompt) if isinstance(prompt, list) else 1

            # tokenize input prompts
            text_inputs = self.tokenizer(
                prompt,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            text_input_ids = text_inputs.input_ids

            text_embeddings = self.text_encoder(
                text_input_ids)[self.text_encoder_out]

            # duplicate text embeddings for each generation per prompt, using mps friendly method
            if num_images_per_prompt != 1:
                bs_embed, seq_len, _ = text_embeddings.shape
                text_embeddings = np.tile(
                    text_embeddings, (1, num_images_per_prompt, 1))
                text_embeddings = np.reshape(
                    text_embeddings, (bs_embed * num_images_per_prompt, seq_len, -1))

            # get unconditional embeddings for classifier free guidance
            if do_classifier_free_guidance:
                uncond_tokens: List[str]
                uncond_tokens = [""] * batch_size
                max_length = text_input_ids.shape[-1]
                uncond_input = self.tokenizer(
                    uncond_tokens,
                    padding="max_length",
                    max_length=max_length,
                    truncation=True,
                    return_tensors="np",
                )

                uncond_embeddings = self.text_encoder(uncond_input.input_ids)[
                    self.text_encoder_out]

                # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
                seq_len = uncond_embeddings.shape[1]
                uncond_embeddings = np.tile(
                    uncond_embeddings, (1, num_images_per_prompt, 1))
                uncond_embeddings = np.reshape(
                    uncond_embeddings, (batch_size * num_images_per_prompt, seq_len, -1))

                # For classifier free guidance, you need to do two forward passes.
                # Here, you concatenate the unconditional and text embeddings into a single batch
                # to avoid doing two forward passes
                text_embeddings = np.concatenate(
                    [text_embeddings, uncond_embeddings, uncond_embeddings])

            return text_embeddings

        def prepare_image_latents(
            self, image, batch_size=1, num_images_per_prompt=1, do_classifier_free_guidance=True
        ):
            """
            Encodes input image to latent space using VAE Encoder

            Parameters:
               image (np.ndarray): input image tensor
               num_image_per_prompt (int, *optional*, 1): number of image generated for promt
               do_classifier_free_guidance (bool): whether to use classifier free guidance or not
            Returns:
               image_latents: image encoded to latent space
            """

            image = image.astype(np.float32)

            batch_size = batch_size * num_images_per_prompt
            image_latents = self.vae_encoder(image)[self.vae_encoder_out]

            if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:
                # expand image_latents for batch_size
                additional_image_per_prompt = batch_size // image_latents.shape[0]
                image_latents = np.concatenate(
                    [image_latents] * additional_image_per_prompt, axis=0)
            elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:
                raise ValueError(
                    f"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts."
                )
            else:
                image_latents = np.concatenate([image_latents], axis=0)

            if do_classifier_free_guidance:
                uncond_image_latents = np.zeros_like(image_latents)
                image_latents = np.concatenate([image_latents, image_latents, uncond_image_latents], axis=0)

            return image_latents

        def prepare_latents(self, batch_size:int, num_channels_latents:int, height:int, width:int, dtype:np.dtype = np.float32, latents:np.ndarray = None):
            """
            Preparing noise to image generation. If initial latents are not provided, they will be generated randomly,
            then prepared latents scaled by the standard deviation required by the scheduler

            Parameters:
               batch_size (int): input batch size
               num_channels_latents (int): number of channels for noise generation
               height (int): image height
               width (int): image width
               dtype (np.dtype, *optional*, np.float32): dtype for latents generation
               latents (np.ndarray, *optional*, None): initial latent noise tensor, if not provided will be generated
            Returns:
               latents (np.ndarray): scaled initial noise for diffusion
            """
            shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
            if latents is None:
                latents = randn_tensor(shape, dtype=dtype)
            else:
                latents = latents

            # scale the initial noise by the standard deviation required by the scheduler
            latents = latents * self.scheduler.init_noise_sigma.numpy()
            return latents

        def decode_latents(self, latents:np.array, pad:Tuple[int]):
            """
            Decode predicted image from latent space using VAE Decoder and unpad image result

            Parameters:
               latents (np.ndarray): image encoded in diffusion latent space
               pad (Tuple[int]): each side padding sizes obtained on preprocessing step
            Returns:
               image: decoded by VAE decoder image
            """
            latents = 1 / 0.18215 * latents
            image = self.vae_decoder(latents)[self.vae_decoder_out]
            (_, end_h), (_, end_w) = pad[1:3]
            h, w = image.shape[2:]
            unpad_h = h - end_h
            unpad_w = w - end_w
            image = image[:, :, :unpad_h, :unpad_w]
            image = np.clip(image / 2 + 0.5, 0, 1)
            image = np.transpose(image, (0, 2, 3, 1))
            return image

.. code:: ipython3

    import matplotlib.pyplot as plt


    def visualize_results(orig_img:PIL.Image.Image, processed_img:PIL.Image.Image, prompt:str):
        """
        Helper function for results visualization

        Parameters:
           orig_img (PIL.Image.Image): original image
           processed_img (PIL.Image.Image): processed image after editing
           prompt (str): text instruction used for editing
        Returns:
           fig (matplotlib.pyplot.Figure): matplotlib generated figure contains drawing result
        """
        orig_title = "Original image"
        im_w, im_h = orig_img.size
        is_horizontal = im_h <= im_w
        figsize = (20, 30) if is_horizontal else (30, 20)
        fig, axs = plt.subplots(1 if is_horizontal else 2, 2 if is_horizontal else 1, figsize=figsize, sharex='all', sharey='all')
        fig.patch.set_facecolor('white')
        list_axes = list(axs.flat)
        for a in list_axes:
            a.set_xticklabels([])
            a.set_yticklabels([])
            a.get_xaxis().set_visible(False)
            a.get_yaxis().set_visible(False)
            a.grid(False)
        list_axes[0].imshow(np.array(orig_img))
        list_axes[1].imshow(np.array(processed_img))
        list_axes[0].set_title(orig_title, fontsize=20)
        list_axes[1].set_title(f"Prompt: {prompt}", fontsize=20)
        fig.subplots_adjust(wspace=0.0 if is_horizontal else 0.01 , hspace=0.01 if is_horizontal else 0.0)
        fig.tight_layout()
        fig.savefig("result.png", bbox_inches='tight')
        return fig

Model tokenizer and scheduler are also important parts of the pipeline.
Let us define them and put all components together. Additionally, you
can provide device selecting one from available in dropdown list.

.. code:: ipython3

    import ipywidgets as widgets

    device = widgets.Dropdown(
        options=core.available_devices + ["AUTO"],
        value='AUTO',
        description='Device:',
        disabled=False,
    )

    device

.. code:: ipython3

    from transformers import CLIPTokenizer

    tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-large-patch14')
    scheduler = EulerAncestralDiscreteScheduler.from_config(scheduler_config)

    ov_pipe = OVInstructPix2PixPipeline(tokenizer, scheduler, core, TEXT_ENCODER_OV_PATH, VAE_ENCODER_OV_PATH, UNET_OV_PATH, VAE_DECODER_OV_PATH, device=device.value)

Now, you are ready to define editing instructions and an image for
running the inference pipeline. You can find example results generated
by the model on this
`page <https://www.timothybrooks.com/instruct-pix2pix/>`__, in case you
need inspiration. Optionally, you can also change the random generator
seed for latent state initialization and number of steps.

.. note::

   Consider increasing ``steps`` to get more precise results.
   A suggested value is ``100``, but it will take more time to process.


.. code:: ipython3

    style = {'description_width': 'initial'}
    text_prompt = widgets.Text(value=" Make it in galaxy", description='your text')
    num_steps = widgets.IntSlider(min=1, max=100, value=10, description='steps:')
    seed = widgets.IntSlider(min=0, max=1024, description='seed: ', value=42)
    image_widget = widgets.FileUpload(
        accept='',
        multiple=False,
        description='Upload image',
        style=style
    )
    widgets.VBox([text_prompt, seed, num_steps, image_widget])


.. parsed-literal::

    VBox(children=(Text(value=' Make it in galaxy', description='your text'), IntSlider(value=42, description='see…


.. note::

   Diffusion process can take some time, depending on what hardware you select.


.. code:: ipython3

    import io
    import requests

    default_url = "https://user-images.githubusercontent.com/29454499/223343459-4ac944f0-502e-4acf-9813-8e9f0abc8a16.jpg"
    # read uploaded image
    image = PIL.Image.open(io.BytesIO(image_widget.value[-1]['content']) if image_widget.value else requests.get(default_url, stream=True).raw)
    image = image.convert("RGB")
    print('Pipeline settings')
    print(f'Input text: {text_prompt.value}')
    print(f'Seed: {seed.value}')
    print(f'Number of steps: {num_steps.value}')
    np.random.seed(seed.value)
    processed_image = ov_pipe(text_prompt.value, image, num_steps.value)


.. parsed-literal::

    Pipeline settings
    Input text:  Make it in galaxy
    Seed: 42
    Number of steps: 10


.. parsed-literal::

      0%|          | 0/10 [00:00<?, ?it/s]


Now, let us look at the results. The top image represents the original
before editing. The bottom image is the result of the editing process.
The title between them contains the text instructions used for
generation.

.. code:: ipython3

    fig = visualize_results(image, processed_image[0], text_prompt.value)


.. image:: 231-instruct-pix2pix-image-editing-with-output_files/231-instruct-pix2pix-image-editing-with-output_25_0.png


Nice. As you can see, the picture has quite a high definition 🔥.