add-253 (#19500)

2023-08-30 13:46:27 +02:00 · 2023-08-30 13:46:27 +02:00 · 8aec490128
commit 8aec490128
parent 23cad1770e
4 changed files with 1015 additions and 109 deletions
--- a/docs/notebooks/253-zeroscope-text2video-with-output.rst
+++ b/docs/notebooks/253-zeroscope-text2video-with-output.rst
@ -0,0 +1,896 @@
+Video generation with ZeroScope and OpenVINO
+============================================
+
+.. _top:
+
+The ZeroScope model is a free and open-source text-to-video model that
+can generate realistic and engaging videos from text descriptions. It is
+based on the
+`Modelscope <https://modelscope.cn/models/damo/text-to-video-synthesis/summary>`__
+model, but it has been improved to produce higher-quality videos with a
+16:9 aspect ratio and no Shutterstock watermark. The ZeroScope model is
+available in two versions: ZeroScope_v2 576w, which is optimized for
+rapid content creation at a resolution of 576x320 pixels, and
+ZeroScope_v2 XL, which upscales videos to a high-definition resolution
+of 1024x576.
+
+The ZeroScope model is trained on a dataset of over 9,000 videos and
+29,000 tagged frames. It uses a diffusion model to generate videos,
+which means that it starts with a random noise image and gradually adds
+detail to it until it matches the text description. The ZeroScope model
+is still under development, but it has already been used to create some
+impressive videos. For example, it has been used to create videos of
+people dancing, playing sports, and even driving cars.
+
+The ZeroScope model is a powerful tool that can be used to create
+various videos, from simple animations to complex scenes. It is still
+under development, but it has the potential to revolutionize the way we
+create and consume video content.
+
+Both versions of the ZeroScope model are available on Hugging Face:
+
+- `ZeroScope_v2 576w <https://huggingface.co/cerspense/zeroscope_v2_576w>`__
+- `ZeroScope_v2 XL <https://huggingface.co/cerspense/zeroscope_v2_XL>`__
+
+We will use the first one.
+
+**Table of contents**:
+
+- `Install and import required packages <#install-and-import-required-packages>`__
+- `Load the model <#load-the-model>`__
+- `Convert the model <#convert-the-model>`__
+
+  - `Define the conversion function <#define-the-conversion-function>`__
+  - `UNet <#unet>`__ -
+  - `VAE <#vae>`__
+  - `Text encoder <#text-encoder>`__
+
+- `Build a pipeline <#build-a-pipeline>`__
+- `Inference with OpenVINO <#inference-with-openvino>`__
+
+  - `Select inference device <#select-inference-device>`__
+  - `Define a prompt <#define-a-prompt>`__
+  - `Video generation <#video-generation>`__
+
+
+.. important::
+
+   This tutorial requires at least 24GB of free memory to generate a video with 
+   a frame size of 432x240 and 16 frames. Increasing either of these values will 
+   require more memory and take more time.
+
+
+Install and import required packages `⇑ <#top>`__
+###############################################################################################################################
+
+To work with text-to-video synthesis model, we will use Hugging Face’s
+`Diffusers <https://github.com/huggingface/diffusers>`__ library. It
+provides already pretrained model from ``cerspense``.
+
+.. code:: ipython3
+
+    !pip install -q "diffusers[torch]>=0.15.0" transformers "openvino==2023.1.0.dev20230811" numpy gradio
+
+.. code:: ipython3
+
+    import gc
+    from pathlib import Path
+    from typing import Optional, Union, List, Callable
+    import base64
+    import tempfile
+    import warnings
+    
+    import diffusers
+    import transformers
+    import numpy as np
+    import IPython
+    import ipywidgets as widgets
+    import torch
+    import PIL
+    import gradio as gr
+    
+    import openvino as ov
+
+
+.. parsed-literal::
+
+    2023-08-16 21:15:40.145184: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
+    2023-08-16 21:15:40.146998: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
+    2023-08-16 21:15:40.179214: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
+    2023-08-16 21:15:40.180050: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
+    To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
+    2023-08-16 21:15:40.750499: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
+    
+
+Original 576x320 inference requires a lot of RAM (>100GB), so let’s run
+our example on a smaller frame size, keeping the same aspect ratio. Try
+reducing values below to reduce the memory consumption.
+
+.. code:: ipython3
+
+    WIDTH = 432  # must be divisible by 8
+    HEIGHT = 240  # must be divisible by 8
+    NUM_FRAMES = 16
+
+Load the model `⇑ <#top>`__
+###############################################################################################################################
+
+The model is loaded from HuggingFace using ``.from_pretrained`` method
+of ``diffusers.DiffusionPipeline``.
+
+.. code:: ipython3
+
+    pipe = diffusers.DiffusionPipeline.from_pretrained('cerspense/zeroscope_v2_576w')
+
+
+.. parsed-literal::
+
+    vae/diffusion_pytorch_model.safetensors not found
+    
+
+
+.. parsed-literal::
+
+    Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]
+
+
+.. code:: ipython3
+
+    unet = pipe.unet
+    unet.eval()
+    vae = pipe.vae
+    vae.eval()
+    text_encoder = pipe.text_encoder
+    text_encoder.eval()
+    tokenizer = pipe.tokenizer
+    scheduler = pipe.scheduler
+    vae_scale_factor = pipe.vae_scale_factor
+    unet_in_channels = pipe.unet.config.in_channels
+    sample_width = WIDTH // vae_scale_factor
+    sample_height = HEIGHT // vae_scale_factor
+    del pipe
+    gc.collect();
+
+Convert the model `⇑ <#top>`__
+###############################################################################################################################
+
+The architecture for generating videos from text comprises three
+distinct sub-networks: one for extracting text features, another for
+translating text features into the video latent space using a diffusion
+model, and a final one for mapping the video latent space to the visual
+space. The collective parameters of the entire model amount to
+approximately 1.7 billion. It’s capable of processing English input. The
+diffusion model is built upon the Unet3D model and achieves video
+generation by iteratively denoising a starting point of pure Gaussian
+noise video.
+
+.. image:: 253-zeroscope-text2video-with-output_files/253-zeroscope-text2video-with-output_01_02.png
+
+
+Define the conversion function `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+Model components are PyTorch modules, that can be converted with
+``ov.convert_model`` function directly. We also use ``ov.save_model``
+function to serialize the result of conversion.
+
+.. code:: ipython3
+
+    warnings.filterwarnings("ignore", category=torch.jit.TracerWarning)
+
+.. code:: ipython3
+
+    def convert(model: torch.nn.Module, xml_path: str, **convert_kwargs) -> Path:
+        xml_path = Path(xml_path)
+        if not xml_path.exists():
+            xml_path.parent.mkdir(parents=True, exist_ok=True)
+            with torch.no_grad():
+                converted_model = ov.convert_model(model, **convert_kwargs)
+            ov.save_model(converted_model, xml_path)
+            del converted model
+            gc.collect()
+            torch._C._jit_clear_class_registry()
+            torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
+            torch.jit._state._clear_class_state()
+        return xml_path
+
+UNet `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+Text-to-video generation pipeline main component is a conditional 3D
+UNet model that takes a noisy sample, conditional state, and a timestep
+and returns a sample shaped output.
+
+.. code:: ipython3
+
+    unet_xml_path = convert(
+        unet,
+        "models/unet.xml",
+        example_input={
+            "sample": torch.randn(2, 4, 2, 32, 32),
+            "timestep": torch.tensor(1),
+            "encoder_hidden_states": torch.randn(2, 77, 1024),
+        },
+        input=[
+            ("sample", (2, 4, NUM_FRAMES, sample_height, sample_width)),
+            ("timestep", ()),
+            ("encoder_hidden_states", (2, 77, 1024)),
+        ],
+    )
+    del unet
+    gc.collect();
+
+
+.. parsed-literal::
+
+    WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.
+    
+
+.. parsed-literal::
+
+    [ WARNING ]  Please fix your imports. Module %s has been moved to %s. The old module will be deleted in version %s.
+    
+
+VAE `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+Variational autoencoder (VAE) uses UNet output to decode latents to
+visual representations. Our VAE model has KL loss for encoding images
+into latents and decoding latent representations into images. For
+inference, we need only decoder part.
+
+.. code:: ipython3
+
+    class VaeDecoderWrapper(torch.nn.Module):
+        def __init__(self, vae):
+            super().__init__()
+            self.vae = vae
+            
+        def forward(self, z: torch.FloatTensor):
+            return self.vae.decode(z)
+
+.. code:: ipython3
+
+    vae_decoder_xml_path = convert(
+        VaeDecoderWrapper(vae),
+        "models/vae.xml",
+        example_input=torch.randn(2, 4, 32, 32),
+        input=((NUM_FRAMES, 4, sample_height, sample_width)),
+    )
+    del vae
+    gc.collect();
+
+Text encoder `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+Text encoder is used to encode the input prompt to tensor. Default
+tensor length is 77.
+
+.. code:: ipython3
+
+    text_encoder_xml = convert(
+        text_encoder,
+        "models/text_encoder.xml",
+        example_input=torch.ones(1, 77, dtype=torch.int64),
+        input=((1, 77), (ov.Type.i64,)),
+    )
+    del text_encoder
+    gc.collect();
+
+Build a pipeline `⇑ <#top>`__
+###############################################################################################################################
+
+.. code:: ipython3
+
+    def tensor2vid(video: torch.Tensor, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) -> List[np.ndarray]:
+        # This code is copied from https://github.com/modelscope/modelscope/blob/1509fdb973e5871f37148a4b5e5964cafd43e64d/modelscope/pipelines/multi_modal/text_to_video_synthesis_pipeline.py#L78
+        # reshape to ncfhw
+        mean = torch.tensor(mean, device=video.device).reshape(1, -1, 1, 1, 1)
+        std = torch.tensor(std, device=video.device).reshape(1, -1, 1, 1, 1)
+        # unnormalize back to [0,1]
+        video = video.mul_(std).add_(mean)
+        video.clamp_(0, 1)
+        # prepare the final outputs
+        i, c, f, h, w = video.shape
+        images = video.permute(2, 3, 0, 4, 1).reshape(
+            f, h, i * w, c
+        )  # 1st (frames, h, batch_size, w, c) 2nd (frames, h, batch_size * w, c)
+        images = images.unbind(dim=0)  # prepare a list of indvidual (consecutive frames)
+        images = [(image.cpu().numpy() * 255).astype("uint8") for image in images]  # f h w c
+        return images
+
+.. code:: ipython3
+
+    class OVTextToVideoSDPipeline(diffusers.DiffusionPipeline):
+        def __init__(
+            self,
+            vae_decoder: ov.CompiledModel,
+            text_encoder: ov.CompiledModel,
+            tokenizer: transformers.CLIPTokenizer,
+            unet: ov.CompiledModel,
+            scheduler: diffusers.schedulers.DDIMScheduler,
+        ):
+            super().__init__()
+    
+            self.vae_decoder = vae_decoder
+            self.text_encoder = text_encoder
+            self.tokenizer = tokenizer
+            self.unet = unet
+            self.scheduler = scheduler
+            self.vae_scale_factor = vae_scale_factor
+            self.unet_in_channels = unet_in_channels
+            self.width = WIDTH
+            self.height = HEIGHT
+            self.num_frames = NUM_FRAMES
+    
+        def __call__(
+            self,
+            prompt: Union[str, List[str]] = None,
+            num_inference_steps: int = 50,
+            guidance_scale: float = 9.0,
+            negative_prompt: Optional[Union[str, List[str]]] = None,
+            eta: float = 0.0,
+            generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+            latents: Optional[torch.FloatTensor] = None,
+            prompt_embeds: Optional[torch.FloatTensor] = None,
+            negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+            output_type: Optional[str] = "np",
+            return_dict: bool = True,
+            callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+            callback_steps: int = 1,
+        ):
+            r"""
+            Function invoked when calling the pipeline for generation.
+    
+            Args:
+                prompt (`str` or `List[str]`, *optional*):
+                    The prompt or prompts to guide the video generation. If not defined, one has to pass `prompt_embeds`.
+                    instead.
+                num_inference_steps (`int`, *optional*, defaults to 50):
+                    The number of denoising steps. More denoising steps usually lead to a higher quality videos at the
+                    expense of slower inference.
+                guidance_scale (`float`, *optional*, defaults to 7.5):
+                    Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                    `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                    Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                    1`. Higher guidance scale encourages to generate videos that are closely linked to the text `prompt`,
+                    usually at the expense of lower video quality.
+                negative_prompt (`str` or `List[str]`, *optional*):
+                    The prompt or prompts not to guide the video generation. If not defined, one has to pass
+                    `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                    less than `1`).
+                eta (`float`, *optional*, defaults to 0.0):
+                    Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
+                    [`schedulers.DDIMScheduler`], will be ignored for others.
+                generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                    One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                    to make generation deterministic.
+                latents (`torch.FloatTensor`, *optional*):
+                    Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video
+                    generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                    tensor will ge generated by sampling using the supplied random `generator`. Latents should be of shape
+                    `(batch_size, num_channel, num_frames, height, width)`.
+                prompt_embeds (`torch.FloatTensor`, *optional*):
+                    Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                    provided, text embeddings will be generated from `prompt` input argument.
+                negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+                    Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                    weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                    argument.
+                output_type (`str`, *optional*, defaults to `"np"`):
+                    The output format of the generate video. Choose between `torch.FloatTensor` or `np.array`.
+                return_dict (`bool`, *optional*, defaults to `True`):
+                    Whether or not to return a [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] instead of a
+                    plain tuple.
+                callback (`Callable`, *optional*):
+                    A function that will be called every `callback_steps` steps during inference. The function will be
+                    called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+                callback_steps (`int`, *optional*, defaults to 1):
+                    The frequency at which the `callback` function will be called. If not specified, the callback will be
+                    called at every step.
+    
+            Returns:
+                `List[np.ndarray]`: generated video frames
+            """
+    
+            num_images_per_prompt = 1
+    
+            # 1. Check inputs. Raise error if not correct
+            self.check_inputs(
+                prompt,
+                callback_steps,
+                negative_prompt,
+                prompt_embeds,
+                negative_prompt_embeds,
+            )
+    
+            # 2. Define call parameters
+            if prompt is not None and isinstance(prompt, str):
+                batch_size = 1
+            elif prompt is not None and isinstance(prompt, list):
+                batch_size = len(prompt)
+            else:
+                batch_size = prompt_embeds.shape[0]
+    
+            # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+            # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+            # corresponds to doing no classifier free guidance.
+            do_classifier_free_guidance = guidance_scale > 1.0
+    
+            # 3. Encode input prompt
+            prompt_embeds = self._encode_prompt(
+                prompt,
+                num_images_per_prompt,
+                do_classifier_free_guidance,
+                negative_prompt,
+                prompt_embeds=prompt_embeds,
+                negative_prompt_embeds=negative_prompt_embeds,
+            )
+    
+            # 4. Prepare timesteps
+            self.scheduler.set_timesteps(num_inference_steps)
+            timesteps = self.scheduler.timesteps
+    
+            # 5. Prepare latent variables
+            num_channels_latents = self.unet_in_channels
+            latents = self.prepare_latents(
+                batch_size * num_images_per_prompt,
+                num_channels_latents,
+                prompt_embeds.dtype,
+                generator,
+                latents,
+            )
+    
+            # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+            extra_step_kwargs = {"generator": generator, "eta": eta}
+    
+            # 7. Denoising loop
+            num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+            with self.progress_bar(total=num_inference_steps) as progress_bar:
+                for i, t in enumerate(timesteps):
+                    # expand the latents if we are doing classifier free guidance
+                    latent_model_input = (
+                        torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                    )
+                    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+    
+                    # predict the noise residual
+                    noise_pred = self.unet(
+                        {
+                            "sample": latent_model_input,
+                            "timestep": t,
+                            "encoder_hidden_states": prompt_embeds,
+                        }
+                    )[0]
+                    noise_pred = torch.tensor(noise_pred)
+    
+                    # perform guidance
+                    if do_classifier_free_guidance:
+                        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                        noise_pred = noise_pred_uncond + guidance_scale * (
+                            noise_pred_text - noise_pred_uncond
+                        )
+    
+                    # reshape latents
+                    bsz, channel, frames, width, height = latents.shape
+                    latents = latents.permute(0, 2, 1, 3, 4).reshape(
+                        bsz * frames, channel, width, height
+                    )
+                    noise_pred = noise_pred.permute(0, 2, 1, 3, 4).reshape(
+                        bsz * frames, channel, width, height
+                    )
+    
+                    # compute the previous noisy sample x_t -> x_t-1
+                    latents = self.scheduler.step(
+                        noise_pred, t, latents, **extra_step_kwargs
+                    ).prev_sample
+    
+                    # reshape latents back
+                    latents = (
+                        latents[None, :]
+                        .reshape(bsz, frames, channel, width, height)
+                        .permute(0, 2, 1, 3, 4)
+                    )
+    
+                    # call the callback, if provided
+                    if i == len(timesteps) - 1 or (
+                        (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
+                    ):
+                        progress_bar.update()
+                        if callback is not None and i % callback_steps == 0:
+                            callback(i, t, latents)
+    
+            video_tensor = self.decode_latents(latents)
+    
+            if output_type == "pt":
+                video = video_tensor
+            else:
+                video = tensor2vid(video_tensor)
+    
+            if not return_dict:
+                return (video,)
+    
+            return {"frames": video}
+    
+        # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline._encode_prompt
+        def _encode_prompt(
+            self,
+            prompt,
+            num_images_per_prompt,
+            do_classifier_free_guidance,
+            negative_prompt=None,
+            prompt_embeds: Optional[torch.FloatTensor] = None,
+            negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        ):
+            r"""
+            Encodes the prompt into text encoder hidden states.
+    
+            Args:
+                 prompt (`str` or `List[str]`, *optional*):
+                    prompt to be encoded
+                num_images_per_prompt (`int`):
+                    number of images that should be generated per prompt
+                do_classifier_free_guidance (`bool`):
+                    whether to use classifier free guidance or not
+                negative_prompt (`str` or `List[str]`, *optional*):
+                    The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                    `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                    less than `1`).
+                prompt_embeds (`torch.FloatTensor`, *optional*):
+                    Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                    provided, text embeddings will be generated from `prompt` input argument.
+                negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+                    Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                    weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                    argument.
+            """
+            if prompt is not None and isinstance(prompt, str):
+                batch_size = 1
+            elif prompt is not None and isinstance(prompt, list):
+                batch_size = len(prompt)
+            else:
+                batch_size = prompt_embeds.shape[0]
+    
+            if prompt_embeds is None:
+                text_inputs = self.tokenizer(
+                    prompt,
+                    padding="max_length",
+                    max_length=self.tokenizer.model_max_length,
+                    truncation=True,
+                    return_tensors="pt",
+                )
+                text_input_ids = text_inputs.input_ids
+                untruncated_ids = self.tokenizer(
+                    prompt, padding="longest", return_tensors="pt"
+                ).input_ids
+    
+                if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
+                    text_input_ids, untruncated_ids
+                ):
+                    removed_text = self.tokenizer.batch_decode(
+                        untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
+                    )
+                    print(
+                        "The following part of your input was truncated because CLIP can only handle sequences up to"
+                        f" {self.tokenizer.model_max_length} tokens: {removed_text}"
+                    )
+    
+                prompt_embeds = self.text_encoder(text_input_ids)
+                prompt_embeds = prompt_embeds[0]
+                prompt_embeds = torch.tensor(prompt_embeds)
+    
+            bs_embed, seq_len, _ = prompt_embeds.shape
+            # duplicate text embeddings for each generation per prompt, using mps friendly method
+            prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+            prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)
+    
+            # get unconditional embeddings for classifier free guidance
+            if do_classifier_free_guidance and negative_prompt_embeds is None:
+                uncond_tokens: List[str]
+                if negative_prompt is None:
+                    uncond_tokens = [""] * batch_size
+                elif type(prompt) is not type(negative_prompt):
+                    raise TypeError(
+                        f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                        f" {type(prompt)}."
+                    )
+                elif isinstance(negative_prompt, str):
+                    uncond_tokens = [negative_prompt]
+                elif batch_size != len(negative_prompt):
+                    raise ValueError(
+                        f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+                        f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+                        " the batch size of `prompt`."
+                    )
+                else:
+                    uncond_tokens = negative_prompt
+    
+                max_length = prompt_embeds.shape[1]
+                uncond_input = self.tokenizer(
+                    uncond_tokens,
+                    padding="max_length",
+                    max_length=max_length,
+                    truncation=True,
+                    return_tensors="pt",
+                )
+    
+                negative_prompt_embeds = self.text_encoder(uncond_input.input_ids)
+                negative_prompt_embeds = negative_prompt_embeds[0]
+                negative_prompt_embeds = torch.tensor(negative_prompt_embeds)
+    
+            if do_classifier_free_guidance:
+                # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
+                seq_len = negative_prompt_embeds.shape[1]
+    
+                negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1)
+                negative_prompt_embeds = negative_prompt_embeds.view(
+                    batch_size * num_images_per_prompt, seq_len, -1
+                )
+    
+                # For classifier free guidance, we need to do two forward passes.
+                # Here we concatenate the unconditional and text embeddings into a single batch
+                # to avoid doing two forward passes
+                prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
+    
+            return prompt_embeds
+    
+        def prepare_latents(
+            self,
+            batch_size,
+            num_channels_latents,
+            dtype,
+            generator,
+            latents=None,
+        ):
+            shape = (
+                batch_size,
+                num_channels_latents,
+                self.num_frames,
+                self.height // self.vae_scale_factor,
+                self.width // self.vae_scale_factor,
+            )
+            if isinstance(generator, list) and len(generator) != batch_size:
+                raise ValueError(
+                    f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                    f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+                )
+    
+            if latents is None:
+                latents = diffusers.utils.randn_tensor(shape, generator=generator, dtype=dtype)
+    
+            # scale the initial noise by the standard deviation required by the scheduler
+            latents = latents * self.scheduler.init_noise_sigma
+            return latents
+    
+        def check_inputs(
+            self,
+            prompt,
+            callback_steps,
+            negative_prompt=None,
+            prompt_embeds=None,
+            negative_prompt_embeds=None,
+        ):
+            if self.height % 8 != 0 or self.width % 8 != 0:
+                raise ValueError(
+                    f"`height` and `width` have to be divisible by 8 but are {self.height} and {self.width}."
+                )
+    
+            if (callback_steps is None) or (
+                callback_steps is not None
+                and (not isinstance(callback_steps, int) or callback_steps <= 0)
+            ):
+                raise ValueError(
+                    f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
+                    f" {type(callback_steps)}."
+                )
+    
+            if prompt is not None and prompt_embeds is not None:
+                raise ValueError(
+                    f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                    " only forward one of the two."
+                )
+            elif prompt is None and prompt_embeds is None:
+                raise ValueError(
+                    "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+                )
+            elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+                raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+    
+            if negative_prompt is not None and negative_prompt_embeds is not None:
+                raise ValueError(
+                    f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                    f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+                )
+    
+            if prompt_embeds is not None and negative_prompt_embeds is not None:
+                if prompt_embeds.shape != negative_prompt_embeds.shape:
+                    raise ValueError(
+                        "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                        f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                        f" {negative_prompt_embeds.shape}."
+                    )
+    
+        def decode_latents(self, latents):
+            scale_factor = 0.18215
+            latents = 1 / scale_factor * latents
+    
+            batch_size, channels, num_frames, height, width = latents.shape
+            latents = latents.permute(0, 2, 1, 3, 4).reshape(
+                batch_size * num_frames, channels, height, width
+            )
+            image = self.vae_decoder(latents)[0]
+            image = torch.tensor(image)
+            video = (
+                image[None, :]
+                .reshape(
+                    (
+                        batch_size,
+                        num_frames,
+                        -1,
+                    )
+                    + image.shape[2:]
+                )
+                .permute(0, 2, 1, 3, 4)
+            )
+            # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
+            video = video.float()
+            return video
+
+Inference with OpenVINO `⇑ <#top>`__
+###############################################################################################################################
+
+.. code:: ipython3
+
+    core = ov.Core()
+
+Select inference device `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+select device from dropdown list for running inference using OpenVINO
+
+.. code:: ipython3
+
+    device = widgets.Dropdown(
+        options=core.available_devices + ["AUTO"],
+        value='AUTO',
+        description='Device:',
+        disabled=False,
+    )
+    
+    device
+
+
+
+
+.. parsed-literal::
+
+    Dropdown(description='Device:', index=4, options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='AUTO')
+
+
+
+.. code:: ipython3
+
+    %%time
+    ov_unet = core.compile_model(unet_xml_path, device_name=device.value)
+
+
+.. parsed-literal::
+
+    CPU times: user 14.1 s, sys: 5.62 s, total: 19.7 s
+    Wall time: 10.6 s
+    
+
+.. code:: ipython3
+
+    %%time
+    ov_vae_decoder = core.compile_model(vae_decoder_xml_path, device_name=device.value)
+
+
+.. parsed-literal::
+
+    CPU times: user 456 ms, sys: 320 ms, total: 776 ms
+    Wall time: 328 ms
+    
+
+.. code:: ipython3
+
+    %%time
+    ov_text_encoder = core.compile_model(text_encoder_xml, device_name=device.value)
+
+
+.. parsed-literal::
+
+    CPU times: user 1.78 s, sys: 1.44 s, total: 3.22 s
+    Wall time: 1.13 s
+    
+
+Here we replace the pipeline parts with versions converted to OpenVINO
+IR and compiled to specific device. Note that we use original pipeline
+tokenizer and scheduler.
+
+.. code:: ipython3
+
+    ov_pipe = OVTextToVideoSDPipeline(ov_vae_decoder, ov_text_encoder, tokenizer, ov_unet, scheduler)
+
+Define a prompt `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+.. code:: ipython3
+
+    prompt = "A panda eating bamboo on a rock."
+
+Let’s generate a video for our prompt. For full list of arguments, see
+``__call__`` function definition of ``OVTextToVideoSDPipeline`` class in
+`Build a pipeline <#Build-a-pipeline>`__ section.
+
+Video generation `⇑ <#top>`__
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+.. code:: ipython3
+
+    frames = ov_pipe(prompt, num_inference_steps=25)['frames']
+
+
+
+.. parsed-literal::
+
+      0%|          | 0/25 [00:00<?, ?it/s]
+
+
+.. code:: ipython3
+
+    images = [PIL.Image.fromarray(frame) for frame in frames]
+    images[0].save("output.gif", save_all=True, append_images=images[1:], duration=125, loop=0)
+    with open("output.gif", "rb") as gif_file:
+        b64 = f'data:image/gif;base64,{base64.b64encode(gif_file.read()).decode()}'
+    IPython.display.HTML(f"<img src=\"{b64}\" />")
+
+
+.. image:: 253-zeroscope-text2video-with-output_files/253-zeroscope-text2video-with-output_01_03.gif
+
+
+Interactive demo `⇑ <#top>`__
+###############################################################################################################################
+
+.. code:: ipython3
+
+    def generate(
+        prompt, seed, num_inference_steps, _=gr.Progress(track_tqdm=True)
+    ):
+        generator = torch.Generator().manual_seed(seed)
+        frames = ov_pipe(
+            prompt,
+            num_inference_steps=num_inference_steps,
+            generator=generator,
+        )["frames"]
+        out_file = tempfile.NamedTemporaryFile(suffix=".gif", delete=False)
+        images = [PIL.Image.fromarray(frame) for frame in frames]
+        images[0].save(
+            out_file, save_all=True, append_images=images[1:], duration=125, loop=0
+        )
+        return out_file.name
+
+
+    demo = gr.Interface(
+        generate,
+        [
+            gr.Textbox(label="Prompt"),
+            gr.Slider(0, 1000000, value=42, label="Seed", step=1),
+            gr.Slider(10, 50, value=25, label="Number of inference steps", step=1),
+        ],
+        gr.Image(label="Result"),
+        examples=[
+            ["An astronaut riding a horse.", 0, 25],
+            ["A panda eating bamboo on a rock.", 0, 25],
+            ["Spiderman is surfing.", 0, 25],
+        ],
+        allow_flagging="never"
+    )
+
+    try:
+        demo.queue().launch(debug=True)
+    except Exception:
+        demo.queue().launch(share=True, debug=True)
+    # if you are launching remotely, specify server_name and server_port
+    # demo.launch(server_name='your server name', server_port='server port in int')
+    # Read more in the docs: https://gradio.app/docs/
--- a/docs/notebooks/253-zeroscope-text2video-with-output_files/253-zeroscope-text2video-with-output_01_02.png
+++ b/docs/notebooks/253-zeroscope-text2video-with-output_files/253-zeroscope-text2video-with-output_01_02.png
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f9b3abdf1818a885d159961285a1ef96a2c0c0c99d26eac96435b7813e28198d
+size 41341
--- a/docs/notebooks/253-zeroscope-text2video-with-output_files/253-zeroscope-text2video-with-output_01_03.gif
+++ b/docs/notebooks/253-zeroscope-text2video-with-output_files/253-zeroscope-text2video-with-output_01_03.gif
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c0786f897470a25d935d1f5e096132f086c7f96f42d441102f598828d6d39452
+size 1366066
--- a/docs/tutorials.md
+++ b/docs/tutorials.md
@ -154,115 +154,117 @@ Demos that demonstrate inference on a particular model.

 .. dropdown:: Explore more notebooks below.

-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | Notebook                                                                                                                      | Description                                                                                                                                | Preview                                            |
-   +===============================================================================================================================+============================================================================================================================================+===========================================+
+   +===============================================================================================================================+============================================================================================================================================+====================================================+
   | `201-vision-monodepth <notebooks/201-vision-monodepth-with-output.html>`__ |br| |n201| |br| |c201|                            | Monocular depth estimation with images and video.                                                                                          | |n201-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `202-vision-superresolution-image <notebooks/202-vision-superresolution-image-with-output.html>`__ |br| |n202i| |br| |c202i|  | Upscale raw images with a super resolution model.                                                                                          | |n202i-img1| → |n202i-img2|                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `202-vision-superresolution-video <notebooks/202-vision-superresolution-video-with-output.html>`__ |br| |n202v| |br| |c202v|  | Turn 360p into 1080p video using a super resolution model.                                                                                 | |n202v-img1| → |n202v-img2|                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `203-meter-reader <notebooks/203-meter-reader-with-output.html>`__ |br| |n203|                                                | PaddlePaddle pre-trained models to read industrial meter's value.                                                                          | |n203-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `204-segmenter-semantic-segmentation <notebooks/204-segmenter-semantic-segmentation-with-output.html>`__ |br| |c204|          | Semantic segmentation with OpenVINO™ using Segmenter.                                                                                      | |n204-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `206-vision-paddlegan-anime <notebooks/206-vision-paddlegan-anime-with-output.html>`__                                        | Turn an image into anime using a GAN.                                                                                                      | |n206-img1| → |n206-img2|                          |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `207-vision-paddlegan-superresolution <notebooks/207-vision-paddlegan-superresolution-with-output.html>`__                    | Upscale small images with superresolution using a PaddleGAN model.                                                                         | |n207-img1| → |n207-img2|                          |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `208-optical-character-recognition <notebooks/208-optical-character-recognition-with-output.html>`__                          | Annotate text on images using text recognition resnet.                                                                                     | |n208-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `212-pyannote-speaker-diarization <notebooks/212-pyannote-speaker-diarization-with-output.html>`__                            | Run inference on speaker diarization pipeline.                                                                                             | |n212-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `210-slowfast-video-recognition <notebooks/210-slowfast-video-recognition-with-output.html>`__ |br| |n210|                    | Video Recognition using SlowFast and OpenVINO™                                                                                             | |n210-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `213-question-answering <notebooks/213-question-answering-with-output.html>`__ |br| |n213|                                    | Answer your questions basing on a context.                                                                                                 | |n213-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `214-grammar-correction <notebooks/214-grammar-correction-with-output.html>`__                                                | Grammatical error correction with OpenVINO.                                                                                                |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `216-attention-center <notebooks/216-attention-center-with-output.html>`__                                                    | The attention center model with OpenVINO™                                                                                                  |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `217-vision-deblur <notebooks/217-vision-deblur-with-output.html>`__ |br| |n217|                                              | Deblur images with DeblurGAN-v2.                                                                                                           | |n217-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `219-knowledge-graphs-conve <notebooks/219-knowledge-graphs-conve-with-output.html>`__ |br| |n219|                            | Optimize the knowledge graph embeddings model (ConvE) with OpenVINO.                                                                       |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `220-cross-lingual-books-alignment <notebooks/220-cross-lingual-books-alignment-with-output.html>`__ |br| |n220| |br| |c220|  | Cross-lingual Books Alignment With Transformers and OpenVINO™                                                                              | |n220-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `221-machine-translation <notebooks/221-machine-translation-with-output.html>`__ |br| |n221| |br| |c221|                      | Real-time translation from English to German.                                                                                              |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `222-vision-image-colorization <notebooks/222-vision-image-colorization-with-output.html>`__ |br| |n222|                      | Use pre-trained models to colorize black & white images using OpenVINO.                                                                    | |n222-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `223-text-prediction <notebooks/223-text-prediction-with-output.html>`__ |br| |c223|                                          | Use pre-trained models to perform text prediction on an input sequence.                                                                    | |n223-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `224-3D-segmentation-point-clouds <notebooks/224-3D-segmentation-point-clouds-with-output.html>`__                            | Process point cloud data and run 3D Part Segmentation with OpenVINO.                                                                       | |n224-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `225-stable-diffusion-text-to-image <notebooks/225-stable-diffusion-text-to-image-with-output.html>`__                        | Text-to-image generation with Stable Diffusion method.                                                                                     | |n225-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `226-yolov7-optimization <notebooks/226-yolov7-optimization-with-output.html>`__                                              | Optimize YOLOv7, using NNCF PTQ API.                                                                                                       | |n226-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `227-whisper-subtitles-generation <notebooks/227-whisper-subtitles-generation-with-output.html>`__ |br| |c227|                | Generate subtitles for video with OpenAI Whisper and OpenVINO.                                                                             | |n227-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `228-clip-zero-shot-convert <notebooks/228-clip-zero-shot-convert-with-output.html>`__                                        | Zero-shot Image Classification with OpenAI CLIP and OpenVINO™                                                                              | |n228-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `228-clip-zero-shot-quantize <notebooks/228-clip-zero-shot-quantize-with-output.html>`__                                      | Post-Training Quantization of OpenAI CLIP model with NNCF                                                                                  | |n228-img2|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `229-distilbert-sequence-classification <notebooks/229-distilbert-sequence-classification-with-output.html>`__ |br| |n229|    | Sequence classification with OpenVINO.                                                                                                     | |n229-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `230-yolov8-optimization <notebooks/230-yolov8-optimization-with-output.html>`__ |br| |c230|                                  | Optimize YOLOv8, using NNCF PTQ API.                                                                                                       | |n230-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `231-instruct-pix2pix-image-editing <notebooks/231-instruct-pix2pix-image-editing-with-output.html>`__                        | Image editing with InstructPix2Pix.                                                                                                        | |n231-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `232-clip-language-saliency-map <notebooks/232-clip-language-saliency-map-with-output.html>`__ |br| |c232|                    | Language-visual saliency with CLIP and OpenVINO™.                                                                                          | |n232-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `233-blip-visual-language-processing <notebooks/233-blip-visual-language-processing-with-output.html>`__                      | Visual question answering and image captioning using BLIP and OpenVINO™.                                                                   | |n233-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `234-encodec-audio-compression <notebooks/234-encodec-audio-compression-with-output.html>`__                                  | Audio compression with EnCodec and OpenVINO™.                                                                                              | |n234-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `235-controlnet-stable-diffusion <notebooks/235-controlnet-stable-diffusion-with-output.html>`__                              | A text-to-image generation with ControlNet Conditioning and OpenVINO™.                                                                     | |n235-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `236-stable-diffusion-v2 <notebooks/236-stable-diffusion-v2-infinite-zoom-with-output.html>`__                                | Text-to-image generation and Infinite Zoom with Stable Diffusion v2 and OpenVINO™.                                                         | |n236-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `236-stable-diffusion-v2 <notebooks/236-stable-diffusion-v2-optimum-demo-comparison-with-output.html>`__                      | Stable Diffusion v2.1 using Optimum-Intel OpenVINO and multiple Intel Hardware.                                                            | |n236-img4|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `236-stable-diffusion-v2 <notebooks/236-stable-diffusion-v2-optimum-demo-with-output.html>`__                                 | Stable Diffusion v2.1 using Optimum-Intel OpenVINO.                                                                                        | |n236-img4|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `236-stable-diffusion-v2 <notebooks/236-stable-diffusion-v2-text-to-image-demo-with-output.html>`__                           | Stable Diffusion Text-to-Image Demo.                                                                                                       | |n236-img4|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `236-stable-diffusion-v2 <notebooks/236-stable-diffusion-v2-text-to-image-with-output.html>`__                                | Text-to-image generation with Stable Diffusion v2 and OpenVINO™.                                                                           | |n236-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `237-segment-anything <notebooks/237-segment-anything-with-output.html>`__                                                    | Prompt based object segmentation mask generation, using Segment Anything and OpenVINO™.                                                    | |n237-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `238-deep-floyd-if <notebooks/238-deep-floyd-if-with-output.html>`__                                                          | Text-to-image generation with DeepFloyd IF and OpenVINO™.                                                                                  | |n238-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `239-image-bind <notebooks/239-image-bind-convert-with-output.html>`__                                                        | Binding multimodal data, using ImageBind and OpenVINO™.                                                                                    | |n239-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `240-dolly-2-instruction-following <notebooks/240-dolly-2-instruction-following-with-output.html>`__                          | Instruction following using Databricks Dolly 2.0 and OpenVINO™.                                                                            | |n240-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `241-riffusion-text-to-music <notebooks/241-riffusion-text-to-music-with-output.html>`__                                      | Text-to-Music generation using Riffusion and OpenVINO™.                                                                                    | |n241-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `242-freevc-voice-conversion <notebooks/242-freevc-voice-conversion-with-output.html>`__                                      | High-Quality Text-Free One-Shot Voice Conversion with FreeVC and OpenVINO™                                                                 |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `243-tflite-selfie-segmentation <notebooks/243-tflite-selfie-segmentation-with-output.html>`__ |br| |n243| |br| |c243|        | Selfie Segmentation using TFLite and OpenVINO™.                                                                                            | |n243-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `244-named-entity-recognition <notebooks/244-named-entity-recognition-with-output.html>`__ |br| |c244|                        | Named entity recognition with OpenVINO™.                                                                                                   |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `245-typo-detector <notebooks/245-typo-detector-with-output.html>`__                                                          | English Typo Detection in sentences with OpenVINO™.                                                                                        | |n245-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `246-depth-estimation-videpth <notebooks/246-depth-estimation-videpth-with-output.html>`__                                    | Monocular Visual-Inertial Depth Estimation with OpenVINO™.                                                                                 | |n246-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `247-code-language-id <notebooks/247-code-language-id-with-output.html>`__ |br| |n247|                                        | Identify the programming language used in an arbitrary code snippet.                                                                       |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `248-stable-diffusion-xl <notebooks/248-stable-diffusion-xl-with-output.html>`__                                              | Image generation with Stable Diffusion XL and OpenVINO™.                                                                                   | |n248-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `249-oneformer-segmentation <notebooks/249-oneformer-segmentation-with-output.html>`__                                        | Universal segmentation with OneFormer and OpenVINO™.                                                                                       | |n249-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `250-music-generation <notebooks/250-music-generation-with-output.html>`__ |br| |n250| |br| |c250|                            | Controllable Music Generation with MusicGen and OpenVINO™.                                                                                 | |n250-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `251-tiny-sd-image-generation <notebooks/251-tiny-sd-image-generation-with-output.html>`__ |br| |c251|                        | Image Generation with Tiny-SD and OpenVINO™.                                                                                               | |n251-img1|                                        |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
   | `252-fastcomposer-image-generation <notebooks/252-fastcomposer-image-generation-with-output.html>`__                          | Image generation with FastComposer and OpenVINO™.                                                                                          |                                                    |
-   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
+   | `253-zeroscope-text2video <notebooks/253-zeroscope-text2video-with-output.html>`__                                            | Text-to video synthesis with ZeroScope and OpenVINO™.                                                                                      | A panda eating bamboo on a rock. |br| |n253-img1|  |
+   +-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+


 Model Training
@ -501,6 +503,8 @@ Made with `contributors-img <https://contrib.rocks>`__.
   :target: https://user-images.githubusercontent.com/76463150/260439306-81c81c8d-1f9c-41d0-b881-9491766def8e.png
 .. |n251-img1| image:: https://user-images.githubusercontent.com/29454499/260904650-274fc2f9-24d2-46a3-ac3d-d660ec3c9a19.png
   :target: https://user-images.githubusercontent.com/29454499/260904650-274fc2f9-24d2-46a3-ac3d-d660ec3c9a19.png
+.. |n253-img1| image:: https://user-images.githubusercontent.com/76161256/261102399-500956d5-4aac-4710-a77c-4df34bcda3be.gif
+   :target: https://user-images.githubusercontent.com/76161256/261102399-500956d5-4aac-4710-a77c-4df34bcda3be.gif
 .. |n301-img1| image:: https://user-images.githubusercontent.com/15709723/127779607-8fa34947-1c35-4260-8d04-981c41a2a2cc.png
   :target: https://user-images.githubusercontent.com/15709723/127779607-8fa34947-1c35-4260-8d04-981c41a2a2cc.png
 .. |n401-img1| image:: https://user-images.githubusercontent.com/4547501/141471665-82b28c86-cf64-4bfe-98b3-c314658f2d96.gif