openvino/docs/notebooks/232-clip-language-saliency-map-with-output.rst

Language-Visual Saliency with CLIP and OpenVINO™
================================================

The notebook will cover the following topics:

-  Explanation of a *saliency map* and how it can be used.
-  Overview of the CLIP neural network and its usage in generating
   saliency maps.
-  How to split a neural network into parts for separate inference.
-  How to speed up inference with OpenVINO™ and asynchronous execution.

Saliency Map
------------

A saliency map is a visualization technique that highlights regions of
interest in an image. For example, it can be used to `explain image
classification
predictions <https://academic.oup.com/mnras/article/511/4/5032/6529251#389668570>`__
for a particular label. Here is an example of a saliency map that you
will get in this notebook:

|image0|

CLIP
----

What Is CLIP?
~~~~~~~~~~~~~

CLIP (Contrastive Language–Image Pre-training) is a neural network that
can work with both images and texts. It has been trained to predict
which randomly sampled text snippets are close to a given image, meaning
that a text better describes the image. Here is a visualization of the
pre-training process:

|image1| `image_source <https://openai.com/blog/clip/>`__

To solve the task, CLIP uses two parts: ``Image Encoder`` and
``Text Encoder``. Both parts are used to produce embeddings, which are
vectors of floating-point numbers, for images and texts, respectively.
Given two vectors, one can define and measure the similarity between
them. A popular method to do so is the ``cosine_similarity``, which is
defined as the dot product of the two vectors divided by the product of
their norms:

.. figure:: https://user-images.githubusercontent.com/29454499/218972165-f61a82f2-9711-4ce6-84b5-58fdd1d80d10.png
   :alt: cs

   cs

The result can range from :math:`-1` to :math:`1`. A value :math:`1`
means that the vectors are similar, :math:`0` means that the vectors are
not “connected” at all, and :math:`-1` is for vectors with somehow
opposite “meaning”. To train CLIP, OpenAI uses samples of texts and
images and organizes them so that the first text corresponds to the
first image in the batch, the second text to the second image, and so
on. Then, cosine similarities are measured between all texts and all
images, and the results are put in a matrix. If the matrix has numbers
close to :math:`1` on a diagonal and close to :math:`0` elsewhere, it
indicates that the network is appropriately trained.

How to Build a Saliency Map with CLIP?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Providing an image and a text to CLIP returns two vectors. The cosine
similarity between these vectors is calculated, resulting in a number
between :math:`-1` and :math:`1` that indicates whether the text
describes the image or not. The idea is that *some regions of the image
are closer to the text query* than others, and this difference can be
used to build the saliency map. Here is how it can be done:

1. Compute ``query`` and ``image`` similarity. This will represent the
   *neutral value* :math:`s_0` on the ``saliency map``.
2. Get a random ``crop`` of the image.
3. Compute ``crop`` and ``query`` similarity.
4. Subtract the :math:`s_0` from it. If the value is positive, the
   ``crop`` is closer to the ``query``, and it should be a red region on
   the saliency map. If negative, it should be blue.
5. Update the corresponding region on the ``saliency map``.
6. Repeat steps 2-5 multiple times (``n_iters``).

**Table of contents:**
---

-  `Initial Implementation with Transformers and
   Pytorch <#initial-implementation-with-transformers-and-pytorch>`__
-  `Separate Text and Visual
   Processing <#separate-text-and-visual-processing>`__
-  `Convert to OpenVINO™ Intermediate Representation (IR)
   Format <#convert-to-openvino-intermediate-representation-ir-format>`__
-  `Inference with OpenVINO™ <#inference-with-openvino>`__

   -  `Select inference device <#select-inference-device>`__

-  `Accelerate Inference with
   AsyncInferQueue <#accelerate-inference-with-asyncinferqueue>`__
-  `Pack the Pipeline into a
   Function <#pack-the-pipeline-into-a-function>`__
-  `Interactive demo with
   Gradio <#interactive-demo-with-gradio>`__
-  `What To Do Next <#what-to-do-next>`__

.. |image0| image:: https://user-images.githubusercontent.com/29454499/218967961-9858efd5-fff2-4eb0-bde9-60852f4b31cb.JPG
.. |image1| image:: https://openaiassets.blob.core.windows.net/$web/clip/draft/20210104b/overview-a.svg

Initial Implementation with Transformers and Pytorch
----------------------------------------------------------------------------------------------

.. code:: ipython3

    # Install requirements
    %pip install -q "openvino>=2023.1.0"
    %pip install -q --extra-index-url https://download.pytorch.org/whl/cpu transformers torch gradio

.. code:: ipython3

    from pathlib import Path
    from typing import Tuple, Union, Optional
    from urllib.request import urlretrieve

    from matplotlib import colors
    import matplotlib.pyplot as plt
    import numpy as np
    import requests
    import torch
    import tqdm
    from PIL import Image
    from transformers import CLIPModel, CLIPProcessor


.. parsed-literal::

    2023-09-12 14:10:49.435909: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
    2023-09-12 14:10:49.470573: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
    To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2023-09-12 14:10:50.130215: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT


To get the CLIP model, you will use the ``transformers`` library and the
official ``openai/clip-vit-base-patch16`` from OpenAI. You can use any
CLIP model from the HuggingFace Hub by simply replacing a model
checkpoint in the cell below.

There are several preprocessing steps required to get text and image
data to the model. Images have to be resized, cropped, and normalized,
and text must be split into tokens and swapped by token IDs. To do that,
you will use ``CLIPProcessor``, which encapsulates all the preprocessing
steps.

.. code:: ipython3

    model_checkpoint = "openai/clip-vit-base-patch16"

    model = CLIPModel.from_pretrained(model_checkpoint).eval()
    processor = CLIPProcessor.from_pretrained(model_checkpoint)

Let us write helper functions first. You will generate crop coordinates
and size with ``get_random_crop_params``, and get the actual crop with
``get_crop_image``. To update the saliency map with the calculated
similarity, you will use ``update_saliency_map``. A
``cosine_similarity`` function is just a code representation of the
formula above.

.. code:: ipython3

    def get_random_crop_params(
        image_height: int, image_width: int, min_crop_size: int
    ) -> Tuple[int, int, int, int]:
        crop_size = np.random.randint(min_crop_size, min(image_height, image_width))
        x = np.random.randint(image_width - crop_size + 1)
        y = np.random.randint(image_height - crop_size + 1)
        return x, y, crop_size


    def get_cropped_image(
        im_tensor: np.array, x: int, y: int, crop_size: int
    ) -> np.array:
        return im_tensor[
            y : y + crop_size,
            x : x + crop_size,
            ...
        ]


    def update_saliency_map(
        saliency_map: np.array, similarity: float, x: int, y: int, crop_size: int
    ) -> None:
        saliency_map[
            y : y + crop_size,
            x : x + crop_size,
        ] += similarity


    def cosine_similarity(
        one: Union[np.ndarray, torch.Tensor], other: Union[np.ndarray, torch.Tensor]
    ) -> Union[np.ndarray, torch.Tensor]:
        return one @ other.T / (np.linalg.norm(one) * np.linalg.norm(other))

Parameters to be defined:

-  ``n_iters`` - number of times the procedure will be repeated. Larger
   is better, but will require more time to inference
-  ``min_crop_size`` - minimum size of the crop window. A smaller size
   will increase the resolution of the saliency map but may require more
   iterations
-  ``query`` - text that will be used to query the image
-  ``image`` - the actual image that will be queried. You will download
   the image from a link

The image at the beginning was acquired with ``n_iters=2000`` and
``min_crop_size=50``. You will start with the lower number of inferences
to get the result faster. It is recommended to experiment with the
parameters at the end, when you get an optimized model.

.. code:: ipython3

    n_iters = 300
    min_crop_size = 50

    query = "Who developed the Theory of General Relativity?"
    image_path = Path("example.jpg")
    urlretrieve("https://www.storypick.com/wp-content/uploads/2016/01/AE-2.jpg", image_path)
    image = Image.open(image_path)
    im_tensor = np.array(image)

    x_dim, y_dim = image.size

Given the ``model`` and ``processor``, the actual inference is simple:
transform the text and image into combined ``inputs`` and pass it to the
model:

.. code:: ipython3

    inputs = processor(text=[query], images=[im_tensor], return_tensors="pt")
    with torch.no_grad():
        results = model(**inputs)
    results.keys()


.. parsed-literal::

    odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])


The model produces several outputs, but for your application, you are
interested in ``text_embeds`` and ``image_embeds``, which are the
vectors for text and image, respectively. Now, you can calculate
``initial_similarity`` between the ``query`` and the ``image``. You also
initialize a saliency map. Numbers in the comments correspond to the
items in the “How To Build a Saliency Map With CLIP?” list above.

.. code:: ipython3

    initial_similarity = cosine_similarity(results.text_embeds, results.image_embeds).item()  # 1. Computing query and image similarity
    saliency_map = np.zeros((y_dim, x_dim))

    for _ in tqdm.notebook.tqdm(range(n_iters)):  # 6. Setting number of the procedure iterations
        x, y, crop_size = get_random_crop_params(y_dim, x_dim, min_crop_size)
        im_crop = get_cropped_image(im_tensor, x, y, crop_size)  # 2. Getting a random crop of the image

        inputs = processor(text=[query], images=[im_crop], return_tensors="pt")
        with torch.no_grad():
            results = model(**inputs)  # 3. Computing crop and query similarity

        similarity = cosine_similarity(results.text_embeds, results.image_embeds).item() - initial_similarity  # 4. Subtracting query and image similarity from crop and query similarity
        update_saliency_map(saliency_map, similarity, x, y, crop_size)  # 5. Updating the region on the saliency map


.. parsed-literal::

      0%|          | 0/300 [00:00<?, ?it/s]


To visualize the resulting saliency map, you can use ``matplotlib``:

.. code:: ipython3

    plt.figure(dpi=150)
    plt.imshow(saliency_map, norm=colors.TwoSlopeNorm(vcenter=0), cmap='jet')
    plt.colorbar(location="bottom")
    plt.title(f'Query: \"{query}\"')
    plt.axis("off")
    plt.show()


.. image:: 232-clip-language-saliency-map-with-output_files/232-clip-language-saliency-map-with-output_15_0.png


The result map is not as smooth as in the example picture because of the
lower number of iterations. However, the same red and blue areas are
clearly visible.

Let us overlay the saliency map on the image:

.. code:: ipython3

    def plot_saliency_map(image_tensor: np.ndarray, saliency_map: np.ndarray, query: Optional[str]) -> None:
        fig = plt.figure(dpi=150)
        plt.imshow(image_tensor)
        plt.imshow(
            saliency_map,
            norm=colors.TwoSlopeNorm(vcenter=0),
            cmap="jet",
            alpha=0.5,  # make saliency map trasparent to see original picture
        )
        if query:
            plt.title(f'Query: "{query}"')
        plt.axis("off")
        return fig


    plot_saliency_map(im_tensor, saliency_map, query);


.. image:: 232-clip-language-saliency-map-with-output_files/232-clip-language-saliency-map-with-output_17_0.png


Separate Text and Visual Processing
-----------------------------------------------------------------------------

The code above is functional, but there are some repeated computations
that can be avoided. The text embedding can be computed once because it
does not depend on the input image. This separation will also be useful
in the future. The initial preparation will remain the same since you
still need to compute the similarity between the text and the full
image. After that, the ``get_image_features`` method could be used to
obtain embeddings for the cropped images.

.. code:: ipython3

    inputs = processor(text=[query], images=[im_tensor], return_tensors="pt")
    with torch.no_grad():
        results = model(**inputs)
    text_embeds = results.text_embeds  # save text embeddings to use them later

    initial_similarity = cosine_similarity(text_embeds, results.image_embeds).item()
    saliency_map = np.zeros((y_dim, x_dim))

    for _ in tqdm.notebook.tqdm(range(n_iters)):
        x, y, crop_size = get_random_crop_params(y_dim, x_dim, min_crop_size)
        im_crop = get_cropped_image(im_tensor, x, y, crop_size)

        image_inputs = processor(images=[im_crop], return_tensors="pt")  # crop preprocessing
        with torch.no_grad():
            image_embeds = model.get_image_features(**image_inputs)  # calculate image embeddings only

        similarity = cosine_similarity(text_embeds, image_embeds).item() - initial_similarity
        update_saliency_map(saliency_map, similarity, x, y, crop_size)

    plot_saliency_map(im_tensor, saliency_map, query);


.. parsed-literal::

      0%|          | 0/300 [00:00<?, ?it/s]


.. image:: 232-clip-language-saliency-map-with-output_files/232-clip-language-saliency-map-with-output_19_1.png


The result might be slightly different because you use random crops to
build a saliency map.

Convert to OpenVINO™ Intermediate Representation (IR) Format
------------------------------------------------------------------------------------------------------

The process of building a saliency map can be quite time-consuming. To
speed it up, you will use OpenVINO. OpenVINO is an inference framework
designed to run pre-trained neural networks efficiently. One way to use
it is to convert a model from its original framework representation to
an OpenVINO Intermediate Representation (IR) format and then load it for
inference. The model currently uses PyTorch. To get an IR, you need to
use Model Conversion API. ``ov.convert_model`` function accepts PyTorch
model object and example input and converts it to OpenVINO Model
instance, that ready to load on device using ``ov.compile_model`` or can
be saved on disk using ``ov.save_model``. To separate model on text and
image parts, we overload forward method with ``get_text_features`` and
``get_image_features`` methods respectively. Internally, PyTorch
conversion to OpenVINO involves TorchScript tracing. For achieving
better conversion results, we need to guarantee that model can be
successfully traced. ``model.config.torchscript = True`` parameters
allows to prepare HuggingFace models for TorchScript tracing. More
details about that can be found in HuggingFace Transformers
`documentation <https://huggingface.co/docs/transformers/torchscript>`__

.. code:: ipython3

    import openvino as ov

    model_name = model_checkpoint.split("/")[-1]

    model.config.torchscript = True
    model.forward = model.get_text_features
    text_ov_model = ov.convert_model(
        model,
        example_input={"input_ids": inputs.input_ids, "attention_mask": inputs.attention_mask}
    )

    # get image size after preprocessing from the processor
    crops_info = processor.image_processor.crop_size.values() if hasattr(processor, "image_processor") else processor.feature_extractor.crop_size.values()
    model.forward = model.get_image_features
    image_ov_model = ov.convert_model(
        model,
        example_input={"pixel_values": inputs.pixel_values},
        input=[1,3, *crops_info],
    )

    ov_dir = Path("ir")
    ov_dir.mkdir(exist_ok=True)
    text_model_path = ov_dir / f"{model_name}_text.xml"
    image_model_path = ov_dir / f"{model_name}_image.xml"

    # write resulting models on disk
    ov.save_model(text_ov_model, text_model_path)
    ov.save_model(image_ov_model, image_model_path)


.. parsed-literal::

    WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.


.. parsed-literal::

    [ WARNING ]  Please fix your imports. Module %s has been moved to %s. The old module will be deleted in version %s.


.. parsed-literal::

    INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
    	- Avoid using `tokenizers` before the fork if possible
    	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
    	- Avoid using `tokenizers` before the fork if possible
    	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
    	- Avoid using `tokenizers` before the fork if possible
    	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


.. parsed-literal::

    No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
    /home/ea/work/ov_venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:287: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
      if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
    /home/ea/work/ov_venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:295: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
      if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
    /home/ea/work/ov_venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:304: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
      if attention_mask.size() != (bsz, 1, tgt_len, src_len):
    /home/ea/work/ov_venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:327: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
      if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):


Now, you have two separate models for text and images, stored on disk
and ready to be loaded and inferred with OpenVINO™.

Inference with OpenVINO™
------------------------------------------------------------------

1. Create an instance of the ``Core`` object that will handle any
   interaction with OpenVINO runtime for you.
2. Use the ``core.read_model`` method to load the model into memory.
3. Compile the model with the ``core.compile_model`` method for a
   particular device to apply device-specific optimizations.
4. Use the compiled model for inference.

.. code:: ipython3

    core = ov.Core()

    text_model = core.read_model(text_model_path)
    image_model = core.read_model(image_model_path)

Select inference device
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

select device from dropdown list for running inference using OpenVINO

.. code:: ipython3

    import ipywidgets as widgets

    device = widgets.Dropdown(
        options=core.available_devices + ["AUTO"],
        value='AUTO',
        description='Device:',
        disabled=False,
    )

    device


.. parsed-literal::

    Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')


.. code:: ipython3

    text_model = core.compile_model(model=text_model, device_name=device.value)
    image_model = core.compile_model(model=image_model, device_name=device.value)

OpenVINO supports ``numpy.ndarray`` as an input type, so you change the
``return_tensors`` to ``np``. You also convert a transformers’
``BatchEncoding`` object to a python dictionary with input names as keys
and input tensors for values.

Once you have a compiled model, the inference is similar to Pytorch - a
compiled model is callable. Just pass input data to it. Inference
results are stored in the dictionary. Once you have a compiled model,
the inference process is mostly similar.

.. code:: ipython3

    text_inputs = dict(
        processor(text=[query], images=[im_tensor], return_tensors="np")
    )
    image_inputs = text_inputs.pop("pixel_values")

    text_embeds = text_model(text_inputs)[0]
    image_embeds = image_model(image_inputs)[0]

    initial_similarity = cosine_similarity(text_embeds, image_embeds)
    saliency_map = np.zeros((y_dim, x_dim))

    for _ in tqdm.notebook.tqdm(range(n_iters)):
        x, y, crop_size = get_random_crop_params(y_dim, x_dim, min_crop_size)
        im_crop = get_cropped_image(im_tensor, x, y, crop_size)

        image_inputs = processor(images=[im_crop], return_tensors="np").pixel_values
        image_embeds = image_model(image_inputs)[image_model.output()]

        similarity = cosine_similarity(text_embeds, image_embeds) - initial_similarity
        update_saliency_map(saliency_map, similarity, x, y, crop_size)

    plot_saliency_map(im_tensor, saliency_map, query);


.. parsed-literal::

      0%|          | 0/300 [00:00<?, ?it/s]


.. image:: 232-clip-language-saliency-map-with-output_files/232-clip-language-saliency-map-with-output_29_1.png


Accelerate Inference with ``AsyncInferQueue``
---------------------------------------------------------------------------------------

Up until now, the pipeline was synchronous, which means that the data
preparation, model input population, model inference, and output
processing is sequential. That is a simple, but not the most effective
way to organize an inference pipeline in your case. To utilize the
available resources more efficiently, you will use ``AsyncInferQueue``.
It can be instantiated with compiled model and a number of jobs -
parallel execution threads. If you do not pass a number of jobs or pass
``0``, then OpenVINO will pick the optimal number based on your device
and heuristics. After acquiring the inference queue, you have two jobs
to do:

-  Preprocess the data and push it to the inference queue. The
   preprocessing steps will remain the same
-  Tell the inference queue what to do with the model output after the
   inference is finished. It is represented by a python function called
   ``callback`` that takes an inference result and data that you passed
   to the inference queue along with the prepared input data

Everything else will be handled by the ``AsyncInferQueue`` instance.

There is another low-hanging bit of optimization. You are expecting many
inference requests for your image model at once and want the model to
process them as fast as possible. In other words - maximize the
**throughput**. To do that, you can recompile the model giving it the
performance hint.

.. code:: ipython3

    from typing import Dict, Any


    image_model = core.read_model(image_model_path)

    image_model = core.compile_model(
        model=image_model,
        device_name=device.value,
        config={"PERFORMANCE_HINT":"THROUGHPUT"},
    )

.. code:: ipython3

    text_inputs = dict(
        processor(text=[query], images=[im_tensor], return_tensors="np")
    )
    image_inputs = text_inputs.pop("pixel_values")

    text_embeds = text_model(text_inputs)[text_model.output()]
    image_embeds = image_model(image_inputs)[image_model.output()]

    initial_similarity = cosine_similarity(text_embeds, image_embeds)
    saliency_map = np.zeros((y_dim, x_dim))

Your callback should do the same thing that you did after inference in
the sync mode:

-  Pull the image embeddings from an inference request.
-  Compute cosine similarity between text and image embeddings.
-  Update saliency map based.

If you do not change the progress bar, it will show the progress of
pushing data to the inference queue. To track the actual progress, you
should pass a progress bar object and call ``update`` method after
``update_saliency_map`` call.

.. code:: ipython3

    def completion_callback(
        infer_request: ov.InferRequest,  # inferente result
        user_data: Dict[str, Any],  # data that you passed along with input pixel values
    ) -> None:
        pbar = user_data.pop("pbar")

        image_embeds = infer_request.get_output_tensor().data
        similarity = (
            cosine_similarity(user_data.pop("text_embeds"), image_embeds) - user_data.pop("initial_similarity")
        )
        update_saliency_map(**user_data, similarity=similarity)

        pbar.update(1)  # update the progress bar


    infer_queue = ov.AsyncInferQueue(image_model)
    infer_queue.set_callback(completion_callback)

.. code:: ipython3

    def infer(im_tensor, x_dim, y_dim, text_embeds, image_embeds, initial_similarity, saliency_map, query, n_iters, min_crop_size, _tqdm=tqdm.notebook.tqdm, include_query=True):
        with _tqdm(total=n_iters) as pbar:
            for _ in range(n_iters):
                x, y, crop_size = get_random_crop_params(y_dim, x_dim, min_crop_size)
                im_crop = get_cropped_image(im_tensor, x, y, crop_size)

                image_inputs = processor(images=[im_crop], return_tensors="np")

                # push data to the queue
                infer_queue.start_async(
                    # pass inference data as usual
                    image_inputs.pixel_values,
                    # the data that will be passed to the callback after the inference complete
                    {
                        "text_embeds": text_embeds,
                        "saliency_map": saliency_map,
                        "initial_similarity": initial_similarity,
                        "x": x,
                        "y": y,
                        "crop_size": crop_size,
                        "pbar": pbar,
                    }
                )

            # after you pushed all data to the queue you wait until all callbacks finished
            infer_queue.wait_all()

        return plot_saliency_map(im_tensor, saliency_map, query if include_query else None)
    infer(im_tensor, x_dim, y_dim, text_embeds, image_embeds, initial_similarity, saliency_map, query, n_iters, min_crop_size, _tqdm=tqdm.notebook.tqdm, include_query=True);


.. parsed-literal::

      0%|          | 0/300 [00:00<?, ?it/s]


.. image:: 232-clip-language-saliency-map-with-output_files/232-clip-language-saliency-map-with-output_35_1.png


Pack the Pipeline into a Function
---------------------------------------------------------------------------

Let us wrap all code in the function and add a user interface to it.

.. code:: ipython3

    import ipywidgets as widgets


    def build_saliency_map(image: Image, query: str, n_iters: int = n_iters, min_crop_size=min_crop_size, _tqdm=tqdm.notebook.tqdm, include_query=True):
        x_dim, y_dim = image.size
        im_tensor = np.array(image)

        text_inputs = dict(
            processor(text=[query], images=[im_tensor], return_tensors="np")
        )
        image_inputs = text_inputs.pop("pixel_values")

        text_embeds = text_model(text_inputs)[text_model.output()]
        image_embeds = image_model(image_inputs)[image_model.output()]

        initial_similarity = cosine_similarity(text_embeds, image_embeds)
        saliency_map = np.zeros((y_dim, x_dim))

        return infer(im_tensor, x_dim, y_dim, text_embeds, image_embeds, initial_similarity, saliency_map, query, n_iters, min_crop_size, _tqdm=_tqdm, include_query=include_query)

The first version will enable passing a link to the image, as you have
done so far in the notebook.

.. code:: ipython3

    n_iters_widget = widgets.BoundedIntText(
        value=n_iters,
        min=1,
        max=10000,
        description="n_iters",
    )
    min_crop_size_widget = widgets.IntSlider(
        value=min_crop_size,
        min=1,
        max=200,
        description="min_crop_size",
    )


    @widgets.interact_manual(image_link="", query="", n_iters=n_iters_widget, min_crop_size=min_crop_size_widget)
    def build_saliency_map_from_image_link(
        image_link: str,
        query: str,
        n_iters: int,
        min_crop_size: int,
    ) -> None:
        try:
            image_bytes = requests.get(image_link, stream=True).raw
        except requests.RequestException as e:
            print(f"Cannot load image from link: {image_link}\nException: {e}")
            return

        image = Image.open(image_bytes)
        image = image.convert("RGB")  # remove transparency channel or convert grayscale 1 channel to 3 channels

        build_saliency_map(image, query, n_iters, min_crop_size)


.. parsed-literal::

    interactive(children=(Text(value='', continuous_update=False, description='image_link'), Text(value='', contin…


The second version will enable loading the image from your computer.

.. code:: ipython3

    import io


    load_file_widget = widgets.FileUpload(
        accept="image/*", multiple=False, description="Image file",
    )


    @widgets.interact_manual(file=load_file_widget, query="", n_iters=n_iters_widget, min_crop_size=min_crop_size_widget)
    def build_saliency_map_from_file(
        file: Path,
        query: str = "",
        n_iters: int = 2000,
        min_crop_size: int = 50,
    ) -> None:
        image_bytes = io.BytesIO(file[0]["content"])
        try:
            image = Image.open(image_bytes)
        except Exception as e:
            print(f"Cannot load the image: {e}")
            return

        image = image.convert("RGB")

        build_saliency_map(image, query, n_iters, min_crop_size)


.. parsed-literal::

    interactive(children=(FileUpload(value=(), accept='image/*', description='Image file'), Text(value='', continu…


Interactive demo with Gradio
----------------------------------------------------------------------

.. code:: ipython3

    import gradio as gr


    def _process(image, query, n_iters, min_crop_size, _=gr.Progress(track_tqdm=True)):
        saliency_map = build_saliency_map(image, query, n_iters, min_crop_size, _tqdm=tqdm.tqdm, include_query=False)

        return saliency_map


    demo = gr.Interface(
        _process,
        [
            gr.Image(label="Image", type="pil"),
            gr.Textbox(label="Query"),
            gr.Slider(1, 10000, n_iters, label="Number of iterations"),
            gr.Slider(1, 200, min_crop_size, label="Minimum crop size"),
        ],
        gr.Plot(label="Result"),
        examples=[[image_path, query]],
    )
    try:
        demo.queue().launch(debug=False)
    except Exception:
        demo.queue().launch(share=True, debug=False)
    # if you are launching remotely, specify server_name and server_port
    # demo.launch(server_name='your server name', server_port='server port in int')
    # Read more in the docs: https://gradio.app/docs/


.. parsed-literal::

    Running on local URL:  http://127.0.0.1:7860

    To create a public link, set `share=True` in `launch()`.


.. .. raw:: html

..    <div><iframe src="http://127.0.0.1:7860/" width="100%" height="500" allow="autoplay; camera; microphone; clipboard-read; clipboard-write;" frameborder="0" allowfullscreen></iframe></div>


What To Do Next
---------------------------------------------------------

Now that you have a convenient interface and accelerated inference, you
can explore the CLIP capabilities further. For example:

-  Can CLIP read? Can it detect text regions in general and specific
   words on the image?
-  Which famous people and places does CLIP know?
-  Can CLIP identify places on a map? Or planets, stars, and
   constellations?
-  Explore different CLIP models from HuggingFace Hub: just change the
   ``model_checkpoint`` at the beginning of the notebook.
-  Add batch processing to the pipeline: modify
   ``get_random_crop_params``, ``get_cropped_image`` and
   ``update_saliency_map`` functions to process multiple crop images at
   once and accelerate the pipeline even more.
-  Optimize models with
   `NNCF <https://docs.openvino.ai/nightly/basic_quantization_flow.html>`__
   to get further acceleration. You can find example how to quantize
   CLIP model in `this
   notebook <228-clip-zero-shot-image-classification-with-output.html>`__