From b9fe465cf0ce8bcacf757153b24c8a6d1c468156 Mon Sep 17 00:00:00 2001
From: Gabriele Galiero Casay <gabriele.galiero.casay@intel.com>
Date: Fri, 7 May 2021 13:06:28 +0200
Subject: [PATCH] BatchNormInference specification refactoring (#5489)

* BatchNormInference specification refactoring

* Address review comments

 * Remove he term Transform from definition
 * Add title of the paper where this operation is introduced
 * Add missing backticks
 * Remove redundant information in attribute epsilon range of values

* Refinement of spec

Remove more mentions to transformation to avoid confusion

* Corrected typos and added changes to improve readability

* Use third person to express operation steps
---
 .../ops/normalization/BatchNormInference_1.md | 109 ++++++++++++-----
 .../ops/normalization/BatchNormInference_5.md | 112 +++++++++++++-----
 2 files changed, 163 insertions(+), 58 deletions(-)

diff --git a/docs/ops/normalization/BatchNormInference_1.md b/docs/ops/normalization/BatchNormInference_1.md
index d7bf1f59edd..218111575bd 100644
--- a/docs/ops/normalization/BatchNormInference_1.md
+++ b/docs/ops/normalization/BatchNormInference_1.md
@@ -4,39 +4,33 @@
 
 **Category**: *Normalization*
 
-**Short description**: *BatchNormInference* layer normalizes a `input` tensor by `mean` and `variance`, and applies a scale (`gamma`) to it, as well as an offset (`beta`).
+**Short description**: *BatchNormInference* performs Batch Normalization operation described in the [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167v2) article.
 
-**Attributes**:
+**Detailed Description**
 
-* *epsilon*
-  * **Description**: *epsilon* is the number to be added to the variance to avoid division by zero when normalizing a value. For example, *epsilon* equal to 0.001 means that 0.001 is added to the variance.
-  * **Range of values**: a positive floating-point number
-  * **Type**: `float`
-  * **Default value**: None
-  * **Required**: *yes*
+*BatchNormInference* performs the following operations on a given data batch input tensor `data`:
 
-**Inputs**
+* Normalizes each activation \f$x^{(k)}\f$ by the mean and variance.
+\f[
+   \hat{x}^{(k)}=\frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var(x^{(k)}) + \epsilon}}
+\f]
+where \f$E[x^{(k)}]\f$ and \f$Var(x^{(k)})\f$ are the mean and variance, calculated per channel axis of `data` input, and correspond to `mean` and `variance` inputs, respectively. Additionally, \f$\epsilon\f$ is a value added to the variance for numerical stability and corresponds to `epsilon` attribute.
 
-* **1**: `input` - input tensor with data for normalization. At least a 2D tensor of type T, the second dimension represents the channel axis and must have a span of at least 1. **Required.**
-* **2**: `gamma` - gamma scaling for normalized value. A 1D tensor of type T with the same span as input's channel axis. **Required.**
-* **3**: `beta` - bias added to the scaled normalized value. A 1D tensor of type T with the same span as input's channel axis.. **Required.**
-* **4**: `mean` - value for mean normalization. A 1D tensor of type T with the same span as input's channel axis.. **Required.**
-* **5**: `variance` - value for variance normalization. A 1D tensor of type T with the same span as input's channel axis.. **Required.**
-
-**Outputs**
-
-* **1**: The result of normalization. A tensor of the same type and shape with 1st input tensor.
-
-**Types**
-
-* *T*: any numeric type.
+* Performs linear transformation of each normalized activation based on `gamma` and `beta` input, representing the scaling factor and shift, respectively.
+\f[
+   \hat{y}^{(k)}=\gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)}
+\f]
+where \f$\gamma^{(k)}\f$ and \f$\beta^{(k)}\f$ are learnable parameters, calculated per channel axis, and correspond to `gamma` and `beta` inputs.
 
 **Mathematical Formulation**
 
-*BatchNormInference*  normalizes the output in each hidden layer.
+Let `x` be a *d*-dimensional input, \f$x=(x_{1}\dotsc x_{d})\f$. Since normalization is applied to each activation \f$E[x^{(k)}]\f$, you can focus on a particular activation and omit k.
+
+For a particular activation, consider a mini-batch \f$\mathcal{B}\f$ of m values. *BatchNormInference* performs Batch Normalization algorithm as follows:
+
 *   **Input**: Values of \f$x\f$ over a mini-batch:
     \f[
-    \beta = \{ x_{1...m} \}
+    \mathcal{B} = \{ x_{1...m} \}
     \f]
 *   **Parameters to learn**: \f$ \gamma, \beta\f$
 *   **Output**:
@@ -45,22 +39,81 @@
     \f]
 *   **Mini-batch mean**:
     \f[
-    \mu_{\beta} \leftarrow \frac{1}{m}\sum_{i=1}^{m}b_{i}
+    \mu_{\mathcal{B}} \leftarrow \frac{1}{m}\sum_{i=1}^{m}b_{i}
     \f]
 *   **Mini-batch variance**:
     \f[
-    \sigma_{\beta }^{2}\leftarrow \frac{1}{m}\sum_{i=1}^{m} ( b_{i} - \mu_{\beta} )^{2}
+    \sigma_{\mathcal{B}}^{2}\leftarrow \frac{1}{m}\sum_{i=1}^{m} ( b_{i} - \mu_{\mathcal{B}})^{2}
     \f]
 *   **Normalize**:
     \f[
-    \hat{b_{i}} \leftarrow \frac{b_{i} - \mu_{\beta}}{\sqrt{\sigma_{\beta }^{2} + \epsilon }}
+    \hat{b_{i}} \leftarrow \frac{b_{i} - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^{2} + \epsilon }}
     \f]
 *   **Scale and shift**:
     \f[
     o_{i} \leftarrow \gamma\hat{b_{i}} + \beta = BN_{\gamma ,\beta } ( b_{i} )
     \f]
 
-**Example**
+**Attributes**:
+
+* *epsilon*
+  * **Description**: *epsilon* is a constant added to the variance for numerical stability.
+  * **Range of values**: a positive floating-point number
+  * **Type**: `float`
+  * **Default value**: none
+  * **Required**: *yes*
+
+**Inputs**
+
+* **1**: `data` - A tensor of type *T* and at least rank 2. The second dimension represents the channel axis and must have a span of at least 1. **Required.**
+* **2**: `gamma` - Scaling factor for normalized value. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+* **3**: `beta` - Bias added to the scaled normalized value. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+* **4**: `mean` - Value for mean normalization. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+* **5**: `variance` - Value for variance normalization. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+
+**Outputs**
+
+* **1**: The result of element-wise Batch Normalization operation applied to the input tensor `data`. A tensor of type *T* and the same shape as `data` input tensor.
+
+**Types**
+
+* *T*: any supported floating-point type.
+
+**Examples**
+
+*Example: 2D input tensor `data`*
+
+```xml
+<layer ... type="BatchNormInference" ...>
+    <data epsilon="9.99e-06" />
+    <input>
+        <port id="0">  <!-- input -->
+            <dim>10</dim>
+            <dim>128</dim>
+        </port>
+        <port id="1">  <!-- gamma -->
+            <dim>128</dim>
+        </port>
+        <port id="2">  <!-- beta -->
+            <dim>128</dim>
+        </port>
+        <port id="3">  <!-- mean -->
+            <dim>128</dim>
+        </port>
+        <port id="4">  <!-- variance -->
+            <dim>128</dim>
+        </port>
+    </input>
+    <output>
+        <port id="5">
+            <dim>10</dim>
+            <dim>128</dim>
+        </port>
+    </output>
+</layer>
+```
+
+*Example: 4D input tensor `data`*
 
 ```xml
 <layer ... type="BatchNormInference" ...>
diff --git a/docs/ops/normalization/BatchNormInference_5.md b/docs/ops/normalization/BatchNormInference_5.md
index aab4daee36c..cec26e4b2ec 100644
--- a/docs/ops/normalization/BatchNormInference_5.md
+++ b/docs/ops/normalization/BatchNormInference_5.md
@@ -1,42 +1,36 @@
 ## BatchNormInference <a name="BatchNormInference"></a> {#openvino_docs_ops_normalization_BatchNormInference_5}
 
-**Versioned name**: *BatchNormInference-5
+**Versioned name**: *BatchNormInference-5*
 
 **Category**: *Normalization*
 
-**Short description**: *BatchNormInference* layer normalizes a `input` tensor by `mean` and `variance`, and applies a scale (`gamma`) to it, as well as an offset (`beta`).
+**Short description**: *BatchNormInference* performs Batch Normalization operation described in the [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167v2) article.
 
-**Attributes**:
+**Detailed Description**
 
-* *epsilon*
-  * **Description**: *epsilon* is the number to be added to the variance to avoid division by zero when normalizing a value. For example, *epsilon* equal to 0.001 means that 0.001 is added to the variance.
-  * **Range of values**: a positive floating-point number
-  * **Type**: `float`
-  * **Default value**: None
-  * **Required**: *yes*
+*BatchNormInference* performs the following operations on a given data batch input tensor `data`:
 
-**Inputs**
+* Normalizes each activation \f$x^{(k)}\f$ by the mean and variance.
+\f[
+   \hat{x}^{(k)}=\frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var(x^{(k)}) + \epsilon}}
+\f]
+where \f$E[x^{(k)}]\f$ and \f$Var(x^{(k)})\f$ are the mean and variance, calculated per channel axis of `data` input, and correspond to `mean` and `variance` inputs, respectively. Additionally, \f$\epsilon\f$ is a value added to the variance for numerical stability and corresponds to `epsilon` attribute.
 
-* **1**: `input` - input tensor with data for normalization. At least a 2D tensor of type T, the second dimension represents the channel axis and must have a span of at least 1. **Required.**
-* **2**: `gamma` - gamma scaling for normalized value. A 1D tensor of type T with the same span as input's channel axis. **Required.**
-* **3**: `beta` - bias added to the scaled normalized value. A 1D tensor of type T with the same span as input's channel axis.. **Required.**
-* **4**: `mean` - value for mean normalization. A 1D tensor of type T with the same span as input's channel axis.. **Required.**
-* **5**: `variance` - value for variance normalization. A 1D tensor of type T with the same span as input's channel axis.. **Required.**
-
-**Outputs**
-
-* **1**: The result of normalization. A tensor of the same type and shape with 1st input tensor.
-
-**Types**
-
-* *T*: any numeric type.
+* Performs linear transformation of each normalized activation based on `gamma` and `beta` input, representing the scaling factor and shift, respectively.
+\f[
+   \hat{y}^{(k)}=\gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)}
+\f]
+where \f$\gamma^{(k)}\f$ and \f$\beta^{(k)}\f$ are learnable parameters, calculated per channel axis, and correspond to `gamma` and `beta` inputs.
 
 **Mathematical Formulation**
 
-*BatchNormInference*  normalizes the output in each hidden layer.
+Let `x` be a *d*-dimensional input, \f$x=(x_{1}\dotsc x_{d})\f$. Since normalization is applied to each activation \f$E[x^{(k)}]\f$, you can focus on a particular activation and omit k.
+
+For a particular activation, consider a mini-batch \f$\mathcal{B}\f$ of m values. *BatchNormInference* performs Batch Normalization algorithm as follows:
+
 *   **Input**: Values of \f$x\f$ over a mini-batch:
     \f[
-    \beta = \{ x_{1...m} \}
+    \mathcal{B} = \{ x_{1...m} \}
     \f]
 *   **Parameters to learn**: \f$ \gamma, \beta\f$
 *   **Output**:
@@ -45,22 +39,81 @@
     \f]
 *   **Mini-batch mean**:
     \f[
-    \mu_{\beta} \leftarrow \frac{1}{m}\sum_{i=1}^{m}b_{i}
+    \mu_{\mathcal{B}} \leftarrow \frac{1}{m}\sum_{i=1}^{m}b_{i}
     \f]
 *   **Mini-batch variance**:
     \f[
-    \sigma_{\beta }^{2}\leftarrow \frac{1}{m}\sum_{i=1}^{m} ( b_{i} - \mu_{\beta} )^{2}
+    \sigma_{\mathcal{B}}^{2}\leftarrow \frac{1}{m}\sum_{i=1}^{m} ( b_{i} - \mu_{\mathcal{B}})^{2}
     \f]
 *   **Normalize**:
     \f[
-    \hat{b_{i}} \leftarrow \frac{b_{i} - \mu_{\beta}}{\sqrt{\sigma_{\beta }^{2} + \epsilon }}
+    \hat{b_{i}} \leftarrow \frac{b_{i} - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^{2} + \epsilon }}
     \f]
 *   **Scale and shift**:
     \f[
     o_{i} \leftarrow \gamma\hat{b_{i}} + \beta = BN_{\gamma ,\beta } ( b_{i} )
     \f]
 
-**Example**
+**Attributes**:
+
+* *epsilon*
+  * **Description**: *epsilon* is a constant added to the variance for numerical stability.
+  * **Range of values**: a positive floating-point number
+  * **Type**: `float`
+  * **Default value**: none
+  * **Required**: *yes*
+
+**Inputs**
+
+* **1**: `data` - A tensor of type *T* and at least rank 2. The second dimension represents the channel axis and must have a span of at least 1. **Required.**
+* **2**: `gamma` - Scaling factor for normalized value. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+* **3**: `beta` - Bias added to the scaled normalized value. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+* **4**: `mean` - Value for mean normalization. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+* **5**: `variance` - Value for variance normalization. A 1D tensor of type *T* with the same span as `data` channel axis. **Required.**
+
+**Outputs**
+
+* **1**: The result of element-wise Batch Normalization operation applied to the input tensor `data`. A tensor of type *T* and the same shape as `data` input tensor.
+
+**Types**
+
+* *T*: any supported floating-point type.
+
+**Examples**
+
+*Example: 2D input tensor `data`*
+
+```xml
+<layer ... type="BatchNormInference" ...>
+    <data epsilon="9.99e-06" />
+    <input>
+        <port id="0">  <!-- input -->
+            <dim>10</dim>
+            <dim>128</dim>
+        </port>
+        <port id="1">  <!-- gamma -->
+            <dim>128</dim>
+        </port>
+        <port id="2">  <!-- beta -->
+            <dim>128</dim>
+        </port>
+        <port id="3">  <!-- mean -->
+            <dim>128</dim>
+        </port>
+        <port id="4">  <!-- variance -->
+            <dim>128</dim>
+        </port>
+    </input>
+    <output>
+        <port id="5">
+            <dim>10</dim>
+            <dim>128</dim>
+        </port>
+    </output>
+</layer>
+```
+
+*Example: 4D input tensor `data`*
 
 ```xml
 <layer ... type="BatchNormInference" ...>
@@ -95,4 +148,3 @@
     </output>
 </layer>
 ```
-