Added opset docs (#992)

This commit is contained in:
Andrey Zaytsev
2020-06-19 14:39:57 +03:00
committed by GitHub
parent a5e5af068f
commit d67371617a
141 changed files with 11574 additions and 0 deletions

View File

@@ -0,0 +1,121 @@
## DeformablePSROIPooling <a name="DeformablePSROIPooling"></a>
**Versioned name**: *DeformablePSROIPooling-1*
**Category**: Object detection
**Short description**: *DeformablePSROIPooling* computes position-sensitive pooling on regions of interest specified by input.
**Detailed description**: [Reference](https://arxiv.org/abs/1703.06211).
*DeformablePSROIPooling* operation takes two or three input tensors: with feature maps, with regions of interests (box coordinates) and an optional tensor with transformation values.
The box coordinates are specified as five element tuples: *[batch_id, x_1, y_1, x_2, y_2]* in absolute values.
**Attributes**
* *output_dim*
* **Description**: *output_dim* is a pooled output channel number.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *group_size*
* **Description**: *group_size* is the number of groups to encode position-sensitive score maps.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*
* *spatial_scale*
* **Description**: *spatial_scale* is a multiplicative spatial scale factor to translate ROI coordinates from their input scale to the scale used when pooling.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: None
* **Required**: *yes*
* *mode*
* **Description**: *mode* specifies mode for pooling.
* **Range of values**:
* *bilinear_deformable* - perform pooling with bilinear interpolation and deformable transformation
* **Type**: string
* **Default value**: *bilinear_deformable*
* **Required**: *no*
* *spatial_bins_x*
* **Description**: *spatial_bins_x* specifies numbers of bins to divide the input feature maps over width.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*
* *spatial_bins_y*
* **Description**: *spatial_bins_y* specifies numbers of bins to divide the input feature maps over height.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*
* *trans_std*
* **Description**: *trans_std* is the value that all transformation (offset) values are multiplied with.
* **Range of values**: floating point number
* **Type**: `float`
* **Default value**: 1
* **Required**: *no*
* *part_size*
* **Description**: *part_size* is the number of parts the output tensor spatial dimensions are divided into. Basically it is the height and width of the third input
with transformation values.
* **Range of values**: positive integer number
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*
**Inputs**:
* **1**: 4D input tensor with feature maps. Required.
* **2**: 2D input tensor describing box consisting of five element tuples: `[batch_id, x_1, y_1, x_2, y_2]`. Required.
* **3**: 4D input blob with transformation [values](https://arxiv.org/abs/1703.06211) (offsets). Optional.
**Outputs**:
* **1**: 4D output tensor with areas copied and interpolated from the 1st input tensor by coordinates of boxes from the 2nd input and transformed according to values from the 3rd input.
**Example**
```xml
<layer ... type="DeformablePSROIPooling" ... >
<data group_size="7" mode="bilinear_deformable" no_trans="False" output_dim="8" part_size="7" pooled_height="7" pooled_width="7" spatial_bins_x="4" spatial_bins_y="4" spatial_scale="0.0625" trans_std="0.1"/>
<input>
<port id="0">
<dim>1</dim>
<dim>392</dim>
<dim>38</dim>
<dim>63</dim>
</port>
<port id="1">
<dim>300</dim>
<dim>5</dim>
</port>
<port id="2">
<dim>300</dim>
<dim>2</dim>
<dim>7</dim>
<dim>7</dim>
</port>
</input>
<output>
<port id="3" precision="FP32">
<dim>300</dim>
<dim>8</dim>
<dim>7</dim>
<dim>7</dim>
</port>
</output>
</layer>
```

View File

@@ -0,0 +1,153 @@
## DetectionOutput <a name="DetectionOutput"></a>
**Versioned name**: *DetectionOutput-1*
**Category**: *Object detection*
**Short description**: *DetectionOutput* performs non-maximum suppression to generate the detection output using information on location and confidence predictions.
**Detailed description**: [Reference](https://arxiv.org/pdf/1512.02325.pdf). The layer has 3 mandatory inputs: tensor with box logits, tensor with confidence predictions and tensor with box coordinates (proposals). It can have 2 additional inputs with additional confidence predictions and box coordinates described in the [article](https://arxiv.org/pdf/1711.06897.pdf). The 5-input version of the layer is supported with Myriad plugin only. The output tensor contains information about filtered detections described with 7 element tuples: *[batch_id, class_id, confidence, x_1, y_1, x_2, y_2]*. The first tuple with *batch_id* equal to *-1* means end of output.
At each feature map cell, *DetectionOutput* predicts the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, *DetectionOutput* computes class scores and the four offsets relative to the original default box shape. This results in a total of \f$(c + 4)k\f$ filters that are applied around each location in the feature map, yielding \f$(c + 4)kmn\f$ outputs for a *m \* n* feature map.
**Attributes**:
* *num_classes*
* **Description**: number of classes to be predicted
* **Range of values**: positive integer number
* **Type**: int
* **Default value**: None
* **Required**: *yes*
* *background_label_id*
* **Description**: background label id. If there is no background class, set it to -1.
* **Range of values**: integer values
* **Type**: int
* **Default value**: 0
* **Required**: *no*
* *top_k*
* **Description**: maximum number of results to be kept per batch after NMS step. -1 means keeping all bounding boxes.
* **Range of values**: integer values
* **Type**: int
* **Default value**: -1
* **Required**: *no*
* *variance_encoded_in_target*
* **Description**: *variance_encoded_in_target* is a flag that denotes if variance is encoded in target. If flag is false then it is necessary to adjust the predicted offset accordingly.
* **Range of values**: False or True
* **Type**: boolean
* **Default value**: False
* **Required**: *no*
* *keep_top_k*
* **Description**: maximum number of bounding boxes per batch to be kept after NMS step. -1 means keeping all bounding boxes after NMS step.
* **Range of values**: integer values
* **Type**: int[]
* **Default value**: None
* **Required**: *yes*
* *code_type*
* **Description**: type of coding method for bounding boxes
* **Range of values**: "caffe.PriorBoxParameter.CENTER_SIZE", "caffe.PriorBoxParameter.CORNER"
* **Type**: string
* **Default value**: "caffe.PriorBoxParameter.CORNER"
* **Required**: *no*
* *share_location*
* **Description**: *share_location* is a flag that denotes if bounding boxes are shared among different classes.
* **Range of values**: 0 or 1
* **Type**: int
* **Default value**: 1
* **Required**: *no*
* *nms_threshold*
* **Description**: threshold to be used in the NMS stage
* **Range of values**: floating point values
* **Type**: float
* **Default value**: None
* **Required**: *yes*
* *confidence_threshold*
* **Description**: only consider detections whose confidences are larger than a threshold. If not provided, consider all boxes.
* **Range of values**: floating point values
* **Type**: float
* **Default value**: 0
* **Required**: *no*
* *clip_after_nms*
* **Description**: *clip_after_nms* flag that denotes whether to perform clip bounding boxes after non-maximum suppression or not.
* **Range of values**: 0 or 1
* **Type**: int
* **Default value**: 0
* **Required**: *no*
* *clip_before_nms*
* **Description**: *clip_before_nms* flag that denotes whether to perform clip bounding boxes before non-maximum suppression or not.
* **Range of values**: 0 or 1
* **Type**: int
* **Default value**: 0
* **Required**: *no*
* *decrease_label_id*
* **Description**: *decrease_label_id* flag that denotes how to perform NMS.
* **Range of values**:
* 0 - perform NMS like in Caffe\*.
* 1 - perform NMS like in MxNet\*.
* **Type**: int
* **Default value**: 0
* **Required**: *no*
* *normalized*
* **Description**: *normalized* flag that denotes whether input tensors with boxes are normalized. If tensors are not normalized then *input_height* and *input_width* attributes are used to normalize box coordinates.
* **Range of values**: 0 or 1
* **Type**: int
* **Default value**: 0
* **Required**: *no*
* *input_height (input_width)*
* **Description**: input image height (width). If the *normalized* is 1 then these attributes are not used.
* **Range of values**: positive integer number
* **Type**: int
* **Default value**: 1
* **Required**: *no*
* *objectness_score*
* **Description**: threshold to sort out confidence predictions. Used only when the *DetectionOutput* layer has 5 inputs.
* **Range of values**: non-negative float number
* **Type**: float
* **Default value**: 0
* **Required**: *no*
**Inputs**
* **1**: 2D input tensor with box logits. Required.
* **2**: 2D input tensor with class predictions. Required.
* **3**: 3D input tensor with proposals. Required.
* **4**: 2D input tensor with additional class predictions information described in the [article](https://arxiv.org/pdf/1711.06897.pdf). Optional.
* **5**: 2D input tensor with additional box predictions information described in the [article](https://arxiv.org/pdf/1711.06897.pdf). Optional.
**Example**
```xml
<layer ... type="DetectionOutput" ... >
<data num_classes="21" share_location="1" background_label_id="0" nms_threshold="0.450000" top_k="400" input_height="1" input_width="1" code_type="caffe.PriorBoxParameter.CENTER_SIZE" variance_encoded_in_target="0" keep_top_k="200" confidence_threshold="0.010000"/>
<input> ... </input>
<output> ... </output>
</layer>
```

View File

@@ -0,0 +1,100 @@
## PSROIPooling <a name="PSROIPooling"></a>
**Versioned name**: *PSROIPooling-1*
**Category**: Object detection
**Short description**: *PSROIPooling* computes position-sensitive pooling on regions of interest specified by input.
**Detailed description**: [Reference](https://arxiv.org/pdf/1703.06211.pdf).
*PSROIPooling* operation takes two input blobs: with feature maps and with regions of interests (box coordinates).
The latter is specified as five element tuples: *[batch_id, x_1, y_1, x_2, y_2]*.
ROIs coordinates are specified in absolute values for the average mode and in normalized values (to *[0,1]* interval) for bilinear interpolation.
**Attributes**
* *output_dim*
* **Description**: *output_dim* is a pooled output channel number.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *group_size*
* **Description**: *group_size* is the number of groups to encode position-sensitive score maps. Use for *average* mode only.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*
* *spatial_scale*
* **Description**: *spatial_scale* is a multiplicative spatial scale factor to translate ROI coordinates from their input scale to the scale used when pooling.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: None
* **Required**: *yes*
* *mode*
* **Description**: *mode* specifies mode for pooling.
* **Range of values**:
* *average* - perform average pooling
* *bilinear* - perform pooling with bilinear interpolation
* **Type**: string
* **Default value**: *average*
* **Required**: *no*
* *spatial_bins_x*
* **Description**: *spatial_bins_x* specifies numbers of bins to divide the input feature maps over width. Used for "bilinear" mode only.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*
* *spatial_bins_y*
* **Description**: *spatial_bins_y* specifies numbers of bins to divide the input feature maps over height. Used for "bilinear" mode only.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*
**Inputs**:
* **1**: 4D input blob with feature maps. Required.
* **2**: 2D input blob describing box consisting of five element tuples: `[batch_id, x_1, y_1, x_2, y_2]`. Required.
**Outputs**:
* **1**: 4D output tensor with areas copied and interpolated from the 1st input tensor by coordinates of boxes from the 2nd input.
**Example**
```xml
<layer ... type="PSROIPooling" ... >
<data group_size="6" mode="bilinear" output_dim="360" spatial_bins_x="3" spatial_bins_y="3" spatial_scale="1"/>
<input>
<port id="0">
<dim>1</dim>
<dim>3240</dim>
<dim>38</dim>
<dim>38</dim>
</port>
<port id="1">
<dim>100</dim>
<dim>5</dim>
</port>
</input>
<output>
<port id="2">
<dim>100</dim>
<dim>360</dim>
<dim>6</dim>
<dim>6</dim>
</port>
</output>
</layer>
```

View File

@@ -0,0 +1,129 @@
## PriorBoxClustered <a name="PriorBoxClustered"></a>
**Versioned name**: *PriorBoxClustered-1*
**Category**: Object detection
**Short description**: *PriorBoxClustered* operation generates prior boxes of specified sizes normalized to the input image size.
**Attributes**
* *width (height)*
* **Description**: *width (height)* specifies desired boxes widths (heights) in pixels.
* **Range of values**: floating point positive numbers
* **Type**: float[]
* **Default value**: 1.0
* **Required**: *no*
* *clip*
* **Description**: *clip* is a flag that denotes if each value in the output tensor should be clipped within [0,1].
* **Range of values**:
* False - clipping is not performed
* True - each value in the output tensor is within [0,1]
* **Type**: boolean
* **Default value**: True
* **Required**: *no*
* *step (step_w, step_h)*
* **Description**: *step (step_w, step_h)* is a distance between box centers. For example, *step* equal 85 means that the distance between neighborhood prior boxes centers is 85. If both *step_h* and *step_w* are 0 then they are updated with value of *step*. If after that they are still 0 then they are calculated as input image width(height) divided with first input width(height).
* **Range of values**: floating point positive number
* **Type**: float
* **Default value**: 0.0
* **Required**: *no*
* *offset*
* **Description**: *offset* is a shift of box respectively to top left corner. For example, *offset* equal 85 means that the shift of neighborhood prior boxes centers is 85.
* **Range of values**: floating point positive number
* **Type**: float
* **Default value**: None
* **Required**: *yes*
* *variance*
* **Description**: *variance* denotes a variance of adjusting bounding boxes.
* **Range of values**: floating point positive numbers
* **Type**: float[]
* **Default value**: []
* **Required**: *no*
* *img_h (img_w)*
* **Description**: *img_h (img_w)* specifies height (width) of input image. These attributes are taken from the second input `image_size` height(width) unless provided explicitly as the value for this attributes.
* **Range of values**: floating point positive number
* **Type**: float
* **Default value**: 0
* **Required**: *no*
**Inputs**:
* **1**: `output_size` - 1D tensor with two integer elements `[height, width]`. Specifies the spatial size of generated grid with boxes. Required.
* **2**: `image_size` - 1D tensor with two integer elements `[image_height, image_width]` that specifies shape of the image for which boxes are generated. Optional.
**Outputs**:
* **1**: 2D tensor of shape `[2, 4 * height * width * priors_per_point]` with box coordinates. The `priors_per_point` is the number of boxes generated per each grid element. The number depends on layer attribute values.
**Detailed description**
*PriorBoxClustered* computes coordinates of prior boxes by following:
1. Calculates the *center_x* and *center_y* of prior box:
\f[
W \equiv Width \quad Of \quad Image
\f]
\f[
H \equiv Height \quad Of \quad Image
\f]
\f[
center_x=(w+offset)*step
\f]
\f[
center_y=(h+offset)*step
\f]
\f[
w \subset \left( 0, W \right )
\f]
\f[
h \subset \left( 0, H \right )
\f]
2. For each \f$s \subset \left( 0, W \right )\f$ calculates the prior boxes coordinates:
\f[
xmin = \frac{center_x - \frac{width_s}{2}}{W}
\f]
\f[
ymin = \frac{center_y - \frac{height_s}{2}}{H}
\f]
\f[
xmax = \frac{center_x - \frac{width_s}{2}}{W}
\f]
\f[
ymax = \frac{center_y - \frac{height_s}{2}}{H}
\f]
If *clip* is defined, the coordinates of prior boxes are recalculated with the formula:
\f$coordinate = \min(\max(coordinate,0), 1)\f$
**Example**
```xml
<layer type="PriorBoxClustered" ... >
<data clip="0" flip="1" height="44.0,10.0,30.0,19.0,94.0,32.0,61.0,53.0,17.0" offset="0.5" step="16.0" variance="0.1,0.1,0.2,0.2" width="86.0,13.0,57.0,39.0,68.0,34.0,142.0,50.0,23.0"/>
<input>
<port id="0">
<dim>2</dim> <!-- [10, 19] -->
</port>
<port id="1">
<dim>2</dim> <!-- [180, 320] -->
</port>
</input>
<output>
<port id="2">
<dim>2</dim>
<dim>6840</dim>
</port>
</output>
</layer>
```

View File

@@ -0,0 +1,179 @@
## PriorBox<a name="PriorBox"></a>
**Versioned name**: *PriorBox-1*
**Category**: Object detection
**Short description**: *PriorBox* operation generates prior boxes of specified sizes and aspect ratios across all dimensions.
**Attributes**:
* *min_size (max_size)*
* **Description**: *min_size (max_size)* is the minimum (maximum) box size (in pixels). For example, *min_size (max_size)* equal 15 means that the minimum (maximum) box size is 15.
* **Range of values**: positive floating point numbers
* **Type**: float[]
* **Default value**: []
* **Required**: *no*
* *aspect_ratio*
* **Description**: *aspect_ratio* is a variance of aspect ratios. Duplicate values are ignored. For example, *aspect_ratio* equal "2.0,3.0" means that for the first box aspect_ratio is equal to 2.0 and for the second box is 3.0.
* **Range of values**: set of positive integer numbers
* **Type**: float[]
* **Default value**: []
* **Required**: *no*
* *flip*
* **Description**: *flip* is a flag that denotes that each *aspect_ratio* is duplicated and flipped. For example, *flip* equals 1 and *aspect_ratio* equals to "4.0,2.0" mean that aspect_ratio is equal to "4.0,2.0,0.25,0.5".
* **Range of values**:
* False - each *aspect_ratio* is flipped
* True - each *aspect_ratio* is not flipped
* **Type**: boolean
* **Default value**: False
* **Required**: *no*
* *clip*
* **Description**: *clip* is a flag that denotes if each value in the output tensor should be clipped to [0,1] interval.
* **Range of values**:
* False - clipping is not performed
* True - each value in the output tensor is clipped to [0,1] interval.
* **Type**: boolean
* **Default value**: False
* **Required**: *no*
* *step*
* **Description**: *step* is a distance between box centers. For example, *step* equal 85 means that the distance between neighborhood prior boxes centers is 85.
* **Range of values**: floating point non-negative number
* **Type**: float
* **Default value**: 0
* **Required**: *no*
* *offset*
* **Description**: *offset* is a shift of box respectively to top left corner. For example, *offset* equal 85 means that the shift of neighborhood prior boxes centers is 85.
* **Range of values**: floating point non-negative number
* **Type**: float
* **Default value**: None
* **Required**: *yes*
* *variance*
* **Description**: *variance* denotes a variance of adjusting bounding boxes. The attribute could contain 0, 1 or 4 elements.
* **Range of values**: floating point positive numbers
* **Type**: float[]
* **Default value**: []
* **Required**: *no*
* *scale_all_sizes*
* **Description**: *scale_all_sizes* is a flag that denotes type of inference. For example, *scale_all_sizes* equals 0 means that the PriorBox layer is inferred in MXNet-like manner. In particular, *max_size* attribute is ignored.
* **Range of values**:
* False - *max_size* is ignored
* True - *max_size* is used
* **Type**: boolean
* **Default value**: True
* **Required**: *no*
* *fixed_ratio*
* **Description**: *fixed_ratio* is an aspect ratio of a box. For example, *fixed_ratio* equal to 2.000000 means that the aspect ratio for the first box aspect ratio is 2.
* **Range of values**: a list of positive floating-point numbers
* **Type**: `float[]`
* **Default value**: None
* **Required**: *no*
* *fixed_size*
* **Description**: *fixed_size* is an initial box size (in pixels). For example, *fixed_size* equal to 15 means that the initial box size is 15.
* **Range of values**: a list of positive floating-point numbers
* **Type**: `float[]`
* **Default value**: None
* **Required**: *no*
* *density*
* **Description**: *density* is the square root of the number of boxes of each type. For example, *density* equal to 2 means that the first box generates four boxes of the same size and with the same shifted centers.
* **Range of values**: a list of positive floating-point numbers
* **Type**: `float[]`
* **Default value**: None
* **Required**: *no*
**Inputs**:
* **1**: `output_size` - 1D tensor with two integer elements `[height, width]`. Specifies the spatial size of generated grid with boxes. Required.
* **2**: `image_size` - 1D tensor with two integer elements `[image_height, image_width]` that specifies shape of the image for which boxes are generated. Required.
**Outputs**:
* **1**: 2D tensor of shape `[2, 4 * height * width * priors_per_point]` with box coordinates. The `priors_per_point` is the number of boxes generated per each grid element. The number depends on layer attribute values.
**Detailed description**:
*PriorBox* computes coordinates of prior boxes by following:
1. First calculates *center_x* and *center_y* of prior box:
\f[
W \equiv Width \quad Of \quad Image
\f]
\f[
H \equiv Height \quad Of \quad Image
\f]
* If step equals 0:
\f[
center_x=(w+0.5)
\f]
\f[
center_y=(h+0.5)
\f]
* else:
\f[
center_x=(w+offset)*step
\f]
\f[
center_y=(h+offset)*step
\f]
\f[
w \subset \left( 0, W \right )
\f]
\f[
h \subset \left( 0, H \right )
\f]
2. Then, for each \f$ s \subset \left( 0, min_sizes \right ) \f$ calculates coordinates of prior boxes:
\f[
xmin = \frac{\frac{center_x - s}{2}}{W}
\f]
\f[
ymin = \frac{\frac{center_y - s}{2}}{H}
\f]
\f[
xmax = \frac{\frac{center_x + s}{2}}{W}
\f]
\f[
ymin = \frac{\frac{center_y + s}{2}}{H}
\f]
**Example**
```xml
<layer type="PriorBox" ...>
<data aspect_ratio="2.0" clip="0" density="" fixed_ratio="" fixed_size="" flip="1" max_size="38.46" min_size="16.0" offset="0.5" step="16.0" variance="0.1,0.1,0.2,0.2"/>
<input>
<port id="0">
<dim>2</dim> <!-- values: [24, 42] -->
</port>
<port id="1">
<dim>2</dim> <!-- values: [384, 672] -->
</port>
</input>
<output>
<port id="2">
<dim>2</dim>
<dim>16128</dim>
</port>
</output>
</layer>
```

View File

@@ -0,0 +1,160 @@
## Proposal <a name="Proposal"></a>
**Versioned name**: *Proposal-1*
**Category**: *Object detection*
**Short description**: *Proposal* operation filters bounding boxes and outputs only those with the highest prediction confidence.
**Detailed description**
*Proposal* has three inputs: a tensor with probabilities whether particular bounding box corresponds to background and foreground, a tensor with logits for each of the bounding boxes, a tensor with input image size in the [`image_height`, `image_width`, `scale_height_and_width`] or [`image_height`, `image_width`, `scale_height`, `scale_width`] format. The produced tensor has two dimensions `[batch_size * post_nms_topn, 5]`.
*Proposal* layer does the following with the input tensor:
1. Generates initial anchor boxes. Left top corner of all boxes is at (0, 0). Width and height of boxes are calculated from *base_size* with *scale* and *ratio* attributes.
2. For each point in the first input tensor:
* pins anchor boxes to the image according to the second input tensor that contains four deltas for each box: for *x* and *y* of center, for *width* and for *height*
* finds out score in the first input tensor
3. Filters out boxes with size less than *min_size*
4. Sorts all proposals (*box*, *score*) by score from highest to lowest
5. Takes top *pre_nms_topn* proposals
6. Calculates intersections for boxes and filter out all boxes with \f$intersection/union > nms\_thresh\f$
7. Takes top *post_nms_topn* proposals
8. Returns top proposals
* *base_size*
* **Description**: *base_size* is the size of the anchor to which *scale* and *ratio* attributes are applied.
* **Range of values**: a positive integer number
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *pre_nms_topn*
* **Description**: *pre_nms_topn* is the number of bounding boxes before the NMS operation. For example, *pre_nms_topn* equal to 15 means that the minimum box size is 15.
* **Range of values**: a positive integer number
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *post_nms_topn*
* **Description**: *post_nms_topn* is the number of bounding boxes after the NMS operation. For example, *post_nms_topn* equal to 15 means that the maximum box size is 15.
* **Range of values**: a positive integer number
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *nms_thresh*
* **Description**: *nms_thresh* is the minimum value of the proposal to be taken into consideration. For example, *nms_thresh* equal to 0.5 means that all boxes with prediction probability less than 0.5 are filtered out.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: None
* **Required**: *yes*
* *feat_stride*
* **Description**: *feat_stride* is the step size to slide over boxes (in pixels). For example, *feat_stride* equal to 16 means that all boxes are analyzed with the slide 16.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *min_size*
* **Description**: *min_size* is the minimum size of box to be taken into consideration. For example, *min_size* equal 35 means that all boxes with box size less than 35 are filtered out.
* **Range of values**: a positive integer number
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *ratio*
* **Description**: *ratio* is the ratios for anchor generation.
* **Range of values**: a list of floating-point numbers
* **Type**: `float[]`
* **Default value**: None
* **Required**: *yes*
* *scale*
* **Description**: *scale* is the scales for anchor generation.
* **Range of values**: a list of floating-point numbers
* **Type**: `float[]`
* **Default value**: None
* **Required**: *yes*
* *clip_before_nms*
* **Description**: *clip_before_nms* flag that specifies whether to perform clip bounding boxes before non-maximum suppression or not.
* **Range of values**: True or False
* **Type**: `boolean`
* **Default value**: True
* **Required**: *no*
* *clip_after_nms*
* **Description**: *clip_after_nms* is a flag that specifies whether to perform clip bounding boxes after non-maximum suppression or not.
* **Range of values**: True or False
* **Type**: `boolean`
* **Default value**: False
* **Required**: *no*
* *normalize*
* **Description**: *normalize* is a flag that specifies whether to perform normalization of output boxes to *[0,1]* interval or not.
* **Range of values**: True or False
* **Type**: `boolean`
* **Default value**: False
* **Required**: *no*
* *box_size_scale*
* **Description**: *box_size_scale* specifies the scale factor applied to logits of box sizes before decoding.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: 1.0
* **Required**: *no*
* *box_coordinate_scale*
* **Description**: *box_coordinate_scale* specifies the scale factor applied to logits of box coordinates before decoding.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: 1.0
* **Required**: *no*
* *framework*
* **Description**: *framework* specifies how the box coordinates are calculated.
* **Range of values**:
* "" (empty string) - calculate box coordinates like in Caffe*
* *tensorflow* - calculate box coordinates like in the TensorFlow* Object Detection API models
* **Type**: string
* **Default value**: "" (empty string)
* **Required**: *no*
**Inputs**:
* **1**: 4D input floating point tensor with class prediction scores. Required.
* **2**: 4D input floating point tensor with box logits. Required.
* **3**: 1D input floating tensor 3 or 4 elements: [`image_height`, `image_width`, `scale_height_and_width`] or [`image_height`, `image_width`, `scale_height`, `scale_width`]. Required.
**Outputs**:
* **1**: Floating point tensor of shape `[batch_size * post_nms_topn, 5]`.
**Example**
```xml
<layer ... type="Proposal" ... >
<data base_size="16" feat_stride="16" min_size="16" nms_thresh="0.6" post_nms_topn="200" pre_nms_topn="6000"
ratio="2.67" scale="4.0,6.0,9.0,16.0,24.0,32.0"/>
<input> ... </input>
<output> ... </output>
</layer>
```

View File

@@ -0,0 +1,112 @@
## ROIAlign <a name="ROIAlign"></a>
**Versioned name**: *ROIAlign-3*
**Category**: Object detection
**Short description**: *ROIAlign* is a *pooling layer* used over feature maps of non-uniform input sizes and outputs a feature map of a fixed size.
**Detailed description**: [Reference](https://arxiv.org/abs/1703.06870).
*ROIAlign* performs the following for each Region of Interest (ROI) for each input feature map:
1. Multiply box coordinates with *spatial_scale* to produce box coordinates relative to the input feature map size.
2. Divide the box into bins according to the *sampling_ratio* attribute.
3. Apply bilinear interpolation with 4 points in each bin and apply maximum or average pooling based on *mode* attribute to produce output feature map element.
**Attributes**
* *pooled_h*
* **Description**: *pooled_h* is the height of the ROI output feature map.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *pooled_w*
* **Description**: *pooled_w* is the width of the ROI output feature map.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *sampling_ratio*
* **Description**: *sampling_ratio* is the number of bins over height and width to use to calculate each output feature map element. If the value
is equal to 0 then use adaptive number of elements over height and width: `ceil(roi_height / pooled_h)` and `ceil(roi_width / pooled_w)` respectively.
* **Range of values**: a non-negative integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *spatial_scale*
* **Description**: *spatial_scale* is a multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: None
* **Required**: *yes*
* *mode*
* **Description**: *mode* specifies a method to perform pooling to produce output feature map elements.
* **Range of values**:
* *max* - maximum pooling
* *avg* - average pooling
* **Type**: string
* **Default value**: None
* **Required**: *yes*
**Inputs**:
* **1**: 4D input tensor of shape `[N, C, H, W]` with feature maps of type *T*. Required.
* **2**: 2D input tensor of shape `[NUM_ROIS, 4]` describing box consisting of 4 element tuples: `[x_1, y_1, x_2, y_2]` in relative coordinates of type *T*.
The box height and width are calculated the following way: `roi_width = max(spatial_scale * (x_2 - x_1), 1.0)`,
`roi_height = max(spatial_scale * (y_2 - y_1), 1.0)`, so the malformed boxes are expressed as a box of size `1 x 1`. Required.
* **3**: 1D input tensor of shape `[NUM_ROIS]` with batch indices of type *IND_T*. Required.
**Outputs**:
* **1**: 4D output tensor of shape `[NUM_ROIS, C, pooled_h, pooled_w]` with feature maps of type *T*.
**Types**
* *T*: any supported floating point type.
* *IND_T*: any supported integer type.
**Example**
```xml
<layer ... type="ROIAlign" ... >
<data pooled_h="6" pooled_w="6" spatial_scale="16.0" sampling_ratio="2" mode="avg"/>
<input>
<port id="0">
<dim>7</dim>
<dim>256</dim>
<dim>200</dim>
<dim>200</dim>
</port>
<port id="1">
<dim>1000</dim>
<dim>4</dim>
</port>
<port id="2">
<dim>1000</dim>
</port>
</input>
<output>
<port id="3" precision="FP32">
<dim>1000</dim>
<dim>256</dim>
<dim>6</dim>
<dim>6</dim>
</port>
</output>
</layer>
```

View File

@@ -0,0 +1,63 @@
## ROIPooling <a name="ROIPooling"></a>
**Versioned name**: *ROIPooling-1*
**Category**: Object detection
**Short description**: *ROIPooling* is a *pooling layer* used over feature maps of non-uniform input sizes and outputs a feature map of a fixed size.
**Detailed description**: [deepsense.io reference](https://blog.deepsense.ai/region-of-interest-pooling-explained/)
**Attributes**
* *pooled_h*
* **Description**: *pooled_h* is the height of the ROI output feature map. For example, *pooled_h* equal to 6 means that the height of the output of *ROIPooling* is 6.
* **Range of values**: a non-negative integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *pooled_w*
* **Description**: *pooled_w* is the width of the ROI output feature map. For example, *pooled_w* equal to 6 means that the width of the output of *ROIPooling* is 6.
* **Range of values**: a non-negative integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *spatial_scale*
* **Description**: *spatial_scale* is the ratio of the input feature map over the input image size.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: None
* **Required**: *yes*
* *method*
* **Description**: *method* specifies a method to perform pooling. If the method is *bilinear*, the input box coordinates are normalized to the `[0, 1]` interval.
* **Range of values**: *max* or *bilinear*
* **Type**: string
* **Default value**: *max*
* **Required**: *no*
**Inputs**:
* **1**: 4D input tensor of shape `[1, C, H, W]` with feature maps. Required.
* **2**: 2D input tensor of shape `[NUM_ROIS, 5]` describing box consisting of 5 element tuples: `[batch_id, x_1, y_1, x_2, y_2]`. Required.
**Outputs**:
* **1**: 4D output tensor of shape `[NUM_ROIS, C, pooled_h, pooled_w]` with feature maps. Required.
**Example**
```xml
<layer ... type="ROIPooling" ... >
<data pooled_h="6" pooled_w="6" spatial_scale="0.062500"/>
<input> ... </input>
<output> ... </output>
</layer>
```

View File

@@ -0,0 +1,133 @@
## RegionYolo <a name="RegionYolo"></a>
**Versioned name**: *RegionYolo-1*
**Category**: *Object detection*
**Short description**: *RegionYolo* computes the coordinates of regions with probability for each class.
**Detailed description**: This operation is directly mapped to the original YOLO layer. [Reference](https://arxiv.org/pdf/1612.08242.pdf)
**Attributes**:
* *anchors*
* **Description**: *anchors* codes a flattened list of pairs `[width, height]` that codes prior box sizes. This attribute is not used in output computation, but it is required for post-processing to restore real box coordinates.
* **Range of values**: list of any length of positive floating point number
* **Type**: `float[]`
* **Default value**: None
* **Required**: *no*
* *axis*
* **Description**: starting axis index in the input tensor `data` shape that will be flattened in the output; the end of flattened range is defined by `end_axis` attribute.
* **Range of values**: `-rank(data) .. rank(data)-1`
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *coords*
* **Description**: *coords* is the number of coordinates for each region.
* **Range of values**: an integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *classes*
* **Description**: *classes* is the number of classes for each region.
* **Range of values**: an integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *end_axis*
* **Description**: ending axis index in the input tensor `data` shape that will be flattened in the output; the beginning of the flattened range is defined by `axis` attribute.
* **Range of values**: `-rank(data)..rank(data)-1`
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *num*
* **Description**: *num* is the number of regions.
* **Range of values**: an integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
* *do_softmax*
* **Description**: *do_softmax* is a flag that specifies the inference method and affects how the number of regions is determined. It also affects output shape. If it is 0, then output shape is 4D, and 2D otherwise.
* **Range of values**:
* *False* - do not perform softmax
* *True* - perform softmax
* **Type**: `boolean`
* **Default value**: True
* **Required**: *no*
* *mask*
* **Description**: *mask* specifies the number of regions. Use this attribute instead of *num* when *do_softmax* is equal to 0.
* **Range of values**: a list of integers
* **Type**: `int[]`
* **Default value**: `[]`
* **Required**: *no*
**Inputs**:
* **1**: `data` - 4D input tensor with floating point elements and shape `[N, C, H, W]`. Required.
**Outputs**:
* **1**: output tensor of rank 4 or less that codes detected regions. Refer to the original YOLO paper to decode the output as boxes. `anchors` should be used to decode real box coordinates. If `do_softmax` is set to 0, then the output shape is `[N, (classes + coords + 1)*len(mask), H, W]`. If `do_softmax` is set to 1, then output shape is partially flattened and defined in the following way:
flat_dim = data.shape[axis] * data.shape[axis+1] * ... * data.shape[end_axis]
output.shape = [data.shape[0], ..., data.shape[axis-1], flat_dim, data.shape[end_axis + 1], ...]
**Example**
```xml
<!-- YOLO V3 example -->
<layer type="RegionYolo" ... >
<data anchors="10,14,23,27,37,58,81,82,135,169,344,319" axis="1" classes="80" coords="4" do_softmax="0" end_axis="3" mask="0,1,2" num="6"/>
<input>
<port id="0">
<dim>1</dim>
<dim>255</dim>
<dim>26</dim>
<dim>26</dim>
</port>
</input>
<output>
<port id="0">
<dim>1</dim>
<dim>255</dim>
<dim>26</dim>
<dim>26</dim>
</port>
</output>
</layer>
<!-- YOLO V2 Example -->
<layer type="RegionYolo" ... >
<data anchors="1.08,1.19,3.42,4.41,6.63,11.38,9.42,5.11,16.62,10.52" axis="1" classes="20" coords="4" do_softmax="1" end_axis="3" num="5"/>
<input>
<port id="0">
<dim>1</dim>
<dim>125</dim>
<dim>13</dim>
<dim>13</dim>
</port>
</input>
<output>
<port id="0">
<dim>1</dim>
<dim>21125</dim>
</port>
</output>
</layer>
```

View File

@@ -0,0 +1,53 @@
## ReorgYolo Layer <a name="ReorgYolo"></a>
**Versioned name**: *ReorgYolo-1*
**Category**: *Object detection*
**Short description**: *ReorgYolo* reorganizes input tensor taking into account strides.
**Detailed description**:
[Reference](https://arxiv.org/pdf/1612.08242.pdf)
**Attributes**
* *stride*
* **Description**: *stride* is the distance between cut throws in output blobs.
* **Range of values**: positive integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*
**Inputs**:
* **1**: 4D input tensor of any type and shape `[N, C, H, W]`. `H` and `W` should be divisible by `stride`. Required.
**Outputs**:
* **1**: 4D output tensor of the same type as input tensor and shape `[N, C*stride*stride, H/stride, W/stride]`. Required.
**Example**
```xml
<layer id="89" name="ExtractImagePatches" type="ReorgYolo">
<data stride="2"/>
<input>
<port id="0">
<dim>1</dim>
<dim>64</dim>
<dim>26</dim>
<dim>26</dim>
</port>
</input>
<output>
<port id="1" precision="f32">
<dim>1</dim>
<dim>256</dim>
<dim>13</dim>
<dim>13</dim>
</port>
</output>
</layer>
```