The sample works with Kaldi ARK or Numpy* uncompressed NPZ files, so it does not cover an end-to-end speech recognition scenario (speech to text), requiring additional preprocessing (feature extraction) to get a feature vector from a speech signal, as well as postprocessing (decoding) to produce text from scores.
| Available Devices | `ov::Core::get_available_devices`, `ov::Core::get_property` | Get information of the devices for inference |
| Import/Export Model | `ov::Core::import_model`, `ov::CompiledModel::export_model` | The GNA plugin supports loading and saving of the GNA-optimized model |
| Model Operations | `ov::set_batch`, `ov::Model::add_output`, `ov::CompiledModel::inputs`, `ov::CompiledModel::outputs` | Managing of model: configure batch_size, input and output tensors |
| Asynchronous Infer | `ov::InferRequest::start_async`, `ov::InferRequest::wait` | Do asynchronous inference and waits until inference result becomes available |
| InferRequest Operations | `ov::InferRequest::query_state`, `ov::VariableState::reset` | Gets and resets CompiledModel state control |
| Tensor Operations | `ov::Tensor::get_size`, `ov::Tensor::data`, `ov::InferRequest::get_tensor` | Get a tensor, its size and data |
| Profiling | `ov::InferRequest::get_profiling_info` | Get infer request profiling info |
Basic OpenVINO™ Runtime API is covered by [Hello Classification C++ sample](../hello_classification/README.md).
| Options | Values |
| :--- | :--- |
| Validated Models | Acoustic model based on Kaldi\* neural networks (see [Model Preparation](#model-preparation) section) |
| Model Format | OpenVINO™ toolkit Intermediate Representation (\*.xml + \*.bin) |
| Supported devices | See [Execution Modes](#execution-modes) section below and [List Supported Devices](../../../docs/OV_Runtime_UG/supported_plugins/Supported_Devices.md) |
At startup, the sample application reads command-line parameters, loads a specified model and input data to the OpenVINO™ Runtime plugin, performs inference on all speech utterances stored in the input file(s), logging each step in a standard output stream.
If the `-r` option is given, error statistics are provided for each speech utterance as shown above.
each sample step at [Integration Steps](../../../docs/OV_Runtime_UG/Integrate_with_customer_application_new_API.md) section of "Integrate the OpenVINO™ Runtime with Your Application" guide.
If the GNA device is selected (for example, using the `-d` GNA flag), the GNA OpenVINO™ Runtime plugin quantizes the model and input feature vector sequence to integer representation before performing inference.
- *static* - The first utterance in the input file is scanned for dynamic range. The scale factor (floating point scalar multiplier) required to scale the maximum input value of the first utterance to 16384 (15 bits) is used for all subsequent inputs. The neural network is quantized to accommodate the scaled input dynamic range.
- *dynamic* - The scale factor for each input batch is computed just before inference on that batch. The input and network are (re)quantized on the fly using an efficient procedure.
The `-qb` flag provides a hint to the GNA plugin regarding the preferred target weight resolution for all layers. For example, when `-qb 8` is specified, the plugin will use 8-bit weights wherever possible in the
> - It is not always possible to use 8-bit weights due to GNA hardware limitations. For example, convolutional layers always use 16-bit weights (GNA hardware version 1 and 2). This limitation will be removed in GNA hardware version 3 and higher.
The GNA plugin supports loading and saving of the GNA-optimized model (non-IR) via the `-rg` and `-wg` flags. Thereby, it is possible to avoid the cost of full model quantization at run time. The GNA plugin also supports export of firmware-compatible embedded model images for the Intel® Speech Enabling Developer Kit and Amazon Alexa* Premium Far-Field Voice Development Kit via the `-we` flag (save only).
To build the sample, please use instructions available at [Build the Sample Applications](../../../docs/OV_Runtime_UG/Samples_Overview.md) section in OpenVINO™ Toolkit Samples guide.
-i "<path>" Required. Paths to input files. Example of usage: <file1.ark,file2.ark> or <file.ark> or <file.npz>.
-m "<path>" Required. Path to an .xml file with a trained model (required if -rg is missing).
-o "<path>" Optional. Output file name to save scores. Example of usage: <output.ark> or <output.npz>
-d "<device>" Optional. Specify a target device to infer on. CPU, GPU, MYRIAD, GNA_AUTO, GNA_HW, GNA_HW_WITH_SW_FBACK, GNA_SW_FP32, GNA_SW_EXACT and HETERO with combination of GNA as the primary device and CPU as a secondary (e.g. HETERO:GNA,CPU) are supported. The sample will look for a suitable plugin for device specified.
-q "<mode>" Optional. Input quantization mode: static (default), dynamic, or user (use with -sf).
-qb "<integer>" Optional. Weight bits for quantization: 8 or 16 (default)
-sf "<double>" Optional. User-specified input scale factor for quantization (use with -q user). If the network contains multiple inputs, provide scale factors by separating them with commas.
-layout "<string>" Optional. Prompts how network layouts should be treated by application.For example, \"input1[NCHW],input2[NC]\" or \"[NCHW]\" in case of one input size.
-r "<path>" Optional. Read reference score file and compare scores. Example of usage: <reference.ark> or <reference.npz>
-rg "<path>" Read GNA model from file using path/filename provided (required if -m is missing).
-wg "<path>" Optional. Write GNA model to file using path/filename provided.
-we "<path>" Optional. Write GNA embedded model to file using path/filename provided.
-cw_l "<integer>" Optional. Number of frames for left context windows (default is 0). Works only with context window networks. If you use the cw_l or cw_r flag, then batch size argument is ignored.
-cw_r "<integer>" Optional. Number of frames for right context windows (default is 0). Works only with context window networks. If you use the cw_r or cw_l flag, then batch size argument is ignored.
-oname "<string>" Optional. Layer names for output blobs. The names are separated with "," Example: Output1:port,Output2:port
-iname "<string>" Optional. Layer names for input blobs. The names are separated with "," Example: Input1,Input2
-pwl_me "<double>" Optional. The maximum percent of error for PWL function.The value must be in <0,100> range. The default value is 1.0.
-exec_target "<string>" Optional. Specify GNA execution target generation. May be one of GNA_TARGET_2_0, GNA_TARGET_3_0. By default, generation corresponds to the GNA HW available in the system or the latest fully supported generation by the software. See the GNA Plugin's GNA_EXEC_TARGET config option description.
-compile_target "<string>" Optional. Specify GNA compile target generation. May be one of GNA_TARGET_2_0, GNA_TARGET_3_0. By default, generation corresponds to the GNA HW available in the system or the latest fully supported generation by the software. See the GNA Plugin's GNA_COMPILE_TARGET config option description.
You can use the following model optimizer command to convert a Kaldi nnet1 or nnet2 neural model to OpenVINO™ toolkit Intermediate Representation format:
All of them can be downloaded from [https://storage.openvinotoolkit.org/models_contrib/speech/2021.2](https://storage.openvinotoolkit.org/models_contrib/speech/2021.2).
Here, the floating point Kaldi-generated reference neural network scores (`dev93_scores_10.ark`) corresponding to the input feature file (`dev93_10.ark`) are assumed to be available for comparison.
> - Before running the sample with a trained model, make sure the model is converted to the intermediate representation (IR) format (\*.xml + \*.bin) using the [Model Optimizer tool](../../../docs/MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md).
>
> - The sample supports input and output in numpy file format (.npz)
The Wall Street Journal DNN model used in this example was prepared using the Kaldi s5 recipe and the Kaldi Nnet (nnet1) framework. It is possible to recognize speech by substituting the `speech_sample` for
Kaldi's nnet-forward command. Since the `speech_sample` does not yet use pipes, it is necessary to use temporary files for speaker-transformed feature vectors and scores when running the Kaldi speech recognition pipeline. The following operations assume that feature extraction was already performed according to the `s5` recipe and that the working directory within the Kaldi source tree is `egs/wsj/s5`.
1. Prepare a speaker-transformed feature set given the feature transform specified in `final.feature_transform` and the feature files specified in `feats.scp`:
3. Run the Kaldi decoder to produce n-best text hypotheses and select most likely text given the WFST (`HCLG.fst`), vocabulary (`words.txt`), and TID/PID mapping (`final.mdl`):
All of mentioned files can be downloaded from [https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/wsj_dnn5b_smbr](https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/wsj_dnn5b_smbr)