From c0060aefa776fb152aebc461b92bd62876e47bf2 Mon Sep 17 00:00:00 2001 From: Karol Blaszczak Date: Mon, 15 May 2023 10:48:32 +0200 Subject: [PATCH] Prepare "memory_optimization_guide.md" (#17022) (#17498) --------- Co-authored-by: Vitaliy Urusovskij --- .../dldt_deployment_optimization_guide.md | 1 + .../memory_optimization_guide.md | 51 +++++++++++++++++++ 2 files changed, 52 insertions(+) create mode 100644 docs/optimization_guide/memory_optimization_guide.md diff --git a/docs/optimization_guide/dldt_deployment_optimization_guide.md b/docs/optimization_guide/dldt_deployment_optimization_guide.md index 8dfad7f088d..567af96d8d7 100644 --- a/docs/optimization_guide/dldt_deployment_optimization_guide.md +++ b/docs/optimization_guide/dldt_deployment_optimization_guide.md @@ -13,6 +13,7 @@ openvino_docs_deployment_optimization_guide_tput_advanced openvino_docs_OV_UG_Preprocessing_Overview openvino_docs_deployment_optimization_guide_internals + openvino_docs_memory_optimization_guide Runtime optimization, or deployment optimization, focuses on tuning inference parameters and execution means (e.g., the optimum number of requests executed simultaneously). Unlike model-level optimizations, they are highly specific to the hardware and case they are used for, and often come at a cost. diff --git a/docs/optimization_guide/memory_optimization_guide.md b/docs/optimization_guide/memory_optimization_guide.md new file mode 100644 index 00000000000..20991b0f5be --- /dev/null +++ b/docs/optimization_guide/memory_optimization_guide.md @@ -0,0 +1,51 @@ +# Optimizing memory usage {#openvino_docs_memory_optimization_guide} + +@sphinxdirective + +.. warning:: + + Before applying any of the recommendations provided here, note that it may significantly impact first inference latency. + +The most RAM-consuming OpenVINO stage is model compilation. It may cause several issues: + +* Not enough memory to compile a model. To decrease memory requirement, the following options may be applied: + + * Weights mapping - memory mapping (using ``mmap``) has been introduced as the default way to work + with weights. Currently, this feature is supported by the IR frontend. + Mapping may be switched by specifying the ``ov::enable_mmap(BOOL)`` property for the ``ov::Core``. + Because of its "memory-on-demand" nature, there is no need to store all weights + in RAM. Storing just the data that is needed at the moment lowers the amount of memory + required for compilation. Moreover, ``mmap`` provides extensive memory sharing, so the + consecutive compilation of the same model will fetch the information already stored in RAM + instead of reading it one more time from storage. + + * Decrease the number of threads for compilation - to change the number of threads, specify + the ``ov::compilation_num_threads(NUMBER)`` property for the ``ov::Core`` or pass it as an additional + argument to ``ov::Core::compile_model()`` + +* Not enough memory to recompile a model. If model compilation is successful but one of the following recompilations fails due lack of resources, it may be caused by: + + * Memory leak - to determine direct leaks, you can use tools like 'address-sanitizer' or + 'valgrind'. In case of indirect leaks, which cannot be caught by tools, peak RAM (VMHWM) + may be tracked (you can use tests/stress_tests/memleaks_tests as a tracking tool). If you + experience significant memory usage increase, report it in + `Github "Issues" `__ + + * Memory allocator behavior - each allocator works according to a unique strategy and + balances between performance and memory usage. For example, the GNU allocator aggressively + requests from the OS for more memory for consecutive model compilations than was + required for the first compilation (such behavior may be determined by tracking actual RAM + (VMRSS) after compilation - it will grow until some stable point). To optimize memory + pressure, the following options are available: + + * Apply ``malloc_trim(0)``. The function attempts to release free memory even from thread + caches, so it may signifficantly decrease and stabilize VMRSS usage + + * Use glibc ``Tunables``. A couple of promising options are: + ``glibc.malloc.trim_threshold`` and `glibc.malloc.arena_max`. + More details on the two may be found in the + `GNU Tunables Manual `__ + + * Try another allocator. One of the allocators that handles memory carefully is ``jemalloc`` + +@endsphinxdirective