Becoming Human: Artificial Intelligence Magazine
In the previous part, we reviewed acceleration methods and then explained the need for ONNX as a common Intermediate Representation(IR). Baseboard Heaters
In this article we will look at ONNX Runtime and how it works.
Over time, many frameworks and runtime engines have included ONNX format support in their tools.
Among runtime engines, ONNX Runtime is one of the most well-known accelerators. This accelerator can be used for both model training and inference.
As shown above, ONNX Runtime as a runtime engine runs the ONNX model (which can be developed in different frameworks and then converted to ONNX format) on different hardware. This possibility can be done through the APIs provided by ONNX Runtime itself (under the title of Execution Providers or EP). In other words, Execution Providers are hardware acceleration interfaces for querying the capabilities of that hardware.
Execution Providers expose their own set of capabilities to the ONNX Runtime. These capabilities can be the nodes to be executed, its memory allocator, etc. Custom accelerators (such as GPU, VPU, etc.) are examples of Execution Providers.
In general, an ONNX model does not always run on a specific Execution Provider, but the ONNX Runtime runs the model on a heterogeneous environment that contains multiple Execution Providers. There is also a default Execution Provider on which all the nodes that were not able to run on the dedicated Execution Providers will be run.
Execution Providers can be reviewed in the following two categories:
They provide the implementations of the operations defined in ONNX.
These Execution Providers may not have detailed implementations of ONNX operations, but they can execute the ONNX graph in whole or in part.
This group of optimizations can be considered as including all the changes that are applied to the computational graph of the model. These optimizations can be divided into three levels:
This level of optimization rewrites the graph while preserving the functional meaning and applies to all Execution Providers. This rewrite includes the following:
Statically computes parts of the graph that rely only on constant initialization. This eliminates the need to calculate them at runtime.
The BatchNormalization layer is often placed after the Convolution layer (to stabilize and improve the learning efficiency), but by reducing it to the parameters of the Convolution layer, it can be removed from the network because the values of the parameters will remain constant in the Inference phase.
This level of optimization includes more complex node integrations that are only applied to nodes assigned to the CPU, CUDA, and ROCm Execution Providers. Some of these optimizations include:
These optimizations change the layout of the data and are only applied to nodes assigned to the CPU Execution Provider. These optimizations include:
Two important things about these optimizations are:
With the graph_optimization_level parameter, you can set the optimization level in the graph.
All optimizations can be applied online or offline.
In online mode, when initializing an inference session, we apply all enabled optimizations prior to inference. Doing this every time a session is initialized causes time overhead.
In offline mode, after applying graph optimizations, ONNX Runtime serializes the model and saves it to disk. Because of this, we can reduce the initial setup time by reading the optimized model from disk. By setting the optmized_model_filepath parameter, it is possible to enable serialization of the optimized model on disk.
The optimizations can be checked from other perspectives such as the level of application on the model. In this point of view, optimizations are generally referred to as “graph transformations”. These conversions are done at three levels:
Example: cast insertion, mem copy insertion
2. General transformations and Execution Provider-agnostic
Example: transpose insertion for FPGA
These transformations can also be global or local.
In global mode, transformations are applied to the entire graph. The interface that applies these transformations in ONNX Runtime is called GraphTransformer.
On the other hand, and in local mode, transformations can be considered rewriting rules that are applied to subgraphs and some nodes. The RewritingRule interface in the ONNX Runtime is used to implement these transformations.
2. A series of hardware-independent optimizations are applied.
3. Based on available Execution Providers, ONNX Runtime decomposes the graph into a set of subgraphs.
ONNX Runtime uses a greedy approach to assign nodes to Execution Providers. Available Execution Providers are considered in a specified order. In this order, efforts are made to assign subgraphs to special execution providers as much as possible. The default Execution Provider is considered as the last Execution Provider and is responsible for executing the remaining subgraphs that are not assigned to other Execution Providers.
4. Each subgraph is assigned to an Execution Provider.
Along with the flexibility to choose the model and execution provider, there are hyper-parameters to tune to improve performance.
ONNX Runtime supports two execution modes: sequential and parallel. The execution mode controls whether the operators of a graph are executed sequentially or in parallel. By default, the execution mode is sequential.
The parallel execution of several operators is scheduled on an inter-op thread pool. An optimized version of the Eigen library thread pool is used for inter-operational parallelization. With the help of inter_op_num_threads parameter in ONNX Runtime, we can set the number of threads in this mode. This parameter is used when we are in parallel execution mode.
The execution of an operator is also parallelized using an intra-op thread pool. And OpenMP is used for intra-operational parallelization. With the help of the intra_op_num_threads parameter in ONNX Runtime, we can set the number of threads in this mode. This parameter is used when we are in sequential execution mode.
By default, memory arenas do not return unused memory to the operating system. By setting some parameters, it will be possible for users to free the memory with a certain period of time. Currently, the only time period supported is at the end of each run.
Region-based memory arena is a type of memory management in which each allocated object is assigned to a region. A region (zone, or arena, or area) or memory context is a collection of allocated objects that can be reallocated or freed. In simpler terms, an arena is a continuous and large piece of memory that is allocated once and then its different parts can be managed manually. The important thing about the arena is the complete control over how memory is allocated. The only thing out of control is the call to the initialization library.
Most of the time, we have a dynamic shape model that sometimes may process a request which requires a lot of memory. Since arena doesn’t release memory by default, this memory from arena growth will remain forever as part of servicing this request, so most subsequent requests will probably not need that memory, and this means a lot of memory allocated just for some exceptions remain.
There are two types of strategies to extend or shrink memory according to incoming requests:
ONNX Runtime provides information related to model profiling (multi-threading, latency of operators, etc.) as a JSON file to the developer. This file can be generated in one of the following two ways:
For logging, the log_severity_level parameter is used to set the log level. To see all the logs, the value of this parameter should be set to 0 (verbose).
Becoming Human: Artificial Intelligence Magazine
Alchrome Dk Machine Learning Engineer at Part AI Research Center