Skip to content

Latest commit

 

History

History
35 lines (21 loc) · 4.53 KB

execution_of_inference.md

File metadata and controls

35 lines (21 loc) · 4.53 KB

Execution of Inference

Network execution is triggered when the inferRequest->infer() or inferRequest->start_async() methods are called. (src)

At high level, all that is required to do is enqueuing OCL kernels with buffers. For that purpose, you need to find the cldnn::network instance, as it contains the required buffers for execution. (link) CPUStreamExecutor is holding streams, and the stream corresponds to the cldnn::network structure. (src)

The main body of network execution is cldnn::network::execute_impl. (src) In this function, set_arguments() is called to set OpenCL arguments and execute_primitive is called to enqueue kernels to OCL queue. In case of a synchronous API call (that is, inferRequest->infer()), waiting for the completion of kernels is also required. It is called from the cldnn::network_output::get_memory() function. (src)

Optimized-out node

During graph compilation (link), some nodes may be optimized out.

For example, concat operation may be executed implicitly, or in other words, concat may be optimized out. Implicit concat is possible when the input of concat can put the output tensor directly into the resulting tensor of concat.

In such case, you do not remove the node in the graph for the integrity of the node connection. Concat layer is just marked as optimized-out and not executed during runtime. (src)

Dumping layer in/out buffer during execution

The cldnn::network::execute_impl function also contains some logic to dump layer in/out buffers for debugging purposes. As it is related to memory usage, it deserves some description, too.

To dump buffers, you need to wait for the moment that the kernel is about to be called (for source buffer) or just called (for destination buffer). In other moments, you do not have the layer's buffer as the buffers are reused from the memory pool. (link)

The get_stream().finish() function is called first as you need to be synchronous with kernel execution. (src). Then, you can access the buffer. (src). This access varies depending on the kind of the buffer. If it is usm_host or usm_shared, it is just accessed directly. If it is usm_device, it is accessed after copying the data into host memory because the host cannot access usm_device directly. (src) If it is OCL memory, you map this into host memory. (src)

Typical network execution happens with usm_host for network input and output and usm_device for the buffers inside the network.

For usage of this dumping feature, see this link.

See also