-
Notifications
You must be signed in to change notification settings - Fork 776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execution of Inference Workloads on Hikey970 with layer splitting #882
Comments
Hi @Shraddhaa1 The graph api in ACL is experimental and does not support that level of granularity to specify the backend for each individual layer. You could experiment with the functions interface which lets you mix GPU and CPU kernels, please see the example: https://github.com/ARM-software/ComputeLibrary/blob/master/examples/neoncl_scale_median_gaussian.cpp Hope this helps. |
Hello Sir,
Thank you for the reply. That really helped.
Have a good day.
Sincerely,
Shraddha Dahal
|
Hello Sir,
I had contacted you a few weeks ago regarding the Execution of Inference
Workloads on Hikey970 with layer splitting. Could you please help me know
if there is any function specific in ARM-CL which would let me calculate
inference time taken by each layer of a neural network?
Thank you.
Sincerely,
Shraddha Dahal
|
Hi Shraddha,
If you build ACL with the option benchmark_examples=1 you can then
run the network and use the instruments to see how much time each kernel is taking:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./acl_neon+cl_release/ ./benchmark_graph_mobilenet --instruments=OPENCL_TIMESTAMPS_MS --example_args='--target=CL'
This will produce output like below:
OpenCLTimestamps/Now OpenCL: AVG=1121307039 ms
OpenCLTimestamps/Now Wall clock: AVG=1623931494169279 us
OpenCLTimestamps/[end]Conv2d_0+Conv2d_0/BatchNorm/gemm_mm_floating_point_f32_bifrost GWS[8,3136,1] LWS[4,1,1] #2: AVG=1121306920 ms
OpenCLTimestamps/[end]Conv2d_0+Conv2d_0/BatchNorm/im2col3x3_nhwc GWS[2,12544,1] #1: AVG=1121306919 ms
OpenCLTimestamps/[end]Conv2d_0+Conv2d_0/BatchNorm/reshape_to_columns GWS[3,3,3] #0: AVG=1121306918 ms
OpenCLTimestamps/[end]Conv2d_10_depthwise/depthwise+Conv2d_10_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[512,7,7] #36: AVG=1121306986 ms
OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[128,64,1] #40: AVG=1121306995 ms
OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[128,40,1] #39: AVG=1121306989 ms
OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[128,128,1] #38: AVG=1121306988 ms
OpenCLTimestamps/[end]Conv2d_10_pointwise/Conv2D+Conv2d_10_pointwise/BatchNorm/reshape_to_columns GWS[512,1,1] #37: AVG=1121306988 ms
OpenCLTimestamps/[end]Conv2d_11_depthwise/depthwise+Conv2d_11_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[512,7,7] #41: AVG=1121306995 ms
OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[128,64,1] #45: AVG=1121307003 ms
OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[128,40,1] #44: AVG=1121306998 ms
OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[128,128,1] #43: AVG=1121306997 ms
OpenCLTimestamps/[end]Conv2d_11_pointwise/Conv2D+Conv2d_11_pointwise/BatchNorm/reshape_to_columns GWS[512,1,1] #42: AVG=1121306997 ms
OpenCLTimestamps/[end]Conv2d_12_depthwise/depthwise+Conv2d_12_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc GWS[512,7,7] #46: AVG=1121307004 ms
OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[256,16,1] #50: AVG=1121307012 ms
OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[128,10,1] #49: AVG=1121307008 ms
OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[256,128,1] #48: AVG=1121307008 ms
OpenCLTimestamps/[end]Conv2d_12_pointwise/Conv2D+Conv2d_12_pointwise/BatchNorm/reshape_to_columns GWS[512,1,1] #47: AVG=1121307007 ms
OpenCLTimestamps/[end]Conv2d_13_depthwise/depthwise+Conv2d_13_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[1024,4,4] #51: AVG=1121307012 ms
OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/gemm_mm_reshaped_lhs_nt_rhs_t GWS[256,16,1] #55: AVG=1121307025 ms
OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/gemm_reshape_lhs_matrix_nt GWS[256,10,1] #54: AVG=1121307018 ms
OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/gemm_reshape_rhs_matrix_t GWS[256,256,1] #53: AVG=1121307018 ms
OpenCLTimestamps/[end]Conv2d_13_pointwise/Conv2D+Conv2d_13_pointwise/BatchNorm/reshape_to_columns GWS[1024,1,1] #52: AVG=1121307016 ms
OpenCLTimestamps/[end]Conv2d_1_depthwise/depthwise+Conv2d_1_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc_stride1 GWS[32,56,56] #3: AVG=1121306921 ms
OpenCLTimestamps/[end]Conv2d_1_pointwise/Conv2D+Conv2d_1_pointwise/BatchNorm/gemm_mm_floating_point_f32_bifrost GWS[16,28,112] LWS[4,1,1] #5: AVG=1121306924 ms
OpenCLTimestamps/[end]Conv2d_1_pointwise/Conv2D+Conv2d_1_pointwise/BatchNorm/reshape_to_columns GWS[32,1,1] #4: AVG=1121306922 ms
OpenCLTimestamps/[end]Conv2d_2_depthwise/depthwise+Conv2d_2_depthwise/BatchNorm/depthwise_convolution_3x3_nhwc GWS[64,56,56] #6: AVG=1121306926 ms
OpenCLTimestamps/[end]Conv2d_2_pointwise/Conv2D+Conv2d_2_pointwise/BatchNorm/gemm_mm_floating_point_f32_bifrost GWS[32,14,56] LWS[4,1,1] #8: AVG=1121306929 ms
Another alternative is to use Armnn's ExecuteNetwork to run a tfllite model and use the -e option which will make the tool output the time consumed by each kernel
For more information about ExecuteNetwork see:
https://github.com/ARM-software/armnn/tree/branches/armnn_21_02/tests/ExecuteNetwork
Hope this helps.
|
Hello Sir,
Thank you for the response. I tried the first method you mentioned in the
previous email.
In the makefile.arm file, I added benchmark_examples:=1 such as:
BUILD:=native
NEON:=1
OPENCL:=1
ARCH:=arm64-v8a
all: release
CFLAGS:=-std=c++14
*benchmark_examples:=1*
debug:
scons -j8 -Q arch=$(ARCH) build=$(BUILD) neon=$(NEON) opencl=$(OPENCL)
build_dir=debug debug=1 extra_cxx_flags=$(CFLAGS)
release:
scons -j8 -Q arch=$(ARCH) build=$(BUILD) neon=$(NEON) opencl=$(OPENCL)
build_dir=release debug=0 extra_cxx_flags=$(CFLAGS)
sched:
g++ -o build/release/examples/graph_temp_scheduler2.o -c
-Wno-deprecated-declarations -Wall -DARCH_ARM -Wextra -Wno-unused-parameter
-pedantic -Wdisabled-optimization -Wformat=2 -Winit-self
-Wstrict-overflow=2 -Wswitch-default -fpermissive -std=gnu++11 -Wno-vla
-Woverloaded-virtual -Wctor-dtor-privacy -Wsign-promo -Weffc++
-Wno-format-nonliteral -Wno-overlength-strings -Wno-strict-overflow
-Wlogical-op -Wnoexcept -Wstrict-null-sentinel -Wno-implicit-fallthrough
-march=armv8-a -Wno-ignored-attributes -Werror -O3 -ftree-vectorize
-std=c++14 -D_GLIBCXX_USE_NANOSLEEP -DARM_COMPUTE_CPP_SCHEDULER=1
-DARM_COMPUTE_AARCH64_V8A -DNO_DOT_IN_TOOLCHAIN -DEMBEDDED_KERNELS
-Iinclude -I. -I. examples/graph_temp_scheduler2.cpp
g++ -o build/release/examples/graph_temp_scheduler2
-Wl,--allow-shlib-undefined build/release/examples/graph_temp_scheduler2.o
build/release/utils/Utils.o build/release/utils/GraphUtils.o
build/release/utils/CommonGraphOptions.o -Lbuild/release -L. -lpthread
-larm_compute_graph -larm_compute -larm_compute_core
After that, I gave commands:
$make all
$sudo LD_LIBRARY_PATH=/home/shunya/ComputeLibrary1/build/release
./build/release/examples/graph_mobilenet --instruments=OPENCL_TIMESTAMPS_MS
--example_args='--target=CL'
but, I could not see the outputs with time taken by each kernel. The path
to library in the above command has libarm_compute.so,
libarm_compute_core.so, libarm_compute_graph.so,
libarm_compute_core-static.a, libarm_compute_graph-static.a and
libarm_compute-static.a. Could you please help me know where I am going
wrong?
Have a good day.
Sincerely,
Shraddha Dahal
|
Hi,
Please try running benchmark_graph_mobilenet instead of graph_mobilenet.
Hope this helps.
|
Hello Sir,
Thank you for the reply. The file benchmark_graph_mobilenet, is not
generated anywhere inside the Compute Library. Could you please help me
know if the way I used to build ARM-CL with benchmarks_examples=1 correct?
Have a good day.
Sincerely,
Shraddha Dahal
|
Hello Sir,
Thank you for the reply. I can now observe time taken by each kernel of the
neural networks. Could you please help me know if there has been any update
on how specific layers of a neural network can be specified to either CPU
or GPU? With the repo:
https://github.com/adityagupta1089/ComputeLibrary.git
I could mix the use of CPU and GPU, but I am trying to specify layers to
CPU or GPU. Could you please help me understand how it can be obtained?
Have a good evening.
Sincerely,
Shraddha Dahal
|
Hello Sir,
Thank you for the response. It really helped.
Have a good day.
Sincerely,
Shraddha Dahal
|
Hello Sir,
I was working on obtaining time taken by each kernel with target CL for the
available networks under development repo, as you suggested in previous
emails. Could you please help me know if there are any such instruments
available, which would help me know the time taken by each kernel when
assigning benchmark networks under CPU? I am currently using the command
mentioned below for target CL:
sudo LD_LIBRARY_PATH=/home/shunya/ComputeLibrary/build
./build/tests/benchmark_graph_mobilenet_v2
--instruments=OPENCL_TIMESTAMPS_MS --example_args='--target=CL'
Could you please help me know how I can obtain similar information with
target NEON?
Have a good day.
Sincerely,
Shraddha Dahal
|
Hello Sir,
Thank you for the response. It worked. I am also currently working on
accessing performance monitoring counters such as cache misses, IPC and
memory bandwidth. I was trying to check how memory intensive each neural
network is. I was using perf tool for CPU profiling but the above
mentioned parameters are not supported in HiKey970 board. This can be
observed from the image attached below:
[image: perf_result.png]
The parameters were not on the perf list:
[image: perf_list.png]
Could you please help me know if there are tools which would help me obtain
these parameters when running a neural network?
Thanks again.
Have a good day.
Sincerely,
Shraddha Dahal
|
Hello Sir,
Thank you for the reply. I will open an issue on github soon to discuss it.
However, I have a question regarding the time taken by each kernel when
assigning benchmark networks under CPU and GPU. By running the benchmarks
on CPU and GPU, I observed that the time taken by each kernel is much more
for target CL than that with target NEON. I was not expecting such a big
time difference. Could you please help me understand why kernels are taking
less time with target NEON?
Thank you.
Sincerely,
Shraddha Dahal
|
@Shraddhaa1 I am also working on using ARM CL with HiKey 970. Would you like to discuss this? |
Hello Chang,
Thank you for the reply, and I would like to discuss this more. Have you
gone through the CPU + GPU utilization from ComputeLibrary? I have attached
link to the github below:
https://github.com/adityagupta1089/ComputeLibrary.git
Could you please help me understand how the number of images to be
processed is decided by CPU and GPU?
Is it possible to modify the code so that I could assign a specific number
of images to CPU and GPU?
Have a good weekend.
Sincerely,
Shraddha Dahal
|
Hello @Shraddhaa1 I indeed went through the GitHub you share above, but for my work I will only focus on using CPU. Btw now I have moved to ARM NN, because ARM NN is built on top of ARM CL. you can check ARM NN for more info. |
Hello Chang,
Thank you for the response. I will go through the GitHub link that you have
mentioned, and email you again with some queries.
Have a good day.
Sincerely,
Shraddha Dahal
|
Hello Chang,
I am currently working on the repo that you mentioned in the previous
email: https://github.com/Ehsan-aghapour/ARMCL-pipe-all.
I ran the networks separately on GPU, CPU Big and CPU Little. However, I
observed that CPU Big performed better than the GPU in all of the networks.
Also, when I split layers of ResNet50 amongst GPU, CPU Big and CPU Little,
I could see that the inference time of GPU is lesser than that of CPU Big
and CPU Little. The command I used for it was:
sudo LD_LIBRARY_PATH=/home/shunya/ARMCL-pipe-all-pipe-all/build
./graph_resnet50_all_pipe_sync --threads=4 --threads2=2 --total_cores=6
--partition_point=8 --partition_point2=12 --order=G-L-B --n=50
Could you please help me know why CPU Big is better than GPU when I assign
all of the layers to it?
Also, for ResNet50, I could see that the total parts are given as 18.
First partition point:8
Second partition point:12
*Total parts:18*
Should it not be 50 for ResNet50?
Have a good weekend.
Sincerely,
Shraddha Dahal
|
Hello,
I am currently working on executing inference workloads on Hikey970. I am trying to split the layers of a network amongst CPU and GPU, and run the workloads to reduce inference latency. I am following the repo attached below to run the models with CPU and GPU utilization.
https://github.com/adityagupta1089/ComputeLibrary.git
Could you guys help me understand how I can split the layers of the network and assign them to CPU and GPU?
Is there any API specific for CPU and GPU in ARM-CL?
Thanks.
The text was updated successfully, but these errors were encountered: