Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel i5-5300U HD Graphics doesnt finish running #36

Open
rajgott opened this issue Jul 8, 2016 · 25 comments
Open

Intel i5-5300U HD Graphics doesnt finish running #36

rajgott opened this issue Jul 8, 2016 · 25 comments

Comments

@rajgott
Copy link

rajgott commented Jul 8, 2016

I have an i5-5300U and want to do inference on the integrated GPU. I can detect the GPU using clDeviceQuery. I compiled and installed Greentea with Intel OpenCL 1.2.
clDeviceQuery.txt

I can test my model on CPU in ~1 second per image. When i switch to GPU mode the inference doesnt finish, i have waited 6+ hours. It shows CPU is running at close to 100% during this run.

Is this normal? Has anyone go it to work on intel integrated graphics?

@naibaf7
Copy link
Owner

naibaf7 commented Jul 8, 2016

@rajgott
Can you show me your Makefile.config and your network prototxt? In some instances, it could be that you run out of memory for the GPU, or that the convolution engine is not suitable for your GPU.
On the 5300U, you should test if either enabling INTEL_SPATIAL or LIBDNN in the Makefile.config will work instead.

Also try to run:

./build/tools/caffe device_query

and post the result here. It is the "built-in clinfo" for OpenCL Caffe.

@rajgott
Copy link
Author

rajgott commented Jul 8, 2016

Here is my Makefile.config
Makefile.config.txt

This happens with standard models from bvlc, for example googlenet. INTEL_SPATIAL was enabled, now i enabled LIBDNN also. Even with this the problem remains.

Output of ./build/tools/caffe device_query:
I0708 07:01:31.117813 6125 common.cpp:373] Total devices: 2
I0708 07:01:31.118115 6125 common.cpp:374] CUDA devices: 0
I0708 07:01:31.118127 6125 common.cpp:375] OpenCL devices: 2
I0708 07:01:31.118134 6125 common.cpp:399] Device id: 0
I0708 07:01:31.118144 6125 common.cpp:401] Device backend: OpenCL
I0708 07:01:31.118168 6125 common.cpp:403] Backend details: Intel(R) Corporation: OpenCL 1.2
I0708 07:01:31.118181 6125 common.cpp:405] Device vendor: Intel(R) Corporation
I0708 07:01:31.118191 6125 common.cpp:407] Name: Intel(R) HD Graphics
I0708 07:01:31.118198 6125 common.cpp:409] Total global memory: 3427585229
I0708 07:01:31.118208 6125 common.cpp:399] Device id: 1
I0708 07:01:31.118216 6125 common.cpp:401] Device backend: OpenCL
I0708 07:01:31.118224 6125 common.cpp:403] Backend details: Intel(R) Corporation: OpenCL 1.2
I0708 07:01:31.118240 6125 common.cpp:405] Device vendor: Intel(R) Corporation
I0708 07:01:31.118249 6125 common.cpp:407] Name: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
I0708 07:01:31.118435 6125 common.cpp:409] Total global memory: 7245664256

Thanks

@naibaf7
Copy link
Owner

naibaf7 commented Jul 8, 2016

Can you try with LIBDNN enabled but INTEL_SPATIAL disabled and also with both disabled? Just to be sure. From your information I can't think of another problem than a convolution that is stalling.
Otherwise, can you try to run the network on --gpu1 instead of --gpu0 and see if at least OpenCL on CPU works?
Make sure to do clean-builds (make clean, make all) before testing.
You can also try to run
./build/test/test_all.testbin 0
./build/test/test_all.testbin 1
to see if a certain layer gets stalled.

@rajgott
Copy link
Author

rajgott commented Jul 12, 2016

I tried all combinations with LIBDNN and INTEL_SPATIAL, but inference still stalls.

./build/test/test_all.testbin 1. For example BatchNormLayerTest/2:
[----------] 3 tests from BatchNormLayerTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] BatchNormLayerTest/2.TestForward
[ OK ] BatchNormLayerTest/2.TestForward (2 ms)
[ RUN ] BatchNormLayerTest/2.TestForwardInplace
[ OK ] BatchNormLayerTest/2.TestForwardInplace (1 ms)
[ RUN ] BatchNormLayerTest/2.TestGradient
[ OK ] BatchNormLayerTest/2.TestGradient (15257 ms)
[----------] 3 tests from BatchNormLayerTest/2 (15260 ms total)
This segfaults after passing several tests. The output:
test1.txt

./build/test/test_all.testbin 0 is much slower. For example BatchNormLayerTest/2:
[----------] 3 tests from BatchNormLayerTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] BatchNormLayerTest/2.TestForward
[ OK ] BatchNormLayerTest/2.TestForward (970 ms)
[ RUN ] BatchNormLayerTest/2.TestForwardInplace
[ OK ] BatchNormLayerTest/2.TestForwardInplace (68 ms)
[ RUN ] BatchNormLayerTest/2.TestGradient
[ OK ] BatchNormLayerTest/2.TestGradient (369108 ms)
[----------] 3 tests from BatchNormLayerTest/2 (370146 ms total)
This may take a while to finish more tests.

Thanks

@naibaf7
Copy link
Owner

naibaf7 commented Jul 12, 2016

@rajgott
Hmm, seems like INTEL_SPATIAL fails on your GPU (because it's not having the expected Skylake OpenCL features it seems).

I can't really find out what's wrong, seems you ran the runtest with INTEL_SPATIAL compiled. You should really compile it in the default configuration (LIBDNN and INTEL_SPATIAL off) to find out more...

@rajgott
Copy link
Author

rajgott commented Jul 12, 2016

I have INTEL_SPATIAL and LIBDNN commented out as in the default Makefile.config

CMakeCache.txt shows this:
//No help, variable specified on the command line.
USE_INTEL_SPATIAL:UNINITIALIZED=OFF
//Build Caffe with OpenCL libdnn
USE_LIBDNN:BOOL=OFF

How else to confirm these two are disabled/enabled?

@mattg-sp
Copy link

I have the same CPU, and I experience the exact same failure on device 1. However, that's the CPU cores, which we don't plan to use. If I run that test case on device 0 (the real HD Graphics GPU), it passes. I am using the default setting of USE_INTEL_SPATIAL.

The problem exposed by the tests is that the GPU backend (device 0) runs many at a speed somewhere between 60x and 270x slower than device 1. During this time, the test is using about 6 % of a CPU core and the rest of its time seems to be io_wait.

@naibaf7
Copy link
Owner

naibaf7 commented Jul 21, 2016

@mattg-sp
OK. I got an Intel test platform now and will investigate this further.
As a benchmark, can you please run this:
./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5

@mattg-sp
Copy link

First, thanks for all your help!

Second, that command has been running for 75 minutes, using 99% of a CPU core (virtually all user time). I've attached a profile and the callstacks of all the threads.

caffe_profile-time_gpu0_bvlc_alexnet-benchmark64.txt
caffe_callstacks-time_gpu0_bvlc_alexnet-benchmark64.txt

@naibaf7
Copy link
Owner

naibaf7 commented Jul 22, 2016

@mattg-sp
Uh, that's not supposed to happen, especially not if you run it on the iGPU. Are you both using Macbooks by any chance?
Can I have these details please:

  • Operating system
  • clinfo
  • ./build/tools/caffe device_query
  • Laptop type and brand

It does not stall like that on either the i7-3632QM or i7-6560U integrated GPUs I use for testing (both on beignet-1.1 and Fedora 23/24).

@gongzg ideas?

@mattg-sp
Copy link

sumac:~/caffe # uname -a

Linux sumac 4.1.20-11-default #1 SMP PREEMPT Fri Mar 18 14:42:07 UTC 2016 (0a392b2) x86_64 x86_64 x86_64 GNU/Linux

sumac:~/caffe # cat /etc/os-release

NAME="SLES"
VERSION="12-SP1"
VERSION_ID="12.1"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP1"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp1"

I'm having issues attaching files to this post, so I'll attach the rest of the requested details in a separate post.

@gongzg
Copy link

gongzg commented Jul 22, 2016

@naibaf7 @mattg-sp @rajgott , beignet seems have a cpu side performance issue with the gradient test case. It runs slower and slower during the iteration and seems to be stalled. Beignet team is investigating it. But you will not have that issue if you run the convnet-benchmark or the case Fabian mentioned above with INTEL_SPATIAL enabled. And for BDW, the recommended kernel version is 4.4 or newer. For SKL the recommended kernel version is 4.6 or newer.

@mattg-sp
Copy link

caffe-device_query.txt
clinfo.txt

The hardware is actually an Intel NUC (model: NUC5i5MYHE) with 16 GB of RAM.

Actually, here's the top of /proc/meminfo:

MemTotal:       16300032 kB
MemFree:         6359192 kB
MemAvailable:   13040116 kB

@mattg-sp
Copy link

@gongzg thanks, but would that explain the behavior of caffe time -model models/bvlc_alexnet/benchmark64.prototxt?

Also, how do I know whether we're using beignet? The runtime I'm using is intel-linux-media-ocl_generic_16.4.4-47109_64bit.tar.gz, which I downloaded from Intel's website. Is that built on beignet?

@mattg-sp
Copy link

mattg-sp commented Jul 22, 2016

clinfo_dos.txt
Here's the same clinfo, with DOS line endings.

@naibaf7
Copy link
Owner

naibaf7 commented Jul 22, 2016

@mattg-sp
Hmm, haven't tried with that package yet.
Usually it's easiest to use https://www.freedesktop.org/wiki/Software/Beignet/ provided by the operating system (Ubuntu and Fedora have the "beignet" package in the repositories) on a recent kernel (4.4 or 4.6 as @gongzg pointed out).

Alternatively, try to compile the most recent beignet from source: https://cgit.freedesktop.org/beignet/
Instructions are here: https://gist.github.com/spiralray/cae0bc235509e495fec1

The installation is successful if you can find the "beignet-1.x" string in clinfo.

@gongzg
Copy link

gongzg commented Jul 22, 2016

@mattg-sp let's focus on one configuration at one time. I mean, all of my comments is for USE_INTEL_SPATIAL=ON. And I just saw your clinfo and confirmed that you are using the close source opencl compiler. But the version is a little bit out-of-date. Please change to use the latest published version at https://software.intel.com/en-us/articles/opencl-drivers#latest_linux_driver. The clinfo should be:
Number of devices 1
Device Name Intel(R) HD Graphics
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 1.2
Driver Version 1.0.47971
Device OpenCL C Version OpenCL C 1.2 ( using USC )
Device Type GPU

@naibaf7
Copy link
Owner

naibaf7 commented Jul 22, 2016

@gongzg
What's the difference from beignet to the closed source compiler? Can you elaborate why it even exists?

@mattg-sp
Copy link

I believe the closed source SDK came first. It's understandable why people want open source, though.

The reason we're using closed source is that we're also using Intel's Media SDK. I'll investigate whether beignet can be used in conjunction with that.

@gongzg
Copy link

gongzg commented Jul 23, 2016

@naibaf7 that's a little bit complicate. One of the reason is the open source version is for Linux only and the close source version is derived from Windows OCL driver. @mattg-sp Thanks for you explanation and yes, the closed source SDK for windows came first, then we have the open source for linux and then the OpenCL SDK begin to support Linux. I would stop this discussion here. Let's focus on the issue itself :).

If you want to use beignet, the recommended beignet version is git master, and the recommended LLVM version is LLVM 3.6. LLVM evolve very quickly and some times the newer version brings compatibility issue with beignet. You can check the link https://www.freedesktop.org/wiki/Software/Beignet/ which recommend to use LLVM 3.5 or 3.6.

If you have a BDW or HSW machine and want to use the OpenCL SDK, I would suggest the version "1.0.47971" which is what I am using for BDW machine right now and should not have any issue to run those benchmark.

If you have a SKL machine, you will have beignet support only so far.

@mattg-sp
Copy link

mattg-sp commented Jul 23, 2016

By upgrading my i915 device driver, I was able to resolve the issue of slow tests. Now, all unit tests pass on the GPU except for these:

Im2colLayerTest/2.TestDilatedGradient
Im2colLayerTest/2.TestDilatedGradientForceND
ConvolutionLayerTest/2.TestDilatedGradient

And those pass on the CPU device.

The benchmark64 still hangs on the GPU device, however. I will now investigate using the OpenCL 2.0 runtime and Beignet.

@naibaf7
Copy link
Owner

naibaf7 commented Jul 23, 2016

@mattg-sp
You can also use more lightweight versions of the benchmark - starting at benchmark1 and if that passes go up in batch size until you found the fastest performing batch size (which is the smallest batch size that fully exhausts the GPU cores - check scaling by dividing the time by the batch size).

@mattg-sp
Copy link

mattg-sp commented Jul 23, 2016

Thanks. I got up to benchmark32 to work on GPU 0 (Total Time: 12673.4 ms). Incidentally, it's about twice as fast as GPU 1 (Total Time: 25053.4 ms - on a CPU with 2 cores / 4 threads).

Wait... now benchmark64 even works. But I can still scroll back to last night and see the run that didn't work. Nothing changed, since then. No reboots, nor did I run or install anything until I started with benchmark1, this morning. I'm definitely not mistaken. I've checked over the parameters and I can clearly see that I canceled the failed run after 7m2.418s.

Update: even benchmark128 passed. Three out of three times, so far.

Maybe I'll reboot and see if I can get it to hang, again.

@mattg-sp
Copy link

Oh, I was also going to ask whether any benchmark data from different platforms is collected anywhere.

And are the unit test failures I mentioned a few posts ago anything to be concerned about? Are they likely to compromise the integrity of my results?

@naibaf7
Copy link
Owner

naibaf7 commented Jul 23, 2016

@mattg-sp
Yes, these failures basically mean you can't train correctly (gradients are wrong) with the Caffe convolution engine. You could check if LibDNN verification passes, but Intel spatial convolution uses the default engine for backward, and since the verification fails, it's not usable.
What you could do is check how far the values are off. If they are only a little off, and fail the kappa-test just-so, you might be fine though.

Which device is GPU and which one is CPU on your system (0 and 1)? If 0 is the GPU, then that's not too bad :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants