Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA DeviceFree: out of memory error when building Aresdb first time #342

Open
alxmrr opened this issue Nov 13, 2019 · 8 comments
Open

CUDA DeviceFree: out of memory error when building Aresdb first time #342

alxmrr opened this issue Nov 13, 2019 · 8 comments

Comments

@alxmrr
Copy link

alxmrr commented Nov 13, 2019

Describe the issue
When running 'make run_server' to build version 0.0.2, the build fails with a DeviceFree: out of memory error after a few minutes. I am using a new server with no other processes running.

Reproduce the issue
NVIDIA driver version: 390.48
Cuda version: release 9.1, V9.1.85
golang version: 1.13
gcc version: 5.4.0
cmake version: 3.15.4

Follow the instructions to compile Aresdb version 0.0.2 through 'run make_server'

Error message
[ 15%] Built target mem
[100%] Built target algorithm
[100%] Built target lib
[100%] Built target aresd
Using config file:  config/ares.yaml
{"level":"info","msg":"Bootstrapping service","config":{"Port":9374,"DebugPort":43202,"RootPath":"ares-root","TotalMemorySize":161061273600,"SchedulerOff":false,"Version":"","Env":"","Query":{"DeviceMemoryUtilization":0.95,"DeviceChoosingTimeout":10,"TimezoneTable":{"TableName":"api_cities"},"EnableHashReduction":false},"DiskStore":{"WriteSync":true},"HTTP":{"MaxConnections":300,"ReadTimeOutInSeconds":20,"WriteTimeOutInSeconds":300},"RedoLogConfig":{"DiskConfig":{"Disabled":false},"KafkaConfig":{"Enabled":false,"Brokers":null,"TopicSuffix":""},"DiskOnlyForUnsharded":false},"Cluster":{"Enable":false,"Distributed":false,"Namespace":"","InstanceID":"","Controller":{"Address":"localhost:6708","Headers":null,"TimeoutSec":0},"Etcd":{"Zone":"local","Env":"dev","Service":"ares-datanode","CacheDir":"","ETCDClusters":[{"Zone":"local","Endpoints":["127.0.0.1:2379"],"KeepAlive":null,"TLS":null}],"SDConfig":{"InitTimeout":null},"WatchWithRevision":0},"HeartbeatConfig":{"Timeout":10,"Interval":1}}}}
panic: ERROR when calling CUDA functions: DeviceFree: out of memory
 
goroutine 1 [running]:
github.com/uber/aresdb/utils.StackError(0x0, 0x0, 0xc00004e040, 0x3d, 0x0, 0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/utils/error.go:61 +0x3f9
github.com/uber/aresdb/cgoutils.DoCGoCall(0xc0005b2e18, 0xc0004a44d0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/utils.go:31 +0xa7
github.com/uber/aresdb/cgoutils.doCGoCall(0xc0005b2e48, 0x1)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:188 +0x49
github.com/uber/aresdb/cgoutils.DeviceFree(0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:111 +0x5c
github.com/uber/aresdb/cmd/aresd/cmd.start(0x249e, 0xa8c2, 0xc0005660c0, 0x9, 0x2580000000, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:103 +0x1c2
github.com/uber/aresdb/cmd/aresd/cmd.Execute.func1(0xc00038e000, 0x1e39648, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:85 +0x13d
github.com/spf13/cobra.(*Command).execute(0xc00038e000, 0xc00003c1d0, 0x0, 0x0, 0xc00038e000, 0xc00003c1d0)
        /nvme1n1/go1/pkg/mod/github.com/spf13/[email protected]/command.go:830 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xc00038e000, 0xc0004a2050, 0x5, 0x134fe40)
        /nvme1n1/go1/pkg/mod/github.com/spf13/[email protected]/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /nvme1n1/go1/pkg/mod/github.com/spf13/[email protected]/command.go:864
github.com/uber/aresdb/cmd/aresd/cmd.Execute(0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:95 +0x229
main.main()
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/main.go:20 +0x32
 
goroutine 1 [running]:
github.com/uber/aresdb/cgoutils.DoCGoCall(0xc0005b2e18, 0xc0004a44d0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/utils.go:31 +0xc1
github.com/uber/aresdb/cgoutils.doCGoCall(0xc0005b2e48, 0x1)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:188 +0x49
github.com/uber/aresdb/cgoutils.DeviceFree(0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:111 +0x5c
github.com/uber/aresdb/cmd/aresd/cmd.start(0x249e, 0xa8c2, 0xc0005660c0, 0x9, 0x2580000000, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:103 +0x1c2
github.com/uber/aresdb/cmd/aresd/cmd.Execute.func1(0xc00038e000, 0x1e39648, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:85 +0x13d
github.com/spf13/cobra.(*Command).execute(0xc00038e000, 0xc00003c1d0, 0x0, 0x0, 0xc00038e000, 0xc00003c1d0)
        /nvme1n1/go1/pkg/mod/github.com/spf13/[email protected]/command.go:830 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xc00038e000, 0xc0004a2050, 0x5, 0x134fe40)
        /nvme1n1/go1/pkg/mod/github.com/spf13/[email protected]/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /nvme1n1/go1/pkg/mod/github.com/spf13/[email protected]/command.go:864
github.com/uber/aresdb/cmd/aresd/cmd.Execute(0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:95 +0x229
main.main()
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/main.go:20 +0x32
CMakeFiles/run_server.dir/build.make:57: recipe for target 'CMakeFiles/run_server' failed
make[3]: *** [CMakeFiles/run_server] Error 2
CMakeFiles/Makefile2:467: recipe for target 'CMakeFiles/run_server.dir/all' failed
make[2]: *** [CMakeFiles/run_server.dir/all] Error 2
CMakeFiles/Makefile2:474: recipe for target 'CMakeFiles/run_server.dir/rule' failed
make[1]: *** [CMakeFiles/run_server.dir/rule] Error 2
Makefile:298: recipe for target 'run_server' failed
make: *** [run_server] Error 2

@shz117
Copy link
Contributor

shz117 commented Nov 13, 2019

what's the output of nvidia-smi in your environment?

@alxmrr
Copy link
Author

alxmrr commented Nov 14, 2019

Output of nvidia-smi:
 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:15:00.0 Off |                    0 |
| N/A   35C    P0    46W / 300W |      6MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:16:00.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3A:00.0 Off |                    0 |
| N/A   33C    P0    44W / 300W |      6MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   36C    P0    42W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0    41W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   34C    P0    43W / 300W |      6MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |      6MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    1      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    2      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    3      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    4      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    5      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    6      3413      G   /usr/lib/xorg/Xorg                             5MiB |
|    7      3413      G   /usr/lib/xorg/Xorg                             5MiB |
+-----------------------------------------------------------------------------+

@shz117
Copy link
Contributor

shz117 commented Nov 15, 2019

that's wired.. the error is during initializing. AresDB has not copied anything to device memory yet.

looks like you were running on bare metal (whithout docker) but I was not able to reproduce. (I ran the same make rule with same cuda version and driver version, and was able to start the server properly).

several things I would try:

  1. see if you can run any sample cuda app.
  2. maybe try after killing xorg processes. (although it shouldn't matter)

@alxmrr
Copy link
Author

alxmrr commented Nov 20, 2019

I tried a number of CUDA samples, including bandwidthTest, deviceQuery, histogram, and more without error.

Linux version: Ubuntu 16.04.6 LTS

Could the trouble have to do with the Linux version, or is there a recommended version of nvidia and cuda to work with Ubuntu 16.04.6 LTS maybe?

@shz117
Copy link
Contributor

shz117 commented Nov 20, 2019

not likely. I tested in the same linux version.

@alxmrr
Copy link
Author

alxmrr commented Nov 21, 2019

Still having trouble. Has anyone encountered this error from make test-cuda?
 
[----------] 4 tests from UnaryTransformTest
[ RUN      ] UnaryTransformTest.CheckInt
Exception happend when doing UnaryTransform:parallel_for failed: invalid device function
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: invalid device function
Aborted (core dumped)
CMakeFiles/test-cuda.dir/build.make:60: recipe for target 'CMakeFiles/test-cuda' failed
make[3]: *** [CMakeFiles/test-cuda] Error 134
make[3]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
CMakeFiles/Makefile2:388: recipe for target 'CMakeFiles/test-cuda.dir/all' failed
make[2]: *** [CMakeFiles/test-cuda.dir/all] Error 2
make[2]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
CMakeFiles/Makefile2:395: recipe for target 'CMakeFiles/test-cuda.dir/rule' failed
make[1]: *** [CMakeFiles/test-cuda.dir/rule] Error 2
make[1]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
Makefile:262: recipe for target 'test-cuda' failed
make: *** [test-cuda] Error 2

@alxmrr
Copy link
Author

alxmrr commented Nov 22, 2019

I just realized I didn’t mention this before, but I am running in GPU mode, so am running the cmake command ‘cmake -DQUERY_MODE=DEVICE’. I am considering the Docker implementation as well, but I saw the cmake command does not specify DQUERY_MODE. Does the Docker version run in GPU or CPU mode?

@shz117
Copy link
Contributor

shz117 commented Dec 3, 2019

I just realized I didn’t mention this before, but I am running in GPU mode, so am running the cmake command ‘cmake -DQUERY_MODE=DEVICE’. I am considering the Docker implementation as well, but I saw the cmake command does not specify DQUERY_MODE. Does the Docker version run in GPU or CPU mode?

QUERY_MODE=DEVICE is the one you want to set with a GPU machine.
if QUERY_MODE is missing the make file will set it base on whether the build machine has GPU card (here)
The docker version should be in GPU mode because it's on top of nvidiadocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants