Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc #523

Open
guillaumekln opened this issue Oct 17, 2019 · 5 comments
Open

Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc #523

guillaumekln opened this issue Oct 17, 2019 · 5 comments
Labels

Comments

@guillaumekln
Copy link
Contributor

guillaumekln commented Oct 17, 2019

@luozhouyang reported this error during evaluation:

INFO:tensorflow:Step = 4300 ; source words/s = 78948, target words/s = 2012 ; Learning rate = 0.000100 ; Loss = 2.087587
INFO:tensorflow:Step = 4400 ; source words/s = 79358, target words/s = 2059 ; Learning rate = 0.000100 ; Loss = 2.108997
INFO:tensorflow:Step = 4500 ; source words/s = 79888, target words/s = 1977 ; Learning rate = 0.000100 ; Loss = 2.675094
INFO:tensorflow:Step = 4600 ; source words/s = 77566, target words/s = 2015 ; Learning rate = 0.000100 ; Loss = 2.173948
INFO:tensorflow:Step = 4700 ; source words/s = 80029, target words/s = 1967 ; Learning rate = 0.000100 ; Loss = 2.588823
INFO:tensorflow:Step = 4800 ; source words/s = 80122, target words/s = 1950 ; Learning rate = 0.000100 ; Loss = 2.336910
INFO:tensorflow:Step = 4900 ; source words/s = 78610, target words/s = 1998 ; Learning rate = 0.000100 ; Loss = 2.527997
INFO:tensorflow:Step = 5000 ; source words/s = 79802, target words/s = 1957 ; Learning rate = 0.000100 ; Loss = 2.110916
INFO:tensorflow:Running evaluation for step 5000
2019-10-17 06:07:54.980482: F tensorflow/core/kernels/softmax_op_gpu.cu.cc:192] Non-OK-status: GpuLaunchKernel( GenerateNormalizedProb<T, acc_type>, numBlocks, numThreadsPerBlock, 0, cu_stream, reinterpret_cast<const T*>(logits_in_.flat<T>().data()), reinterpret_cast<const acc_type*>(sum_probs.flat<acc_type>().data()), reinterpret_cast<const T*>(max_logits.flat<T>().data()), const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_) status: Internal: invalid configuration argument

And here is a similar issue of tensorflow Non-OK-status for CudaLaunchKernel when torch is also imported #27487

Originally posted by @luozhouyang in #519 (comment)

@guillaumekln
Copy link
Contributor Author

@luozhouyang Never saw this issue. Is there something special in your installation?

@luozhouyang
Copy link

I install OpenNMT-tf using pip, and train this model in a docker container based on tensorflow/tensorflow:2.0.0-gpu-py3 image.

@guillaumekln
Copy link
Contributor Author

Do you still face this issue?

@guillaumekln
Copy link
Contributor Author

Closing this one. I don't think this is related to something we do in OpenNMT-tf.

@FPBHW
Copy link

FPBHW commented Oct 12, 2021

Could this maybe have something to do with batches ? From tensorflow issues

In case anyone else is going crazy because of the GpuLaunchKernel(...) status: Internal: invalid configuration argumnent error, please note that this may also occur if the batch size you use is such that there will be an odd batch with a single record, in my case, the error occurred with the following numbers, when distributed across 4 GPUs:
To fix the issue, change your batch size such that there won't be an odd batch with a single record.

Currently I see this error when using 'score' repeatedly. Trying a different TF version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants