Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc #523

guillaumekln · 2019-10-17T08:46:37Z

@luozhouyang reported this error during evaluation:

INFO:tensorflow:Step = 4300 ; source words/s = 78948, target words/s = 2012 ; Learning rate = 0.000100 ; Loss = 2.087587
INFO:tensorflow:Step = 4400 ; source words/s = 79358, target words/s = 2059 ; Learning rate = 0.000100 ; Loss = 2.108997
INFO:tensorflow:Step = 4500 ; source words/s = 79888, target words/s = 1977 ; Learning rate = 0.000100 ; Loss = 2.675094
INFO:tensorflow:Step = 4600 ; source words/s = 77566, target words/s = 2015 ; Learning rate = 0.000100 ; Loss = 2.173948
INFO:tensorflow:Step = 4700 ; source words/s = 80029, target words/s = 1967 ; Learning rate = 0.000100 ; Loss = 2.588823
INFO:tensorflow:Step = 4800 ; source words/s = 80122, target words/s = 1950 ; Learning rate = 0.000100 ; Loss = 2.336910
INFO:tensorflow:Step = 4900 ; source words/s = 78610, target words/s = 1998 ; Learning rate = 0.000100 ; Loss = 2.527997
INFO:tensorflow:Step = 5000 ; source words/s = 79802, target words/s = 1957 ; Learning rate = 0.000100 ; Loss = 2.110916
INFO:tensorflow:Running evaluation for step 5000
2019-10-17 06:07:54.980482: F tensorflow/core/kernels/softmax_op_gpu.cu.cc:192] Non-OK-status: GpuLaunchKernel( GenerateNormalizedProb<T, acc_type>, numBlocks, numThreadsPerBlock, 0, cu_stream, reinterpret_cast<const T*>(logits_in_.flat<T>().data()), reinterpret_cast<const acc_type*>(sum_probs.flat<acc_type>().data()), reinterpret_cast<const T*>(max_logits.flat<T>().data()), const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_) status: Internal: invalid configuration argument

And here is a similar issue of tensorflow Non-OK-status for CudaLaunchKernel when torch is also imported #27487

Originally posted by @luozhouyang in #519 (comment)

guillaumekln · 2019-10-17T08:47:40Z

@luozhouyang Never saw this issue. Is there something special in your installation?

luozhouyang · 2019-10-17T12:10:43Z

I install OpenNMT-tf using pip, and train this model in a docker container based on tensorflow/tensorflow:2.0.0-gpu-py3 image.

guillaumekln · 2019-10-25T12:08:05Z

Do you still face this issue?

guillaumekln · 2019-10-29T17:17:43Z

Closing this one. I don't think this is related to something we do in OpenNMT-tf.

FPBHW · 2021-10-12T08:47:59Z

Could this maybe have something to do with batches ? From tensorflow issues

In case anyone else is going crazy because of the GpuLaunchKernel(...) status: Internal: invalid configuration argumnent error, please note that this may also occur if the batch size you use is such that there will be an odd batch with a single record, in my case, the error occurred with the following numbers, when distributed across 4 GPUs:
To fix the issue, change your batch size such that there won't be an odd batch with a single record.

Currently I see this error when using 'score' repeatedly. Trying a different TF version.

guillaumekln closed this as completed Oct 29, 2019

FPBHW mentioned this issue Oct 12, 2021

RuntimeError: Model diverged with loss = NaN. #890

Closed

guillaumekln reopened this Oct 12, 2021

guillaumekln added the bug label Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc #523

Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc #523

guillaumekln commented Oct 17, 2019 •

edited

Loading

guillaumekln commented Oct 17, 2019

luozhouyang commented Oct 17, 2019

guillaumekln commented Oct 25, 2019

guillaumekln commented Oct 29, 2019

FPBHW commented Oct 12, 2021 •

edited

Loading

Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc #523

Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc #523

Comments

guillaumekln commented Oct 17, 2019 • edited Loading

guillaumekln commented Oct 17, 2019

luozhouyang commented Oct 17, 2019

guillaumekln commented Oct 25, 2019

guillaumekln commented Oct 29, 2019

FPBHW commented Oct 12, 2021 • edited Loading

guillaumekln commented Oct 17, 2019 •

edited

Loading

FPBHW commented Oct 12, 2021 •

edited

Loading