CUDA ERROR #44

kkkkkk123-ops · 2024-11-25T13:06:05Z

Does anyone have the same issue?
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

CodeStarting-design · 2024-12-17T08:49:34Z

Hello,

I have encountered the same issue and set CUDA_LAUNCH_BLOCKING=1 for better debugging. Below is the specific error message I received:

File "train.py", line 221, in <module>
    train(net, loader_train, loader_test, optimizer, criterion)
  File "train.py", line 66, in train
    out = net(x)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/DEA-RWKV/code/model/backbone_train.py", line 113, in forward
    x8 = self.level3_VRWKV8(x8, patch_resolution)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/DEA-RWKV/code/model/vrwkv6.py", line 356, in forward
    x = _inner_forward(x)
  File "/root/DEA-RWKV/code/model/vrwkv6.py", line 349, in _inner_forward
    x = x + self.drop_path(self.att(self.ln1(x), patch_resolution))
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/DEA-RWKV/code/model/vrwkv6.py", line 239, in forward
    x = _inner_forward(x)
  File "/root/DEA-RWKV/code/model/vrwkv6.py", line 232, in _inner_forward
    x = RUN_CUDA_RWKV6(B, T, C, self.n_head, r, k, v, w, u=self.time_faaaa)
  File "/root/DEA-RWKV/code/model/vrwkv6.py", line 66, in RUN_CUDA_RWKV6
    return WKV_6.apply(B, T, C, H, r, k, v, w, u)
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "train.py", line 221, in <module>
    train(net, loader_train, loader_test, optimizer, criterion)
  File "train.py", line 72, in train
    loss.backward()
  File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/root/DEA-RWKV/code/model/vrwkv6.py", line 62, in backward
    gu = torch.sum(gu, 0).view(H, C//H)
RuntimeError: CUDA error: an illegal memory access was encountered

Have you found a solution for this issue? Any insight would be greatly appreciated!

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA ERROR #44

CUDA ERROR #44

kkkkkk123-ops commented Nov 25, 2024

CodeStarting-design commented Dec 17, 2024

CUDA ERROR #44

CUDA ERROR #44

Comments

kkkkkk123-ops commented Nov 25, 2024

CodeStarting-design commented Dec 17, 2024