Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #14

Open
xczhusuda opened this issue Oct 29, 2024 · 0 comments
Open

CUDA out of memory #14

xczhusuda opened this issue Oct 29, 2024 · 0 comments

Comments

@xczhusuda
Copy link

xczhusuda commented Oct 29, 2024

您好,我们训练使用SimAM_ResNet34_ASP系列模型遇到显存不足的问题
设置参数如下:
batch_size: 2
spk_model: SimAM_ResNet34_ASP
spk_model_init: ./wespeaker_models/voxblink2_samresnet34_ft/avg_model.pt
tse_model: BSRNN
训练1600step后报错如下:
File "..../speech_separation/tse/wesep/wesep/models/bsrnn.py", line 41, in forward
rnn_output, _ = self.rnn(self.norm(input).transpose(1, 2).contiguous())
File "/opt/conda/envs/py310torch201/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/py310torch201/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 812, in forward
result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 608.00 MiB (GPU 1; 23.69 GiB total capacity; 21.86 GiB already allocated; 334.94 MiB free; 22.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 173992 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 173994 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 173995 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 173996 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 173997 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 173998 closing signal SIGTERM

我们将batch_size设置为1以及使用torch.cuda.empty_cache()还是会遇到这样的问题

为什么会出现这样的情况呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant