You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using lightning to train a complex-valued neural networks with complex valued tensor. When I use single gpu training, there is no issue. When I train with multi-gpus with DDP, my training diverges. I try to train on only one gpu, and still declaring " strategy='ddp' " in the trainer, the training also diverge.
I've tried to reproduce the issue with the code sample below. MNIST dataset and the model defined in this sample are simpler than in my current work, so the model won't diverge but really struggle to converge. To check if the issue happens, just comment the line " strategy='ddp' " in the trainer.
Thank you for the report. I don't have first-hand experience with complex + distributed, but it indeed looks like there is an issue upstream with DDP not behaving correctly with complex valued tensors, from looking at pytorch/pytorch#55375
I debugged the code. In fact, it seems that DDP doesn't cause make the view_as_real tensor as in pytorch/pytorch#55375 , all tensors and models have dtype torch.complex64. But I don't know what happen in the distributing process that causes this.
Bug description
Hello,
I am using lightning to train a complex-valued neural networks with complex valued tensor. When I use single gpu training, there is no issue. When I train with multi-gpus with DDP, my training diverges. I try to train on only one gpu, and still declaring " strategy='ddp' " in the trainer, the training also diverge.
I've tried to reproduce the issue with the code sample below. MNIST dataset and the model defined in this sample are simpler than in my current work, so the model won't diverge but really struggle to converge. To check if the issue happens, just comment the line " strategy='ddp' " in the trainer.
This seems to be related to #55375 and #60931
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
No response
Environment
Current environment
More info
@jeremyfix @QuentinGABOT might also be interested in this issue
The text was updated successfully, but these errors were encountered: