Model Parallel + DDP #6133

ferrine · 2021-02-22T16:13:22Z

ferrine
Feb 22, 2021

Hi, in my research code I have a huge dense classification layer with millions of classes. This does not fit well to the memory. What best practice suggests is to use model parallel dense layer, where a matrix multiplication is split across GPUs, that significantly impacts the performance since the huge dense layer becomes a bottleneck.

So far I have implemented a model parallel loss that distributes the computation across workers with nccl/gloo backends and is also able to propagate gradients correctly. In this layer I have different parameters on each GPU. Unfortunately, this does not work well with DDP distributed backend PyTorch plugin as it aggregates gradients for backward assuming all parameters should be the same.

What do you suggest to implement as a workaround for my problem or is there any?

edenlightning · 2021-02-22T20:38:29Z

edenlightning
Feb 22, 2021

@SeanNaren

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Parallel + DDP #6133

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Model Parallel + DDP #6133

ferrine Feb 22, 2021

Replies: 1 comment

edenlightning Feb 22, 2021

ferrine
Feb 22, 2021

edenlightning
Feb 22, 2021