You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I was trying to run the avit_L configuration with 8 40G-GPUs in parallel and found CUDAOutofMemory error, so I'm wondering if you've ever tried using more GPUs with smaller memory in size to parallel and if it succeeded since I believe in paper, you've used 8 80G-GPUs to train the model.
The text was updated successfully, but these errors were encountered:
It's definitely possible, but model parallelism is a lot of work at this point in torch that generally needs to be tuned to the specific cluster topology to get good performance so I wouldn't recommend it for the current model sizes. Pipeline parallelism (https://github.com/pytorch/PiPPy) is probably the easiest to get up and running if you really want to go down that path.
On the data parallel side, AMP and gradient checkpointing are both implemented in the repository and can be enabled by config. AMP can become unstable on a few datasets but results in speedups and memory saving. Gradient checkpointing is the biggest memory saver, but is functionally trading off memory for compute so your jobs will be much slower. FSDP isn't implemented here, but it's also pretty straightforward (only a few lines in recent versions of torch). Performance there will depend on your node connectivity.
Hi, I was trying to run the avit_L configuration with 8 40G-GPUs in parallel and found CUDAOutofMemory error, so I'm wondering if you've ever tried using more GPUs with smaller memory in size to parallel and if it succeeded since I believe in paper, you've used 8 80G-GPUs to train the model.
The text was updated successfully, but these errors were encountered: