Running avit_L with GPU parallel #2

jsong2333333 · 2024-05-14T22:16:03Z

Hi, I was trying to run the avit_L configuration with 8 40G-GPUs in parallel and found CUDAOutofMemory error, so I'm wondering if you've ever tried using more GPUs with smaller memory in size to parallel and if it succeeded since I believe in paper, you've used 8 80G-GPUs to train the model.

mikemccabe210 · 2024-05-15T21:38:32Z

It's definitely possible, but model parallelism is a lot of work at this point in torch that generally needs to be tuned to the specific cluster topology to get good performance so I wouldn't recommend it for the current model sizes. Pipeline parallelism (https://github.com/pytorch/PiPPy) is probably the easiest to get up and running if you really want to go down that path.

On the data parallel side, AMP and gradient checkpointing are both implemented in the repository and can be enabled by config. AMP can become unstable on a few datasets but results in speedups and memory saving. Gradient checkpointing is the biggest memory saver, but is functionally trading off memory for compute so your jobs will be much slower. FSDP isn't implemented here, but it's also pretty straightforward (only a few lines in recent versions of torch). Performance there will depend on your node connectivity.

jsong2333333 · 2024-05-22T19:23:56Z

Thank you for the suggestions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running avit_L with GPU parallel #2

Running avit_L with GPU parallel #2

jsong2333333 commented May 14, 2024

mikemccabe210 commented May 15, 2024

jsong2333333 commented May 22, 2024

Running avit_L with GPU parallel #2

Running avit_L with GPU parallel #2

Comments

jsong2333333 commented May 14, 2024

mikemccabe210 commented May 15, 2024

jsong2333333 commented May 22, 2024