Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running avit_L with GPU parallel #2

Open
jsong2333333 opened this issue May 14, 2024 · 2 comments
Open

Running avit_L with GPU parallel #2

jsong2333333 opened this issue May 14, 2024 · 2 comments

Comments

@jsong2333333
Copy link

Hi, I was trying to run the avit_L configuration with 8 40G-GPUs in parallel and found CUDAOutofMemory error, so I'm wondering if you've ever tried using more GPUs with smaller memory in size to parallel and if it succeeded since I believe in paper, you've used 8 80G-GPUs to train the model.

@mikemccabe210
Copy link
Contributor

It's definitely possible, but model parallelism is a lot of work at this point in torch that generally needs to be tuned to the specific cluster topology to get good performance so I wouldn't recommend it for the current model sizes. Pipeline parallelism (https://github.com/pytorch/PiPPy) is probably the easiest to get up and running if you really want to go down that path.

On the data parallel side, AMP and gradient checkpointing are both implemented in the repository and can be enabled by config. AMP can become unstable on a few datasets but results in speedups and memory saving. Gradient checkpointing is the biggest memory saver, but is functionally trading off memory for compute so your jobs will be much slower. FSDP isn't implemented here, but it's also pretty straightforward (only a few lines in recent versions of torch). Performance there will depend on your node connectivity.

@jsong2333333
Copy link
Author

Thank you for the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants