Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡训练VL-7B时报错;显存是足够的,这个训练方式是每张卡一个模型吗? #8

Open
weilanzhikong opened this issue Oct 9, 2024 · 1 comment

Comments

@weilanzhikong
Copy link

[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in
[rank1]: train()
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 182, in train
[rank1]: optimizer.step()
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step
[rank1]: self.optimizer.step(closure)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank1]: ret = func(self, *args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 216, in step
[rank1]: has_complex = self._init_group(
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 155, in _init_group
[rank1]: state["exp_avg"] = torch.zeros_like(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 79.33 GiB of which 923.69 MiB is free. Including non-PyTorch memory, this process has 78.41 GiB memory in use. Of the allocated memory 76.26 GiB is allocated by PyTorch, and 951.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587770 closing signal SIGTERM
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587771 closing signal SIGTERM

@zhangfaen
Copy link
Owner

It use huggface https://github.com/huggingface/accelerate lib to do distributed train. You may read it to see how to revise the code further to get FSDP training. This repo is mainly for education purpose, so it just juse the simplest distributed training function provided by accelerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants