You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in
[rank1]: train()
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 182, in train
[rank1]: optimizer.step()
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step
[rank1]: self.optimizer.step(closure)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank1]: ret = func(self, *args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 216, in step
[rank1]: has_complex = self._init_group(
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 155, in _init_group
[rank1]: state["exp_avg"] = torch.zeros_like(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 79.33 GiB of which 923.69 MiB is free. Including non-PyTorch memory, this process has 78.41 GiB memory in use. Of the allocated memory 76.26 GiB is allocated by PyTorch, and 951.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587770 closing signal SIGTERM
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587771 closing signal SIGTERM
The text was updated successfully, but these errors were encountered:
It use huggface https://github.com/huggingface/accelerate lib to do distributed train. You may read it to see how to revise the code further to get FSDP training. This repo is mainly for education purpose, so it just juse the simplest distributed training function provided by accelerate.
[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in
[rank1]: train()
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 182, in train
[rank1]: optimizer.step()
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step
[rank1]: self.optimizer.step(closure)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank1]: ret = func(self, *args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 216, in step
[rank1]: has_complex = self._init_group(
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 155, in _init_group
[rank1]: state["exp_avg"] = torch.zeros_like(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 79.33 GiB of which 923.69 MiB is free. Including non-PyTorch memory, this process has 78.41 GiB memory in use. Of the allocated memory 76.26 GiB is allocated by PyTorch, and 951.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587770 closing signal SIGTERM
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587771 closing signal SIGTERM
The text was updated successfully, but these errors were encountered: