多卡训练VL-7B时报错；显存是足够的，这个训练方式是每张卡一个模型吗？ #8

weilanzhikong · 2024-10-09T01:00:47Z

[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in
[rank1]: train()
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 182, in train
[rank1]: optimizer.step()
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step
[rank1]: self.optimizer.step(closure)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank1]: ret = func(self, *args, **kwargs)
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 216, in step
[rank1]: has_complex = self._init_group(
[rank1]: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 155, in _init_group
[rank1]: state["exp_avg"] = torch.zeros_like(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 79.33 GiB of which 923.69 MiB is free. Including non-PyTorch memory, this process has 78.41 GiB memory in use. Of the allocated memory 76.26 GiB is allocated by PyTorch, and 951.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587770 closing signal SIGTERM
W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587771 closing signal SIGTERM

zhangfaen · 2024-10-10T07:59:54Z

It use huggface https://github.com/huggingface/accelerate lib to do distributed train. You may read it to see how to revise the code further to get FSDP training. This repo is mainly for education purpose, so it just juse the simplest distributed training function provided by accelerate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡训练VL-7B时报错；显存是足够的，这个训练方式是每张卡一个模型吗？ #8

多卡训练VL-7B时报错；显存是足够的，这个训练方式是每张卡一个模型吗？ #8

weilanzhikong commented Oct 9, 2024

zhangfaen commented Oct 10, 2024

多卡训练VL-7B时报错；显存是足够的，这个训练方式是每张卡一个模型吗？ #8

多卡训练VL-7B时报错；显存是足够的，这个训练方式是每张卡一个模型吗？ #8

Comments

weilanzhikong commented Oct 9, 2024

zhangfaen commented Oct 10, 2024