Out of memory using the default training configuration #28

JacobYuan7 · 2024-09-24T17:59:47Z

Hi, many thanks for your great work.

I am trying to use the default script for training. I find that even if I use batch_size=1, training runs out of memory. I am wondering what might cause the problem. I'd appreciate any suggestions.


[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/lumina_mgpt/finetune_solver.py", line 114, in <module>
[rank0]:     solver.run()
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 518, in run
[rank0]:     train_stats = self.train_one_epoch(
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 620, in train_one_epoch
[rank0]:     self.optimizer.step()
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
[rank0]:     has_complex = self._init_group(
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
[rank0]:     state["exp_avg_sq"] = torch.zeros_like(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 
exp name: 7B-8

The text was updated successfully, but these errors were encountered:

xiexing0916 · 2024-09-29T00:31:43Z

I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?

ChrisLiu6 · 2024-09-29T12:34:53Z

Hi, thank you for your interest in our work! Could you tell me the type and number of your GPUs? Since we use FSDP during training, more GPUs will still lower the GPU memory requirement even when batch-size is set to 1.

JacobYuan7 · 2024-10-11T11:44:57Z

I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?

Just run on more graphic cards.

JacobYuan7 · 2024-10-11T11:45:31Z

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.

I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

ChrisLiu6 · 2024-10-11T12:31:00Z

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.

I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.

As a reference implementation, you may add the following codes after

Lumina-mGPT/lumina_mgpt/finetune_solver.py

Line 24 in 104abe4

labels = data_item["label"]

        if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1:  # image generation data
            if random.random() < 0.1:
                tokens = labels = [_ for _ in labels[:-1] if _ != -100]

JacobYuan7 · 2024-10-11T18:28:43Z

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.
I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.

As a reference implementation, you may add the following codes after

Lumina-mGPT/lumina_mgpt/finetune_solver.py

Line 24 in 104abe4

labels = data_item["label"]
        if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1:  # image generation data
            if random.random() < 0.1:
                tokens = labels = [_ for _ in labels[:-1] if _ != -100]

@ChrisLiu6
Many thanks for your prompt feedback.

As I understand it, for image generation, this is quite important for the CFG to work. For Omnipotent SFT, we should not do this random drop. Is my understanding of "our implementation is highly data-format-dependent" correct?

Btw, labels[:-1] drops the last token. Is this intentional or a mistake?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory using the default training configuration #28

Out of memory using the default training configuration #28

JacobYuan7 commented Sep 24, 2024

xiexing0916 commented Sep 29, 2024

ChrisLiu6 commented Sep 29, 2024

JacobYuan7 commented Oct 11, 2024

JacobYuan7 commented Oct 11, 2024

ChrisLiu6 commented Oct 11, 2024

JacobYuan7 commented Oct 11, 2024 •

edited

Loading

Out of memory using the default training configuration #28

Out of memory using the default training configuration #28

Comments

JacobYuan7 commented Sep 24, 2024

xiexing0916 commented Sep 29, 2024

ChrisLiu6 commented Sep 29, 2024

JacobYuan7 commented Oct 11, 2024

JacobYuan7 commented Oct 11, 2024

ChrisLiu6 commented Oct 11, 2024

JacobYuan7 commented Oct 11, 2024 • edited Loading

JacobYuan7 commented Oct 11, 2024 •

edited

Loading