Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory using the default training configuration #28

Open
JacobYuan7 opened this issue Sep 24, 2024 · 6 comments
Open

Out of memory using the default training configuration #28

JacobYuan7 opened this issue Sep 24, 2024 · 6 comments

Comments

@JacobYuan7
Copy link

Hi, many thanks for your great work.

I am trying to use the default script for training. I find that even if I use batch_size=1, training runs out of memory. I am wondering what might cause the problem. I'd appreciate any suggestions.


[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/lumina_mgpt/finetune_solver.py", line 114, in <module>
[rank0]:     solver.run()
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 518, in run
[rank0]:     train_stats = self.train_one_epoch(
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 620, in train_one_epoch
[rank0]:     self.optimizer.step()
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
[rank0]:     has_complex = self._init_group(
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
[rank0]:     state["exp_avg_sq"] = torch.zeros_like(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 
exp name: 7B-8

@xiexing0916
Copy link

I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?

@ChrisLiu6
Copy link
Contributor

Hi, thank you for your interest in our work! Could you tell me the type and number of your GPUs? Since we use FSDP during training, more GPUs will still lower the GPU memory requirement even when batch-size is set to 1.

@JacobYuan7
Copy link
Author

I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?

Just run on more graphic cards.

@JacobYuan7
Copy link
Author

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.

I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

@ChrisLiu6
Copy link
Contributor

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.

I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.

As a reference implementation, you may add the following codes after

labels = data_item["label"]

        if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1:  # image generation data
            if random.random() < 0.1:
                tokens = labels = [_ for _ in labels[:-1] if _ != -100]

@JacobYuan7
Copy link
Author

JacobYuan7 commented Oct 11, 2024

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.
I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.

As a reference implementation, you may add the following codes after

labels = data_item["label"]

        if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1:  # image generation data
            if random.random() < 0.1:
                tokens = labels = [_ for _ in labels[:-1] if _ != -100]

@ChrisLiu6
Many thanks for your prompt feedback.

As I understand it, for image generation, this is quite important for the CFG to work. For Omnipotent SFT, we should not do this random drop. Is my understanding of "our implementation is highly data-format-dependent" correct?

Btw, labels[:-1] drops the last token. Is this intentional or a mistake?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants