-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory using the default training configuration #28
Comments
I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this? |
Hi, thank you for your interest in our work! Could you tell me the type and number of your GPUs? Since we use FSDP during training, more GPUs will still lower the GPU memory requirement even when batch-size is set to 1. |
Just run on more graphic cards. |
@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs. I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks! |
The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different. As a reference implementation, you may add the following codes after
if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1: # image generation data
if random.random() < 0.1:
tokens = labels = [_ for _ in labels[:-1] if _ != -100] |
@ChrisLiu6 As I understand it, for image generation, this is quite important for the CFG to work. For Omnipotent SFT, we should not do this random drop. Is my understanding of "our implementation is highly data-format-dependent" correct? Btw, labels[:-1] drops the last token. Is this intentional or a mistake? |
Hi, many thanks for your great work.
I am trying to use the default script for training. I find that even if I use batch_size=1, training runs out of memory. I am wondering what might cause the problem. I'd appreciate any suggestions.
The text was updated successfully, but these errors were encountered: