Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug on z-loss calculation #13

Open
leloykun opened this issue Aug 16, 2024 · 6 comments
Open

Potential bug on z-loss calculation #13

leloykun opened this issue Aug 16, 2024 · 6 comments

Comments

@leloykun
Copy link

The mask should be calculated using the shifted labels (labels shifted 1 token to the left) as in ChameleonModelForConditionalGeneration.forward

@ChrisLiu6
Copy link
Contributor

Cool, you are right, we'll fix it

@leloykun
Copy link
Author

Thanks! <3

I'm also currently working on adding native support on Transformers for image generation (image-only & interleaved image-text) with Chameleon & Anole here: huggingface/transformers#32013 . I'll add native support for this project too when I'm done with these two ^^.

I'm curious: how much of an effect did adding the CFG & the z-loss have?

@ChrisLiu6
Copy link
Contributor

Thanks! <3

I'm also currently working on adding native support on Transformers for image generation (image-only & interleaved image-text) with Chameleon & Anole here: huggingface/transformers#32013 . I'll add native support for this project too when I'm done with these two ^^.

I'm curious: how much of an effect did adding the CFG & the z-loss have?

Thank you so much for your support! Please feel free to reach out to us if you ever need any assistance.

Regarding CFG and z-loss:

The z-loss is extremely important for full-finetuning. In our experience with the 7B model, without z-loss, the training process collapses EVERY TIME just after a few hundred iterations. Note that we've tried printing of the z-loss value without involving its gradient for training, and we find that when z-loss is included in the training, the z-loss value typically stabilizes between 100 and 200. However, when it’s not included, the value quickly escalates into the thousands, indicating a fundamental surge in the norm of logits.

As for CFG, though it is not that indispensable, its impact is still significant. While Lumina-mGPT can still produce high-quality images without CFG, using CFG significantly increases the probability of generating good examples. Additionally, it helps achieve a better balance between content richness and structural coherence.

@leloykun
Copy link
Author

without z-loss, the training process collapses EVERY TIME just after a few hundred iterations

Interesting...

In my finetuning runs, I've observed massive loss and gradient spikes at the beginning but they eventually stabiliized after just a few iterations
image
image

But that might just be because I'm starting from a really small learning rate (1e-5) & using a large batch size (32 across 8 GPUs)? How about yours?

I'll run a couple of finetuning runs w/ my setup + the z-loss and report back 🫡

@ChrisLiu6
Copy link
Contributor

1723890205394

This is an experiment with lr=2e-5, batch size 512 across 16 GPUs (with FSDP and checkpointing), and one epoch takes 3692 iterations. We can see that the loss first drops and then rises. In some of the other experiments, the loss would reach inf. We also find that the stability seems to be related to the data distribution and task difficulty. For example, when we finetune chameleon on fixed 512x512 images (which means the resolution is consistent with chameleon pretrianing), the procedure tends to be more stable than training with variable-aspect-ratio images.

@leloykun
Copy link
Author

Oh wow, that looks cursed

As for my case, I used deepspeed to do DDP & finetuned 64x64 images (constant size) from an OOD dataset (with a lot of white background). I tried finetuning it on the 512x512 version of the dataset, but my models ended up mode collapsing 😅

Perhaps the z-loss is indeed the missing piece

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants