-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential bug on z-loss calculation #13
Comments
Cool, you are right, we'll fix it |
Thanks! <3 I'm also currently working on adding native support on Transformers for image generation (image-only & interleaved image-text) with Chameleon & Anole here: huggingface/transformers#32013 . I'll add native support for this project too when I'm done with these two ^^. I'm curious: how much of an effect did adding the CFG & the z-loss have? |
Thank you so much for your support! Please feel free to reach out to us if you ever need any assistance. Regarding CFG and z-loss: The z-loss is extremely important for full-finetuning. In our experience with the 7B model, without z-loss, the training process collapses EVERY TIME just after a few hundred iterations. Note that we've tried printing of the z-loss value without involving its gradient for training, and we find that when z-loss is included in the training, the z-loss value typically stabilizes between 100 and 200. However, when it’s not included, the value quickly escalates into the thousands, indicating a fundamental surge in the norm of logits. As for CFG, though it is not that indispensable, its impact is still significant. While Lumina-mGPT can still produce high-quality images without CFG, using CFG significantly increases the probability of generating good examples. Additionally, it helps achieve a better balance between content richness and structural coherence. |
Oh wow, that looks cursed As for my case, I used deepspeed to do DDP & finetuned 64x64 images (constant size) from an OOD dataset (with a lot of white background). I tried finetuning it on the 512x512 version of the dataset, but my models ended up mode collapsing 😅 Perhaps the z-loss is indeed the missing piece |
Lumina-mGPT/lumina_mgpt/model/modeling_xllmx_chameleon.py
Line 50 in c8e180a
The mask should be calculated using the shifted labels (labels shifted 1 token to the left) as in
ChameleonModelForConditionalGeneration.forward
The text was updated successfully, but these errors were encountered: