Potential bug on z-loss calculation #13

leloykun · 2024-08-16T06:01:51Z

Lumina-mGPT/lumina_mgpt/model/modeling_xllmx_chameleon.py

Line 50 in c8e180a

valid_mask = labels >= 0

The mask should be calculated using the shifted labels (labels shifted 1 token to the left) as in ChameleonModelForConditionalGeneration.forward

The text was updated successfully, but these errors were encountered:

ChrisLiu6 · 2024-08-16T07:09:26Z

Cool, you are right, we'll fix it

leloykun · 2024-08-16T07:25:17Z

Thanks! <3

I'm also currently working on adding native support on Transformers for image generation (image-only & interleaved image-text) with Chameleon & Anole here: huggingface/transformers#32013 . I'll add native support for this project too when I'm done with these two ^^.

I'm curious: how much of an effect did adding the CFG & the z-loss have?

ChrisLiu6 · 2024-08-16T09:36:31Z

Thanks! <3

I'm also currently working on adding native support on Transformers for image generation (image-only & interleaved image-text) with Chameleon & Anole here: huggingface/transformers#32013 . I'll add native support for this project too when I'm done with these two ^^.

I'm curious: how much of an effect did adding the CFG & the z-loss have?

Thank you so much for your support! Please feel free to reach out to us if you ever need any assistance.

Regarding CFG and z-loss:

The z-loss is extremely important for full-finetuning. In our experience with the 7B model, without z-loss, the training process collapses EVERY TIME just after a few hundred iterations. Note that we've tried printing of the z-loss value without involving its gradient for training, and we find that when z-loss is included in the training, the z-loss value typically stabilizes between 100 and 200. However, when it’s not included, the value quickly escalates into the thousands, indicating a fundamental surge in the norm of logits.

As for CFG, though it is not that indispensable, its impact is still significant. While Lumina-mGPT can still produce high-quality images without CFG, using CFG significantly increases the probability of generating good examples. Additionally, it helps achieve a better balance between content richness and structural coherence.

leloykun · 2024-08-16T11:19:32Z

without z-loss, the training process collapses EVERY TIME just after a few hundred iterations

Interesting...

In my finetuning runs, I've observed massive loss and gradient spikes at the beginning but they eventually stabiliized after just a few iterations

But that might just be because I'm starting from a really small learning rate (1e-5) & using a large batch size (32 across 8 GPUs)? How about yours?

I'll run a couple of finetuning runs w/ my setup + the z-loss and report back 🫡

ChrisLiu6 · 2024-08-17T10:29:23Z

This is an experiment with lr=2e-5, batch size 512 across 16 GPUs (with FSDP and checkpointing), and one epoch takes 3692 iterations. We can see that the loss first drops and then rises. In some of the other experiments, the loss would reach inf. We also find that the stability seems to be related to the data distribution and task difficulty. For example, when we finetune chameleon on fixed 512x512 images (which means the resolution is consistent with chameleon pretrianing), the procedure tends to be more stable than training with variable-aspect-ratio images.

leloykun · 2024-08-17T11:11:12Z

Oh wow, that looks cursed

As for my case, I used deepspeed to do DDP & finetuned 64x64 images (constant size) from an OOD dataset (with a lot of white background). I tried finetuning it on the 512x512 version of the dataset, but my models ended up mode collapsing 😅

Perhaps the z-loss is indeed the missing piece

leloykun mentioned this issue Aug 17, 2024

Improve support for image generation with Chameleon & Anole huggingface/transformers#32013

Draft

39 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug on z-loss calculation #13

Potential bug on z-loss calculation #13

leloykun commented Aug 16, 2024

ChrisLiu6 commented Aug 16, 2024

leloykun commented Aug 16, 2024

ChrisLiu6 commented Aug 16, 2024

leloykun commented Aug 16, 2024

ChrisLiu6 commented Aug 17, 2024

leloykun commented Aug 17, 2024

Potential bug on z-loss calculation #13

Potential bug on z-loss calculation #13

Comments

leloykun commented Aug 16, 2024

ChrisLiu6 commented Aug 16, 2024

leloykun commented Aug 16, 2024

ChrisLiu6 commented Aug 16, 2024

leloykun commented Aug 16, 2024

ChrisLiu6 commented Aug 17, 2024

leloykun commented Aug 17, 2024