-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whether the performance of DiT improves as the VAE dimension increases? #3
Comments
确实 |
This work is a meaningful exploration. I have a suggestion, feel free to critique. When the compression ratio remains unchanged, increasing the latent dimension from 16 to 64 might lead the VAE to exploit shortcuts instead of performing its intended task. This is because the reconstruction task becomes too easy in such cases. Perhaps methods should be devised to verify whether the VAE is truly learning meaningful representations. |
Great point, we'll consider it. Thanks~😄 |
it is expected that f16d32 and f16d64 have different scale-factor, which could affect the Diffusion/FM training dynamic. |
@rakkit No, please refer to img_latent_dataset and config_details for details. |
Hi @JingfengYao, thanks for pointing this out. I am surprised that vanilla-VAE works pretty well (rfid=0.49) with 50 epochs and no extra tricks. Though, it actually makes sense that by setting KL's weights to a very small amount number, we then have a degenerated VAE as plain AutoEncoder, which is supposed to max. performance on reconstruction. This can also explain why we can refactor latent landscape via VF loss to improve generation performance. (Can you share the kl-weights seetings and stats. of your latents' mean and varience?) also, it is not surprised that @LinB203 point out DiTB + D64 have worse performance than DiTB+D32.. SD3 also have a similar conclusion (figure 10, page 21). But it is interesting that from SD3's conclusion, model >22 layers should benefit from high dimension latent, which is easy to stratify (DiT-L have 24 layers). The fact is we have already seen successful works in generations, and understanding tasks benefit from high-dimension, and your methods should work as well. The real problem could be that KL-VAE isn't a very good baseline to give a good latent space. |
|
@JingfengYao Hi, thanks. |
Hey, @JingfengYao, the Vae's checkpoints have not all been released. Is it possible you could share the scale_factor (variance) of F16D16, F16D32 and F16D64 VAE trained without VF loss? |
Hi, thanks for your attention. We have released more va-vae experimental variants here. Hope you like it. 😊 |
Please feel free to correct me if I have any misunderstandings.
The text was updated successfully, but these errors were encountered: