Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whether the performance of DiT improves as the VAE dimension increases? #3

Open
LinB203 opened this issue Jan 4, 2025 · 12 comments
Open

Comments

@LinB203
Copy link

LinB203 commented Jan 4, 2025

Please feel free to correct me if I have any misunderstandings.

  1. Although Figure 4 (a) and (b) demonstrate that +VF DINO V2 performs better, dim64 is not superior to dim32, am I right?
  2. For the main results in Table 3, they should all use dim32 VAE, correct? I'm curious whether the performance of DiT improves as the VAE dimension increases. If not, then perhaps +VF DINO V2 simply shifts the trade-off from dim16 to dim32, which means it hasn't fundamentally addressed the issue.

image

@yinghuozijin
Copy link

确实

@JingfengYao
Copy link
Member

Thanks for your questions.

  1. Although low-dimensional tokenizers have a clear computational advantage in FID convergence, they inherently produce artifacts in detail reconstruction, such as small faces and tiny text (Figure 1 in the paper). Influential text-to-image models like FLUX and SD3 still opt for higher-dimensional tokenizers and larger generative models (>8B) to achieve finer generation quality.
  2. In Table 2, VF loss significantly improves the convergence speed of high-dimensional tokenizers, whether d32 or d64. Although VF loss does not achieve performance improvements with increasing dimensions in small-scale models (<0.6B), its performance boost for specific high-dimensional tokenizers is still expected to reduce the parameter requirements for generative models. This effectively controls the optimization dilemma and advances the reconstruction-generation frontier.
  3. Thank you again for your attention and feedback on our work. We will continue to refine our approach and ensure that our expressions in the paper are more precise.
1736150662893

@LinB203
Copy link
Author

LinB203 commented Jan 6, 2025

This work is a meaningful exploration. I have a suggestion, feel free to critique. When the compression ratio remains unchanged, increasing the latent dimension from 16 to 64 might lead the VAE to exploit shortcuts instead of performing its intended task. This is because the reconstruction task becomes too easy in such cases. Perhaps methods should be devised to verify whether the VAE is truly learning meaningful representations.

@JingfengYao
Copy link
Member

Great point, we'll consider it. Thanks~😄

@txytju
Copy link

txytju commented Jan 10, 2025

截屏2025-01-10 09 53 46 I tested a VAE that I trained with 64 channels, and found that the correlation coefficient between channels is very high. So I guess maybe when using more channels, many of them are wasted.

@rakkit
Copy link

rakkit commented Jan 10, 2025

it is expected that f16d32 and f16d64 have different scale-factor, which could affect the Diffusion/FM training dynamic.
Does the code here mean you use 1.0 for your DIT + VA-VAE training?

@JingfengYao
Copy link
Member

@rakkit No, please refer to img_latent_dataset and config_details for details.

@rakkit
Copy link

rakkit commented Jan 10, 2025

Hi @JingfengYao, thanks for pointing this out.

I am surprised that vanilla-VAE works pretty well (rfid=0.49) with 50 epochs and no extra tricks. Though, it actually makes sense that by setting KL's weights to a very small amount number, we then have a degenerated VAE as plain AutoEncoder, which is supposed to max. performance on reconstruction. This can also explain why we can refactor latent landscape via VF loss to improve generation performance. (Can you share the kl-weights seetings and stats. of your latents' mean and varience?)

also, it is not surprised that @LinB203 point out DiTB + D64 have worse performance than DiTB+D32.. SD3 also have a similar conclusion (figure 10, page 21). But it is interesting that from SD3's conclusion, model >22 layers should benefit from high dimension latent, which is easy to stratify (DiT-L have 24 layers).

The fact is we have already seen successful works in generations, and understanding tasks benefit from high-dimension, and your methods should work as well. The real problem could be that KL-VAE isn't a very good baseline to give a good latent space.

@JingfengYao
Copy link
Member

JingfengYao commented Jan 10, 2025

@rakkit

  1. The configuration for KL is provided here; the latent status is provided here in README. It is a classic configuration for LDM and is not the main focus of this paper, but its performance on ImageNet has been discussed in recent work MAR' issue.
  2. The SD3 issue you mentioned has been discussed in our paper's abstract and introduction (page 2).
  3. There is a misunderstanding. The number of layers is not the determining factor. The architecture of SD3 differs from that of DiT, and the 22-layer SD3 has a larger number of parameters than DiT-L. According to SD3, it is the model's capability that matters.
  4. There is a causal misunderstanding here. The VAE similar to LDM is the most widely adopted latent diffusion VAE (e.g., SD3, FLUX, SD-XL, DiT), which is why we used it for our experiments.

@rakkit
Copy link

rakkit commented Jan 10, 2025

@JingfengYao Hi, thanks.
Yes, I do agree with you. My point is the baseline is not strong enough to fully show the power of your work, as well as another distillation-based method. That is also a possibe answer of why a higher D here did not give better results due to the weak of kl-vae. Techniquely saying with a stronger tokenizer (better latent landscape), your method can work better

@rakkit
Copy link

rakkit commented Jan 15, 2025

Hey, @JingfengYao, the Vae's checkpoints have not all been released. Is it possible you could share the scale_factor (variance) of F16D16, F16D32 and F16D64 VAE trained without VF loss?

@JingfengYao
Copy link
Member

Hi, thanks for your attention.

We have released more va-vae experimental variants here. Hope you like it. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants