Whether the performance of DiT improves as the VAE dimension increases? #3

LinB203 · 2025-01-04T11:59:57Z

Please feel free to correct me if I have any misunderstandings.

Although Figure 4 (a) and (b) demonstrate that +VF DINO V2 performs better, dim64 is not superior to dim32, am I right?
For the main results in Table 3, they should all use dim32 VAE, correct? I'm curious whether the performance of DiT improves as the VAE dimension increases. If not, then perhaps +VF DINO V2 simply shifts the trade-off from dim16 to dim32, which means it hasn't fundamentally addressed the issue.

yinghuozijin · 2025-01-04T15:03:24Z

确实

JingfengYao · 2025-01-06T08:50:12Z

Thanks for your questions.

Although low-dimensional tokenizers have a clear computational advantage in FID convergence, they inherently produce artifacts in detail reconstruction, such as small faces and tiny text (Figure 1 in the paper). Influential text-to-image models like FLUX and SD3 still opt for higher-dimensional tokenizers and larger generative models (>8B) to achieve finer generation quality.
In Table 2, VF loss significantly improves the convergence speed of high-dimensional tokenizers, whether d32 or d64. Although VF loss does not achieve performance improvements with increasing dimensions in small-scale models (<0.6B), its performance boost for specific high-dimensional tokenizers is still expected to reduce the parameter requirements for generative models. This effectively controls the optimization dilemma and advances the reconstruction-generation frontier.
Thank you again for your attention and feedback on our work. We will continue to refine our approach and ensure that our expressions in the paper are more precise.

LinB203 · 2025-01-06T09:01:18Z

This work is a meaningful exploration. I have a suggestion, feel free to critique. When the compression ratio remains unchanged, increasing the latent dimension from 16 to 64 might lead the VAE to exploit shortcuts instead of performing its intended task. This is because the reconstruction task becomes too easy in such cases. Perhaps methods should be devised to verify whether the VAE is truly learning meaningful representations.

JingfengYao · 2025-01-06T10:06:23Z

Great point, we'll consider it. Thanks~😄

txytju · 2025-01-10T01:55:30Z

I tested a VAE that I trained with 64 channels, and found that the correlation coefficient between channels is very high. So I guess maybe when using more channels, many of them are wasted.

rakkit · 2025-01-10T15:38:46Z

it is expected that f16d32 and f16d64 have different scale-factor, which could affect the Diffusion/FM training dynamic.
Does the code here mean you use 1.0 for your DIT + VA-VAE training?

JingfengYao · 2025-01-10T16:42:57Z

@rakkit No, please refer to img_latent_dataset and config_details for details.

rakkit · 2025-01-10T20:44:07Z

Hi @JingfengYao, thanks for pointing this out.

I am surprised that vanilla-VAE works pretty well (rfid=0.49) with 50 epochs and no extra tricks. Though, it actually makes sense that by setting KL's weights to a very small amount number, we then have a degenerated VAE as plain AutoEncoder, which is supposed to max. performance on reconstruction. This can also explain why we can refactor latent landscape via VF loss to improve generation performance. (Can you share the kl-weights seetings and stats. of your latents' mean and varience?)

also, it is not surprised that @LinB203 point out DiTB + D64 have worse performance than DiTB+D32.. SD3 also have a similar conclusion (figure 10, page 21). But it is interesting that from SD3's conclusion, model >22 layers should benefit from high dimension latent, which is easy to stratify (DiT-L have 24 layers).

The fact is we have already seen successful works in generations, and understanding tasks benefit from high-dimension, and your methods should work as well. The real problem could be that KL-VAE isn't a very good baseline to give a good latent space.

JingfengYao · 2025-01-10T22:29:25Z

@rakkit

The configuration for KL is provided here; the latent status is provided here in README. It is a classic configuration for LDM and is not the main focus of this paper, but its performance on ImageNet has been discussed in recent work MAR' issue.
The SD3 issue you mentioned has been discussed in our paper's abstract and introduction (page 2).
There is a misunderstanding. The number of layers is not the determining factor. The architecture of SD3 differs from that of DiT, and the 22-layer SD3 has a larger number of parameters than DiT-L. According to SD3, it is the model's capability that matters.
There is a causal misunderstanding here. The VAE similar to LDM is the most widely adopted latent diffusion VAE (e.g., SD3, FLUX, SD-XL, DiT), which is why we used it for our experiments.

rakkit · 2025-01-10T22:44:49Z

@JingfengYao Hi, thanks.
Yes, I do agree with you. My point is the baseline is not strong enough to fully show the power of your work, as well as another distillation-based method. That is also a possibe answer of why a higher D here did not give better results due to the weak of kl-vae. Techniquely saying with a stronger tokenizer (better latent landscape), your method can work better

rakkit · 2025-01-15T23:48:40Z

Hey, @JingfengYao, the Vae's checkpoints have not all been released. Is it possible you could share the scale_factor (variance) of F16D16, F16D32 and F16D64 VAE trained without VF loss?

JingfengYao · 2025-01-16T08:13:39Z

Hi, thanks for your attention.

We have released more va-vae experimental variants here. Hope you like it. 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whether the performance of DiT improves as the VAE dimension increases? #3

Whether the performance of DiT improves as the VAE dimension increases? #3

LinB203 commented Jan 4, 2025

yinghuozijin commented Jan 4, 2025

JingfengYao commented Jan 6, 2025

LinB203 commented Jan 6, 2025

JingfengYao commented Jan 6, 2025

txytju commented Jan 10, 2025 •

edited

Loading

rakkit commented Jan 10, 2025

JingfengYao commented Jan 10, 2025

rakkit commented Jan 10, 2025

JingfengYao commented Jan 10, 2025 •

edited

Loading

rakkit commented Jan 10, 2025 •

edited

Loading

rakkit commented Jan 15, 2025

JingfengYao commented Jan 16, 2025

Whether the performance of DiT improves as the VAE dimension increases? #3

Whether the performance of DiT improves as the VAE dimension increases? #3

Comments

LinB203 commented Jan 4, 2025

yinghuozijin commented Jan 4, 2025

JingfengYao commented Jan 6, 2025

LinB203 commented Jan 6, 2025

JingfengYao commented Jan 6, 2025

txytju commented Jan 10, 2025 • edited Loading

rakkit commented Jan 10, 2025

JingfengYao commented Jan 10, 2025

rakkit commented Jan 10, 2025

JingfengYao commented Jan 10, 2025 • edited Loading

rakkit commented Jan 10, 2025 • edited Loading

rakkit commented Jan 15, 2025

JingfengYao commented Jan 16, 2025

txytju commented Jan 10, 2025 •

edited

Loading

JingfengYao commented Jan 10, 2025 •

edited

Loading

rakkit commented Jan 10, 2025 •

edited

Loading