todo #5

lucidrains · 2024-12-07T16:43:52Z

classifier free guidance + disney research
allow for peeking at the last frame before deciding on next action for next time step
order the actions, then use a small hierarchical action transformer to predict next set of actions
abstract the vq pre-post transformer logic into a wrapper, and prepare for swapping out various wrappers (do a residual VQ version, followed by some guesses to perhaps working in continuous latent space) the main ambiguity is whether they operate on discrete or continuous embeddings from imagen
design an axial space / time version of the transformer
allow for decoding of next set of actions
add a concise example for pong at root

Provide feedback