Authors: Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan and Qifeng Chen.
Authors propose a zero-shot text-based editing method (FateZero) on real-world videos without per-prompt training or use-specific mask.
Previous or concurrent diffusion-based editing methods majorly work on images, and to edit real images, their methods utilize deterministic DDIM for the image-to-noise inversion, then the inverted noise gradually generates the edited images under the condition of the target prompt. based on this pipeline, several methods have been proposed in terms of cross-attention guidance.
Authors proposed FateZero which is a zero-shot video editing that doesn't need to be trained for each target prompt individually and has no user-specific mask. And to keep the temporal consistency of the edited video, they used two novel designs. Firstly, instead of solely relying on inversion and generation, they stored all the self and cross-attention maps at every step of the inversion process. This enabled subsequently replacing them during the denoising steps of the DDIM pipeline.
Author's contribution can be summarized as follows:
- They present the first framework for temporal-consistent zero-shot text-based video editing using pre-trained text-to-image model.
- They propose to fuse the attention maps in the inversion process and generation process to preserve the motion and structure consistency during editing.
- Their Attention Blending Block utilizes the source prompt's cross-attention map during attention fusion to prevent source semantic leakage and improve the shape-editing capability.
- They show extensive applications of their method in video style editing, video local editing, video object replacement , etc.
Authors use pre-trained text-to-image model as a base model, which contains a UNet for
Inversion Attenion Fusion: Direct editing using the inverted noise results in frame inconsistency, which may be attributed to two factors. First, the invertible property of DDIM only holds in the limit of small steps. Nevertheless, the present requirements of 50 DDIM denoising steps lead to an accumulation of errors with each subsequent step. Second, using large classifier-free guidance
$z_T,[c^{src}t]t^T=1,[s^{src}]t^T_{=1}$
Where
where
Attention Map Blending: Inversion-time attention fusion might be insufficient in local attrition editing, as shown in the next Figure. In the third column, replacing self-attention
Spatial-Temporal Self-Attention: Denoising each frame individually produces inconsistent video. Inspired by the casual self-attention and recent one-shot video generation method. authors reshape the original self-attention to Spatial-Temporal Self-Attention without changing pre-trained weights. Specifically, they implement
where
Different from appearance editing, reforming the shape of a specific object in the video is much more challenging. To this end, a pre-trained video diffusion model is needed. Since there is no publicly-available generic video diffusion model, we perform the editing on the one-shot video diffusion model instead. In this case, authors compare their editing method with simple DDIM inversion which achieves better performance in terms of editing ability, motion consistency and temporal consistency.