Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

This repository contains our implementation of the following paper:

Yuki Endo: "Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation," accepted to The Visual Computer Journal. [Project] [PDF (preprint)]

Prerequisites

Python3
PyTorch
Others (see env.yml)

Preparation

Download the Stable Diffusion model weight (512-base-ema.ckpt) from https://huggingface.co/stabilityai/stable-diffusion-2-base and put it in the checkpoint directory.

Inference

You can generate an image from an input mask and prompt by running the following command:

python scripts/txt2img_mag.py --ckpt ./checkpoint/512-base-ema.ckpt --prompt "A furry bear riding on a bike in the city" --mask ./inputs/mask1.png --word_ids_for_mask "[[1,2,3,4],[6,7]]" --outdir ./outputs

Here, --word_ids_for_mask means word indices corresponding to each region in a mask image. For example, if you specify word_ids_for_mask as "[[1,2,3,4],[6,7]]", the first region corresponds to "A" (1), "furry" (2), "bear" (3), and "riding" (4), and the second region corresponds to "a" (6) and "bike" (7). (An index 0 means the token of the beginning of the sentence.) The order of regions is determined as a reverse order based on BGR color.

You can also specify two additional parameters, --alpha and --lmda to determine the masked attention guidance scale and loss balancing weight.

Citation

Please cite our paper if you find the code useful:

@Article{endoTVC2023,
Title = {Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation},
Author = {Yuki Endo},
Journal = {The Visual Computer},
volume = {40},
pages = {6033-6045},
doi = {https://doi.org/10.1007/s00371-023-03151-y},
Year = {2023}
}

Acknowledgements

This code heavily borrows from the Stable Diffusion repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
checkpoint		checkpoint
configs/stable-diffusion		configs/stable-diffusion
docs		docs
inputs		inputs
ldm		ldm
licenses		licenses
scripts		scripts
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Prerequisites

Preparation

Inference

Citation

Acknowledgements

About

Releases

Packages

Languages

License

endo-yuki-t/MAG

Folders and files

Latest commit

History

Repository files navigation

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Prerequisites

Preparation

Inference

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages