Authors: Zhangxuan Gu, Haoxing Chen, Zhuoer Xu, Jun Lan, Changhua Meng, Weiqiang Wang.
Authors propose Diffusion-Inst, a novel framework that represents instances as instance-aware filters and formulates instance segmentation as a noise-to-filter denoising process.
- Authors propose DiffusionInst, the first work of diffusion model for instance segmentation by regarding it as a generative noise-to-filter diffusion process.
- Instead of predicting local masks, they utilize instance-aware filters and a common mask branch feature to represent and reconstruct global masks.
- Comprehensive experiments are conducted on the COCO and LVIS benchmarks. DiffusionInst achieves competitive results compared with existing approaches, showing the promising future of diffusion models in discriminative tasks.
Authors regard a data sample in DiffusionInts as a filter
1- A CNN (e.g. ResNet-50) or Swin backbone which is utilized to extract compact visual feature representations with FPN
2- A mask branch is utilized to fuse different scale information from FPN, which outputs a mask feature
These two components work like an encoder, and the input image will only pass them once for feature extraction.
3- As for the decoder, we take a set of noisy bounding boxes associated with filters as input to refine boxes and filters as a denoise process. This component can be iteratively called.
4- Finally, reconstruct the instance mask with the help of mask feature
Training: During training, we tend to construct the diffusion process from ground truth filters to noise filters relying on the corresponding bounding boxes. Then we train the model to reverse this process. Assuming an input image has
With the dice loss used in CondInst <Conditional Convolutions for Instance Segmentation>, we can obtain the training objective function as:
where
Inference: The inference pipeline of the DiffusionInst is a denoising sampling process from noise to instance filters. Starting from boxes