Skip to content

Latest commit

 

History

History
62 lines (30 loc) · 3.57 KB

Diffusion Model for Instance Segmentation.md

File metadata and controls

62 lines (30 loc) · 3.57 KB

DiffusionInst: Diffusion Model for Instance Segmentation

Authors: Zhangxuan Gu, Haoxing Chen, Zhuoer Xu, Jun Lan, Changhua Meng, Weiqiang Wang.

Authors propose Diffusion-Inst, a novel framework that represents instances as instance-aware filters and formulates instance segmentation as a noise-to-filter denoising process.

Introduction

image

  • Authors propose DiffusionInst, the first work of diffusion model for instance segmentation by regarding it as a generative noise-to-filter diffusion process.
  • Instead of predicting local masks, they utilize instance-aware filters and a common mask branch feature to represent and reconstruct global masks.
  • Comprehensive experiments are conducted on the COCO and LVIS benchmarks. DiffusionInst achieves competitive results compared with existing approaches, showing the promising future of diffusion models in discriminative tasks.

DiffusionInst

image

Authors regard a data sample in DiffusionInts as a filter $x_0$ $=$ $θ$ for instance segmentation. The overall framework of the DiffusionInst is illustrated in the previous figure, Which contains the following components:

1- A CNN (e.g. ResNet-50) or Swin backbone which is utilized to extract compact visual feature representations with FPN

2- A mask branch is utilized to fuse different scale information from FPN, which outputs a mask feature $F_{mask}∈R^{c×H/4×W/4}$.

These two components work like an encoder, and the input image will only pass them once for feature extraction.

3- As for the decoder, we take a set of noisy bounding boxes associated with filters as input to refine boxes and filters as a denoise process. This component can be iteratively called.

4- Finally, reconstruct the instance mask with the help of mask feature $F_{mask}$ and denoised filters. Like DiffusionDet, they keep its optimization target but omit them here for better understanding.

Training: During training, we tend to construct the diffusion process from ground truth filters to noise filters relying on the corresponding bounding boxes. Then we train the model to reverse this process. Assuming an input image has $N$ instances ($θ^{gt}_{0}$) need to be detected. We randomly choose a time $t$ to perturb these ground truth boxes. In conclusion, we can obtain the predicted instance masks as (the denoise process of the decoder follows DiffusionDet <Diffusion Model for Object Detection> and is denoted as $f(b,t)$):

$b_t = ᾱ_tb^{gt}_0 + (1 - ᾱ_t)_∈$,

$θ_t = Decoder(b_t )$,

$m = φ(F_{mask} ; f (θ_t, t)).$

With the dice loss used in CondInst <Conditional Convolutions for Instance Segmentation>, we can obtain the training objective function as:

$L_{overall} = L_{det} + λL_{dice} (m, m^{gt} ),$

where $L_{det}$ is the training loss of DiffusionDet and $λ$ being 5 in this work is used to balance the two losses.

Inference: The inference pipeline of the DiffusionInst is a denoising sampling process from noise to instance filters. Starting from boxes $b_T$ sampled in Gaussian distribution, the model progressively refines its predictions as follows:

$b_0 = f (· · · (f (b_{T −s} , T − s)))$

$s = {{0, · · · , T }},$

$θ_0 = Decoder(b_0 ),$

$m = φ(F_{mask} ; θ_0 ).$

Comparison

image