Authors: Xudong Wang, Rohit Girdhar, Stella X. Yu, and Ishan Misra.
Authors propose Cut-and-LeaRn (CutLER), which is a simple approach for training unsupervised object detection and segmentation models.
Author's method Cut-and-LeaRn (CutLER) consists of three simple architecture and data-agnostic mechanisms. CutLER is trained exclusively on unlabeled ImageNet data without the need of additional training data.
Authors first propose MaskCut which can automatically produce multiple initial coarse masks for each image using the pre-trained self-supervised features.
Secondly, they propose a simple loss-dropping strategy to train detectors using the coarse masks while being robust to objects missed by MaskCut.
Features of CutLER:
1) Simplicity: CutLER is simple to train and agnostic to the choice of detection and backbone architectures.
2) Zero-shot detector: CutLEER trained solely on ImageNet shows strong zero-shot performance on 11 different bench-marks where it outperforms prior work trained with additional in-domain data.
3) Robustness: CutER exhibits strong robustness against domain shifts when tested on images from different domains such as video frames, sketches, paintings, clip arts, etc.
4) Pretraining for supervised detection: CutLER can serve as a pre-trained model for training fully supervised object detection and instance segmentation models and improves performance on COCO, including on few-shot object detection benchmarks.
As illustrated in previous figure, authors propose MaskCut (See next figure) that generates multiple binary masks per image using self-supervised features from DINO. Second, they show a dynamic loss-dropping strategy, called DropLoss that can learn a detector from MaskCut's initial masks while encouraging the model to explore objects missed by MaskCut. Third, they further improve the performance of their method through multiple rounds of self-training
Preliminaries
Normalized Cuts (Ncut) treats the image segmentation problem as a graph partitioning task. They construct a fully connected undirected graph via representing each image as a node. Each pair of nodes is connected by edges with weights
DINO and TokenCut DINO finds that the self-supervised ViT can automatically learn a certain degree of perceptual grouping of image patches.
TokenCut leverages the DINO features for NCut and obtaining foreground/background segments in an image.
Vanilla NCut is limited to discovering a single object in an image. thereby authors propose MaskCut that extends NCut to discover multiple objects per image by iteratively applying NCut to a masked similarity matrix. After getting the bipartition
To determine which group corresponds to the foreground, they make use of two criteria:
- intuitively, the foreground patches should be more prominent than background patches. Therefore, the foreground mask should contain the patch corresponding to the maximum absolute value in the second smallest eigenvector
$M^t$ - Authors incorporate a simple but empirically effective object-centric prior, the foreground set should contain less than two of the four corners. They reverse the partitioning of the foreground and background, i.e.,
$M^t_{ij} = 1- M^t_{ij}$ , if the criteria 1 is not satisfied while the current foreground set contains two corners or the criteria 2 is not satisfied. In practice, authors also set all$W_{ij}< τ^{ncut}$ to$1e^{-5}$ and$W_{ij} ≥ τ^{ncut}$ to$1$ .
To get a mask for the
Where
A standard detection loss penalized predicted regions
Where $IoU^{max}i$ denotes the maximum IoU with all "ground-truth" for $r_i$ and $L{vanilla}$ refers to the vanilla loss function of detectors.
Empirically, authors found that despite learning from the coarse masks obtained by MaskCut, detection models "clean" the ground truth and produce masks (and boxes) that are better than the initial coarse masks used for training. The detectors refine mask quality, and DropLoss strategy encourages them to discover new objects masks. Thus authors leverage this property and use multiple rounds of self-training to improve the detector's performance.
Authors use the predicted masks and proposals with a confidence score over