This file documents a collection of models reported in our paper. The training time was measured on with 8 NVIDIA V100 GPUs & NVLink.
The "Name" column contains a link to the config file.
To train a model, run:
python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml
To evaluate a model with a trained/pretrained model, run:
python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth
An example of cross-dataset evaluation:
python train_net.py --num-gpus 8 --config-file configs/partimagenet/r50_partimagenet.yaml --eval-only MODEL.WEIGHTS models/r50_pascalpart.pth
Before training, make sure Preparing Datasets and Preparing Models are well-prepared.
Config | All(40) AP | quad-: head | quad-: body | quad-: foot | quad-: tail | Training time | Download |
---|---|---|---|---|---|---|---|
pascal_part | 4.5 | 17.4 | 0.1 | 0.0 | 2.9 | 1h | model |
+ IN-S11 label | 5.4 | 23.6 | 3.4 | 0.8 | 1.2 | 1.5h | model |
+ IN-S11 parsed | 7.8 | 35.0 | 15.2 | 3.5 | 8.9 | 3h | model |
Config | All(40) AP | quad-: head | quad-: body | quad-: foot | quad-: tail | Training time | Download |
---|---|---|---|---|---|---|---|
pascal_part | 4.5 | 17.4 | 0.1 | 0.0 | 2.9 | 1h | model |
+ LVIS_PACO | 7.8 | 22.9 | 7.1 | 0.3 | 4.0 | 15h + 2.5h | model |
+ IN-S11 label | 8.8 | 26.3 | 3.7 | 0.4 | 1.0 | 3h | model |
+ IN-S11 parsed | 11.8 | 47.5 | 13.4 | 4.5 | 14.8 | 3h | model |
- The evaluation metric is mAPmask@[0.5:0.95] on the validation set of PartImageNet.
- pascal_part + LVIS_PACO is training first(15h) on LVIS and PACO r50_lvis_paco.pth, then(2.5h) on LVIS, PACO and Pascal Part.
- Before training on IN-S11 parsed, generate IN-S11 parsed(20min) by:
python train_net.py --num-gpus 8 --config-file configs/ann_parser/build_pascalpart.yaml --eval-only
python train_net.py --num-gpus 8 --config-file configs/ann_parser/find_ins11_mixer.yaml --eval-only
or download partimagenet_parsed.json and put it to $VLPart_ROOT/datasets/partimagenet/
.
Config | All(93) AP/AP50 | Base(77) AP/AP50 | Novel(16) AP/AP50 | dog: head | dog: torso | dog: leg | dog: paw | dog: tail | Training time | Download |
---|---|---|---|---|---|---|---|---|---|---|
pascal_part_base | 15.0/33.4 | 17.8/39.6 | 1.5/3.7 | 6.1 | 7.9 | 2.9 | 13.8 | 3.2 | 1h | model |
+ VOC object | 16.8/36.8 | 19.9/43.3 | 2.1/5.9 | 29.9 | 22.6 | 3.2 | 12.4 | 2.1 | 1.5h | model |
+ IN-S20 label | 17.4/37.5 | 20.8/44.7 | 1.1/3.1 | 12.8 | 17.8 | 2.0 | 5.9 | 0.9 | 3h | model |
+ IN-S20 parsed | 18.4/39.4 | 21.3/45.3 | 4.2/11.0 | 28.7 | 34.8 | 17.2 | 5.7 | 14.3 | 4.5h | model |
- The evaluation metric is [email protected] on the validation set of Pascal Part.
- Before training on IN-S20 parsed, generate IN-S20 parsed(50min) by:
python train_net.py --num-gpus 8 --config-file configs/ann_parser/build_pascalpartbase.yaml --eval-only
python train_net.py --num-gpus 8 --config-file configs/ann_parser/find_ins20_mixer.yaml --eval-only
or download imagenet_voc_image_parsed.json and put it to $VLPart_ROOT/datasets/imagenet/
.
R50 Mask R-CNN:
Name | VOC AP/AP50 | COCO AP/AP50 | LVIS AP/APr | PartImageNet AP/AP50 | Pascal Part AP/AP50 | PACO AP/AP50 |
---|---|---|---|---|---|---|
Dataset-specific | 35.9/69.7 | 38.0/60.8 | 28.1/20.8 | 29.7/54.1 | 19.4/42.3 | 10.6/21.7 |
Config | r50_voc | r50_coco | r50_lvis | r50_partimagenet | r50_pascalpart | r50_paco |
Training Time | 2h | 6.5h | 7h | 2h | 1h | 7h |
Download | r50_voc.pth | r50_coco.pth | r50_lvis.pth | r50_partimagenet.pth | r50_pascalpart.pth | r50_paco.pth |
Config | VOC AP/AP50 | COCO AP/AP50 | LVIS AP/APr | PartImageNet AP/AP50 | Pascal Part AP/AP50 | PACO AP/AP50 | Training time | Download |
---|---|---|---|---|---|---|---|---|
joint | 44.5/70.3 | 29.0/48.1 | 27.3/19.0 | 5.4/11.3 | 4.9/11.3 | 9.6/19.5 | 15h | model |
joint* | 42.8/70.8 | 28.6/48.0 | 26.8/20.4 | 7.8/15.3 | 21.6/46.3 | 9.3/18.9 | 15h + 2.5h | model |
joint** | 40.6/69.3 | 28.4/47.8 | 26.4/16.0 | 29.1/52.0 | 22.6/47.8 | 9.3/18.9 | 15h + 3h | model |
+ IN label | 38.0/67.8 | 28.2/47.8 | 26.0/15.9 | 30.8/54.4 | 23.6/49.2 | 9.0/18.7 | 15h + 3h + 4h | model |
+ IN parsed | 38.3/67.8 | 28.5/47.8 | 26.2/17.8 | 31.6/55.7 | 24.0/49.8 | 9.6/20.2 | 15h + 3h + 6h | model |
- joint is training on LVIS and PACO.
- joint* is training first(15h) on LVIS and PACO, then(2.5h) on LVIS, PACO, Pascal Part.
- joint** is training first(15h) on LVIS and PACO, then(3h) on LVIS, PACO, Pascal Part, PartImageNet.
- Before training on IN parsed, generate IN parsed(100min) by:
bash tools/golden_image_parse.sh
or download golden_image_parsed.zip, put it to $VLPart_ROOT/datasets/imagenet/
and unzip it.
SwinBase Cascade Mask R-CNN:
Name | VOC AP/AP50 | COCO AP/AP50 | LVIS AP/APr | PartImageNet AP/AP50 | Pascal Part AP/AP50 | PACO AP/AP50 |
---|---|---|---|---|---|---|
Dataset-specific | 59.0/82.0 | 52.5/72.0 | 43.1/38.7 | 41.7/68.7 | 27.4/56.1 | 15.2/29.4 |
Config | swinbase_voc | swinbase_coco | swinbase_lvis | swinbase_partimagenet | swinbase_pascal_part | swinbase_paco |
Training Time | 4h | 1day15h | 1day15h | 4.5h | 1.5h | 1day2h |
Download | swinbase_voc.pth | swinbase_coco.pth | swinbase_lvis.pth | swinbase_partimagenet.pth | swinbase_pascalpart.pth | swinbase_paco.pth |
Config | VOC AP/AP50 | COCO AP/AP50 | LVIS AP/APr | PartImageNet AP/AP50 | Pascal Part AP/AP50 | PACO AP/AP50 | Training time | Download |
---|---|---|---|---|---|---|---|---|
joint | 55.2/72.2 | 41.0/58.4 | 41.3/32.8 | 6.9/13.7 | 5.6/12.5 | 15.9/31.9 | 2day5h | model |
joint* | 52.6/72.4 | 40.4/57.9 | 39.9/29.8 | 11.8/21.8 | 30.5/59.3 | 15.4/30.2 | 2day5h + 4.5h | model |
joint** | 50.3/71.6 | 40.3/57.8 | 39.6/30.3 | 40.0/64.8 | 31.2/60.5 | 15.4/30.3 | 2day5h + 6h | model |
+ IN label | 48.1/69.7 | 40.3/57.7 | 39.3/28.9 | 41.2/66.8 | 31.7/61.1 | 15.9/30.8 | 2day5h + 6h + 8h | model |
+ IN parsed | 47.8/69.7 | 40.5/58.1 | 39.6/30.5 | 42.0/68.2 | 31.9/61.6 | 15.6/30.6 | 2day5h + 6h + 20h | model |
- joint is training on LVIS and PACO.
- joint* is training first(2day5h) on LVIS and PACO, then(4.5h) on LVIS, PACO, Pascal Part.
- joint** is training first(2day5h) on LVIS and PACO, then(6h) on LVIS, PACO, Pascal Part, PartImageNet.
- Before training on IN parsed, generate IN parsed(70min) by:
bash tools/golden_image_parse_swinbase.sh
or download golden_image_parsed_swinbase.zip, put it to $VLPart_ROOT/datasets/imagenet/
and unzip it.