-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 7ea7838
Showing
111 changed files
with
19,210 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
/.vscode | ||
/datasets | ||
/logs | ||
/output | ||
|
||
**/__pycache__/ | ||
**/*.pyc | ||
**/.DS_Store | ||
|
||
pretrained_checkpoint | ||
imgs/ | ||
|
||
pretrained_checkpoint | ||
eval_infer_everything.sh | ||
pre-log.txt | ||
recall_eval.py | ||
save_seg_every.sh | ||
|
||
INSTALL.md | ||
requirements.txt | ||
demo | ||
matches.json | ||
|
||
detectron2 | ||
del_param.py | ||
modified_model.pth | ||
gradio_cached_examples/12/log.csv | ||
frozenseg/modeling/pixel_decoder/ops | ||
|
||
gradio* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
## Installation | ||
The codebases are built on top of [Detectron2](https://detectron2.readthedocs.io/tutorials/install.html). | ||
|
||
|
||
### Dependencies and Installation | ||
```bash | ||
conda create --name frozenseg python=3.10 -y | ||
conda activate frozenseg | ||
conda install pytorch==2.3.1 torchvision==0.18.1 -c pytorch -c nvidia | ||
|
||
# under your working directory | ||
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' | ||
pip install git+https://github.com/cocodataset/panopticapi.git | ||
pip install git+https://github.com/mcordts/cityscapesScripts.git | ||
|
||
git clone https://github.com/chenxi52/FrozenSeg.git | ||
cd FrozenSeg | ||
pip install -r requirements.txt | ||
|
||
# compile CUDA kernel for MSDeformAttn | ||
cd frozenseg/modeling/pixel_decoder/ops | ||
sh make.sh | ||
cd ../../../.. | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation | ||
|
||
>[**FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation**]() | ||
XXXX | ||
|
||
## Abstract | ||
|
||
>Open-vocabulary segmentation is challenging, with the need of segmenting and recognizing objects for an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models like CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite demonstrating strong performances, they still face a fundamental challenge of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model’s visual encoder as the feature backbone, we inject the space-aware feature into learnable query and CLIP feature in the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a light transformer decoder for mask proposal generation – the performance bottleneck. Extensive experiments show that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data and tested in a zero-shot manner. | ||
|
||
## Updates | ||
- **XXX** Code is avaliable. | ||
|
||
## Dependencies and Installation | ||
See [installation instructions](INSTALL.md). | ||
|
||
## Getting Started | ||
See [Preparing Datasets for FC-CLIP](datasets/README.md). | ||
|
||
See [Getting Started with FC-CLIP](GETTING_STARTED.md). | ||
|
||
|
||
## Models | ||
<table> | ||
<thead> | ||
<tr> | ||
<th align="center"></th> | ||
<th align="center" style="text-align:center" colspan="3"><a href="logs/testing/ade20k.log">ADE20K(A-150)</th> | ||
<th align="center" style="text-align:center" colspan="3"><a href="logs/testing/cityscapes.log">Cityscapes</th> | ||
<th align="center" style="text-align:center" colspan="2"><a href="logs/testing/mapillary_vistas.log">Mapillary Vistas</th> | ||
<th align="center" style="text-align:center"><a href="logs/testing/a-847.log">ADE20K-Full <br> (A-847)</th> | ||
<th align="center" style="text-align:center"><a href="logs/testing/pc-59.log">Pascal Context 59 <br> (PC-59)</th> | ||
<th align="center" style="text-align:center"><a href="logs/testing/pc-459.log">Pascal Context 459 <br> (PC-459)</th> | ||
<th align="center" style="text-align:center"><a href="logs/testing/pc-21.log">Pascal VOC 21 <br> (PAS-21) </th> | ||
<th align="center" style="text-align:center"><a href="logs/testing/pc-20.log">Pascal VOC 20 <br> (PAS-20) </th> | ||
<th align="center" style="text-align:center" colspan="3"><a href="logs/testing/coco.log">COCO <br> (training dataset)</th> | ||
<th align="center" style="text-align:center">download </th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td align="center"></td> | ||
<td align="center">PQ</td> | ||
<td align="center">mAP</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">PQ</td> | ||
<td align="center">mAP</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">PQ</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">mIoU</td> | ||
<td align="center">PQ</td> | ||
<td align="center">mAP</td> | ||
<td align="center">mIoU</td> | ||
</tr> | ||
<td align="center"><a href="configs/coco/panoptic-segmentation/fcclip/fcclip_convnext_large_eval_ade20k_r50.yaml"> FC-CLIP (ResNet50) </a></td> | ||
<td align="center">17.9</td> | ||
<td align="center">9.5</td> | ||
<td align="center">23.3</td> | ||
<td align="center">40.3</td> | ||
<td align="center">21.6</td> | ||
<td align="center">53.2</td> | ||
<td align="center">15.9</td> | ||
<td align="center">24.4</td> | ||
<td align="center">7.1</td> | ||
<td align="center">50.5</td> | ||
<td align="center">12.9</td> | ||
<td align="center">75.9</td> | ||
<td align="center">89.5</td> | ||
<td align="center">50.7</td> | ||
<td align="center">40.7</td> | ||
<td align="center">58.8</td> | ||
<td align="center"><a href="https://drive.google.com/file/d/1tcB-8FNON-LwckXQbUyKcBA2G7TU65Zh/view?usp=sharing"> checkpoint </a></td> | ||
</tr> | ||
<tr> | ||
<td align="center"><a href="configs/coco/panoptic-segmentation/fcclip/fclip_convnext_large_eval_ade20k_r101.yaml"> FC-CLIP (ResNet101) </a></td> | ||
<td align="center">19.1</td> | ||
<td align="center">10.2</td> | ||
<td align="center">24.0</td> | ||
<td align="center">40.9</td> | ||
<td align="center">24.1</td> | ||
<td align="center">53.9</td> | ||
<td align="center">16.7</td> | ||
<td align="center">23.2</td> | ||
<td align="center">7.7</td> | ||
<td align="center">48.9</td> | ||
<td align="center">12.3</td> | ||
<td align="center">77.6</td> | ||
<td align="center">91.3</td> | ||
<td align="center">51.4</td> | ||
<td align="center">41.6</td> | ||
<td align="center">58.9</td> | ||
<td align="center"><a href="https://drive.google.com/file/d/1P0mdgftReWzVbPQ0O0CSBfW3krHhTOj0/view?usp=sharing"> checkpoint </a></td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
|
||
## Model evaluation | ||
|
||
## <a name="Citing FrozenSeg"></a>Citing FrozenSeg | ||
|
||
If you use FrozenSeg in your research, please use the following BibTeX entry. | ||
|
||
```BibTeX | ||
@inproceedings{yu2023fcclip, | ||
title={Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP}, | ||
author={Qihang Yu and Ju He and Xueqing Deng and Xiaohui Shen and Liang-Chieh Chen}, | ||
booktitle={NeurIPS}, | ||
year={2023} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
import os | ||
import sys | ||
|
||
# os.system("pip install gdown") | ||
# os.system("pip install imutils") | ||
# os.system("pip install gradio_client==0.2.7") | ||
# os.system("python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'") | ||
# os.system("pip install git+https://github.com/cocodataset/panopticapi.git") | ||
# os.system("python fcclip/modeling/pixel_decoder/ops/setup.py build install") | ||
|
||
import gradio as gr | ||
from detectron2.utils.logger import setup_logger | ||
from contextlib import ExitStack | ||
import numpy as np | ||
import cv2 | ||
import torch | ||
import itertools | ||
from detectron2.config import get_cfg | ||
from detectron2.utils.visualizer import ColorMode, random_color | ||
from detectron2.data import MetadataCatalog | ||
|
||
from frozenseg import add_maskformer2_config, add_frozenseg_config | ||
from demo.predictor import DefaultPredictor, OpenVocabVisualizer | ||
from PIL import Image | ||
import json | ||
|
||
setup_logger() | ||
logger = setup_logger(name="fcclip") | ||
cfg = get_cfg() | ||
cfg.MODEL.DEVICE='cuda' | ||
add_maskformer2_config(cfg) | ||
add_frozenseg_config(cfg) | ||
cfg.merge_from_file("configs/coco/frozenseg/convnext_large_eval_ade20k.yaml") | ||
# os.system("gdown 1-91PIns86vyNaL3CzMmDD39zKGnPMtvj") | ||
cfg.MODEL.WEIGHTS = './modified_model.pth' | ||
cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON = False | ||
cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON = False | ||
cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON = True | ||
predictor = DefaultPredictor(cfg) | ||
|
||
|
||
title = "FrozenSeg" | ||
article = "<p style='text-align: center'><a href='' target='_blank'>FrozenSeg</a> | <a href='' target='_blank'>Github Repo</a></p>" | ||
|
||
examples = [ | ||
[ | ||
"demo/examples/ADE_val_00000001.jpg", | ||
"", | ||
["ADE (150 categories)"], | ||
], | ||
[ | ||
"demo/examples/frankfurt_000000_005898_leftImg8bit.png", | ||
"", | ||
["Cityscapes (19 categories)"], | ||
] | ||
] | ||
|
||
|
||
coco_metadata = MetadataCatalog.get("openvocab_coco_2017_val_panoptic_with_sem_seg") | ||
ade20k_metadata = MetadataCatalog.get("openvocab_ade20k_panoptic_val") | ||
cityscapes_metadata = MetadataCatalog.get("openvocab_cityscapes_fine_panoptic_val") | ||
lvis_classes = open("./frozenseg/data/datasets/lvis_1203_with_prompt_eng.txt", 'r').read().splitlines() | ||
lvis_classes = [x[x.find(':')+1:] for x in lvis_classes] | ||
lvis_colors = list( | ||
itertools.islice(itertools.cycle(coco_metadata.stuff_colors), len(lvis_classes)) | ||
) | ||
# rerrange to thing_classes, stuff_classes | ||
coco_thing_classes = coco_metadata.thing_classes | ||
coco_stuff_classes = [x for x in coco_metadata.stuff_classes if x not in coco_thing_classes] | ||
coco_thing_colors = coco_metadata.thing_colors | ||
coco_stuff_colors = [x for x in coco_metadata.stuff_colors if x not in coco_thing_colors] | ||
ade20k_thing_classes = ade20k_metadata.thing_classes | ||
ade20k_stuff_classes = [x for x in ade20k_metadata.stuff_classes if x not in ade20k_thing_classes] | ||
ade20k_thing_colors = ade20k_metadata.thing_colors | ||
ade20k_stuff_colors = [x for x in ade20k_metadata.stuff_colors if x not in ade20k_thing_colors] | ||
cityscapes_stuff_classes = cityscapes_metadata.stuff_classes | ||
cityscapes_stuff_color = cityscapes_metadata.stuff_colors | ||
cityscapes_thing_classes = cityscapes_metadata.thing_classes | ||
cityscapes_thing_color = cityscapes_metadata.thing_colors | ||
|
||
def build_demo_classes_and_metadata(vocab, label_list): | ||
extra_classes = [] | ||
|
||
if vocab: | ||
for words in vocab.split(";"): | ||
extra_classes.append(words) | ||
extra_colors = [random_color(rgb=True, maximum=1) for _ in range(len(extra_classes))] | ||
print("extra_classes:", extra_classes) | ||
demo_thing_classes = extra_classes | ||
demo_stuff_classes = [] | ||
demo_thing_colors = extra_colors | ||
demo_stuff_colors = [] | ||
|
||
if any("COCO" in label for label in label_list): | ||
demo_thing_classes += coco_thing_classes | ||
demo_stuff_classes += coco_stuff_classes | ||
demo_thing_colors += coco_thing_colors | ||
demo_stuff_colors += coco_stuff_colors | ||
if any("ADE" in label for label in label_list): | ||
demo_thing_classes += ade20k_thing_classes | ||
demo_stuff_classes += ade20k_stuff_classes | ||
demo_thing_colors += ade20k_thing_colors | ||
demo_stuff_colors += ade20k_stuff_colors | ||
if any("LVIS" in label for label in label_list): | ||
demo_thing_classes += lvis_classes | ||
demo_thing_colors += lvis_colors | ||
if any("Cityscapes" in label for label in label_list): | ||
demo_thing_classes += cityscapes_thing_classes | ||
demo_thing_colors += cityscapes_thing_color | ||
demo_stuff_classes += cityscapes_stuff_classes | ||
demo_stuff_colors += cityscapes_stuff_color | ||
|
||
|
||
MetadataCatalog.pop("fcclip_demo_metadata", None) | ||
demo_metadata = MetadataCatalog.get("fcclip_demo_metadata") | ||
demo_metadata.thing_classes = demo_thing_classes | ||
demo_metadata.stuff_classes = demo_thing_classes + demo_stuff_classes | ||
demo_metadata.thing_colors = demo_thing_colors | ||
demo_metadata.stuff_colors = demo_thing_colors + demo_stuff_colors | ||
demo_metadata.stuff_dataset_id_to_contiguous_id = { | ||
idx: idx for idx in range(len(demo_metadata.stuff_classes)) | ||
} | ||
demo_metadata.thing_dataset_id_to_contiguous_id = { | ||
idx: idx for idx in range(len(demo_metadata.thing_classes)) | ||
} | ||
demo_classes = demo_thing_classes + demo_stuff_classes | ||
return demo_classes, demo_metadata | ||
|
||
|
||
def inference(image_path, vocab, label_list): | ||
logger.info("building class names") | ||
vocab = vocab.replace(", ", ",").replace("; ", ";") | ||
demo_classes, demo_metadata = build_demo_classes_and_metadata(vocab, label_list) | ||
predictor.set_metadata(demo_metadata) | ||
im = cv2.imread(image_path) | ||
outputs = predictor(im) | ||
v = OpenVocabVisualizer(im[:, :, ::-1], demo_metadata, instance_mode=ColorMode.IMAGE) | ||
panoptic_result = v.draw_panoptic_seg(outputs["panoptic_seg"][0].to("cpu"), outputs["panoptic_seg"][1]).get_image() | ||
return Image.fromarray(np.uint8(panoptic_result)).convert('RGB') | ||
|
||
|
||
with gr.Blocks(title=title, | ||
css=""" | ||
#submit {background: #3498db; color: white; border: none; padding: 10px 20px; border-radius: 5px;width: 20%;margin: 0 auto; display: block;} | ||
""" | ||
) as demo: | ||
gr.Markdown("<h1 style='text-align: center; margin-bottom: 1rem'>" + title + "</h1>") | ||
input_components = [] | ||
output_components = [] | ||
|
||
with gr.Row(): | ||
output_image_gr = gr.Image(label="Panoptic Segmentation Output", type="pil") | ||
output_components.append(output_image_gr) | ||
|
||
with gr.Row().style(equal_height=True): | ||
with gr.Column(scale=3, variant="panel") as input_component_column: | ||
input_image_gr = gr.Image(type="filepath", label="Input Image") | ||
extra_vocab_gr = gr.Textbox(label="Extra Vocabulary (separated by ;)", placeholder="house;sky") | ||
category_list_gr = gr.CheckboxGroup( | ||
choices=["COCO (133 categories)", "ADE (150 categories)", "LVIS (1203 categories)", "Cityscapes (19 categories)"], | ||
label="Category to use", | ||
) | ||
input_components.extend([input_image_gr, extra_vocab_gr, category_list_gr]) | ||
|
||
with gr.Column(scale=2): | ||
examples_handler = gr.Examples( | ||
examples=examples, | ||
inputs=[c for c in input_components if not isinstance(c, gr.State)], | ||
outputs=[c for c in output_components if not isinstance(c, gr.State)], | ||
fn=inference, | ||
cache_examples=torch.cuda.is_available(), | ||
examples_per_page=5, | ||
) | ||
with gr.Row(): | ||
clear_btn = gr.Button("Clear") | ||
submit_btn = gr.Button("Submit", variant="primary") | ||
|
||
gr.Markdown(article) | ||
|
||
submit_btn.click( | ||
inference, | ||
input_components, | ||
output_components, | ||
api_name="predict", | ||
scroll_to_output=True, | ||
) | ||
|
||
clear_btn.click( | ||
None, | ||
[], | ||
(input_components + output_components + [input_component_column]), | ||
_js=f"""() => {json.dumps( | ||
[component.cleared_value if hasattr(component, "cleared_value") else None | ||
for component in input_components + output_components] + ( | ||
[gr.Column.update(visible=True)] | ||
) | ||
+ ([gr.Column.update(visible=False)]) | ||
)} | ||
""", | ||
) | ||
|
||
demo.launch(server_port=7881) |
Oops, something went wrong.