Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Features, tested on internal data #3

Open
nklpolley opened this issue Jan 15, 2025 · 1 comment
Open

Improved Features, tested on internal data #3

nklpolley opened this issue Jan 15, 2025 · 1 comment

Comments

@nklpolley
Copy link

Hey @StepTurtle , great repo! I tested the unified model on our internal data. I tried out some changes, that greatly increased the performance on our data and want to share them with you. If desired, I can create a second branch for merging, but this might take some time.

OpenClip

  • Currently in BGR but requires RGB
      elif self.open_clip_run:

          #print("running clip", self.classes[class_id])
          # Run OpenClip
          # and accept as a valid object if the score is greater than 0.9
          detection_image = image[
              int(xyxy[1]) : int(xyxy[3]), int(xyxy[0]) : int(xyxy[2]), :
          ]

          # convert detection image from BGR to RGB
          detection_image = detection_image[:, :, ::-1]
          
          pil_image = Image.fromarray(detection_image)
  • The self.preprocess of the currenlty used Clip model uses Resize(224) and CenterCrop(224,224).
    This is not suitable for highly rectangular images like the Dino-detections of license plates. CLIP will only receive the middle part of the license plate, because the center crop removes the majoritiy (left and right part) of the image.
    It's better to use the original rectangular license plate image and square it by padding with black pixels at the top and bottom.

    I currently overwrite the self.preprocess like this: It is not ideal, as this hardcoded preprocess no longer supports an easy swap of clip models that might require different preprocessing.

import open_clip.transform as oct
...
        self.model, _, self.original_preprocess = open_clip.create_model_and_transforms(
            "ViT-B-32",
            pretrained="laion2b_s34b_b79k",
            precision="fp32",
            device=device,
        )

        self.preprocess = torchvision.transforms.Compose([
            oct.ResizeKeepRatio(224, interpolation=Image.BICUBIC, longest=1),
            oct.CenterCropOrPad(224, fill=0),
            torchvision.transforms.ToTensor(),
            torchvision.transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
            ])
  • As a side-note I think the softmax here is not ideal as it can lead to images that are not any of the provided classes to receive high probability values. But at the moment I don't have a better idea. I did some rough tests with just using the similarity metric without softmax
    text_probs = (100.0 * image_features @ text_features.T)
    and then using the class id: prob[class_id] but it didn't improve.

GroundingDino

The label captions does not work for the use case. It merges classes non-sensical and reduces the number of detected bboxes per class. This leads to oftentimes not detecting faces. A recent commit of the Grounding-Dino repository added the remove_combined flag that removes the combination of non-sensical classes. I reimplemented this, but had still problems. On some test images I notice that if only a persons head is visible, GroundingDino will create the Bbox "Person" but not the BBox "Face". If i remove the label "person" from the caption, Dino will correctly draw the "Face" BBox.

I could greatly increase its' quality by the following steps:
First I slightly adapted validation.json (added "a", this also seem to help clip)

{
    "prompts": [
        {
            "prompt": "a license plate",
            "should_inside": ["a car", "a bus", "a truck", "a minibus", "a motorcycle", "a trailer", "an utility vehicle", "a tractor", "a golf cart", "a semi-truck", "a moped", "a scooter"],
            "should_not_inside": []
        },
        {
            "prompt": "license plate",
            "should_inside": ["a car", "a bus", "a truck", "a minibus", "a motorcycle", "a trailer", "an utility vehicle", "a tractor", "a golf cart", "a semi-truck", "a moped", "a scooter"],
            "should_not_inside": []
        },
        {
            "prompt": "a human face",
            "should_inside": ["a person", "a human body"],
            "should_not_inside": []
        }
    ]
}

Second I predict each class independently and therefore change the call function

    def __call__(self, image, classes, box_threshold, text_threshold) -> sv.Detections:
        
        all_boxes = []
        all_confidences = []
        all_labels = []

        for cls in classes:
            boxes, confidences, labels = predict(
            model=self.model,
            image=self.preprocess_image(image).to(self.device),
            caption=cls,
            box_threshold=box_threshold,
            text_threshold=text_threshold,
            device=self.device
            )
            #annotated_frame = annotate(image_source=image, boxes=boxes, logits=confidences, phrases=labels)
            #cv2.imwrite(f"/path/to/folder/{cls}.jpg", annotated_frame)
            all_boxes.append(boxes)
            all_confidences.append(confidences)
            all_labels.append(labels)

        # Merge all detections
        all_boxes = torch.cat(all_boxes, dim=0)
        all_confidences = torch.cat(all_confidences, dim=0)
        all_labels = sum(all_labels, [])
        
        source_h, source_w, _ = image.shape
        detections = Model.post_process_result(
            source_h=source_h, source_w=source_w, boxes=all_boxes, logits=all_confidences
        )
        class_id = Model.phrases2classes(phrases=all_labels, classes=classes)
        detections.class_id = class_id

        return detections

This drastically decreases the speed of GroundingDino (I measure it at ~3 seconds on a RTX 4090) so is not suitable for real-time deployment.

Third: Currently Non-Maximum Suppression is used on all BBoxes of Dino. In the case that if person and face bounding boxes are at similar places, it will delete the bbox with the lower confidence. If the person bbox has the higher confidence, the face bbox is removed.
I replace the linked NMS so that it only suppresses boxes within a class and not class-agnostic.

        "Batched NMS, to not suppress overlapping bboxes of different classes"
        boxes = torch.from_numpy(detections.xyxy)
        scores = torch.from_numpy(detections.confidence)
        class_ids = torch.from_numpy(np.array(detections.class_id, dtype=np.int64))
        nms_indices = torchvision.ops.batched_nms(boxes, scores, class_ids, self.nms_threshold)
        detections.xyxy = detections.xyxy[nms_indices.numpy()]
        detections.confidence = detections.confidence[nms_indices.numpy()]
        detections.class_id = detections.class_id[nms_indices.numpy()]

This batched NMS should also be done if the previous two steps are not done, as currently face detections are oftentimes removed.

YOLO

  • The yolo model currently misses a lot of faces and license plates. I replaces the provided yolov11 weights with the 1080p_medium_v8.pt from here and use a confidence value of 0.15

Unified Language Model

  • With the above described changes, sometimes full cars are detected as license plates from grounding dino and CLIP. I add the following constraints here:
        for index, (xyxy, _, score, class_id, _, _) in enumerate(detections):
            if self.classes[class_id] in self.detection_classes:               
                # Remove license plates if their aspect ration < 2 (in europe it is standardized to 5)
                # If the license plate is obstructed so that it's aspect ratio decreased by this much, it's not
                # readable anyway
                # In our dataset, no license plate is smaller than 200 pixels in height or 500 pixels in width, this might
                # differ for different cameras and focal lengths, but it's a good intial value for 1080p cameras 
                if (class_id == 0) or (class_id == 13): # the id_s of license plates, should be dynamically determined and not a magic number like here
                    width = xyxy[2] - xyxy[0]
                    height = xyxy[3] - xyxy[1]
                    if width/height < 2 or height > 200 or width > 500:
                        continue  
  • In anonymization task it's usually more important to increase recall. Blurring a little bit too much is usually not that bad but blurring to little can be catastropic.
    Therefore i reduce the default value of openclip from 0.7 to 0.4
openclip:
  run: True
  model_name: 'ViT-B-32'
  pretrained_model: 'laion2b_s34b_b79k'
  score_threshold: 0.4

also I adapt the "is_inside" "is_max" "is_high3" logic.

  if is_inside and is_max:
      valid_ids.append(index)

high_3 is redundant as I decreased the threshold from 0.7 to 0.4,
if it's lower than 0.4 i want to make sure, that it is atleast the most likely from all available clip classes and inside another bbox.

Specific GPU

  • We have servers and workstations with multiple GPUs, even when changing device to "cuda:1", yolo is still running on "cuda:0". Easy fixable by adding device to yolo and in predict results = self.model.predict(image, conf=confidence,device=self.device)

Additional Models

I added EgoBlur and RetinaFace after YOLO for even better anonymization. But this might be overbearing for this repo as e.g. RetinaFace requires Tensorflow.
But Egoblur is implemented easily, creating file autoware_rosbag2_anonymizer/models/egoblur.py

import torch
import torchvision
import numpy as np
import supervision as sv


def get_detections(
    face_detector: torch.jit._script.RecursiveScriptModule,
    lp_detector: torch.jit._script.RecursiveScriptModule,
    image_tensor: torch.Tensor,
    model_score_threshold=0.1,
    nms_iou_threshold=0.3,
):
    """
    parameter detector: Torchscript module to perform detections
    parameter image_tensor: image tensor on which we want to make detections
    parameter model_score_threshold: model score threshold to filter out low confidence detection
    parameter nms_iou_threshold: NMS iou threshold to filter out low confidence overlapping boxes

    Returns the list of detections
    """
    with torch.no_grad():
        face_detections = face_detector(image_tensor)
        lp_detections = lp_detector(image_tensor)

    face_boxes, _, face_scores, _ = face_detections  # returns boxes, labels, scores, dims

    face_class_ids = torch.zeros_like(face_scores) + 14 # REQUIRES DYNAMIC ADAPTION TO FACE CLASS ID (here 14)

    lp_boxes, _, lp_scores, _ = lp_detections  # returns boxes, labels, scores, dims
    lp_class_ids = torch.zeros_like(lp_scores) + 0 # REQUIRES DYNAMIC ADAPTION TO FACE CLASS ID (here 0)

    boxes = torch.cat([face_boxes, lp_boxes], dim=0)
    scores = torch.cat([face_scores, lp_scores], dim=0)
    class_ids = torch.cat([face_class_ids, lp_class_ids], dim=0)

    nms_keep_idx = torchvision.ops.nms(boxes, scores, nms_iou_threshold)
    boxes = boxes[nms_keep_idx]
    scores = scores[nms_keep_idx]
    boxes = boxes.cpu().numpy()
    scores = scores.cpu().numpy()
    class_ids = class_ids.cpu().numpy()

    score_keep_idx = np.where(scores > model_score_threshold)[0]
    boxes = boxes[score_keep_idx]
    scores = scores[score_keep_idx]
    class_ids = class_ids[score_keep_idx]
    if boxes.size == 0:
        return sv.Detections.empty()

    detections = sv.Detections(
        xyxy=boxes,
        confidence=scores,
        class_id=class_ids,
    )
    
    return detections


class Egoblur():

    def __init__(self,device):
        self.device = device
        # download weights from here https://www.projectaria.com/tools/egoblur/
        self.face_detector = torch.jit.load("ego_blur_face.jit", map_location="cpu").to(device) # face detection
        self.face_detector.eval()
        
        self.lp_detector = torch.jit.load("ego_blur_lp.jit", map_location="cpu").to(device) # license plate detection
        self.lp_detector.eval()
    
    def get_image_tensor(self,bgr_image: np.ndarray) -> torch.Tensor:
        """
        parameter bgr_image: image on which we want to make detections

        Return the image tensor
        """
        bgr_image_transposed = np.transpose(bgr_image, (2, 0, 1))
        image_tensor = torch.from_numpy(bgr_image_transposed).to(self.device)

        return image_tensor
    
    def __call__(self, image, threshold=0.25):
        image_tensor = self.get_image_tensor(image)
        detections = get_detections(self.face_detector, self.lp_detector,image_tensor, threshold)
        return detections

and adding this into unified_language_model here

from autoware_rosbag2_anonymizer.model.egoblur import Egoblur
...
class UnifiedLanguageModel:
    def __init__(self, config, json, device) -> None:
        ...
        # Egoblur
        self.egoblur = Egoblur(self.device)
...

    def __call__(self, image: cv2.Mat) -> sv.Detections:
...
            yolo_detections.class_id = np.array(
                [
                    self.detection_classes.index(self.yolo_classes[yolo_id])
                    for yolo_id in yolo_detections.class_id
                ]
            )

            detections = sv.Detections.merge(
                [
                    detections,
                    yolo_detections,
                ]
            )
            detections.class_id = np.array(
                [int(class_id) for _, _, _, class_id, _, _ in detections]
            )
        
        # RUN EGOBLUR
        egoblur_detections = self.egoblur(image)
        detections = sv.Detections.merge(
            [
                detections,
                egoblur_detections,
            ]
        )

Speed,

here an how fast the adapted individual models take for one image on 4090

Process             Time Taken
------------------------------
Dino                2.9740
OpenCLIP            0.1445
YOLO                0.0436
RetinaFace          1.3741
EgoBlur             0.1370
NMS                 0.0001
SAM                 0.0555
Blur                0.0169

As you can see, the adapted dino now takes a very long time, so there need to be some changes. RetinFace is currently running on CPU so its speed can be improved.

With these changes I was unable to detect any faces or license plates in our internal autonomous driving dataset.

Thanks again for the repo!

@StepTurtle
Copy link
Collaborator

Hey @nklpolley many thanks for your excellent work and sorry for late response.

First of all, I want to say that most of the points you mentioned seem important and good, so we would be very happy if you contribute to the project.

OpenCLIP

The first two items you wrote seem reasonable. It would be good to make these corrections. But I have no idea about the third one, I honestly don't understand the situation very well.

DINO

When I started the project, i remember i struggle with these non-sensical things. I think we should implement remove_combined flag first and after that I also can look at it, maybe there was something we missed. Also as I remember, this issue creates a lot of bbox with None class.

  1. If it is improved, good point, we should do this

  2. I am not sure on this. We can make it optional maybe (false as default), but ~4 sec looks too high.

  3. good point, i think we can apply the nms after Validation. After validation we have only detection_classes, so the person or etc. wont be calculated. We can apply nms even after yolo i guess.

YOLO

I could not find yolo label list for the model you shared. Does it detect face and licence plate?

The current YOLO model trained by me with a small dataset which includes images from only Turkey, and also I collect all data with same setup. It is expected that it will not work well with data collected from Europe.

Here I am thinking about the idea of ​​initializing and using multiple yolo models at the same time, considering that yolo already works fast. Maybe we put yolo models in an array and work on the same image in order

yolo/model: ['./yolo11x_anonymizer.pt', './1080p_medium_v8.pt', './another_self_trained.pt']

With the above described changes, sometimes full cars are detected as license plates from grounding dino and CLIP. I add the following constraints here:

    for index, (xyxy, _, score, class_id, _, _) in enumerate(detections):
        if self.classes[class_id] in self.detection_classes:               
            # Remove license plates if their aspect ration < 2 (in europe it is standardized to 5)
            # If the license plate is obstructed so that it's aspect ratio decreased by this much, it's not
            # readable anyway
            # In our dataset, no license plate is smaller than 200 pixels in height or 500 pixels in width, this might
            # differ for different cameras and focal lengths, but it's a good intial value for 1080p cameras 
            if (class_id == 0) or (class_id == 13): # the id_s of license plates, should be dynamically determined and not a magic number like here
                width = xyxy[2] - xyxy[0]
                height = xyxy[3] - xyxy[1]
                if width/height < 2 or height > 200 or width > 500:
                    continue

It looks good to keep aspect ration thing as optional but I am not sure if we should use a hard coded pixel value. It may give undesirable results with different camera configurations and different camera angles

In anonymization task it's usually more important to increase recall. Blurring a little bit too much is usually not that bad but blurring to little can be catastropic.
Therefore i reduce the default value of openclip from 0.7 to 0.4

openclip:
run: True
model_name: 'ViT-B-32'
pretrained_model: 'laion2b_s34b_b79k'
score_threshold: 0.4

also I adapt the "is_inside" "is_max" "is_high3" logic.

if is_inside and is_max:
valid_ids.append(index)

What about adding lower threshold limit (we have two score_threshold now, one of them is upper and second is lower):

...
if current_score > upper_threshold:
	valid_ids.append(index)

...

if is_inside and (is_max or current_score > lower_threshold):
	valid_ids.append(index)
else:
	invalid_ids.append(index)

...

IOU

you are right, we should update this

GPU

Did you also test the other models with another device id? IF you can approve others works, we can only fix for YOLO.

New Models

In this section I can suggest the following:

We can create a new subheading under the Unified model and name it detection models:

  • GroundingDINO
  • YOLO
  • EgoBlur

And we can make all of them optional. This way, people who are concerned about time or just want to test one or more models can easily turn them on and off.

Here, first I will ask you to add the EgoBlur model, then we can work on a structure like the one I mentioned.

It would be better if we discuss the RetinaFace model again after these are completed.

Summary

Since you have a lot of good points and a lot of separate thing, it will be better to create separate branches and separate PRs.

Maybe followings should be initial branches:

dev/openclip
dev/dino
dev/yolo
dev/unified_model
feat/egoblur
fix/gpu

These are just examples to clarify the situation, so fell free to use any name or branch structure. I think you can create a PR to this repository after you make changes on your own fork. If there is any problem please let me know.

At this point, I would be happy if you could inform me about the changes you want to make in the first stage. I have made suggestions at a few points (multiple yolo, new subheading on detection models, ...) and if you find these suggestions reasonable, I can support you in the development process. Also, if there is anything you want me to do, please share.

Thank you again for your work. 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants