-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Features, tested on internal data #3
Comments
Hey @nklpolley many thanks for your excellent work and sorry for late response. First of all, I want to say that most of the points you mentioned seem important and good, so we would be very happy if you contribute to the project. OpenCLIP The first two items you wrote seem reasonable. It would be good to make these corrections. But I have no idea about the third one, I honestly don't understand the situation very well. DINO When I started the project, i remember i struggle with these non-sensical things. I think we should implement
YOLO I could not find yolo label list for the model you shared. Does it detect face and licence plate? The current YOLO model trained by me with a small dataset which includes images from only Turkey, and also I collect all data with same setup. It is expected that it will not work well with data collected from Europe. Here I am thinking about the idea of initializing and using multiple yolo models at the same time, considering that yolo already works fast. Maybe we put yolo models in an array and work on the same image in order yolo/model: ['./yolo11x_anonymizer.pt', './1080p_medium_v8.pt', './another_self_trained.pt']
It looks good to keep aspect ration thing as optional but I am not sure if we should use a hard coded pixel value. It may give undesirable results with different camera configurations and different camera angles
What about adding lower threshold limit (we have two score_threshold now, one of them is upper and second is lower): ...
if current_score > upper_threshold:
valid_ids.append(index)
...
if is_inside and (is_max or current_score > lower_threshold):
valid_ids.append(index)
else:
invalid_ids.append(index)
... IOU you are right, we should update this GPU Did you also test the other models with another device id? IF you can approve others works, we can only fix for YOLO. New Models In this section I can suggest the following: We can create a new subheading under the Unified model and name it detection models:
And we can make all of them optional. This way, people who are concerned about time or just want to test one or more models can easily turn them on and off. Here, first I will ask you to add the EgoBlur model, then we can work on a structure like the one I mentioned. It would be better if we discuss the RetinaFace model again after these are completed. Summary Since you have a lot of good points and a lot of separate thing, it will be better to create separate branches and separate PRs. Maybe followings should be initial branches:
These are just examples to clarify the situation, so fell free to use any name or branch structure. I think you can create a PR to this repository after you make changes on your own fork. If there is any problem please let me know. At this point, I would be happy if you could inform me about the changes you want to make in the first stage. I have made suggestions at a few points (multiple yolo, new subheading on detection models, ...) and if you find these suggestions reasonable, I can support you in the development process. Also, if there is anything you want me to do, please share. Thank you again for your work. 💯 |
Hey @StepTurtle , great repo! I tested the unified model on our internal data. I tried out some changes, that greatly increased the performance on our data and want to share them with you. If desired, I can create a second branch for merging, but this might take some time.
OpenClip
The self.preprocess of the currenlty used Clip model uses Resize(224) and CenterCrop(224,224).
This is not suitable for highly rectangular images like the Dino-detections of license plates. CLIP will only receive the middle part of the license plate, because the center crop removes the majoritiy (left and right part) of the image.
It's better to use the original rectangular license plate image and square it by padding with black pixels at the top and bottom.
I currently overwrite the self.preprocess like this: It is not ideal, as this hardcoded preprocess no longer supports an easy swap of clip models that might require different preprocessing.
text_probs = (100.0 * image_features @ text_features.T)
and then using the class id:
prob[class_id]
but it didn't improve.GroundingDino
The label captions does not work for the use case. It merges classes non-sensical and reduces the number of detected bboxes per class. This leads to oftentimes not detecting faces. A recent commit of the Grounding-Dino repository added the remove_combined flag that removes the combination of non-sensical classes. I reimplemented this, but had still problems. On some test images I notice that if only a persons head is visible, GroundingDino will create the Bbox "Person" but not the BBox "Face". If i remove the label "person" from the caption, Dino will correctly draw the "Face" BBox.
I could greatly increase its' quality by the following steps:
First I slightly adapted validation.json (added "a", this also seem to help clip)
Second I predict each class independently and therefore change the call function
This drastically decreases the speed of GroundingDino (I measure it at ~3 seconds on a RTX 4090) so is not suitable for real-time deployment.
Third: Currently Non-Maximum Suppression is used on all BBoxes of Dino. In the case that if person and face bounding boxes are at similar places, it will delete the bbox with the lower confidence. If the person bbox has the higher confidence, the face bbox is removed.
I replace the linked NMS so that it only suppresses boxes within a class and not class-agnostic.
This batched NMS should also be done if the previous two steps are not done, as currently face detections are oftentimes removed.
YOLO
Unified Language Model
Therefore i reduce the default value of openclip from 0.7 to 0.4
also I adapt the "is_inside" "is_max" "is_high3" logic.
high_3 is redundant as I decreased the threshold from 0.7 to 0.4,
if it's lower than 0.4 i want to make sure, that it is atleast the most likely from all available clip classes and inside another bbox.
But this is not the formula for IOU and should probably be renamed (calculate_bbox_containment) or similar, to not confuse with standard IOU calculation (where you divide with the union of A1 and A2 and not min(A1, A2)).
Specific GPU
results = self.model.predict(image, conf=confidence,device=self.device)
Additional Models
I added EgoBlur and RetinaFace after YOLO for even better anonymization. But this might be overbearing for this repo as e.g. RetinaFace requires Tensorflow.
But Egoblur is implemented easily, creating file
autoware_rosbag2_anonymizer/models/egoblur.py
and adding this into unified_language_model here
Speed,
here an how fast the adapted individual models take for one image on 4090
As you can see, the adapted dino now takes a very long time, so there need to be some changes. RetinFace is currently running on CPU so its speed can be improved.
With these changes I was unable to detect any faces or license plates in our internal autonomous driving dataset.
Thanks again for the repo!
The text was updated successfully, but these errors were encountered: