-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[py-tx] Implementation of IVF faiss indices in PDQ #1756
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @b8zhong! Thanks again for contributing, I am very excited about this work, and so glad it's got you on it.
I suspect this change is much simpler than you might think! You have added a new wrapping class to the heirarchy, but I don't think we need it.
Is it possible to make the following changes:
- Migrate the index selection logic from
PDQSignalTypeIndex2.build
toPDQIndex2.build
. - Take a look at the existing tests for pdqindex2 and see if you can re-use them by just changing the index you pass in
- See if there's anything left in
PDQSignalTypeIndex2
.
@@ -5,9 +5,10 @@ | |||
""" | |||
|
|||
import typing as t | |||
from typing import Self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: I love Self! However, it was only added in 3.11 and we still are planning to support older version of python (back to 3.8), so no self allowed :/
@@ -133,3 +134,78 @@ def __getstate__(self): | |||
|
|||
def __setstate__(self, data): | |||
self.faiss_index = faiss.deserialize_index(data) | |||
|
|||
|
|||
class PDQSignalTypeIndex2(SignalTypeIndex[IndexT]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Instead of having this as an entirely new class, instead dynamically swap out which index class you are using in PDQIndex2 based on build()
.
I believe it should be possible to use one wrapping class, but wrapping a few different indices. Look at L48 in this file.
def get_index_cls(cls) -> t.Type[SignalTypeIndex]: | ||
return PDQSignalTypeIndex2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DANGER DANGER DANGER!
We have to very carefully make this change as there are live deployments of this code with stored state that need to load correctly still! (Including every deployment of HMA).
I strongly recommend holding off on this change while we prove the new implementation works perfectly, and then do some dedicated testing to show how we will handle old version capability. At worst we may need to lock this behind a major version change (pytx 2.x.x), but I think we have some options before then.
@@ -0,0 +1,184 @@ | |||
import typing as t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about these tests?
|
||
index = faiss.IndexFlatL2(BITS_IN_PDQ) | ||
|
||
ret._index = PDQIndex2(index=index, entries=entries_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait except you are wrapping PDQIndex2 with this class? I am confused!
Summary
PDQ now using IVF Faiss for better scalability (as we move away from the old imp)
Test Plan
Run the tests....
python3 -m pytest threatexchange/signal_type/tests/test_pdq_signal_type_index2.py -v -W ignore::DeprecationWarning
Passes.. ?