Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] Implementation of IVF faiss indices in PDQ #1756

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

b8zhong
Copy link
Contributor

@b8zhong b8zhong commented Feb 6, 2025

Summary

PDQ now using IVF Faiss for better scalability (as we move away from the old imp)

  • Adds PDQSignalTypeIndex2 that automatically switches between flat and IVF indices based on dataset size
  • Uses IVF-Faiss for datasets >= 1000 entries (as you said), flat index for less than
  • Maintains backward compatibility with existing PDQIndex
  • Updates signal.py to use the new index implementation as default (if we're not swapping yet, I'll change it back)

Test Plan

Run the tests....

python3 -m pytest threatexchange/signal_type/tests/test_pdq_signal_type_index2.py -v -W ignore::DeprecationWarning

Passes.. ?

@b8zhong b8zhong requested a review from Dcallies as a code owner February 6, 2025 20:31
@b8zhong b8zhong changed the title Implementation of IVF faiss indices in PD [py-tx] Implementation of IVF faiss indices in PDQ Feb 6, 2025
@b8zhong b8zhong marked this pull request as draft February 10, 2025 19:05
@b8zhong b8zhong marked this pull request as ready for review February 10, 2025 19:05
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @b8zhong! Thanks again for contributing, I am very excited about this work, and so glad it's got you on it.

I suspect this change is much simpler than you might think! You have added a new wrapping class to the heirarchy, but I don't think we need it.

Is it possible to make the following changes:

  1. Migrate the index selection logic from PDQSignalTypeIndex2.build to PDQIndex2.build.
  2. Take a look at the existing tests for pdqindex2 and see if you can re-use them by just changing the index you pass in
  3. See if there's anything left in PDQSignalTypeIndex2.

@@ -5,9 +5,10 @@
"""

import typing as t
from typing import Self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: I love Self! However, it was only added in 3.11 and we still are planning to support older version of python (back to 3.8), so no self allowed :/

https://docs.python.org/3/library/typing.html#typing.Self

@@ -133,3 +134,78 @@ def __getstate__(self):

def __setstate__(self, data):
self.faiss_index = faiss.deserialize_index(data)


class PDQSignalTypeIndex2(SignalTypeIndex[IndexT]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Instead of having this as an entirely new class, instead dynamically swap out which index class you are using in PDQIndex2 based on build().

I believe it should be possible to use one wrapping class, but wrapping a few different indices. Look at L48 in this file.

Comment on lines +58 to +59
def get_index_cls(cls) -> t.Type[SignalTypeIndex]:
return PDQSignalTypeIndex2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DANGER DANGER DANGER!

We have to very carefully make this change as there are live deployments of this code with stored state that need to load correctly still! (Including every deployment of HMA).

I strongly recommend holding off on this change while we prove the new implementation works perfectly, and then do some dedicated testing to show how we will handle old version capability. At worst we may need to lock this behind a major version change (pytx 2.x.x), but I think we have some options before then.

@@ -0,0 +1,184 @@
import typing as t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


index = faiss.IndexFlatL2(BITS_IN_PDQ)

ret._index = PDQIndex2(index=index, entries=entries_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait except you are wrapping PDQIndex2 with this class? I am confused!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants