Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering speedup #68

Open
kwalcock opened this issue Aug 3, 2023 · 6 comments
Open

Clustering speedup #68

kwalcock opened this issue Aug 3, 2023 · 6 comments

Comments

@kwalcock
Copy link
Collaborator

kwalcock commented Aug 3, 2023

The Python code probably shouldn't be translated directly into a different language as is. It should first be optimized for efficiency and then ported if still necessary.

  • For example, if the documents are not changing, the meta_centroid is constant and doesn't need to be recalculated.
  • The composite value for a cluster does not change between the calculation for betweenness and withinness and doesn't need to be calculated twice.

More to follow...

@kwalcock
Copy link
Collaborator Author

kwalcock commented Aug 3, 2023

In code like

for i in range(2, self.k_max + 1): # Must be inclusive
    self.k = max(i, len(seeded_document_clusters)) # If the user has seeded more clusters than the k you're considering, then you can't reduce that number

it looks like the minimum value that can be used is len(seeded_document_clusters). For all values of i lower than that, the minimum value will be reused and the results will be recalculated for the same k. That could waste some time.

for i in range(2, 6) with 4 seeded_document_clusters, k would be

i k
2 4
3 4
4 4
5 5

That 2 can probably be max(2, len(seeeded_document_clusters).

@kwalcock
Copy link
Collaborator Author

kwalcock commented Aug 7, 2023

@poleseArthur, checking whether notifications are working...

@kwalcock
Copy link
Collaborator Author

kwalcock commented Aug 8, 2023

However, when a repeated k value is tried, the randomized seeding by pairs will end up being different, resulting in possibly multiple answers for the same repeated k. This might give that k an advantage over other ones. Not trying the repeated values would lead to different results.

@kwalcock
Copy link
Collaborator Author

kwalcock commented Aug 9, 2023

The soft-kmeans algorithm is working with linear time complexity. The graph below is for k = 5 clusters. The exponent is ~1.0. It doesn't seem like it's the algorithm that is holding us back. If the implementation can be sped up, we should be good, @Allegra-Cohen, @poleseArthur. There are problems with convergence, though. Code is in kwalcock/test branch.

image

image

@kwalcock
Copy link
Collaborator Author

I did some optimizing of the Python code just by reducing the amount of unnecessary recalculation, but it only sped up the result by about 2 seconds out of 30, so not significantly. I'm doing a quick and dirty translation to Scala (on the JVM) to get an idea of what the possibilities are. After that there's the possibility of converting the clustering to compiled code via Scala native or replacing all of the Python backend with a different backend so that we don't need three different languages. The goal is still to minimize the changes. FYI @poleseArthur.

@poleseArthur
Copy link
Collaborator

Great! I just think that replace all of the Python backend would be a huge change on the project and probably, we are going spend a lot of time on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants