-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering speedup #68
Comments
In code like for i in range(2, self.k_max + 1): # Must be inclusive
self.k = max(i, len(seeded_document_clusters)) # If the user has seeded more clusters than the k you're considering, then you can't reduce that number it looks like the minimum value that can be used is for i in range(2, 6) with 4 seeded_document_clusters, k would be
That 2 can probably be |
@poleseArthur, checking whether notifications are working... |
However, when a repeated k value is tried, the randomized seeding by pairs will end up being different, resulting in possibly multiple answers for the same repeated k. This might give that k an advantage over other ones. Not trying the repeated values would lead to different results. |
The soft-kmeans algorithm is working with linear time complexity. The graph below is for k = 5 clusters. The exponent is ~1.0. It doesn't seem like it's the algorithm that is holding us back. If the implementation can be sped up, we should be good, @Allegra-Cohen, @poleseArthur. There are problems with convergence, though. Code is in kwalcock/test branch. |
I did some optimizing of the Python code just by reducing the amount of unnecessary recalculation, but it only sped up the result by about 2 seconds out of 30, so not significantly. I'm doing a quick and dirty translation to Scala (on the JVM) to get an idea of what the possibilities are. After that there's the possibility of converting the clustering to compiled code via Scala native or replacing all of the Python backend with a different backend so that we don't need three different languages. The goal is still to minimize the changes. FYI @poleseArthur. |
Great! I just think that replace all of the Python backend would be a huge change on the project and probably, we are going spend a lot of time on this. |
The Python code probably shouldn't be translated directly into a different language as is. It should first be optimized for efficiency and then ported if still necessary.
More to follow...
The text was updated successfully, but these errors were encountered: