Clustering speedup #68

kwalcock · 2023-08-03T17:37:32Z

The Python code probably shouldn't be translated directly into a different language as is. It should first be optimized for efficiency and then ported if still necessary.

For example, if the documents are not changing, the meta_centroid is constant and doesn't need to be recalculated.
The composite value for a cluster does not change between the calculation for betweenness and withinness and doesn't need to be calculated twice.

More to follow...

kwalcock · 2023-08-03T21:04:43Z

In code like

for i in range(2, self.k_max + 1): # Must be inclusive
    self.k = max(i, len(seeded_document_clusters)) # If the user has seeded more clusters than the k you're considering, then you can't reduce that number

it looks like the minimum value that can be used is len(seeded_document_clusters). For all values of i lower than that, the minimum value will be reused and the results will be recalculated for the same k. That could waste some time.

for i in range(2, 6) with 4 seeded_document_clusters, k would be

i	k
2	4
3	4
4	4
5	5

That 2 can probably be max(2, len(seeeded_document_clusters).

kwalcock · 2023-08-07T15:59:02Z

@poleseArthur, checking whether notifications are working...

kwalcock · 2023-08-08T01:10:07Z

However, when a repeated k value is tried, the randomized seeding by pairs will end up being different, resulting in possibly multiple answers for the same repeated k. This might give that k an advantage over other ones. Not trying the repeated values would lead to different results.

kwalcock · 2023-08-09T15:32:52Z

The soft-kmeans algorithm is working with linear time complexity. The graph below is for k = 5 clusters. The exponent is ~1.0. It doesn't seem like it's the algorithm that is holding us back. If the implementation can be sped up, we should be good, @Allegra-Cohen, @poleseArthur. There are problems with convergence, though. Code is in kwalcock/test branch.

kwalcock · 2023-08-15T16:31:30Z

I did some optimizing of the Python code just by reducing the amount of unnecessary recalculation, but it only sped up the result by about 2 seconds out of 30, so not significantly. I'm doing a quick and dirty translation to Scala (on the JVM) to get an idea of what the possibilities are. After that there's the possibility of converting the clustering to compiled code via Scala native or replacing all of the Python backend with a different backend so that we don't need three different languages. The goal is still to minimize the changes. FYI @poleseArthur.

poleseArthur · 2023-08-21T14:59:25Z

Great! I just think that replace all of the Python backend would be a huge change on the project and probably, we are going spend a lot of time on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering speedup #68

Clustering speedup #68

kwalcock commented Aug 3, 2023

kwalcock commented Aug 3, 2023

kwalcock commented Aug 7, 2023

kwalcock commented Aug 8, 2023

kwalcock commented Aug 9, 2023

kwalcock commented Aug 15, 2023

poleseArthur commented Aug 21, 2023

Clustering speedup #68

Clustering speedup #68

Comments

kwalcock commented Aug 3, 2023

kwalcock commented Aug 3, 2023

kwalcock commented Aug 7, 2023

kwalcock commented Aug 8, 2023

kwalcock commented Aug 9, 2023

kwalcock commented Aug 15, 2023

poleseArthur commented Aug 21, 2023