Constant labels #1632

Dr-viddi · 2023-11-16T08:25:51Z

Hey Maarten,
thank you very much for this great package. I have a database with about a million documents for which I want to create clusters and labels. Every day some new documents will probably be added. To avoid training costs, I want to fit the model on all
data (old and new ones) lets say once per week. After each training, clusters and labels (of course) change. To ensure a good user experience while achieving good performance, I have the following requirements for each training:

The labels of the "old documents", i.e., the documents that already have a label from a previous bertopic.fit_transform run, must be constant.
The newly added documents should be clustered and labelled normally. That is, they should be added to existing clusters or, if they are too different, new clusters should be created.

I've played a bit with the (semi) supervised functions that are already implemented, but I don't manage to fulfil both requirements satisfactorily.

Do you have any recommendation or strategy on how to do it?

MaartenGr · 2023-11-16T13:37:29Z

Your exact use case is perfect for the newly introduced .merge_models. The method allows for different topic models to be merged together. When you combine two models with this method, the first model will remain as it is and the second model will be added as long as it contains new clusters. Existing clusters will not be added since those were already found in the first model.

You can do this continuously and keep on merging models this way every time you train a new model. If I am not mistaken, it completely satisfies the requirements as you described them. You can find more about that here.

This method is currently only found in the main branch but will be added in the coming weeks in an official release.

Dr-viddi · 2023-11-17T13:00:09Z

Awesome as always. That solved my problem. Thank you, sir!

Dr-viddi · 2023-11-17T13:59:26Z

I have one follow-up question. You write in the doc strings that the check, if topics are in the baseline model, is based on "cosine similarity between topics embeddings". Do I understand correctly that

If a topic from the new model can be "found" in the baseline model, all documents from this topic (of the new model) will be assigned to the similar topic of the baseline model?
If a topic from the new model does not find a similar topic in the baseline model, this topic (of the new model) with all the associated documents is simply added to the baseline model?

So this check is based on topic embedding level and not on document embedding level? Or put it differently: it is not possible, that documents of one topic of the new model are assigned to different topics of the baseline model?

MaartenGr · 2023-11-17T20:10:06Z

If a topic from the new model can be "found" in the baseline model, all documents from this topic (of the new model) will be assigned to the similar topic of the baseline model?

Yes. The baseline model is the model that stays mostly the same throughout merging. You can however change which model is the baseline and which model is the new model by simply changing the order.

If a topic from the new model does not find a similar topic in the baseline model, this topic (of the new model) with all the associated documents is simply added to the baseline model?

Yes. It will add the topic to the baseline model as a new topic. It will change the topic id but keep the same name so you can easily find it.

So this check is based on topic embedding level and not on document embedding level? Or put it differently: it is not possible, that documents of one topic of the new model are assigned to different topics of the baseline model?

Yes, this check is done on topic embeddings since document embeddings are not saved within a topic model. Indeed, it is not possible that different documents within a topic are all assigned to different topics.

Dr-viddi · 2023-11-20T09:54:13Z

Thank you so much. If I use you code in the PR and I change it a bit so that I can have an iterative update:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

Extract abstracts to train on and corresponding titles

abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]

Create topic models

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

Combine all models into one

merged_model_1 = BERTopic.merge_models([topic_model_1, topic_model_2])
merged_model_2 = BERTopic.merge_models([merged_model_1, topic_model_3])

I then got an "TypeError: int() argument must be a string, a bytes-like object or a number, not 'TopicMapper'". Is this a known issue?

MaartenGr · 2023-11-20T13:48:28Z

@Dr-viddi Ah, that might be because of an issue that was fixed an issue in the upcoming PR that implements zero-shot topic modeling. Using that PR instead should do the trick!

I am working on the final touches of that PR, some more documentation is needed, before I can merge it. I hope to do that somewhere this or next week with an official release.

Dr-viddi · 2023-11-21T13:09:09Z

Everything works as expected. Thank you so much @MaartenGr for your support

Dr-viddi closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant labels #1632

Constant labels #1632

Dr-viddi commented Nov 16, 2023

MaartenGr commented Nov 16, 2023

Dr-viddi commented Nov 17, 2023

Dr-viddi commented Nov 17, 2023 •

edited

Loading

MaartenGr commented Nov 17, 2023

Dr-viddi commented Nov 20, 2023 •

edited

Loading

MaartenGr commented Nov 20, 2023

Dr-viddi commented Nov 21, 2023

Constant labels #1632

Constant labels #1632

Comments

Dr-viddi commented Nov 16, 2023

MaartenGr commented Nov 16, 2023

Dr-viddi commented Nov 17, 2023

Dr-viddi commented Nov 17, 2023 • edited Loading

MaartenGr commented Nov 17, 2023

Dr-viddi commented Nov 20, 2023 • edited Loading

Extract abstracts to train on and corresponding titles

Create topic models

Combine all models into one

MaartenGr commented Nov 20, 2023

Dr-viddi commented Nov 21, 2023

Dr-viddi commented Nov 17, 2023 •

edited

Loading

Dr-viddi commented Nov 20, 2023 •

edited

Loading