Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant labels #1632

Closed
Dr-viddi opened this issue Nov 16, 2023 · 7 comments
Closed

Constant labels #1632

Dr-viddi opened this issue Nov 16, 2023 · 7 comments

Comments

@Dr-viddi
Copy link

Hey Maarten,
thank you very much for this great package. I have a database with about a million documents for which I want to create clusters and labels. Every day some new documents will probably be added. To avoid training costs, I want to fit the model on all
data (old and new ones) lets say once per week. After each training, clusters and labels (of course) change. To ensure a good user experience while achieving good performance, I have the following requirements for each training:

  • The labels of the "old documents", i.e., the documents that already have a label from a previous bertopic.fit_transform run, must be constant.
  • The newly added documents should be clustered and labelled normally. That is, they should be added to existing clusters or, if they are too different, new clusters should be created.

I've played a bit with the (semi) supervised functions that are already implemented, but I don't manage to fulfil both requirements satisfactorily.

Do you have any recommendation or strategy on how to do it?

@MaartenGr
Copy link
Owner

Your exact use case is perfect for the newly introduced .merge_models. The method allows for different topic models to be merged together. When you combine two models with this method, the first model will remain as it is and the second model will be added as long as it contains new clusters. Existing clusters will not be added since those were already found in the first model.

You can do this continuously and keep on merging models this way every time you train a new model. If I am not mistaken, it completely satisfies the requirements as you described them. You can find more about that here.

This method is currently only found in the main branch but will be added in the coming weeks in an official release.

@Dr-viddi
Copy link
Author

Awesome as always. That solved my problem. Thank you, sir!

@Dr-viddi
Copy link
Author

Dr-viddi commented Nov 17, 2023

I have one follow-up question. You write in the doc strings that the check, if topics are in the baseline model, is based on "cosine similarity between topics embeddings". Do I understand correctly that

  • If a topic from the new model can be "found" in the baseline model, all documents from this topic (of the new model) will be assigned to the similar topic of the baseline model?
  • If a topic from the new model does not find a similar topic in the baseline model, this topic (of the new model) with all the associated documents is simply added to the baseline model?

So this check is based on topic embedding level and not on document embedding level? Or put it differently: it is not possible, that documents of one topic of the new model are assigned to different topics of the baseline model?

@MaartenGr
Copy link
Owner

If a topic from the new model can be "found" in the baseline model, all documents from this topic (of the new model) will be assigned to the similar topic of the baseline model?

Yes. The baseline model is the model that stays mostly the same throughout merging. You can however change which model is the baseline and which model is the new model by simply changing the order.

If a topic from the new model does not find a similar topic in the baseline model, this topic (of the new model) with all the associated documents is simply added to the baseline model?

Yes. It will add the topic to the baseline model as a new topic. It will change the topic id but keep the same name so you can easily find it.

So this check is based on topic embedding level and not on document embedding level? Or put it differently: it is not possible, that documents of one topic of the new model are assigned to different topics of the baseline model?

Yes, this check is done on topic embeddings since document embeddings are not saved within a topic model. Indeed, it is not possible that different documents within a topic are all assigned to different topics.

@Dr-viddi
Copy link
Author

Dr-viddi commented Nov 20, 2023

Thank you so much. If I use you code in the PR and I change it a bit so that I can have an iterative update:

from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

Extract abstracts to train on and corresponding titles

abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]

Create topic models

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

Combine all models into one

merged_model_1 = BERTopic.merge_models([topic_model_1, topic_model_2])
merged_model_2 = BERTopic.merge_models([merged_model_1, topic_model_3])

I then got an "TypeError: int() argument must be a string, a bytes-like object or a number, not 'TopicMapper'". Is this a known issue?

@MaartenGr
Copy link
Owner

@Dr-viddi Ah, that might be because of an issue that was fixed an issue in the upcoming PR that implements zero-shot topic modeling. Using that PR instead should do the trick!

I am working on the final touches of that PR, some more documentation is needed, before I can merge it. I hope to do that somewhere this or next week with an official release.

@Dr-viddi
Copy link
Author

Everything works as expected. Thank you so much @MaartenGr for your support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants