-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constant labels #1632
Comments
Your exact use case is perfect for the newly introduced You can do this continuously and keep on merging models this way every time you train a new model. If I am not mistaken, it completely satisfies the requirements as you described them. You can find more about that here. This method is currently only found in the main branch but will be added in the coming weeks in an official release. |
Awesome as always. That solved my problem. Thank you, sir! |
I have one follow-up question. You write in the doc strings that the check, if topics are in the baseline model, is based on "cosine similarity between topics embeddings". Do I understand correctly that
So this check is based on topic embedding level and not on document embedding level? Or put it differently: it is not possible, that documents of one topic of the new model are assigned to different topics of the baseline model? |
Yes. The baseline model is the model that stays mostly the same throughout merging. You can however change which model is the baseline and which model is the new model by simply changing the order.
Yes. It will add the topic to the baseline model as a new topic. It will change the topic id but keep the same name so you can easily find it.
Yes, this check is done on topic embeddings since document embeddings are not saved within a topic model. Indeed, it is not possible that different documents within a topic are all assigned to different topics. |
Thank you so much. If I use you code in the PR and I change it a bit so that I can have an iterative update: from umap import UMAP dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"] Extract abstracts to train on and corresponding titlesabstracts_1 = dataset["abstract"][:5_000] Create topic modelsumap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) Combine all models into onemerged_model_1 = BERTopic.merge_models([topic_model_1, topic_model_2]) I then got an "TypeError: int() argument must be a string, a bytes-like object or a number, not 'TopicMapper'". Is this a known issue? |
@Dr-viddi Ah, that might be because of an issue that was fixed an issue in the upcoming PR that implements zero-shot topic modeling. Using that PR instead should do the trick! I am working on the final touches of that PR, some more documentation is needed, before I can merge it. I hope to do that somewhere this or next week with an official release. |
Everything works as expected. Thank you so much @MaartenGr for your support |
Hey Maarten,
thank you very much for this great package. I have a database with about a million documents for which I want to create clusters and labels. Every day some new documents will probably be added. To avoid training costs, I want to fit the model on all
data (old and new ones) lets say once per week. After each training, clusters and labels (of course) change. To ensure a good user experience while achieving good performance, I have the following requirements for each training:
I've played a bit with the (semi) supervised functions that are already implemented, but I don't manage to fulfil both requirements satisfactorily.
Do you have any recommendation or strategy on how to do it?
The text was updated successfully, but these errors were encountered: