-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zeroshot Topic Modeling #1572
Zeroshot Topic Modeling #1572
Conversation
Hi Maarten, I am following up on the use of My query revolves around the potential for regex pattern matching to group similar seed words, avoiding stemming or lemmatization of the original text, which might lead to loss of meaning or contextual ambiguity. Specifically, I am considering the following approaches:
Your expertise and insights on these proposed methods would be greatly appreciated, especially considering the balance between semantic integrity and computational efficiency. Thank you for your guidance and contributions to this field. Cheers, |
@linxule Thanks for sharing your use cases and suggestions!
The way In other words, the Another approach would be to run one of the examples you mentioned yourself on the input data with the Note that this method allows you to group the vocabulary before passing it to BERTopic. If we were to implement it within Having said that, let me briefly go through the examples you mentioned.
As mentioned above, I am not sure it should be implemented within
Note that whatever method you choose, you are likely still parsing a large vocabulary which could be feasible if you select a subset of that vocabulary, for instance by choosing words that only appear at least n times. Stemming is a great option but since the stemmed words are semantically similar, using something like MaximalMarginalRelevance seems more appropriate to reduce words with similar meaning.
This does not have a yes or no answer and depends highly on your use case but external resources could indeed help finding synonyms and antonyms. However, as mentioned above, removing similar words seems like a job for MaximalMarginalRelevance. All in all, I would advise doing this grouping first and then supplying the grouped words to the |
Zeroshot Topic Modeling
Zeroshot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics.
This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.
This allows for extensive flexibility as there are three scenario's to explore.
First, both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.
Second, only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
Third, no zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zeroshot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.
This creates two models. One for the zeroshot topics and one for the non-zeroshot topics. We combine these two BERTopic models to create a single model that contains both zeroshot and non-zeroshot topics.
Example
To demonstrate Guided BERTopic, we use the 20 Newsgroups dataset as our example. We have frequently used this
dataset in BERTopic examples and we sometimes see a topic generated about health with words such as
drug
andcancer
being important. However, due to the stochastic nature of UMAP, this topic is not always found.
In order to guide BERTopic to that topic, we create a seed topic list that we pass through our model. However,
there may be several other topics that we know should be in the documents. Let's also initialize those:
When we run
topic_model.get_topic_info()
you will see something like this:The
zeroshot_min_similarity
parameter controls how many of the documents are assigned to the predefined zeroshot topics. Lower this value and you will have more documents assigned to zeroshot topics and fewer documents will be clustered. Increase this value you will have fewer documents assigned to zeroshot topics and more documents will be clustered.Seed (domain-specific) Words
When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the "TNM" classification is a method for identifying the stage of most cancers. The word "TNM" is an abbreviation and might not be correctly captured in generic embedding models.
To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of
seed_words
in thebertopic.vectorizer.ClassTfidfTransformer
. TheClassTfidfTransformer
is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as "TNM".To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the
ClassTfidfTransformer
, we can define thoseseed_words
and also choose by how much their values are multiplied.The full example is then as follows:
Then, when we run
topic_model.get_topic(0)
, we get the following output:As we can see, the output includes some of the seed words that we assigned. However, if a word is not found to be important in a topic than we can still multiply its importance but it will remain relatively low. This is a great feature as it allows you to improve their importance with less risk of making words important in topics that really should not be.
A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the
ClassTfidfTransformer
as candidate words to be further optimized. In many cases, words like "TNM" might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models.Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like "TNM" are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.).
Moreover, these
seed_words
can be defined together with the domain expert as they can decide what type of words are generally important and might need a nudge from you the algorithmic developer.