Zeroshot Topic Modeling #1572

MaartenGr · 2023-10-11T09:58:32Z

Zeroshot Topic Modeling

Zeroshot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics.

This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.
This allows for extensive flexibility as there are three scenario's to explore.

First, both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.

Second, only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.

Third, no zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.

This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zeroshot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.

This creates two models. One for the zeroshot topics and one for the non-zeroshot topics. We combine these two BERTopic models to create a single model that contains both zeroshot and non-zeroshot topics.

Example

To demonstrate Guided BERTopic, we use the 20 Newsgroups dataset as our example. We have frequently used this
dataset in BERTopic examples and we sometimes see a topic generated about health with words such as drug and cancer
being important. However, due to the stochastic nature of UMAP, this topic is not always found.

In order to guide BERTopic to that topic, we create a seed topic list that we pass through our model. However,
there may be several other topics that we know should be in the documents. Let's also initialize those:

from bertopic import BERTopic
from datasets import load_dataset

# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

# We fit our model using the zeroshot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.85
)
topics, probs = topic_model.fit_transform(docs)

When we run topic_model.get_topic_info() you will see something like this:

The zeroshot_min_similarity parameter controls how many of the documents are assigned to the predefined zeroshot topics. Lower this value and you will have more documents assigned to zeroshot topics and fewer documents will be clustered. Increase this value you will have fewer documents assigned to zeroshot topics and more documents will be clustered.

Seed (domain-specific) Words

When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the "TNM" classification is a method for identifying the stage of most cancers. The word "TNM" is an abbreviation and might not be correctly captured in generic embedding models.

To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of seed_words in the bertopic.vectorizer.ClassTfidfTransformer. The ClassTfidfTransformer is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as "TNM".

To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the ClassTfidfTransformer, we can define those seed_words and also choose by how much their values are multiplied.

The full example is then as follows:

from umap import UMAP
from datasets import load_dataset
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

# Let's take a subset of ArXiv abstracts as the training data
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:5_000]

# For illustration purposes, we make sure the output is fixed when running this code multiple times
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# We can choose any number of seed words for which we want their representation
# to be strengthen. We increase the importance of these words as we want them to be more
# likely to end up in the topic representations.
ctfidf_model = ClassTfidfTransformer(
    seed_words=["agent", "robot", "behavior", "policies", "environment"], 
    seed_multiplier=2
)

# We run the topic model with the seeded words
topic_model = BERTopic(
    umap_model=umap_model,
    min_topic_size=15,
    ctfidf_model=ctfidf_model,
).fit(abstracts)

Then, when we run topic_model.get_topic(0), we get the following output:

[('policy', 0.023413102511982354),
 ('reinforcement', 0.021796126795834238),
 ('agent', 0.021131601305431902),
 ('policies', 0.01888385271486409),
 ('environment', 0.017819874593917057),
 ('learning', 0.015321710504308708),
 ('robot', 0.013881115279230468),
 ('control', 0.013297705894983875),
 ('the', 0.013247933839985382),
 ('to', 0.013058208312484141)]

As we can see, the output includes some of the seed words that we assigned. However, if a word is not found to be important in a topic than we can still multiply its importance but it will remain relatively low. This is a great feature as it allows you to improve their importance with less risk of making words important in topics that really should not be.

A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the ClassTfidfTransformer as candidate words to be further optimized. In many cases, words like "TNM" might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models.

Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like "TNM" are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.).

Moreover, these seed_words can be defined together with the domain expert as they can decide what type of words are generally important and might need a nudge from you the algorithmic developer.

linxule · 2023-11-13T18:08:14Z

Hi Maarten,

I am following up on the use of seed_words in bertopic.vectorizer.ClassTfidfTransformer. I have a particular interest in refining the handling of seed words and their variants, such as "budget" and "budgeting".

My query revolves around the potential for regex pattern matching to group similar seed words, avoiding stemming or lemmatization of the original text, which might lead to loss of meaning or contextual ambiguity. Specifically, I am considering the following approaches:

Regex Pattern Matching: Could regex pattern matching be implemented within ClassTfidfTransformer to efficiently group seed words such as "budget" and "budgeting" without reducing them to their root forms through preprocessing?
Alternative Stemming Method: As an alternative, what are your thoughts on stemming an extensive list of English words to identify groups that share the same root? This could automate the grouping of similar seed words but may raise concerns regarding practicality and computational efficiency.
Leveraging Synonyms and Antonyms: Additionally, would integrating external resources like WordNet from NLTK to find synonyms and antonyms be advisable for a broader semantic understanding in BERTopic?

Your expertise and insights on these proposed methods would be greatly appreciated, especially considering the balance between semantic integrity and computational efficiency.

Thank you for your guidance and contributions to this field.

Cheers,
Xule

MaartenGr · 2023-11-14T12:12:18Z

@linxule Thanks for sharing your use cases and suggestions!

I am following up on the use of seed_words in bertopic.vectorizer.ClassTfidfTransformer. I have a particular interest in refining the handling of seed words and their variants, such as "budget" and "budgeting".

My query revolves around the potential for regex pattern matching to group similar seed words, avoiding stemming or lemmatization of the original text, which might lead to loss of meaning or contextual ambiguity.

The way seed_words is currently handled is by taking the IDF values and multiplying them by a user-defined value. That way, the user can control the degree to which the seed_words impact the resulting topic representations. However, the ClassTfidfTransformer is actually not the one handling the words themselves, merely the bag-of-word representations. As a result, the vocabulary is created through the CountVectorizer before passing the bag-of-words, not the vocabulary, to the ClassTfidfTransformer.

In other words, the ClassTfidfTransformer does not actually have access as of yet to the vocabulary. Of course, it is possible to give it that but I am not sure whether we should since parsing the vocabulary is the task of the tokenizer not necessarily the weighting mechanism.

Another approach would be to run one of the examples you mentioned yourself on the input data with the CountVectorizer and parse the vocabulary before passing it back to a new CountVectorizer. It actually has a vocabulary parameter, so you could parse it yourself with whatever method suits your use case.

Note that this method allows you to group the vocabulary before passing it to BERTopic. If we were to implement it within ClassTfidfTransformer, it would have to be re-calculated each time you run BERTopic.

Having said that, let me briefly go through the examples you mentioned.

Regex Pattern Matching: Could regex pattern matching be implemented within ClassTfidfTransformer to efficiently group seed words such as "budget" and "budgeting" without reducing them to their root forms through preprocessing?

As mentioned above, I am not sure it should be implemented within ClassTfidfTransformer but for the sake of simplicity let's assume it could be. Using regex on potentially millions of tokens is quite the task and something that can take quite a while. Therefore, doing this before running BERTopic would be preferred.

Alternative Stemming Method: As an alternative, what are your thoughts on stemming an extensive list of English words to identify groups that share the same root? This could automate the grouping of similar seed words but may raise concerns regarding practicality and computational efficiency.

Note that whatever method you choose, you are likely still parsing a large vocabulary which could be feasible if you select a subset of that vocabulary, for instance by choosing words that only appear at least n times. Stemming is a great option but since the stemmed words are semantically similar, using something like MaximalMarginalRelevance seems more appropriate to reduce words with similar meaning.

Leveraging Synonyms and Antonyms: Additionally, would integrating external resources like WordNet from NLTK to find synonyms and antonyms be advisable for a broader semantic understanding in BERTopic?

This does not have a yes or no answer and depends highly on your use case but external resources could indeed help finding synonyms and antonyms. However, as mentioned above, removing similar words seems like a job for MaximalMarginalRelevance.

All in all, I would advise doing this grouping first and then supplying the grouped words to the vocabulary parameter in the CountVectorizer. That seems the most efficient route.

MaartenGr and others added 7 commits October 11, 2023 11:56

Zeroshot Topic Modeling

acc0d88

Seed (domain-specific) words

cbe8c11

Update docs

ea2584b

More LLM documentation, include Zephyr example

bfc942b

Merge branch 'master' into zeroshot

f03c4e6

Documentation and fix #1570

c68d212

Fix #1577

2266927

MaartenGr mentioned this pull request Oct 23, 2023

Merging topics #1588

Open

MaartenGr and others added 8 commits November 1, 2023 08:56

Fix saving merged model, fix #1589, fix #1591

868ec92

Improved logging

4098721

Improved logging, add #1599

d046023

Update logging

c61b918

Update test

a6e1df0

Improved logging

4fd535c

Add support for Cohere's Embed v3

f4010e9

Additional docs

685fbdd

MaartenGr mentioned this pull request Nov 15, 2023

openai incompatible issues with Bertopic #1629

Open

MaartenGr added 2 commits November 15, 2023 15:41

Fix #1629

6dba00e

Fix parameters in openai

6fd3e14

This was referenced Nov 20, 2023

Constant labels #1632

Closed

Value of min_similarity attribute is overwritten in merge_models function #1639

Closed

MaartenGr and others added 3 commits November 24, 2023 14:09

Added llama.cpp, documentation updates

9de41dd

Added HUGE changelog and up version for upcoming release

70f7cd9

Update docs

9363605

MaartenGr merged commit 61a2cd2 into master Nov 27, 2023
2 checks passed

MaartenGr deleted the zeroshot branch May 12, 2024 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zeroshot Topic Modeling #1572

Zeroshot Topic Modeling #1572

MaartenGr commented Oct 11, 2023 •

edited

Loading

linxule commented Nov 13, 2023

MaartenGr commented Nov 14, 2023

Zeroshot Topic Modeling #1572

Zeroshot Topic Modeling #1572

Conversation

MaartenGr commented Oct 11, 2023 • edited Loading

Zeroshot Topic Modeling

Example

Seed (domain-specific) Words

linxule commented Nov 13, 2023

MaartenGr commented Nov 14, 2023

MaartenGr commented Oct 11, 2023 •

edited

Loading