Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeroshot Topic Modeling #1572

Merged
merged 20 commits into from
Nov 27, 2023
Merged

Zeroshot Topic Modeling #1572

merged 20 commits into from
Nov 27, 2023

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Oct 11, 2023

Zeroshot Topic Modeling

Zeroshot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics.

This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.
This allows for extensive flexibility as there are three scenario's to explore.

First, both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.

Second, only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.

Third, no zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.

zeroshot

This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zeroshot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.

This creates two models. One for the zeroshot topics and one for the non-zeroshot topics. We combine these two BERTopic models to create a single model that contains both zeroshot and non-zeroshot topics.

Example

To demonstrate Guided BERTopic, we use the 20 Newsgroups dataset as our example. We have frequently used this
dataset in BERTopic examples and we sometimes see a topic generated about health with words such as drug and cancer
being important. However, due to the stochastic nature of UMAP, this topic is not always found.

In order to guide BERTopic to that topic, we create a seed topic list that we pass through our model. However,
there may be several other topics that we know should be in the documents. Let's also initialize those:

from bertopic import BERTopic
from datasets import load_dataset

# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

# We fit our model using the zeroshot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.85
)
topics, probs = topic_model.fit_transform(docs)

When we run topic_model.get_topic_info() you will see something like this:

zeroshot_output

The zeroshot_min_similarity parameter controls how many of the documents are assigned to the predefined zeroshot topics. Lower this value and you will have more documents assigned to zeroshot topics and fewer documents will be clustered. Increase this value you will have fewer documents assigned to zeroshot topics and more documents will be clustered.

Seed (domain-specific) Words

When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the "TNM" classification is a method for identifying the stage of most cancers. The word "TNM" is an abbreviation and might not be correctly captured in generic embedding models.

To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of seed_words in the bertopic.vectorizer.ClassTfidfTransformer. The ClassTfidfTransformer is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as "TNM".

To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the ClassTfidfTransformer, we can define those seed_words and also choose by how much their values are multiplied.

The full example is then as follows:

from umap import UMAP
from datasets import load_dataset
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

# Let's take a subset of ArXiv abstracts as the training data
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:5_000]

# For illustration purposes, we make sure the output is fixed when running this code multiple times
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# We can choose any number of seed words for which we want their representation
# to be strengthen. We increase the importance of these words as we want them to be more
# likely to end up in the topic representations.
ctfidf_model = ClassTfidfTransformer(
    seed_words=["agent", "robot", "behavior", "policies", "environment"], 
    seed_multiplier=2
)

# We run the topic model with the seeded words
topic_model = BERTopic(
    umap_model=umap_model,
    min_topic_size=15,
    ctfidf_model=ctfidf_model,
).fit(abstracts)

Then, when we run topic_model.get_topic(0), we get the following output:

[('policy', 0.023413102511982354),
 ('reinforcement', 0.021796126795834238),
 ('agent', 0.021131601305431902),
 ('policies', 0.01888385271486409),
 ('environment', 0.017819874593917057),
 ('learning', 0.015321710504308708),
 ('robot', 0.013881115279230468),
 ('control', 0.013297705894983875),
 ('the', 0.013247933839985382),
 ('to', 0.013058208312484141)]

As we can see, the output includes some of the seed words that we assigned. However, if a word is not found to be important in a topic than we can still multiply its importance but it will remain relatively low. This is a great feature as it allows you to improve their importance with less risk of making words important in topics that really should not be.

A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the ClassTfidfTransformer as candidate words to be further optimized. In many cases, words like "TNM" might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models.

Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like "TNM" are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.).

Moreover, these seed_words can be defined together with the domain expert as they can decide what type of words are generally important and might need a nudge from you the algorithmic developer.

@MaartenGr MaartenGr mentioned this pull request Oct 23, 2023
@linxule
Copy link

linxule commented Nov 13, 2023

Hi Maarten,

I am following up on the use of seed_words in bertopic.vectorizer.ClassTfidfTransformer. I have a particular interest in refining the handling of seed words and their variants, such as "budget" and "budgeting".

My query revolves around the potential for regex pattern matching to group similar seed words, avoiding stemming or lemmatization of the original text, which might lead to loss of meaning or contextual ambiguity. Specifically, I am considering the following approaches:

  1. Regex Pattern Matching: Could regex pattern matching be implemented within ClassTfidfTransformer to efficiently group seed words such as "budget" and "budgeting" without reducing them to their root forms through preprocessing?

  2. Alternative Stemming Method: As an alternative, what are your thoughts on stemming an extensive list of English words to identify groups that share the same root? This could automate the grouping of similar seed words but may raise concerns regarding practicality and computational efficiency.

  3. Leveraging Synonyms and Antonyms: Additionally, would integrating external resources like WordNet from NLTK to find synonyms and antonyms be advisable for a broader semantic understanding in BERTopic?

Your expertise and insights on these proposed methods would be greatly appreciated, especially considering the balance between semantic integrity and computational efficiency.

Thank you for your guidance and contributions to this field.

Cheers,
Xule

@MaartenGr
Copy link
Owner Author

@linxule Thanks for sharing your use cases and suggestions!

I am following up on the use of seed_words in bertopic.vectorizer.ClassTfidfTransformer. I have a particular interest in refining the handling of seed words and their variants, such as "budget" and "budgeting".

My query revolves around the potential for regex pattern matching to group similar seed words, avoiding stemming or lemmatization of the original text, which might lead to loss of meaning or contextual ambiguity.

The way seed_words is currently handled is by taking the IDF values and multiplying them by a user-defined value. That way, the user can control the degree to which the seed_words impact the resulting topic representations. However, the ClassTfidfTransformer is actually not the one handling the words themselves, merely the bag-of-word representations. As a result, the vocabulary is created through the CountVectorizer before passing the bag-of-words, not the vocabulary, to the ClassTfidfTransformer.

In other words, the ClassTfidfTransformer does not actually have access as of yet to the vocabulary. Of course, it is possible to give it that but I am not sure whether we should since parsing the vocabulary is the task of the tokenizer not necessarily the weighting mechanism.

Another approach would be to run one of the examples you mentioned yourself on the input data with the CountVectorizer and parse the vocabulary before passing it back to a new CountVectorizer. It actually has a vocabulary parameter, so you could parse it yourself with whatever method suits your use case.

Note that this method allows you to group the vocabulary before passing it to BERTopic. If we were to implement it within ClassTfidfTransformer, it would have to be re-calculated each time you run BERTopic.

Having said that, let me briefly go through the examples you mentioned.

Regex Pattern Matching: Could regex pattern matching be implemented within ClassTfidfTransformer to efficiently group seed words such as "budget" and "budgeting" without reducing them to their root forms through preprocessing?

As mentioned above, I am not sure it should be implemented within ClassTfidfTransformer but for the sake of simplicity let's assume it could be. Using regex on potentially millions of tokens is quite the task and something that can take quite a while. Therefore, doing this before running BERTopic would be preferred.

Alternative Stemming Method: As an alternative, what are your thoughts on stemming an extensive list of English words to identify groups that share the same root? This could automate the grouping of similar seed words but may raise concerns regarding practicality and computational efficiency.

Note that whatever method you choose, you are likely still parsing a large vocabulary which could be feasible if you select a subset of that vocabulary, for instance by choosing words that only appear at least n times. Stemming is a great option but since the stemmed words are semantically similar, using something like MaximalMarginalRelevance seems more appropriate to reduce words with similar meaning.

Leveraging Synonyms and Antonyms: Additionally, would integrating external resources like WordNet from NLTK to find synonyms and antonyms be advisable for a broader semantic understanding in BERTopic?

This does not have a yes or no answer and depends highly on your use case but external resources could indeed help finding synonyms and antonyms. However, as mentioned above, removing similar words seems like a job for MaximalMarginalRelevance.

All in all, I would advise doing this grouping first and then supplying the grouped words to the vocabulary parameter in the CountVectorizer. That seems the most efficient route.

@MaartenGr MaartenGr merged commit 61a2cd2 into master Nov 27, 2023
2 checks passed
@MaartenGr MaartenGr deleted the zeroshot branch May 12, 2024 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants