Fix #2102 #2105

MaartenGr · 2024-08-01T17:49:31Z

What does this PR do?

This fixes a number of things:

Incorrect ordering of the topic embeddings by using "Old_ID" instead of "ID"
Unneeded ValueError when nr_topics="auto" is used combined with zero-shot topic modeling
Attempting to transform UMAP even when no documents where clustered (only zero-shot topics)
Add TopicMapper at the right moment so that all topics are added to the mapper and not only a subset

I updated the way probabilities were returned as I faced some issues with selecting the correct topics. I might change it back after more testing.

Before submitting

This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes (if applicable)?
Did you write any new necessary tests?

ianrandman · 2024-08-02T14:32:41Z

bertopic/_bertopic.py

+        # No need to sort if it's the first pass of zero-shot topic modeling
+        nr_zeroshot = len(self._topic_id_to_zeroshot_topic_idx)
+        if self._is_zeroshot and not self.nr_topics and nr_zeroshot > 0:
+            return documents


I thought the plan was to update self._topic_id_to_zeroshot_topic_idx in ._sort_mappings_by_frequency if zeroshot, and then not use self.topic_mapper_.get_mappings() in .topic_labels_?

Ah right, I wasn't familiar enough with self._topic_id_to_zeroshot_topic_idx so wasn't sure the best way to change it. This seemed faster and is working now but if you have suggestions on how to change it, feel free to share.

ianrandman · 2024-08-02T14:47:03Z

bertopic/_bertopic.py

@@ -467,7 +469,6 @@ def fit_transform(
            # All documents matches zero-shot topics
            documents = assigned_documents
            embeddings = assigned_embeddings
-        topics_before_reduction = self.topics_

        # Sort and Map Topic IDs by their frequency
        if not self.nr_topics:


Why does sorting mappings not occur if nr_topics is passed?

The topics_before_reduction were used for calculating the probabilities and after I made a couple of changes, it started giving index errors when attempting to access the similarity matrix.

Specifically, if I add nr_topics="auto" when training BERTopic using the latest commit in this PR, I get the following error:

from bertopic.representation import KeyBERTInspired # KeyBERT keybert_model = KeyBERTInspired() # Pass the above models to be used in BERTopic topic_model = UpdatedBERTopic( embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, verbose=True, zeroshot_topic_list=zeroshot_topic_list, zeroshot_min_similarity=.5, nr_topics="auto", calculate_probabilities=True, representation_model=keybert_model ) topic_model = topic_model.fit(abstracts, embeddings)

Then that gives me the following error:

IndexError Traceback (most recent call last) Cell In[13], line 18 6 # Pass the above models to be used in BERTopic 7 topic_model = UpdatedBERTopic( 8 embedding_model=embedding_model, 9 umap_model=umap_model, (...) 16 representation_model=keybert_model 17 ) ---> 18 topic_model = topic_model.fit(abstracts, embeddings) File [~\Documents\Projects\BERTopic\bertopic\_bertopic.py:364](http://localhost:8888/lab/tree/Documents/Projects/~/Documents/Projects/BERTopic/bertopic/_bertopic.py#line=363), in BERTopic.fit(self, documents, embeddings, images, y) 322 def fit( 323 self, 324 documents: List[str], (...) 327 y: Union[List[int], np.ndarray] = None, 328 ): 329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics. 330 331 Arguments: (...) 362 ``` 363 """ --> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images) 365 return self Cell In[12], line 300, in UpdatedBERTopic.fit_transform(self, documents, embeddings, images, y) 292 else: 293 # Use `topics_before_reduction` because `self.topics_` may have already been updated from 294 # reducing topics, and the original probabilities are needed for `self._map_probabilities()` 295 probabilities = sim_matrix[ 296 np.arange(len(documents)), 297 np.array(topics_before_reduction) + self._outliers, 298 ] --> 300 self.probabilities_ = self._map_probabilities(probabilities, original_topics=True) 301 predictions = documents.Topic.to_list() 303 return predictions, self.probabilities_ File [~\Documents\Projects\BERTopic\bertopic\_bertopic.py:4581](http://localhost:8888/lab/tree/Documents/Projects/~/Documents/Projects/BERTopic/bertopic/_bertopic.py#line=4580), in _map_probabilities(self, probabilities, original_topics) 4578 if to_topic != -1 and from_topic != -1: 4579 mapped_probabilities[:, to_topic] += probabilities[:, from_topic] -> 4581 return mapped_probabilities 4583 return probabilities IndexError: index 71 is out of bounds for axis 1 with size 71

bertopic/_bertopic.py

MaartenGr · 2024-08-15T09:12:42Z

@ianrandman If you have the time, could you take a look at the open comments? If we can resolve them, I can start putting out a new minor version with the fix.

ianrandman · 2024-08-15T09:26:57Z

@ianrandman If you have the time, could you take a look at the open comments? If we can resolve them, I can start putting out a new minor version with the fix.

Sorry about the delay. I can get back to you within a day with more detailed comments. Some quick comments I can give now are that it looks like your solution uses my temp fix for the topic labels property rather than changing the zeroshot idx dict as topic mapping changes. Using the temp fix results in incorrectness (I think) with topic reduction because the implemtation there is to change the zeroshot idx dict as the topic mapping changed.

I think that anytime the zeroshot idx dict is looked at, it should be correct without needing to apply the mapping from the topic mapper. As far as I know, all that is needed to achieve that, given the current implementation, is mentioned in #2105 (comment).

Let me know your thoughts on this and whether I (hopefully don't) need to provide more detail.

MaartenGr · 2024-08-15T10:35:37Z

@ianrandman Thank you for the quick reply!

Sorry about the delay. I can get back to you within a day with more detailed comments. Some quick comments I can give now are that it looks like your solution uses my temp fix for the topic labels property rather than changing the zeroshot idx dict as topic mapping changes. Using the temp fix results in incorrectness (I think) with topic reduction because the implemtation there is to change the zeroshot idx dict as the topic mapping changed.

Hmm, I tested the topic reduction a couple of times and although it works it indeed does merge the zero-shot topics.

I think that anytime the zeroshot idx dict is looked at, it should be correct without needing to apply the mapping from the topic mapper. As far as I know, all that is needed to achieve that, given the current implementation, is mentioned in #2105 (comment).

I've been looking through the zeroshot idx dict but I'm just not sure where and how to update it for your proposed fix. It seems that I do not have a sufficient enough grasp on the logic here to make the changes.

…cMapper` (#7) (#2120)

MaartenGr and others added 3 commits August 1, 2024 19:38

Fix #2102

1992835

No need for re-sorting

888f435

Remove unneeded sorting

fa65bc4

MaartenGr mentioned this pull request Aug 1, 2024

incorrect result by topic_model.get_topic_info() due to zeroshot_topic_list was set #2102

Closed

1 task

ianrandman reviewed Aug 2, 2024

View reviewed changes

bertopic/_bertopic.py Outdated Show resolved Hide resolved

Update according to review

aa41c68

MaartenGr mentioned this pull request Aug 15, 2024

KeyError: 'topics_from' #2100

Closed

1 task

ianrandman mentioned this pull request Aug 15, 2024

incorrect result by topic_model.get_topic_info() due to zeroshot_topic_list was set semandex/BERTopic#7

Open

1 task

MaartenGr and others added 2 commits August 15, 2024 15:10

Revert correcting labels

fed0682

Adjust _topic_id_to_zeroshot_topic_idx when adding mapping to `Topi…

63710da

…cMapper` (#7) (#2120)

ianrandman approved these changes Aug 15, 2024

View reviewed changes

MaartenGr merged commit 0b4265a into master Aug 21, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #2102 #2105

Fix #2102 #2105

MaartenGr commented Aug 1, 2024 •

edited

Loading

ianrandman Aug 2, 2024

MaartenGr Aug 3, 2024

ianrandman Aug 2, 2024

MaartenGr Aug 3, 2024

MaartenGr Aug 3, 2024

MaartenGr commented Aug 15, 2024

ianrandman commented Aug 15, 2024 •

edited

Loading

MaartenGr commented Aug 15, 2024

Fix #2102 #2105

Fix #2102 #2105

Conversation

MaartenGr commented Aug 1, 2024 • edited Loading

What does this PR do?

Before submitting

ianrandman Aug 2, 2024

Choose a reason for hiding this comment

MaartenGr Aug 3, 2024

Choose a reason for hiding this comment

ianrandman Aug 2, 2024

Choose a reason for hiding this comment

MaartenGr Aug 3, 2024

Choose a reason for hiding this comment

MaartenGr Aug 3, 2024

Choose a reason for hiding this comment

MaartenGr commented Aug 15, 2024

ianrandman commented Aug 15, 2024 • edited Loading

MaartenGr commented Aug 15, 2024

MaartenGr commented Aug 1, 2024 •

edited

Loading

ianrandman commented Aug 15, 2024 •

edited

Loading