Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2102 #2105

Merged
merged 6 commits into from
Aug 21, 2024
Merged

Fix #2102 #2105

merged 6 commits into from
Aug 21, 2024

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Aug 1, 2024

What does this PR do?

Fixes #2102

This fixes a number of things:

  • Incorrect ordering of the topic embeddings by using "Old_ID" instead of "ID"
  • Unneeded ValueError when nr_topics="auto" is used combined with zero-shot topic modeling
  • Attempting to transform UMAP even when no documents where clustered (only zero-shot topics)
  • Add TopicMapper at the right moment so that all topics are added to the mapper and not only a subset

I updated the way probabilities were returned as I faced some issues with selecting the correct topics. I might change it back after more testing.

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

Comment on lines +4533 to +4536
# No need to sort if it's the first pass of zero-shot topic modeling
nr_zeroshot = len(self._topic_id_to_zeroshot_topic_idx)
if self._is_zeroshot and not self.nr_topics and nr_zeroshot > 0:
return documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the plan was to update self._topic_id_to_zeroshot_topic_idx in ._sort_mappings_by_frequency if zeroshot, and then not use self.topic_mapper_.get_mappings() in .topic_labels_?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I wasn't familiar enough with self._topic_id_to_zeroshot_topic_idx so wasn't sure the best way to change it. This seemed faster and is working now but if you have suggestions on how to change it, feel free to share.

@@ -467,7 +469,6 @@ def fit_transform(
# All documents matches zero-shot topics
documents = assigned_documents
embeddings = assigned_embeddings
topics_before_reduction = self.topics_

# Sort and Map Topic IDs by their frequency
if not self.nr_topics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does sorting mappings not occur if nr_topics is passed?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topics_before_reduction were used for calculating the probabilities and after I made a couple of changes, it started giving index errors when attempting to access the similarity matrix.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, if I add nr_topics="auto" when training BERTopic using the latest commit in this PR, I get the following error:

from bertopic.representation import KeyBERTInspired

# KeyBERT
keybert_model = KeyBERTInspired()

# Pass the above models to be used in BERTopic
topic_model = UpdatedBERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.5,
    nr_topics="auto",
    calculate_probabilities=True,
    representation_model=keybert_model
)
topic_model = topic_model.fit(abstracts, embeddings)

Then that gives me the following error:

IndexError                                Traceback (most recent call last)
Cell In[13], line 18
      6 # Pass the above models to be used in BERTopic
      7 topic_model = UpdatedBERTopic(
      8     embedding_model=embedding_model,
      9     umap_model=umap_model,
   (...)
     16     representation_model=keybert_model
     17 )
---> 18 topic_model = topic_model.fit(abstracts, embeddings)

File [~\Documents\Projects\BERTopic\bertopic\_bertopic.py:364](http://localhost:8888/lab/tree/Documents/Projects/~/Documents/Projects/BERTopic/bertopic/_bertopic.py#line=363), in BERTopic.fit(self, documents, embeddings, images, y)
    322 def fit(
    323     self,
    324     documents: List[str],
   (...)
    327     y: Union[List[int], np.ndarray] = None,
    328 ):
    329     """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
    330 
    331     Arguments:
   (...)
    362     ```
    363     """
--> 364     self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    365     return self

Cell In[12], line 300, in UpdatedBERTopic.fit_transform(self, documents, embeddings, images, y)
    292     else:
    293         # Use `topics_before_reduction` because `self.topics_` may have already been updated from
    294         # reducing topics, and the original probabilities are needed for `self._map_probabilities()`
    295         probabilities = sim_matrix[
    296             np.arange(len(documents)),
    297             np.array(topics_before_reduction) + self._outliers,
    298         ]
--> 300 self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)
    301 predictions = documents.Topic.to_list()
    303 return predictions, self.probabilities_

File [~\Documents\Projects\BERTopic\bertopic\_bertopic.py:4581](http://localhost:8888/lab/tree/Documents/Projects/~/Documents/Projects/BERTopic/bertopic/_bertopic.py#line=4580), in _map_probabilities(self, probabilities, original_topics)
   4578             if to_topic != -1 and from_topic != -1:
   4579                 mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
-> 4581         return mapped_probabilities
   4583 return probabilities

IndexError: index 71 is out of bounds for axis 1 with size 71

bertopic/_bertopic.py Outdated Show resolved Hide resolved
@MaartenGr
Copy link
Owner Author

@ianrandman If you have the time, could you take a look at the open comments? If we can resolve them, I can start putting out a new minor version with the fix.

@MaartenGr MaartenGr mentioned this pull request Aug 15, 2024
1 task
@ianrandman
Copy link
Contributor

ianrandman commented Aug 15, 2024

@ianrandman If you have the time, could you take a look at the open comments? If we can resolve them, I can start putting out a new minor version with the fix.

Sorry about the delay. I can get back to you within a day with more detailed comments. Some quick comments I can give now are that it looks like your solution uses my temp fix for the topic labels property rather than changing the zeroshot idx dict as topic mapping changes. Using the temp fix results in incorrectness (I think) with topic reduction because the implemtation there is to change the zeroshot idx dict as the topic mapping changed.

I think that anytime the zeroshot idx dict is looked at, it should be correct without needing to apply the mapping from the topic mapper. As far as I know, all that is needed to achieve that, given the current implementation, is mentioned in #2105 (comment).

Let me know your thoughts on this and whether I (hopefully don't) need to provide more detail.

@MaartenGr
Copy link
Owner Author

@ianrandman Thank you for the quick reply!

Sorry about the delay. I can get back to you within a day with more detailed comments. Some quick comments I can give now are that it looks like your solution uses my temp fix for the topic labels property rather than changing the zeroshot idx dict as topic mapping changes. Using the temp fix results in incorrectness (I think) with topic reduction because the implemtation there is to change the zeroshot idx dict as the topic mapping changed.

Hmm, I tested the topic reduction a couple of times and although it works it indeed does merge the zero-shot topics.

I think that anytime the zeroshot idx dict is looked at, it should be correct without needing to apply the mapping from the topic mapper. As far as I know, all that is needed to achieve that, given the current implementation, is mentioned in #2105 (comment).

I've been looking through the zeroshot idx dict but I'm just not sure where and how to update it for your proposed fix. It seems that I do not have a sufficient enough grasp on the logic here to make the changes.

@MaartenGr MaartenGr merged commit 0b4265a into master Aug 21, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorrect result by topic_model.get_topic_info() due to zeroshot_topic_list was set
2 participants