Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify zero-shot topic modeling #2060

Merged
merged 16 commits into from
Jul 1, 2024

Conversation

ianrandman
Copy link
Contributor

@ianrandman ianrandman commented Jun 21, 2024

  • Fix downstream operations after zero-shot topic modeling
    • No longer perform merging of models for zero-shot. The zero-shot topics act as a prefiltering step before clustering. The zero-shot topics are combined with the clustered topics immediately after clustering.
    • The C-TF-IDF model is fitted on all documents now, regardless of whether they belong to clustered or zero-shot topics.
  • Validate number of topics when using zero-shot topic modeling
  • Remove check for type(self.hdbscan_model) != BaseCluster when checking whether model is zero-shot
    • The HDBSCAN model is replaced with a BaseCluster during fit_transform() during zero-shot topic modeling.
  • Derive self._outliers rather than tracking it to maintain alignment using @property
  • Derive zero-shot labels when requested rather than tracking it using @property
  • Fix typos related to topic_to, topics_from for mapping
  • Validate existence of outliers in reduce_outliers()
  • Maintain zero-shot topics while reducing topics
    • If original topics contain one or more zero-shot topics, the new topic keeps the best zero-shot topic if the cosine similarity with the topic meets zeroshot_min_similarity. Otherwise, the calculated representation is used.

Fixes #1967

- zero-shot topic modeling is now only the equivalent of a clustering step
  - removed implementation where this functionality is done through merging two models
  - all documents are used at once when calculating representations
  - probability comes from cosine similarity when zeroshot topics are used
- validate `nr_topics` with respect to how many zero-shot topics matched
- track `self._outliers` and `self.topic_labels_` using `@property`, as they are derivatives of other attributes
- validate existence of outliers before outlier reduction
Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this! I left a couple of small comments. Other than that, can you run ruff? With a PR that was recently merged, we now use ruff for the formatting/linting.

bertopic/_bertopic.py Outdated Show resolved Hide resolved
bertopic/_bertopic.py Outdated Show resolved Hide resolved
bertopic/_bertopic.py Outdated Show resolved Hide resolved
# Conflicts:
#	bertopic/_bertopic.py
…strings, lower threshold zeroshot test, fix outliers for probabilities during zeroshot (#2)
@ianrandman
Copy link
Contributor Author

Thank you for your work on this! I left a couple of small comments. Other than that, can you run ruff? With a PR that was recently merged, we now use ruff for the formatting/linting.

Thanks for pointing out the recent incorporation with ruff. I'll have to start using that in my own projects.

I have fixed my changes with ruff, fixed the merge confict, and resolved a couple of your comments. Please mark them resolved if my changes look good. There is only the remaining discussion about probabilities.

@freddyheppell
Copy link
Contributor

It looks like this is still failing because I think you only ran one of the two Ruff commands.

Ruff has format, which replaces e.g. Black and check which replaces e.g. flake8. If you run ruff check --fix it should autofix the remaining issues and let you know which need manual fixing. You can also run make format and make lint as shortcuts.

@MaartenGr
Copy link
Owner

I believe the code check in python 3.8 failed because its not familiar with the tuple[..., ....] type hints. I believe replacing them with either from typing import Tuple should work or simply removing those type hints.

@MaartenGr
Copy link
Owner

@ianrandman Awesome, everything passed and I think we addressed all the comments we had. Just to be 100% sure, shall I go ahead and merge this?

@ianrandman
Copy link
Contributor Author

@ianrandman Awesome, everything passed and I think we addressed all the comments we had. Just to be 100% sure, shall I go ahead and merge this?

Yes, all good to merge if it looks good to you. Happy to be done with this :).

@MaartenGr MaartenGr merged commit d1ffb2f into MaartenGr:master Jul 1, 2024
6 checks passed
@MaartenGr
Copy link
Owner

@ianrandman Awesome, thank you for taking the time the last couple of works to work on this. It is greatly appreciated and hopefully this will also make it easier for you to use BERTopic instead of your own fork. If there are any other changes you would like to see, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues with Zero-shot Topic Modeling regarding outliers and future operations
3 participants