Simplify zero-shot topic modeling #2060

ianrandman · 2024-06-21T05:08:06Z

Fix downstream operations after zero-shot topic modeling
- No longer perform merging of models for zero-shot. The zero-shot topics act as a prefiltering step before clustering. The zero-shot topics are combined with the clustered topics immediately after clustering.
- The C-TF-IDF model is fitted on all documents now, regardless of whether they belong to clustered or zero-shot topics.
Validate number of topics when using zero-shot topic modeling
Remove check for type(self.hdbscan_model) != BaseCluster when checking whether model is zero-shot
- The HDBSCAN model is replaced with a BaseCluster during fit_transform() during zero-shot topic modeling.
Derive self._outliers rather than tracking it to maintain alignment using @property
Derive zero-shot labels when requested rather than tracking it using @property
Fix typos related to topic_to, topics_from for mapping
Validate existence of outliers in reduce_outliers()
Maintain zero-shot topics while reducing topics
- If original topics contain one or more zero-shot topics, the new topic keeps the best zero-shot topic if the cosine similarity with the topic meets zeroshot_min_similarity. Otherwise, the calculated representation is used.

Fixes #1967

- zero-shot topic modeling is now only the equivalent of a clustering step - removed implementation where this functionality is done through merging two models - all documents are used at once when calculating representations - probability comes from cosine similarity when zeroshot topics are used - validate `nr_topics` with respect to how many zero-shot topics matched - track `self._outliers` and `self.topic_labels_` using `@property`, as they are derivatives of other attributes - validate existence of outliers before outlier reduction

… zeroshot (#2)

…d with new topic embedding (#2)

MaartenGr

Thank you for your work on this! I left a couple of small comments. Other than that, can you run ruff? With a PR that was recently merged, we now use ruff for the formatting/linting.

bertopic/_bertopic.py

# Conflicts: # bertopic/_bertopic.py

…strings, lower threshold zeroshot test, fix outliers for probabilities during zeroshot (#2)

ianrandman · 2024-06-24T15:51:54Z

Thank you for your work on this! I left a couple of small comments. Other than that, can you run ruff? With a PR that was recently merged, we now use ruff for the formatting/linting.

Thanks for pointing out the recent incorporation with ruff. I'll have to start using that in my own projects.

I have fixed my changes with ruff, fixed the merge confict, and resolved a couple of your comments. Please mark them resolved if my changes look good. There is only the remaining discussion about probabilities.

freddyheppell · 2024-06-25T13:07:51Z

It looks like this is still failing because I think you only ran one of the two Ruff commands.

Ruff has format, which replaces e.g. Black and check which replaces e.g. flake8. If you run ruff check --fix it should autofix the remaining issues and let you know which need manual fixing. You can also run make format and make lint as shortcuts.

MaartenGr · 2024-06-26T13:49:17Z

I believe the code check in python 3.8 failed because its not familiar with the tuple[..., ....] type hints. I believe replacing them with either from typing import Tuple should work or simply removing those type hints.

MaartenGr · 2024-06-29T07:04:37Z

@ianrandman Awesome, everything passed and I think we addressed all the comments we had. Just to be 100% sure, shall I go ahead and merge this?

ianrandman · 2024-06-29T08:13:17Z

@ianrandman Awesome, everything passed and I think we addressed all the comments we had. Just to be 100% sure, shall I go ahead and merge this?

Yes, all good to merge if it looks good to you. Happy to be done with this :).

MaartenGr · 2024-07-01T09:26:30Z

@ianrandman Awesome, thank you for taking the time the last couple of works to work on this. It is greatly appreciated and hopefully this will also make it easier for you to use BERTopic instead of your own fork. If there are any other changes you would like to see, please let me know!

ianrandman added 11 commits May 28, 2024 08:19

Fix outliers for update topics (#2)

96b1d6f

Fix reduce topics with zeroshot and fix indexing of topic labels for…

69fa29c

… zeroshot (#2)

Fix uninitialized topic_id_to_zeroshot_topic_idx (#2)

fc12e03

Combine merged zero-shot topics with '_' (#2)

02134d6

Remove check for embedding model when check if zero-shot (#2)

cfd75a4

Fix naming issue with topics_from and topic_to for merging (#2)

ecd0224

When merging zero-shot, keep single zero-shot label if meets threshol…

19af331

…d with new topic embedding (#2)

Fix validation check for num topics when zero-shot (#2)

9155a87

Fix typo (#2)

2a7b194

Add test for zero-shot (#2)

0f16c3d

MaartenGr reviewed Jun 23, 2024

View reviewed changes

bertopic/_bertopic.py Outdated Show resolved Hide resolved

bertopic/_bertopic.py Outdated Show resolved Hide resolved

bertopic/_bertopic.py Outdated Show resolved Hide resolved

ianrandman added 3 commits June 23, 2024 10:33

Merge branch 'master' into issue2-simplify-zero-shot

cecb683

# Conflicts: # bertopic/_bertopic.py

Format using ruff (#2)

7766277

Make self._topic_id_to_zeroshot_topic_idx private, add comments/doc…

fbc574b

…strings, lower threshold zeroshot test, fix outliers for probabilities during zeroshot (#2)

Ruff lint fixes (#2)

e812366

Use type hint format supported by older versions of Python (#2)

e093b5b

MaartenGr merged commit d1ffb2f into MaartenGr:master Jul 1, 2024
6 checks passed

ianrandman mentioned this pull request Jul 3, 2024

Simplify zero-shot topic modeling to fix issues with downstream tasks semandex/BERTopic#2

Closed

ianrandman deleted the issue2-simplify-zero-shot branch July 3, 2024 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify zero-shot topic modeling #2060

Simplify zero-shot topic modeling #2060

ianrandman commented Jun 21, 2024 •

edited

Loading

MaartenGr left a comment

ianrandman commented Jun 24, 2024

freddyheppell commented Jun 25, 2024

MaartenGr commented Jun 26, 2024

MaartenGr commented Jun 29, 2024

ianrandman commented Jun 29, 2024

MaartenGr commented Jul 1, 2024

Simplify zero-shot topic modeling #2060

Simplify zero-shot topic modeling #2060

Conversation

ianrandman commented Jun 21, 2024 • edited Loading

MaartenGr left a comment

Choose a reason for hiding this comment

ianrandman commented Jun 24, 2024

freddyheppell commented Jun 25, 2024

MaartenGr commented Jun 26, 2024

MaartenGr commented Jun 29, 2024

ianrandman commented Jun 29, 2024

MaartenGr commented Jul 1, 2024

ianrandman commented Jun 21, 2024 •

edited

Loading