Skip to content

Commit

Permalink
v0.16.3 (#2093)
Browse files Browse the repository at this point in the history
  • Loading branch information
MaartenGr authored Jul 22, 2024
1 parent e07be02 commit 2353f4c
Show file tree
Hide file tree
Showing 5 changed files with 129 additions and 103 deletions.
3 changes: 3 additions & 0 deletions bertopic/_bertopic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2387,6 +2387,7 @@ def visualize_topics(
self,
topics: List[int] = None,
top_n_topics: int = None,
use_ctfidf: bool = False,
custom_labels: bool = False,
title: str = "<b>Intertopic Distance Map</b>",
width: int = 650,
Expand All @@ -2403,6 +2404,7 @@ def visualize_topics(
For example, if you want to visualize only topics 1 through 5:
`topics = [1, 2, 3, 4, 5]`.
top_n_topics: Only select the top n most frequent topics
use_ctfidf: Whether to use c-TF-IDF representations instead of the embeddings from the embedding model.
custom_labels: Whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
title: Title of the plot.
Expand All @@ -2428,6 +2430,7 @@ def visualize_topics(
self,
topics=topics,
top_n_topics=top_n_topics,
use_ctfidf=use_ctfidf,
custom_labels=custom_labels,
title=title,
width=width,
Expand Down
27 changes: 27 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,33 @@ hide:

# Changelog


## **Version 0.16.3**
*Release date: 22 July, 2024*

<h3><b>Highlights:</a></b></h3>

* Simplify zero-shot topic modeling by [@ianrandman](https://github.com/ianrandman) in [#2060](https://github.com/MaartenGr/BERTopic/pull/2060)
* Option to choose between c-TF-IDF and Topic Embeddings in many functions by [@azikoss](https://github.com/azikoss) in [#1894](https://github.com/MaartenGr/BERTopic/pull/1894)
* Use the `use_ctfidf` parameter in the following function to choose between c-TF-IDF and topic embeddings:
* `hierarchical_topics`, `reduce_topics`, `visualize_hierarchy`, `visualize_heatmap`, `visualize_topics`
* Linting with Ruff by [@afuetterer](https://github.com/afuetterer) in [#2033](https://github.com/MaartenGr/BERTopic/pull/2033)
* Switch from setup.py to pyproject.toml by [@afuetterer](https://github.com/afuetterer) in [#1978](https://github.com/MaartenGr/BERTopic/pull/1978)
* In multi-aspect context, allow Main model to be chained by [@ddicato](https://github.com/ddicato) in [#2002](https://github.com/MaartenGr/BERTopic/pull/2002)

<h3><b>Fixes:</a></b></h3>

* Added templates for [issues](https://github.com/MaartenGr/BERTopic/tree/master/.github/ISSUE_TEMPLATE) and [pull requests](https://github.com/MaartenGr/BERTopic/blob/master/.github/PULL_REQUEST_TEMPLATE.md)
* Update River documentation example by [@Proteusiq](https://github.com/Proteusiq) in [#2004](https://github.com/MaartenGr/BERTopic/pull/2004)
* Fix PartOfSpeech reproducibility by [@Greenpp](https://github.com/Greenpp) in [#1996](https://github.com/MaartenGr/BERTopic/pull/1996)
* Fix PartOfSpeech ignoring first word by [@Greenpp](https://github.com/Greenpp) in [#2024](https://github.com/MaartenGr/BERTopic/pull/2024)
* Make sklearn embedding backend auto-select more cautious by [@freddyheppell](https://github.com/freddyheppell) in [#1984](https://github.com/MaartenGr/BERTopic/pull/1984)
* Fix typos by [@afuetterer](https://github.com/afuetterer) in [#1974](https://github.com/MaartenGr/BERTopic/pull/1974)
* Fix hierarchical_topics(...) when the distances between three clusters are the same by [@azikoss](https://github.com/azikoss) in [#1929](https://github.com/MaartenGr/BERTopic/pull/1929)
* Fixes to chain strategy example in outlier_reduction.md by [@reuning](https://github.com/reuning) in [#2065](https://github.com/MaartenGr/BERTopic/pull/2065)
* Remove obsolete flake8 config and update line length by [@afuetterer](https://github.com/afuetterer) in [#22066](https://github.com/MaartenGr/BERTopic/pull/2066)


## **Version 0.16.2**
*Release date: 12 May, 2024*

Expand Down
14 changes: 5 additions & 9 deletions docs/getting_started/zeroshot/zeroshot.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,17 @@
Zero-shot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics.

This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.
This allows for extensive flexibility as there are three scenario's to explore.
This allows for extensive flexibility as there are three scenario's to explore:

First, both zero-shot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.

Second, only zero-shot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.

Third, no zero-shot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
* First, both zero-shot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.
* Second, only zero-shot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
* Third, no zero-shot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.

<div class="svg_image">
--8<-- "docs/getting_started/zeroshot/zeroshot.svg"
</div>

This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zero-shot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.

This creates two models. One for the zero-shot topics and one for the non-zero-shot topics. We combine these two BERTopic models to create a single model that contains both zero-shot and non-zero-shot topics.
This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zero-shot topic is assigned to a document. If it does not, then that document, along with others, will follow the regular BERTopic pipeline and attempt to find clusters that do not fit with the zero-shot topics.

### **Example**
In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,
Expand Down
Loading

0 comments on commit 2353f4c

Please sign in to comment.