-
Notifications
You must be signed in to change notification settings - Fork 765
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Added parameters for UMAP and HDBSCAN * Option to choose sentence-transformer model * Method for transforming unseen documents * Save and load trained models (umap and hdbscan) * Extract topics and their sizes * Optimized c-TF-IDF * Improved documentation
- Loading branch information
Showing
15 changed files
with
502 additions
and
420 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2020, Maarten P. Grootendorst | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,156 @@ | ||
# Topic Modeling with BERT | ||
# BERTopic | ||
|
||
[![PyPI - Status](https://img.shields.io/badge/status-beta-yellow.svg)](https://pypi.org/project/vlac/) | ||
[![PyPI - Python](https://img.shields.io/badge/python-3.6-blue.svg)](https://pypi.org/project/bertopic/) | ||
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/VLAC/blob/master/LICENSE) | ||
[![PyPI - PyPi](https://img.shields.io/badge/pypi-v0.0.1-EF6C00.svg)](https://pypi.org/project/bertopic/) | ||
|
||
BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters | ||
allowing for easily interpretable topics whilst keeping important words in the topic descriptions. | ||
|
||
Corresponding medium post can be found [here](https://medium.com/@maartengrootendorst). | ||
|
||
<a name="toc"/></a> | ||
## Table of Contents | ||
<!--ts--> | ||
1. [About the Project](#about) | ||
2. [Algorithm](#algorithm) | ||
2.1. [Sentence Transformer](#sentence) | ||
2.2. [UMAP + HDBSCAN](#umap) | ||
2.3. [c-TF-IDF](#ctfidf) | ||
3. [Getting Started](#gettingstarted) | ||
3.1. [Installation](#installation) | ||
3.2. [Basic Usage](#usage) | ||
3.3. [Overview](#overview) | ||
4. [Example - 20Newsgroups](#example) | ||
|
||
<!--te--> | ||
|
||
<a name="about"/></a> | ||
## 1. About the Project | ||
[Back to ToC](#toc) | ||
|
||
The initial purpose of this project was to generalize [Top2Vec](https://github.com/ddangelov/Top2Vec) such that it could be | ||
used with state-of-art pre-trained transformer models. However, this proved difficult due to the different natures | ||
of Doc2Vec and transformer models. Instead, I decided to come up with a different algorithm that could use BERT | ||
and 🤗 transformers embeddings. The results is **BERTopic**, an algorithm for generating topics using state-of-the-art embeddings. | ||
|
||
|
||
<a name="gettingstarted"/></a> | ||
## 2. Getting Started | ||
[Back to ToC](#toc) | ||
|
||
<a name="installation"/></a> | ||
### 2.1. Installation | ||
|
||
Installation can be done using [pypi](https://pypi.org/project/bertopic/) | ||
|
||
``pip install bertopic`` | ||
|
||
<a name="usage"/></a> | ||
### 2.2. Usage | ||
|
||
Below is an example of how to use the model. The example uses the | ||
[20 newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) dataset. | ||
|
||
```python | ||
from bertopic import BERTopic | ||
from sklearn.datasets import fetch_20newsgroups | ||
|
||
docs = fetch_20newsgroups(subset='all')['data'] | ||
|
||
model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True) | ||
topics = model.fit_transform(docs) | ||
``` | ||
|
||
The resulting topics can be accessed through `model.get_topic(topic)`: | ||
|
||
```python | ||
>>> model.get_topic(9) | ||
[('game', 0.005251396890032802), | ||
('team', 0.00482651185323754), | ||
('hockey', 0.004335032060690186), | ||
('players', 0.0034782716706978963), | ||
('games', 0.0032873248432630227), | ||
('season', 0.003218987432255393), | ||
('play', 0.0031855141725669637), | ||
('year', 0.002962343114817677), | ||
('nhl', 0.0029577648449943144), | ||
('baseball', 0.0029245163154193524)] | ||
``` | ||
|
||
|
||
<a name="overview"/></a> | ||
### 2.3. Overview | ||
|
||
|
||
| Methods | Code | Returns | | ||
|-----------------------|---|---| | ||
| Access single topic | `model.get_topic(12)` | Tuple[Word, Score] | | ||
| Access all topics | `model.get_topic()` | List[Tuple[Word, Score]] | | ||
| Get single topic freq | `model.get_topic_freq(12)` | int | | ||
| Get all topic freq | `model.get_topics_freq()` | DataFrame | | ||
| Fit the model | `model.fit(docs])` | - | | ||
| Predict new documents | `model.transform([new_doc])` | List[int] | | ||
| Save model | `model.save("my_model")` | - | | ||
| Load model | `BERTopic.load("my_model")` | - | | ||
|
||
**NOTE**: The embeddings itself are not preserved in the model as they are only vital for creating the clusters. | ||
Therefore, it is advised to only use `fit` and then `transform` if you are looking to generalize the model to new documents. | ||
For existing documents, it is best to use `fit_transform` directly as it only needs to generate the document | ||
embeddings once. | ||
|
||
|
||
<a name="algorithm"/></a> | ||
## 3. Algorithm | ||
[Back to ToC](#toc) | ||
The algorithm contains, roughly, 3 stages: | ||
* Extract document embeddings with **Sentence Transformers** | ||
* Cluster document embeddings to create groups of similar documents with **UMAP** and **HDBSCAN** | ||
* Extract and reduce topics with **c-TF-IDF** | ||
|
||
|
||
<a name="sentence"/></a> | ||
### 3.1. Sentence Transformer | ||
We start by creating document embeddings from a set of documents using | ||
[sentence-transformer](https://github.com/UKPLab/sentence-transformers). These models are pre-trained for many | ||
language and are great for creating either document- or sentence-embeddings. | ||
|
||
If you have long documents, I would advise you to split up your documents into paragraphs or sentences as a BERT-based | ||
model in `sentence-transformer` typically has a token limit. | ||
|
||
<a name="umap"/></a> | ||
### 3.2. UMAP + HDBSCAN | ||
Next, in order to cluster the documents using a clustering algorithm such as HDBSCAN we first need to | ||
reduce its dimensionality as HDBCAN is prone to the curse of dimensionality. | ||
|
||
<p align="center"> | ||
<img src="https://github.com/MaartenGr/BERTopic/raw/master/images/clusters.png"/> | ||
</p> | ||
|
||
Thus, we first lower dimensionality with UMAP as it preserves local structure well after which we can | ||
use HDBSCAN to cluster similar documents. | ||
|
||
<a name="ctfidf"/></a> | ||
### 3.3. c-TF-IDF | ||
What we want to know from the clusters that we generated, is what makes one cluster, based on their content, | ||
different from another? To solve this, we can modify TF-IDF such that it allows for interesting words per topic | ||
instead of per document. | ||
|
||
When you apply TF-IDF as usual on a set of documents, what you are basically doing is comparing the importance of | ||
words between documents. Now, what if, we instead treat all documents in a single category (e.g., a cluster) | ||
as a single document and then apply TF-IDF? The result would be importance scores for words within a cluster. | ||
The more important words are within a cluster, the more it is representative of that topic. In other words, | ||
if we extract the most important words per cluster, we get descriptions of **topics**! | ||
|
||
<p align="center"> | ||
<img src="https://github.com/MaartenGr/BERTopic/raw/master/images/ctfidf.png" height="50"/> | ||
</p> | ||
|
||
Each cluster is converted to a single document instead of a set of documents. | ||
Then, the frequency of word `t` are extracted for each class `i` and divided by the total number of words `w`. | ||
This action can now be seen as a form of regularization of frequent words in the class. | ||
Next, the total, unjoined, number of documents `m` is divided by the total frequency of word `t` across all classes `n`. | ||
|
||
|
||
|
||
Using BERT, UMAP, HDBSCAN, and TF-IDF to apply topic modeling. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from bertopic.model import BERTopic |
Oops, something went wrong.