v0.1.0

* Added parameters for UMAP and HDBSCAN * Option to choose sentence-transformer model * Method for transforming unseen documents * Save and load trained models (umap and hdbscan) * Extract topics and their sizes * Optimized c-TF-IDF * Improved documentation
MaartenGr · Sep 24, 2020 · ed382cf · ed382cf
1 parent 8416bf1
commit ed382cf
Show file tree

Hide file tree

Showing 15 changed files with 502 additions and 420 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020, Maarten P. Grootendorst
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,3 +1,156 @@
-# Topic Modeling with BERT
+# BERTopic
+
+[![PyPI - Status](https://img.shields.io/badge/status-beta-yellow.svg)](https://pypi.org/project/vlac/)
+[![PyPI - Python](https://img.shields.io/badge/python-3.6-blue.svg)](https://pypi.org/project/bertopic/)
+[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/VLAC/blob/master/LICENSE)
+[![PyPI - PyPi](https://img.shields.io/badge/pypi-v0.0.1-EF6C00.svg)](https://pypi.org/project/bertopic/)
+
+BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters
+allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 
+
+Corresponding medium post can be found [here](https://medium.com/@maartengrootendorst).
+
+<a name="toc"/></a>
+## Table of Contents  
+<!--ts-->
+   1. [About the Project](#about)  
+   2. [Algorithm](#algorithm)  
+        2.1. [Sentence Transformer](#sentence)  
+        2.2. [UMAP + HDBSCAN](#umap)  
+        2.3. [c-TF-IDF](#ctfidf)  
+   3. [Getting Started](#gettingstarted)    
+        3.1. [Installation](#installation)    
+        3.2. [Basic Usage](#usage)   
+        3.3. [Overview](#overview)    
+   4. [Example - 20Newsgroups](#example)  
+
+<!--te-->
+
+<a name="about"/></a>
+## 1. About the Project
+[Back to ToC](#toc)  
+
+The initial purpose of this project was to generalize [Top2Vec](https://github.com/ddangelov/Top2Vec) such that it could be 
+used with state-of-art pre-trained transformer models. However, this proved difficult due to the different natures 
+of Doc2Vec and transformer models. Instead, I decided to come up with a different algorithm that could use BERT 
+and 🤗 transformers embeddings. The results is **BERTopic**, an algorithm for generating topics using state-of-the-art embeddings.  
+
+
+<a name="gettingstarted"/></a>
+## 2. Getting Started
+[Back to ToC](#toc)  
+
+<a name="installation"/></a>
+###  2.1. Installation
+
+Installation can be done using [pypi](https://pypi.org/project/bertopic/)
+
+``pip install bertopic``
+
+<a name="usage"/></a>
+###  2.2. Usage
+
+Below is an example of how to use the model. The example uses the 
+[20 newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) dataset.  
+
+```python
+from bertopic import BERTopic
+from sklearn.datasets import fetch_20newsgroups
+
+docs = fetch_20newsgroups(subset='all')['data']
+
+model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True)
+topics = model.fit_transform(docs)
+```
+
+The resulting topics can be accessed through `model.get_topic(topic)`:
+
+```python
+>>> model.get_topic(9)
+[('game', 0.005251396890032802),
+ ('team', 0.00482651185323754),
+ ('hockey', 0.004335032060690186),
+ ('players', 0.0034782716706978963),
+ ('games', 0.0032873248432630227),
+ ('season', 0.003218987432255393),
+ ('play', 0.0031855141725669637),
+ ('year', 0.002962343114817677),
+ ('nhl', 0.0029577648449943144),
+ ('baseball', 0.0029245163154193524)]
+``` 
+
+
+<a name="overview"/></a>
+###  2.3. Overview
+
+
+| Methods | Code  | Returns  |
+|-----------------------|---|---|
+| Access single topic   | `model.get_topic(12)`  | Tuple[Word, Score]  |   
+| Access all topics     |  `model.get_topic()` | List[Tuple[Word, Score]]  |
+| Get single topic freq |  `model.get_topic_freq(12)` | int |
+| Get all topic freq    |  `model.get_topics_freq()` | DataFrame  |
+| Fit the model    |  `model.fit(docs])` | -  |
+| Predict new documents    |  `model.transform([new_doc])` | List[int]  |
+| Save model    |  `model.save("my_model")` | -  |
+| Load model    |  `BERTopic.load("my_model")` | - |
+
+**NOTE**: The embeddings itself are not preserved in the model as they are only vital for creating the clusters. 
+Therefore, it is advised to only use `fit` and then `transform` if you are looking to generalize the model to new documents.
+For existing documents, it is best to use `fit_transform` directly as it only needs to generate the document
+embeddings once.   
+
+
+<a name="algorithm"/></a>
+## 3. Algorithm
+[Back to ToC](#toc)  
+The algorithm contains, roughly, 3 stages:
+* Extract document embeddings with **Sentence Transformers**
+* Cluster document embeddings to create groups of similar documents with **UMAP** and **HDBSCAN**
+* Extract and reduce topics with **c-TF-IDF**
+
+
+<a name="sentence"/></a>
+###  3.1. Sentence Transformer
+We start by creating document embeddings from a set of documents using 
+[sentence-transformer](https://github.com/UKPLab/sentence-transformers). These models are pre-trained for many 
+language and are great for creating either document- or sentence-embeddings. 
+
+If you have long documents, I would advise you to split up your documents into paragraphs or sentences as a BERT-based
+model in `sentence-transformer` typically has a token limit. 
+
+<a name="umap"/></a>
+###  3.2. UMAP + HDBSCAN
+Next, in order to cluster the documents using a clustering algorithm such as HDBSCAN we first need to 
+reduce its dimensionality as HDBCAN is prone to the curse of dimensionality.
+
+<p align="center">
+<img src="https://github.com/MaartenGr/BERTopic/raw/master/images/clusters.png"/>
+</p>
+
+Thus, we first lower dimensionality with UMAP as it preserves local structure well after which we can 
+use HDBSCAN to cluster similar documents.  
+
+<a name="ctfidf"/></a>
+###  3.3. c-TF-IDF
+What we want to know from the clusters that we generated, is what makes one cluster, based on their content, 
+different from another? To solve this, we can modify TF-IDF such that it allows for interesting words per topic
+instead of per document. 
+
+When you apply TF-IDF as usual on a set of documents, what you are basically doing is comparing the importance of 
+words between documents. Now, what if, we instead treat all documents in a single category (e.g., a cluster) 
+as a single document and then apply TF-IDF? The result would be importance scores for words within a cluster. 
+The more important words are within a cluster, the more it is representative of that topic. In other words, 
+if we extract the most important words per cluster, we get descriptions of **topics**! 
+
+<p align="center">
+<img src="https://github.com/MaartenGr/BERTopic/raw/master/images/ctfidf.png" height="50"/>
+</p>  
+
+Each cluster is converted to a single document instead of a set of documents. 
+Then, the frequency of word `t` are extracted for each class `i` and divided by the total number of words `w`. 
+This action can now be seen as a form of regularization of frequent words in the class.
+Next, the total, unjoined, number of documents `m` is divided by the total frequency of word `t` across all classes `n`.
+
+
 
-Using BERT, UMAP, HDBSCAN, and TF-IDF to apply topic modeling.
diff --git a/bertopic/__init__.py b/bertopic/__init__.py
@@ -0,0 +1 @@
+from bertopic.model import BERTopic