Topic Modeling #ChatGPT Tweets with a Latent Dirichlet Allocation NLP Model

Project Overview

ChatGPT has been taking the world by storm, and if you haven't heard of it yet, you will soon. A type of Generative Pre-trained Transformer (GPT), ChatGPT is a language model trained to generate human-like text by predicting the next word in a sequence of words.

ChatGPT has been prominent in the Twitersphere, with thousands of tweets and counting on a variety of different topics. But what do Twitter users say about this impressive new OpenAI model? This project aims to train a Latent Dirichlet Allocation model on Tweets about #ChatGPT to answer this question, and deploy the model in Microsoft Power BI for exploration and interaction.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a type of statistical model that is used to uncover the hidden topics that are present in a collection of documents, in this case, Tweets about #ChatGPT. It does this by identifying the words that are most strongly associated with each topic, and using those words to represent the topic.

LDA assumes that each document in the collection is a mixture of a small number of latent topics, and that each word in the document is generated from one of these topics. The goal of LDA is to uncover the set of topics that are present in the documents and to estimate the proportion of each topic present in each document.

To do this, LDA uses a probabilistic model to estimate the likelihood that each word in each document is generated from each of the latent topics. It then uses this information to infer the most likely set of topics present in the documents and to estimate the proportion of each topic present in each document.

Dataset Acknowledgement

The initial dataset is prepared by Konrad Banachewicz, updated daily, and available here under the CC0: Public Domain license.

Project Methodology

Dataset Cleanup, Profiling & EDA

The dataset contains 60,504 unique tweets (simple retweets have been removed) referencing the hashtag #ChatGPT from 35,748 different users over the period December 5, 2022 through January 2, 2023. See this notebook for more details.

Tweet Pre-processing, Modeling Training, Assignment and Evaluation

Natural language problems like this one require significant text pre-processing, and a bit of iteration to find the set of topics that appropriately capture those in the corpus. Luckily, the package PyCaret simplifies this process, turning what would take many lines of code, into just a few lines.

Iteration 1

In the first iteration, no custom stopwords were passed to the pre-processor, and an initial guess at a topic count was set at six topics. The resulting model contained 32,016 terms spread across the 6 topics. As would be expected of a model trained with no custom stopwords, many of highest frequency terms added no significant information to the model, such as:

co
ai
https,
use,
chatgpt
ask
write
make
get
answer
question
give
say
would

Further, an analysis of the t-distributed stochastic neighbor embedding (tSNE) plot shows a lot of topic overlap, making it unclear whether a 6 topic model is the appropriate fit. For more information on iteration 1 training and analysis, see this notebook.

Iteration 2

In the second iteration, terms (above) that were identified in the first iteration that did not yield significant information in the model were removed as custom stopwords, and the model was retrained. Five, rather than 6 topics was initially asserted. The frequency distribution of the corpus did firm up a bit, with the dominant words being "good", "think", "go", "know", "new", "try", "see", "create", "time" and "work".

However, model tuning with coherence yielded a 400-topic model, suggesting that tweets really span a wide-ranging set of topics, many more than the handful asserted. This makes sense, given the fact that people may have and talk about a vast array of use cases for ChatGPT! See this notebook for more information on this iteration.

Deploying Assigned Model to a Power BI application

The model was productionalized in Power BI to facilitate exploration. The application is available here

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Data		Data
Images		Images
Notebooks		Notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modeling #ChatGPT Tweets with a Latent Dirichlet Allocation NLP Model

Project Overview

Latent Dirichlet Allocation

Dataset Acknowledgement

Project Methodology

Dataset Cleanup, Profiling & EDA

Tweet Pre-processing, Modeling Training, Assignment and Evaluation

Iteration 1

Iteration 2

Deploying Assigned Model to a Power BI application

About

Releases

Packages

Languages

joelmsherman/ChatGPT-Topic-Modeling-Blog

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling #ChatGPT Tweets with a Latent Dirichlet Allocation NLP Model

Project Overview

Latent Dirichlet Allocation

Dataset Acknowledgement

Project Methodology

Dataset Cleanup, Profiling & EDA

Tweet Pre-processing, Modeling Training, Assignment and Evaluation

Iteration 1

Iteration 2

Deploying Assigned Model to a Power BI application

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages