Skip to content

Modelling: Topic Modelling

Lee Zhan Peng edited this page Apr 26, 2024 · 4 revisions

On this page, we explore the purpose of topic modelling, showcase its use cases, evaluate different methodologies for topic modelling, and outline the use of topic modelling in our project.

What is Topic Modelling?

Topic Modelling Banner
Topic modelling is a technique used to uncover latent themes or topics within a collection of documents. The primary purpose of topic modelling in our project is to identify and categorise the main themes or topics present in bank reviews. By doing so, we aim to gain a deeper understanding of customer sentiments, preferences, and pain points related to banking applications. Additionally, topic modelling helps in organising and summarising large volumes of text data, enabling efficient analysis and decision-making (Peddireddi, 2023).

How does Topic Modelling work?

At its core, topic modelling assumes that each document in the collection is a mixture of various topics, and a distribution of words characterises each topic. The algorithm identifies these latent topics and their corresponding word distributions by analysing the word frequencies and co-occurrences across the documents. This allows researchers and analysts to uncover the underlying themes present in the corpus and gain insights into the main subjects discussed across the documents.

Why is Topic Modelling important?

  • Discovering Latent Themes: Topic modelling enables the discovery of latent themes or topics within a large collection of documents without the need for manual annotation or categorisation. This allows researchers and analysts to uncover hidden patterns and structures in the data that may not be immediately apparent.
  • Insights Generation: Topic modelling facilitates the generation of insights and understanding of complex datasets by revealing the main subjects and themes present in the corpus. These insights can inform decision-making, guide further analysis, and support various applications such as content recommendation, trend analysis, and sentiment analysis.
  • Personalisation and User Engagement: In applications such as content recommendation and personalised marketing, topic modelling can help tailor recommendations and content suggestions to individual preferences by identifying topics of interest based on user behavior and interaction patterns.

Use Cases

Topic modelling has various applications across different domains, including:

  • Content Analysis: Topic modelling helps in organising and analysing large volumes of textual data, making it useful for content analysis in fields such as marketing, customer service, and social media monitoring.
  • Information Retrieval: Topic modelling aids in categorising and retrieving relevant documents or articles based on specific topics or themes, enhancing search and information retrieval systems.
  • Customer Feedback Analysis: In the context of our project, topic modelling is crucial for analysing and categorising customer reviews to identify common themes or topics, allowing banks to address customer concerns and improve their services.
  • Market Research: Topic modelling can be used to analyse trends, opinions, and discussions in online forums, social media platforms, or customer surveys, providing valuable insights for market research and product development.

Real-Word Application in the Banking Industry

  • Customer Support and Feedback Analysis: Banks receive a large volume of customer feedback through channels like surveys, emails, and social media. Topic modelling can be used to analyse this feedback and identify recurring themes or topics, such as complaints about specific products or services, customer satisfaction levels, or suggestions for improvement (Clark, 2024). By understanding customer sentiments and concerns, banks can prioritise areas for enhancement and tailor their services to meet customer needs more effectively.
  • Market Research and Customer Segmentation: Banks can use topic modelling to analyse market trends, competitor strategies, and customer preferences by mining textual data from sources like market reports, social media discussions, and customer surveys. By identifying key topics and segments within the market, banks can better understand customer needs, segment their customer base more accurately, and tailor their marketing strategies and product offerings accordingly (Chen et al., 2017).
  • Content Recommendation and Personalisation: Topic modelling can enhance customer engagement and satisfaction by providing personalised recommendations for banking products and services. By analysing customer transaction history, browsing behavior, and demographic information, banks can identify relevant topics and recommend products or services that align with each customer's interests and needs (Gichere, 2023). This improves customer engagement, increases cross-selling opportunities, and enhances overall customer satisfaction.

Literature Review

For our project, our goal was to identify the optimal model for categorising each bank review. To achieve this, we employ topic modelling techniques to delineate distinct categories. Additionally, we want to explore the feasibility of supplying our own topics to enhance the modelling process. The topics we are using are as follows:

['login', 'interface', 'stability', 'update', 'notifications', 'speed', 'service', 'functions', 'security']

Techniques Evaluated

We evaluated three strategies for topic modelling in our project:

  1. Latent Dirichlet Allocation (LDA): LDA is a probabilistic generative model that operates under the assumption that each document within a corpus is a mixture of underlying topics and each word is attributable to one of these topics. LDA iteratively refines the allocation of words to topics based on statistical inference (Kulshrestha, 2019). Through this iterative process, LDA infers the probability distributions of topics within documents and words within topics, unveiling the latent thematic structure of the corpus. Since topics are identified from within the word distribution, we attempt to use cosine similarity between them and our topics to select the topic of highest similarity.

  2. BART-large-mnli: BART-large-mnli, a derivative of the BART (Bidirectional and Auto-Regressive Transformers) model, undergoes pre-training on the MultiNLI (Multi-Genre Natural Language Inference) dataset. Tailored for natural language inference tasks, this specific variant of BART is adept at discerning the logical relationships between sentence pairs (Shu et al., 2022). With its capacity for zero-shot topic modelling, we can leverage the inferred topics for inference purposes.

  3. BERTopic: BERTopic is a toolset employing BERT embeddings to portray documents within a vector space, after which clustering methods like HDBSCAN are utilised to cluster akin documents into topics. The allocation of documents to topics relies on their closeness to cluster centroids (Egger & Yu, 2022). Similar to LDA, we gauge cosine similarity between inferred topics and provided topics to determine the most closely related topic.

Results

Topic Modelling Banner

While all strategies exhibit accuracy below 0.7, we found that directly deploying zero-shot classification with BART-large-mnli yielded better results compared to using cosine similarity. Therefore, we decided to utilise BART-large-mnli for zero-shot topic modelling in our project.

Usage in the Project

By utilising the transformers library, specifically the pipeline module (details here), we conveniently make use of facebook/bart-large-mnli model for our zero-shot topic modelling:

self.pipe = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

The pipeline is then called with the review to analyse, along with the topic words we want to use:

result = self.pipe(review, associated_words)

The results are then compiled and used within our data pipeline.

References

Clark, S. (2024, April 15). Unlocking the voice of customer with AI. CMSWire.com.

Chen, Y., Rabbani, R. M., Gupta, A., & Zaki, M. J. (2017, November 1). Comparative text analytics via topic modeling in banking. IEEE Symposium Series on Computational Intelligence.

Egger, R., & Yu, J. (2022, May 6). A topic modeling comparison between LDA, NMF, Top2VEC, and BERTopic to demystify Twitter posts. Frontiers in Sociology, 7.

Gichere, F. (2023, December 1). Unifying Topic Modeling and Sentiment Analysis to Derive Actionable Insights from Kenyan Bank App Reviews. Medium.

Kulshrestha, R. (2021, December 10). A Beginner’s Guide to Latent Dirichlet Allocation(LDA). Medium.

Peddireddi, Y. (2023, September 12). Topic modelling in natural language processing. Analytics Vidhya.

Shu, L., Xu, H., Liu, B., & Chen, J. (2022, February 4). Zero-Shot Aspect-Based Sentiment analysis. arXiv.org.