Skip to content

Reviews Data: Retrieval & Preparation

Lee Zhan Peng edited this page Apr 24, 2024 · 1 revision

On this page, we cover the sources of our data, the steps involved in data preprocessing, and an overview of the scraping process.

Data Source

GXS Bank Application

GXS Apple App Store Reviews GXS Google Play Store Reviews

We collect review data from the GXS bank application, available on both the Apple App Store and Google Play Store, to ensure broad demographic coverage.

Competitive Benchmarking

In addition to GXS, we gather review data from direct digital competitors like Maribank and Trust Bank, along with major banks in Singapore such as DBS, OCBC, and UOB. This comprehensive dataset enables us to conduct a comparative analysis, shedding light on GXS's relative performance and areas where it may have room for improvement in the domain.

Data Preprocessing

Streamlining Data Integrity

We clean the data by removing redundant features and retaining essential elements, ultimately leaving these features in the dataset: review content, title, rating, bank, source, and temporal data.

Refinement through Text Preprocessing

We conduct text preprocessing on the review content by filtering emoticons, non-English characters, punctuations, and line breaks, and converting all text to lowercase for consistency.

Overall Scraping Process

Initialisation of Database

Prior to developing an updatable database, we first have to retrieve the bulk of reviews that already exist from the two sources of reviews. As the GXS banking application was first released in August of 2022, we scraped all reviews that are available from then to the present for all pinpointed banks.

We used existing scraping libraries in Python to collect data from the Apple App Store and Google Play Store. The collected data undergoes the above-mentioned cleaning and preprocessing. Subsequently, leveraging our production-ready models for sentiment analysis, topic modeling, and explainable AI, we utilise the review content to generate new features. These features include sentiment, review topic, and associated words.

Additionally, we keep track of the datetime of the latest review, which will be used to assist us in scraping new reviews without duplicating those already obtained previously.

Database Upkeep and Expansion

Similar methods employed during the database initialisation are applied here. When a database update is requested, we retrieve the previously stored datetime to prevent scraping duplicate reviews. Subsequently, data undergoes the same cleaning and text preprocessing procedures as before, and the timestamp of the last scraped review is updated. Finally, the processed data progresses through the modelling pipeline before being stored into the database.

Conclusion

In conclusion, our data retrieval and preparation process ensures comprehensive coverage and quality of our dataset. With the dataset, we hope to enhance its reliability and usability for the analysis moving forward.