-
Notifications
You must be signed in to change notification settings - Fork 8
Reviews Data: Retrieval & Preparation
On this page, we cover the sources of our data, the steps involved in data preprocessing, and an overview of the scraping process.
We collect review data from the GXS bank application, available on both the Apple App Store and Google Play Store, to ensure broad demographic coverage.
In addition to GXS, we gather review data from direct digital competitors like Maribank and Trust Bank, along with major banks in Singapore such as DBS, OCBC, and UOB. This comprehensive dataset enables us to conduct a comparative analysis, shedding light on GXS's relative performance and areas where it may have room for improvement in the domain.
We clean the data by removing redundant features and retaining essential elements, ultimately leaving these features in the dataset: review content, title, rating, bank, source, and temporal data.
We conduct text preprocessing on the review content by filtering emoticons, non-English characters, punctuations, and line breaks, and converting all text to lowercase for consistency.
Prior to developing an updatable database, we first have to retrieve the bulk of reviews that already exist from the two sources of reviews. As the GXS banking application was first released in August of 2022, we scraped all reviews that are available from then to the present for all pinpointed banks.
We used existing scraping libraries in Python to collect data from the Apple App Store and Google Play Store. The collected data undergoes the above-mentioned cleaning and preprocessing. Subsequently, leveraging our production-ready models for sentiment analysis, topic modeling, and explainable AI, we utilise the review content to generate new features. These features include sentiment, review topic, and associated words.
Additionally, we keep track of the datetime
of the latest review, which will be used to assist us in scraping new reviews without duplicating those already obtained previously.
Similar methods employed during the database initialisation are applied here. When a database update is requested, we retrieve the previously stored datetime
to prevent scraping duplicate reviews. Subsequently, data undergoes the same cleaning and text preprocessing procedures as before, and the timestamp of the last scraped review is updated. Finally, the processed data progresses through the modelling pipeline before being stored into the database.
In conclusion, our data retrieval and preparation process ensures comprehensive coverage and quality of our dataset. With the dataset, we hope to enhance its reliability and usability for the analysis moving forward.