Technology used: MySQL, MongoDB, PostgreSQL, DB2 on Cloud, IBM Cognos Analytics, Python, Apache Airflow, Hadoop, Spark
This captsone project is part of the IBM Data Engineering Professional Certificate.
The task is to build a data platform for retailer data analytics by assuming the role of an Associate Data Engineer who has recently joined an e-commerce organization called SoftCart.
- Design a data platform that uses MySQL as an OLTP database and MongoDB as a NoSQL database.
- Design and implement a data warehouse and generate reports from the data.
- Design a reporting dashboard that reflects the key metrics of the business.
- Extract data from OLTP and NoSQL databases, transform it and load it into the data warehouse, and then create an ETL pipeline.
- Create a Spark connection to the data warehouse, and then deploy a machine learning model
- SoftCart, an e-commerce company, uses a hybrid architecture with some of its databases on premises and some on cloud
- SoftCart's online presence is primarily through its website, which customers access using a variety of devices like laptops, mobiles and tablets.
- All catalog data of the products is stored in the MongoDB NoSQL server and all transactional data like inventory and sales are stored in the MySQL database server. SoftCart's webserver is driven entirely by these two databases.
- Data is periodically extracted from these two databases and put into the staging data warehouse running on PostgreSQL.
- The production data warehouse is on the cloud instance of IBM DB2 server.
- BI teams connect to the IBM DB2 for operational dashboard creation. IBM Cognos Analytics is used to createdashboards.
- SoftCart uses Hadoop cluster as its big data platform where all the data is collected for analytics purposes.
- Spark is used to analyse the data on the Hadoop cluster.
- To move data between OLTP, NoSQL and the data warehouse, ETL pipelines are used and these run on Apache Airflow.
Documentation: OLTP Database
MySQL is used to design the OLTP database for the E-Commerce website. The sales
OLTP Database design is based on the oltp data provided. The data is loaded into sales_data
table, and a datadump.sh bash script is created to take backups using mysqldump.
Documentation: NoSQL Database
MongoDB is used to set up a NoSQL database for the E-Commerce website. The catalog
NoSQL database is created and catalog data is loaded into electronics
collection using mongoimport. Also, electronics.csv containing a subset of fields is exported from the electronics
collection using mongoexport
Documentation: Data Warehouse
PostgreSQL is used to design and create the data warehouse schema using pgAdmin's ERD Design Tool. The star schema design is based on the sample order data provided.
Fact Table: softcartFactSales
Dimension Tables: softcartDimDate
, softcartDimCategory
, softcartDimCountry
, softcartDimItem
Documentation: Data Warehouse Reporting
IBM DB2 is used to generate reports out of the data in the data warehouse. Data warehouse schema is used to create the data warehouse and the following data is loaded:
The following queries and MQT are created for data analytics:
- Total Sales by Country & Category Grouping Sets
- Total Sales by Year & Country Rollup
- Average Sales by Year & Country Cube
- Materialized Query Table for Total Sales per Country
IBM Cognos Analytics is used to design a reporting dashboard that reflects the key metrics of the business. Ecommerce.csv is load into sales_history
table in the DB2 warehouse and used to generate the following reports:
Documentation: ETL
Python is used to perform incremental data load from staging data warehouse to production data warehouse and sync up the databases. automation.py python script connects to the staging and production data warehouses and loads all new records since last load from staging to production.
Documentation: Data Pipeline
Airflow is used to create a data pipeline that analyzes the web server log. The process_web_log dag:
- Extracts the ipaddress field from the web server log file and saves it into text file
- Filters out all the occurrences of ipaddress “198.46.149.143” from the text file and saves the output to a new text file
- Load the data by archiving the transformed text file into a TAR file
Documentation: Big Data Analytics
Spark is used to analyze search terms on the e-commerce web server provided in searchterms.csv. Spark.ipynb loads data into a spark dataframe and queries it to answer questions like the number of times the term gaming laptop
was searched or what are the top 5 most frequently used search terms. Also, a pre-trained sales forecasting model is used this to predict the sales for 2023.