- Multinational Retail Data Centralisation πΌ
- Table of contents
- Installation and Usage Instructions β
- File Structure π
- License information π
- Open source packages used in this project
This project is for collating retail sales data for a multinational company, where data is spread across multiple data sources to one centralised database. Facilitating accessibility and analysing for company team members to become more data-driven.
Using class methods are utilised for connecting to data sources, data extraction and cleaning downloaded data.
Cleaned data will be uploaded to a centralised database. The star-based database schema developed manually, ensuring data columns are of the correct data type.
Querying the newly created centralised database to get up-to-date business metrics, using .
- Advanced object orientated programming in Python.
- Advanced development of Python class and methods.
- Creation and importing of python modules.
- Connecting to external databases - initiating connections, with database credentials accessed from hidden files.
- Querying and data retrieving methods from web APIs.
- Python method for data extraction from pdf files.
- Data base querying from AWS RDS database.
- Data downloading from AWS RDS and s3.
- Accessing data from multiple formats. E.g. .json, .csv, .yaml and, .pdf files.
- Data cleaning with pandas and numpy.
- Connecting and uploading to locally hosted PostgreSQL database.
- Developing star-based schema for PostgreSQL database.
- Querying centralised database with PostgreSQL.
- Utilising CTEs in PostgreSQL queries.
- Advanced project documentation and versioning with Github.
- Applying Conda environments for project, and exportation to facilitate project collaboration.
Requires PostgreSQL with pgAdmin4
Requires Python3
Requires following packages:
pandas
tabula
requests
boto3
yaml
sqlalchemy
pandas as pd
numpy as np
re
Installed by running:
pip install <package_name>
Alternatively, use amy_mrdc_env.yaml to import project conda environment, by running:
conda create -f amy_mrdc_env.yaml -n <env_name>
git clone https://github.com/amysw13/multinational-retail-data-centralisation.git
cd multinational-retail-data-centralisation
- Configure PostgreSQL centralised database and credentials to connect with AWS RDS, API and PostgreSQL
python Classes/main.py
orpython3 Classes/main.py
to extract, clean and upload data to centralised database
Import class modules:
import Classes.database_utils as db_utils
import Classes.data_extraction as data_ext
import Classes.data_cleaning as data_clean
Create instances of each class:
connector = db_utils.DatabaseConnector()
extractor = data_ext.DataExtractor()
cleaning = data_clean.DataCleaning()
Reading credentials and create connection to AWS RDS and local centralised databases:
See credentials_template.yaml for an example to create own credentials file.
# Reading in AWS RDS database credentials file.
AWS_RDS_credentials = connector.read_db_creds('db_creds')
# Create engine and connecting to AWS RDS database.
AWS_RDS_engine = connector.init_db_engine(AWS_RDS_credentials)
# Reading in centralised local database credentials file.
local_credentials = connector.read_db_creds('local_creds')
# Create engine and connecting to centralised local database.
local_engine = connector.init_db_engine(local_credentials)
Data extraction/downloading:
# Printing list of available tables names in AWS RDS database
db_list = connector.list_db_tables(AWS_RDS_engine)
# Download data from 'legacy_users' table, using the AWS RDS specified connection engine.
rds_df = extractor.read_rds_table('legacy_users', AWS_RDS_engine)
Data Cleaning:
# Data specific cleaning methods
clean_rds_df = cleaning.clean_user_data(rds_df)
Uploading dataframe to centralised database:
# Cleaned df object uploaded to centralised database, table named as 'dim_users'.
connector.upload_to_db(clean_rds_df, 'dim_users', local_engine)
Complete run through of project in testing_script.ipynb
Centralised database star-based schema development database_schema.ipynb
Entity Relationship diagram of STAR-based schema centralised database
Applied PostgreSQL database querying querying_database.ipynb
Example:
SELECT country_code,
COUNT(country_code) AS total_no_stores
FROM
dim_store_details
WHERE
store_type != 'Web Portal'
GROUP BY
country_code
ORDER BY
total_no_stores DESC;
country_code | total_no_stores |
---|---|
GB | 265 |
DE | 141 |
US | 34 |
- π Classes
- π __init__.py
- π data_cleaning.py
- π data_extraction.py
- π database_utils.py
- π main.py
- π Credentials
- π Data
- π date_details.json
- π products.csv
- π images
- π LICENSE
- π README.md
- π amy_mrdc_env.yaml
- π database_schema.ipynb
- π querying_database.ipynb
- π testing_script.ipynb