Skip to content

Project for collating retail data from multiple sources to one centralised database.

License

Notifications You must be signed in to change notification settings

amysw13/multinational-retail-data-centralisation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multinational Retail Data Centralisation πŸ’Ό

Table of contents

Description

This project is for collating retail sales data for a multinational company, where data is spread across multiple data sources to one centralised database. Facilitating accessibility and analysing for company team members to become more data-driven.

Project_schematic

Aim

Using python class methods are utilised for connecting to data sources, data extraction and cleaning downloaded data.

Cleaned data will be uploaded to a centralised PostgreSQL database. The star-based database schema developed manually, ensuring data columns are of the correct data type.

Querying the newly created centralised database to get up-to-date business metrics, using PostgreSQL.

Achievement OutcomesπŸ“–

Achievement Outcomes πŸ“–

  • Advanced object orientated programming in Python.
  • Advanced development of Python class and methods.
  • Creation and importing of python modules.
  • Connecting to external databases - initiating connections, with database credentials accessed from hidden files.
  • Querying and data retrieving methods from web APIs.
  • Python method for data extraction from pdf files.
  • Data base querying from AWS RDS database.
  • Data downloading from AWS RDS and s3.
  • Accessing data from multiple formats. E.g. .json, .csv, .yaml and, .pdf files.
  • Data cleaning with pandas and numpy.
  • Connecting and uploading to locally hosted PostgreSQL database.
  • Developing star-based schema for PostgreSQL database.
  • Querying centralised database with PostgreSQL.
  • Utilising CTEs in PostgreSQL queries.
  • Advanced project documentation and versioning with Github.
  • Applying Conda environments for project, and exportation to facilitate project collaboration.

Installation and Usage Instructions βš™

Dependencies

Requires PostgreSQL with pgAdmin4

Requires Python3

Requires following packages:

pandas
tabula
requests
boto3
yaml
sqlalchemy
pandas as pd
numpy as np
re

Installed by running:

pip install <package_name>

Alternatively, use amy_mrdc_env.yaml to import project conda environment, by running:

conda create -f amy_mrdc_env.yaml -n <env_name>

Installation Instructions

  1. git clone https://github.com/amysw13/multinational-retail-data-centralisation.git
  2. cd multinational-retail-data-centralisation
  3. Configure PostgreSQL centralised database and credentials to connect with AWS RDS, API and PostgreSQL
  4. python Classes/main.py or python3 Classes/main.py to extract, clean and upload data to centralised database

Demo

1. Database connection, data extraction and data cleaning

Import class modules:

import Classes.database_utils as db_utils
import Classes.data_extraction as data_ext
import Classes.data_cleaning as data_clean

Create instances of each class:

connector = db_utils.DatabaseConnector()
extractor = data_ext.DataExtractor()
cleaning = data_clean.DataCleaning()

Reading credentials and create connection to AWS RDS and local centralised databases:

See credentials_template.yaml for an example to create own credentials file.

# Reading in AWS RDS database credentials file.
AWS_RDS_credentials = connector.read_db_creds('db_creds')
# Create engine and connecting to AWS RDS database.
AWS_RDS_engine = connector.init_db_engine(AWS_RDS_credentials)

# Reading in centralised local database credentials file.
local_credentials = connector.read_db_creds('local_creds')
# Create engine and connecting to centralised local database.
local_engine = connector.init_db_engine(local_credentials)

Data extraction/downloading:

# Printing list of available tables names in AWS RDS database
db_list = connector.list_db_tables(AWS_RDS_engine)

# Download data from 'legacy_users' table, using the AWS RDS specified connection engine.
rds_df = extractor.read_rds_table('legacy_users', AWS_RDS_engine)

Data Cleaning:

# Data specific cleaning methods
clean_rds_df = cleaning.clean_user_data(rds_df)

Uploading dataframe to centralised database:

# Cleaned df object uploaded to centralised database, table named as 'dim_users'.
connector.upload_to_db(clean_rds_df, 'dim_users', local_engine)

Complete run through of project in testing_script.ipynb


2. Database schema development

Centralised database star-based schema development database_schema.ipynb

Sales_data_ERD

Entity Relationship diagram of STAR-based schema centralised database

3. Database querying

Applied PostgreSQL database querying querying_database.ipynb

Example:

SELECT country_code,
    COUNT(country_code) AS total_no_stores
FROM
    dim_store_details
WHERE
    store_type != 'Web Portal'
GROUP BY
    country_code
ORDER BY
    total_no_stores DESC;

Result

country_code total_no_stores
GB 265
DE 141
US 34

File Structure πŸ“‚

License information πŸ—’

MIT

Open source packages used in this project

Jupyter VsCode github conda

Database connecting

sqlAlchemy boto3

Data extracting/downloading

requests PyYAML tabula

Data cleaning

Pandas Numpy

About

Project for collating retail data from multiple sources to one centralised database.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published