Skip to content

Latest commit

 

History

History
54 lines (34 loc) · 1.84 KB

README.md

File metadata and controls

54 lines (34 loc) · 1.84 KB

Datawrangling example

This repository is a toy example done for educational purposes, in order to practice using Python, Docker and SQL.

My goals for the project were:

  • Set up a local Postgres database with Docker
  • Clean and insert a large dataset into the database
  • Query the database to solve specific tasks
  • Do all of the above with clean code, reproducible steps and a command-line utility (CLI)

Setup

Below are the steps required to run this project. Prequisites:

  • Python (v3.9.6 used during testing)
  • Docker

Virtual environment

Using a virtual environment (venv) is recommended, but not necessary:

  1. py -m venv venv
  2. Activate venv in IDE. Run pip list to check that you are in the correct venv, it should only have pip and setuptools installed by default.
  3. pip install -r requirements.txt

Dataset

  1. Get a copy of the raw dataset:

  2. Extract dataset.zip, for example to /Data

Environment variables

Make a copy of the .env-template file and rename it .env

  1. Supply it with credentials. These will be used both when setting up the database and when accessing it.
  2. Specify where you placed the dataset DATASET_PATH should point to the parent directory of /000, /001 etc.

Running queries

Before running queries, start the database container with docker-compose up. The data insertion and querying can then be by calling main.py from a separate terminal. Run py main.py --help for more detailed instructions.

Code style

The project is set up to use Black for automatic formatting. Either set up your IDE to use this automatically, or run Black manually with black.