Setting Up A Delta Lake For Querying

The goal of this project is to learn how to use the delta lake format to store tables in AWS that can be queried with AWS Athena. The data used for this project was fantasy football data from the time period 1970 to 2021 (source: bendominguez0111/fantasy-csv-data). By following the steps in the README, you should be able to follow along, setup the appropriate services in AWS and construct a query that returns the average fantasy points among top quarterbacks, running backs and wide recievers for each year.

Table Of Contents

Setting Up A Delta Lake For Querying

Prerequisites

Poetry

curl -sSL https://install.python-poetry.org | python3 -

Java

sudo apt install default-jre default-jdk

AWS Account

Follow Along

Clone the repository to your local machine
Download the required dependencies
```
poetry init
```

S3 Bucket

Create a new S3 bucket. Most of the default settings can be left. Remember the bucket name should be unique within AWS.

Create the following folder structure within the bucket

your-name-delta-lake-project-1/
├── athena
├── database
│   └── raw
└── scripts
    └── temp

Upload all the .csv files directly under data to database/raw. Do not upload the .csv files under archive.

IAM Permissions

Go to IAM and select Create Role
Configure Trusted Entity (AWS Service) and Use Case (Glue)
Give the role administrator access
Give the role a name

Initialize Delta Lake

We will create a table in our delta lake with the following schema.

erDiagram
    PERSON {
        int Year
        string Position
        string Player
        float FantasyPoints
    }

Review the init_delta_lake.ipynb notebook. In order to run the notebook successfully, you will need the following configured.
- Access key in ~/.aws/config
- Secret key in ~/.aws/config
- Correct hadoop-aws package version installed
- BUCKET_NAME set
Select the proper kernel (.venv) for the notebook.
Run the cells in the notebook after configuration.

Run AWS Glue Crawler

Navigate to AWS Glue Crawler and click Create crawler
Give a name for the crawler e.g. ff-delta-lake. Next.
Add a data source and make sure to select delta lake. Provide a path to the folder containing the parquet files. It should be s3://bucket-name/database/top_performers_delta/. Next.
Select the IAM role you created for the crawler. Next.
In Output Configuration, select Add databse, and add a new database to the Glue Catalog name delta_lake. Close the tab and return back to crawler configuration and select the newly created database. Refresh the list of databases if necessary. Next.
Review and Create Crawler. Run the crawler. After about a minute, navigate to AWS Glue > Tables. You should see a new table named top_performers_delta. Select the table and the schema should be visible.

Glue Job

The job csv_to_delta.ipynb reads data from all of the CSV files, extracts the top 12 fantasy performers per position, and writes the data to the delta lake storage layer on S3. The Glue job can be run through a local notebook by connecting to an AWS Interactive Session or by uploading the notebook to the cloud. Follow either of the below instructions.

Triggering Locally

Add glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRole> to your ~/.aws/config file. Obtain the proper ARN from IAM > Roles > your-name-glue-delta-lake-project-1. See Configuring sessions with named profiles for more details.
In a new terminal in the project repo, run poetry shell
jupyter notebook --no-browser. Copy the jupyter server URL.
Open csv_to_delta.ipynb. Select Kernel -> Select Another Kernel... -> Paste URL and select Glue PySpark.
Review the notebook. Set the BUCKET_NAME correctly. Then run the notebook.
To confirm the write was successfull, navigate to s3://your-name-delta-lake-project-1/database/top_performers_delta/_delta_log/. You should observe two .json files. The first representing the initial write transaction, and the second representing the glue jobs table append.

Triggering In AWS

Create a new notebook job in AWS Glue.
- Engine: Spark (Python)
- Upload Notebook: notebooks/csv_to_delta.ipynb
- IAM Role your-name-glue-delta-lake-project-1
Save the glue job name to csv_to_delta. Run the cells in the notebook.
To confirm the write was successfull, navigate to s3://your-name-delta-lake-project-1/database/top_performers_delta/_delta_log/. You should observe two .json files. The first representing the initial write transaction, and the second representing the glue jobs table append.

Use Athena To Query

Open up Query Editor in Athena. Set the *Data Source to AwsDataCatalog and Database to delta_lake. Copy and paste the following SQL into the editor and Run.

SELECT year,
    position,
    ROUND(AVG(fantasypoints), 1) AS "Fantasy Points"
FROM "delta_lake"."top_performers_delta"
GROUP BY year,
    position
ORDER BY YEAR desc,
    position

Exploratory Data Analysis

See the qb_eda.ipynb for some exploratory data analysis to understand the data.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
assets		assets
data		data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setting Up A Delta Lake For Querying

Prerequisites

Follow Along

S3 Bucket

IAM Permissions

Initialize Delta Lake

Run AWS Glue Crawler

Glue Job

Triggering Locally

Triggering In AWS

Use Athena To Query

Exploratory Data Analysis

Resources

About

Releases

Packages

Languages

License

machar94/project-1-delta-lake

Folders and files

Latest commit

History

Repository files navigation

Setting Up A Delta Lake For Querying

Prerequisites

Follow Along

S3 Bucket

IAM Permissions

Initialize Delta Lake

Run AWS Glue Crawler

Glue Job

Triggering Locally

Triggering In AWS

Use Athena To Query

Exploratory Data Analysis

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages