The goal of this project is to learn how to use the delta lake format to store tables in AWS that can be queried with AWS Athena. The data used for this project was fantasy football data from the time period 1970 to 2021 (source: bendominguez0111/fantasy-csv-data). By following the steps in the README, you should be able to follow along, setup the appropriate services in AWS and construct a query that returns the average fantasy points among top quarterbacks, running backs and wide recievers for each year.
Table Of Contents
-
Poetry
curl -sSL https://install.python-poetry.org | python3 -
-
Java
sudo apt install default-jre default-jdk
-
AWS Account
-
Clone the repository to your local machine
-
Download the required dependencies
poetry init
-
Create a new S3 bucket. Most of the default settings can be left. Remember the bucket name should be unique within AWS.
-
Create the following folder structure within the bucket
your-name-delta-lake-project-1/ ├── athena ├── database │ └── raw └── scripts └── temp
-
Upload all the .csv files directly under data to database/raw. Do not upload the .csv files under archive.
-
Go to IAM and select Create Role
-
Configure Trusted Entity (AWS Service) and Use Case (Glue)
-
Give the role administrator access
-
Give the role a name
We will create a table in our delta lake with the following schema.
erDiagram
PERSON {
int Year
string Position
string Player
float FantasyPoints
}
-
Review the init_delta_lake.ipynb notebook. In order to run the notebook successfully, you will need the following configured.
- Access key in ~/.aws/config
- Secret key in ~/.aws/config
- Correct hadoop-aws package version installed
BUCKET_NAME
set
-
Select the proper kernel (.venv) for the notebook.
-
Run the cells in the notebook after configuration.
-
Navigate to AWS Glue Crawler and click Create crawler
-
Give a name for the crawler e.g. ff-delta-lake. Next.
-
Add a data source and make sure to select delta lake. Provide a path to the folder containing the parquet files. It should be s3://bucket-name/database/top_performers_delta/. Next.
-
Select the IAM role you created for the crawler. Next.
-
In Output Configuration, select Add databse, and add a new database to the Glue Catalog name delta_lake. Close the tab and return back to crawler configuration and select the newly created database. Refresh the list of databases if necessary. Next.
-
Review and Create Crawler. Run the crawler. After about a minute, navigate to AWS Glue > Tables. You should see a new table named top_performers_delta. Select the table and the schema should be visible.
The job csv_to_delta.ipynb reads data from all of the CSV files, extracts the top 12 fantasy performers per position, and writes the data to the delta lake storage layer on S3. The Glue job can be run through a local notebook by connecting to an AWS Interactive Session or by uploading the notebook to the cloud. Follow either of the below instructions.
-
Add
glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRole>
to your ~/.aws/config file. Obtain the proper ARN from IAM > Roles > your-name-glue-delta-lake-project-1. See Configuring sessions with named profiles for more details. -
In a new terminal in the project repo, run
poetry shell
-
jupyter notebook --no-browser
. Copy the jupyter server URL. -
Open csv_to_delta.ipynb. Select Kernel -> Select Another Kernel... -> Paste URL and select Glue PySpark.
-
Review the notebook. Set the
BUCKET_NAME
correctly. Then run the notebook. -
To confirm the write was successfull, navigate to s3://your-name-delta-lake-project-1/database/top_performers_delta/_delta_log/. You should observe two .json files. The first representing the initial write transaction, and the second representing the glue jobs table append.
-
Create a new notebook job in AWS Glue.
- Engine: Spark (Python)
- Upload Notebook: notebooks/csv_to_delta.ipynb
- IAM Role your-name-glue-delta-lake-project-1
-
Save the glue job name to csv_to_delta. Run the cells in the notebook.
-
To confirm the write was successfull, navigate to s3://your-name-delta-lake-project-1/database/top_performers_delta/_delta_log/. You should observe two .json files. The first representing the initial write transaction, and the second representing the glue jobs table append.
-
Open up Query Editor in Athena. Set the *Data Source to AwsDataCatalog and Database to delta_lake. Copy and paste the following SQL into the editor and Run.
SELECT year, position, ROUND(AVG(fantasypoints), 1) AS "Fantasy Points" FROM "delta_lake"."top_performers_delta" GROUP BY year, position ORDER BY YEAR desc, position
See the qb_eda.ipynb for some exploratory data analysis to understand the data.