This is a simple project created to demonstrate the workflow of a data pipeline. The source data is from the RSS feed of the Computer Science arXiv. The workflow is as follow:
-
install Python 3.6+
-
install
pip3
sudo apt install python3-pip
-
install required modules
pip3 install -r requirements.txt
-
setup a home for
airflow
moduleexport AIRFLOW_HOME=~/airflow
-
start the webserver
airflow webserver
-
visit localhost:8080 in the browser and set the
arxiv-pipeline
DAG toOn
in the webserver UI -
open another terminal and start the schedular
airflow schedular
- this program is only tested on Ubuntu 18.04
- AWS credentials are of the format
***
in source code