Skip to content

Latest commit

 

History

History
46 lines (29 loc) · 1010 Bytes

README.md

File metadata and controls

46 lines (29 loc) · 1010 Bytes

arXiv Pipeline

This is a simple project created to demonstrate the workflow of a data pipeline. The source data is from the RSS feed of the Computer Science arXiv. The workflow is as follow:

alt text

Instructions

  1. install Python 3.6+

  2. install pip3

    sudo apt install python3-pip
  3. install required modules

    pip3 install -r requirements.txt
  4. setup a home for airflow module

    export AIRFLOW_HOME=~/airflow
  5. start the webserver

    airflow webserver
  6. visit localhost:8080 in the browser and set the arxiv-pipeline DAG to On in the webserver UI

  7. open another terminal and start the schedular

    airflow schedular

Note

  • this program is only tested on Ubuntu 18.04
  • AWS credentials are of the format *** in source code