Skip to content

kent-lee/arxiv-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv Pipeline

This is a simple project created to demonstrate the workflow of a data pipeline. The source data is from the RSS feed of the Computer Science arXiv. The workflow is as follow:

alt text

Instructions

  1. install Python 3.6+

  2. install pip3

    sudo apt install python3-pip
  3. install required modules

    pip3 install -r requirements.txt
  4. setup a home for airflow module

    export AIRFLOW_HOME=~/airflow
  5. start the webserver

    airflow webserver
  6. visit localhost:8080 in the browser and set the arxiv-pipeline DAG to On in the webserver UI

  7. open another terminal and start the schedular

    airflow schedular

Note

  • this program is only tested on Ubuntu 18.04
  • AWS credentials are of the format *** in source code

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages