ETL process should be executed from the bootstrap directory.
etl.sh
takes 5 arguments as follows:
- Input path containing warc paths
- Output name (Usually month for which the ETL is executed)
- Type of input (specifies if warc files should be loaded from local drive or S3). Options are:
- S3: use this if you want to access the crawl data through S3 buckets (does not need S3 credentials)
- file: use this if you want to work on WARC data that exists on your local machine
- Crawl path. Should be file path of your crawl data if type of input is file. Should be bucket (commoncrawl) in case of S3.
- Batch size. Specifies how may batches of warc files to process in a single run.
Note that if you are running ETL from local drive, you will need to download sample crawl data using the following command. This might take a couple of hours.
cd bootstrap
./get-data.sh
cd bootstrap
./etl.sh execute input_paths/may.warc.paths may S3 commoncrawl 10
cd bootstrap
./etl.sh execute input_paths/may.warc.paths may file <path_to_root>/community-clusters/bootstrap 1
This process should be executed from the root directory.
autoAnalysis.sh
takes 3 arguments as follows:
- Category of focus e.g. shopping
- Focus Domains e.g. 'etsy ebay amazon'
- Month e.g may
./autoAnalysis.sh shopping 'etsy ebay amazon' may
Note that the above command assumes that bootstrap/spark-warehouse/may
exists (generated by the ETL process)
python3 -m http.server 8080
Then navigate to public/index-
+ category + -
+ month + .html
on your browser
e.g.
localhost:8080/public/index-shopping-may.html