Skip to content

Commit

Permalink
Update requirements and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
mrafayaleem committed Dec 8, 2018
1 parent 3a5ec7d commit 9dfd83c
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 185 deletions.
15 changes: 11 additions & 4 deletions RUNNING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

Steps to run ETL process:
#####Steps to run ETL process:

ETL process should be executed from the bootstrap directory.

Expand All @@ -9,16 +9,23 @@ ETL process should be executed from the bootstrap directory.
3. Type of input (specifies if warc files should be loaded from local drive or s3). Options are:
* s3
* file
4. Craw path. Should be file path of your crawl data if type of input is file. Should be bucket (commoncrawl) in case of s3.
4. Crawl path. Should be file path of your crawl data if type of input is file. Should be bucket (commoncrawl) in case of s3.
5. Batch size. Specifies how may batches of warc files to process in a single run.

For file:
Note that if you are running ETL from local drive, you will need to download sample crawl data using the following command. This might take a couple of hours.
```
cd bootstrap
./get-data.sh
```


#####To run ETL from the file:
```bash
cd bootstrap
./etl.sh execute input_paths/may.warc.paths may file /Users/rafay/datalab/community-clusters/bootstrap 1
```

For S3:
#####To run ETL from S3 (For the month of may):
```bash
cd bootstrap
./etl.sh execute input_paths/may.warc.paths may s3 commoncrawl 10
Expand Down
34 changes: 0 additions & 34 deletions bootstrap/requirements.txt

This file was deleted.

160 changes: 13 additions & 147 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,172 +1,38 @@
apturl==0.5.2
asn1crypto==0.24.0
astroid==2.0.4
autopep8==1.4.3
appnope==0.1.0
backcall==0.1.0
bcolz==1.2.1
beautifulsoup4==4.6.3
bleach==2.1.4
boto3==1.9.23
botocore==1.12.23
Bottleneck==1.2.1
Brlapi==0.6.6
boto3==1.9.47
botocore==1.12.47
bs4==0.0.1
certifi==2018.11.29
certifi==2018.10.15
chardet==3.0.4
cloudpickle==0.6.1
colorama==0.3.9
command-not-found==0.3
cryptography==2.1.4
cupshelpers==1.0
cycler==0.10.0
cymem==2.0.2
cytoolz==0.9.0.1
dask==0.20.1
dataclasses==0.6
decorator==4.3.0
defer==1.0.6
defusedxml==0.5.0
dill==0.2.8.2
distro-info==0.18
docutils==0.14
en-core-web-sm==2.0.0
entrypoints==0.2.3
fastprogress==0.1.18
fire==0.1.3
ftfy==5.5.0
gitdb2==2.0.4
GitPython==2.1.11
graphviz==0.10.1
html5lib==1.0.1
httplib2==0.9.2
graphframes==0.6
idna==2.7
ipykernel==4.9.0
ipython==6.5.0
ipython==7.1.1
ipython-genutils==0.2.0
ipywidgets==7.4.2
isort==4.3.4
isoweek==1.3.3
jedi==0.12.1
Jinja2==2.10
jedi==0.13.1
jmespath==0.9.3
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==5.2.0
jupyter-core==4.4.0
keyring==10.6.0
keyrings.alt==3.0
kiwisolver==1.0.1
language-selector==0.1
launchpadlib==1.10.6
lazr.restfulclient==0.13.5
lazr.uri==1.0.3
lazy-object-proxy==1.3.1
louis==3.5.0
macaroonbakery==1.1.3
Mako==1.0.7
MarkupSafe==1.0
matplotlib==3.0.2
mccabe==0.6.1
mistune==0.8.3
msgpack==0.6.0
msgpack-numpy==0.4.3.2
murmurhash==1.0.1
nbconvert==5.4.0
nbdime==1.0.2
nbformat==4.4.0
networkx==2.2
nltk==3.3
notebook==5.6.0
numexpr==2.6.8
nose==1.3.7
numpy==1.15.4
oauth==1.0.1
olefile==0.45.1
opencv-python==3.4.3.18
pandas==0.23.4
pandas-summary==0.0.5
pandocfilters==1.4.2
parso==0.3.1
pexpect==4.6.0
pickleshare==0.7.4
Pillow==5.3.0
plac==0.9.6
preshed==2.0.1
prometheus-client==0.3.1
prompt-toolkit==1.0.15
protobuf==3.0.0
psutil==5.4.7
pickleshare==0.7.5
prompt-toolkit==2.0.7
ptyprocess==0.6.0
py4j==0.10.7
pycairo==1.16.2
pycodestyle==2.4.0
pycorenlp==0.3.0
pycrypto==2.6.1
pycups==1.9.73
Pygments==2.2.0
pygobject==3.26.1
pylint==2.1.1
pymacaroons==0.13.0
PyNaCl==1.1.2
pyparsing==2.3.0
pyRFC3339==1.0
pyspark==2.3.1
python-apt==1.6.3
pyspark==2.4.0
python-dateutil==2.7.5
python-debian==0.1.32
pytz==2018.7
PyWavelets==1.0.1
pyxdg==0.25
PyYAML==3.13
pyzmq==17.1.2
qtconsole==4.4.1
regex==2018.11.22
reportlab==3.4.0
requests==2.20.1
requests-file==1.4.3
requests-unixsocket==0.1.5
s3transfer==0.1.13
scikit-image==0.14.1
scikit-learn==0.20.1
scipy==1.1.0
seaborn==0.9.0
SecretStorage==2.3.1
Send2Trash==1.5.0
simplegeneric==0.8.1
simplejson==3.13.2
six==1.11.0
sklearn==0.0
sklearn-pandas==1.7.0
smmap2==2.0.4
spacy==2.0.16
system-service==0.3
systemd-python==234
terminado==0.8.1
testpath==0.3.1
thinc==6.12.0
tldextract==2.2.0
toolz==0.9.0
torch-nightly==1.0.0.dev20181203
torchtext==0.3.1
torchvision-nightly==0.2.1
tornado==5.1.1
tqdm==4.28.1
traitlets==4.3.2
typed-ast==1.1.0
typing==3.6.6
ubuntu-drivers-common==0.0.0
ufw==0.35
ujson==1.35
unattended-upgrades==0.1
urllib3==1.24.1
usb-creator==0.3.3
wadllib==1.3.2
warcio==1.6.1
warcio==1.6.3
wcwidth==0.1.7
webencodings==0.5.1
widgetsnbextension==3.4.2
wordcloud==1.5.0
wrapt==1.10.11
xkit==0.0.0
xlrd==1.1.0
zope.interface==4.3.2
wordcount==1.0
Binary file removed wordcloud.png
Binary file not shown.

0 comments on commit 9dfd83c

Please sign in to comment.