Enron-Python-Flask-Cassandra-Pig

This Hortonworks example post extracts topics via TF-IDF from the Enron emails and serves them via Cassandra and Flask with help from the Pygmalion project, CassandraStorage and pycassa. It accompanies the blog post at <>.

Environment Setup

Edit and run env.sh to inform CassandraStorage about your local Cassandra instance.

Cassandra Setup

Install Cassandra according to the instructions in the post, and then create our schema by running cassandra.txt in the cassandra-cli.

Test Pycassa

Run test_pycassa.py to verify it works.

Get the Enron Emails

Grab the Enron emails at https://s3.amazonaws.com/rjurney_public_web/hadoop/enron.avro

Run our Pig Script

Run cassandra_enron.pig to extract topics from the email bodies and store them in Cassandra. Note: you may want to adjust the limit statement to run the example on fewer emails if you are running this example in local mode. The entire corpus on one machine will take a LONG time to finish. This is where the utility of Hadoop comes in :)

Serve up our data in our app

Run index.py and plug in a message_id (which you can get via SAMPLE/LIMIT in Pig) to the url in your favorite browser and you can see the top 20 topics, as determined by Tf*idf, published in a web service. Wallah!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
cassandra.txt		cassandra.txt
cassandra_enron.pig		cassandra_enron.pig
env.sh		env.sh
index.py		index.py
test_pycassa.py		test_pycassa.py
tf_idf_macro_test.pig		tf_idf_macro_test.pig
tf_idf_macro_test.pig.expanded		tf_idf_macro_test.pig.expanded
tfidf.macro		tfidf.macro
token_extractor.py		token_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enron-Python-Flask-Cassandra-Pig

Environment Setup

Cassandra Setup

Test Pycassa

Get the Enron Emails

Run our Pig Script

Serve up our data in our app

About

Releases

Packages

Languages

rjurney/enron-python-flask-cassandra-pig

Folders and files

Latest commit

History

Repository files navigation

Enron-Python-Flask-Cassandra-Pig

Environment Setup

Cassandra Setup

Test Pycassa

Get the Enron Emails

Run our Pig Script

Serve up our data in our app

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages