GitHub - dougneedham/Cloudera-Data-Scientist-Challenge-3: My solution for the Cloudera Data Science Challenge 3. Spark MLlib for Smartfly. Spark GraphX for Winklr. Python Streaming for web log analysis

This is the submission package for Doug Needham

Data Science Challenge 3

The Cloudera Data Science Challenge 3 Description

The proper write up for this solution is in this directory as Doug_Needham_DSC3_Write_Up.pdf

The assumptions for this code is that it will run under the userid "dln" The HDFS directory structure required is:

/user/dln/problem1
/user/dln/problem1/driver
/user/dln/problem1/svm
/user/dln/problem2
/user/dln/problem3
/user/dln/problem3/inGraph
/user/dln/problem3/OutGraph

The shell script "setup.sh" performs the appropriate hadoop fs -mkdir -p commands to create the directories.

As to the source data. All of the code that follows assumes the data for the challenge is in the following location and structure:

/user/dsc/famous/spam.log
/user/dsc/famous/web.log
/user/dsc/winklr/Winklr-network.csv
/user/dsc/winklr/Winklr-topClickPairs.csv
/user/dsc/smartfly/smartfly_historic.csv
/user/dsc/smartfly/smartfly_scheduled.csv

Both of the previous assumptions are used to set these environment variables in the individual shell scripts:

SRC_DATA=/user/dsc
TGT_DATA=/user/dln

The three requested deliverables are under the directory named "answer", these are the "master" answers and no automation is used to copy the files from the individual code directories to the answer directory:

answer/
answer/problem1.csv
answer/problem2.json
answer/problem3.csv

The structure of the directories for the code is as follows (The output directories created by sbt are eliminated for brevity) :

answer
problem1
- analysis
- data
- log
- PredictFlights
problem2
- data
- json
- log
problem3
- AnalyzeGraph
- data
- final
- inGraph
- OutGraph
- log

The shell script to run each problem is in the individual problem directory.

problem1/problem1.sh
problem2/problem2.sh
problem3/problem3.sh

These can all be run as a background process using problem1.sh & for example, since logging within the shell script is being done to the log directory.

problem1.sh and problem3.sh can be run with a single command line argument. Both of these scripts are data driven, in that they each have a file that drives the process. In the case of problem1, it is a list of airports, in problem3 it is a list of originating vertices. The command line argument "throttles" the proces to only run a certain number of airports, or from-vertices for problems 1 and 3 respectively.

Thank you,

Doug Needham

[email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is the submission package for Doug Needham

Data Science Challenge 3

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Source_Data		Source_Data
answer		answer
problem1		problem1
problem2		problem2
problem3		problem3
Doug_Needham_DSC3_Write_Up.pdf		Doug_Needham_DSC3_Write_Up.pdf
README.md		README.md
readme.txt		readme.txt
setup.sh		setup.sh

dougneedham/Cloudera-Data-Scientist-Challenge-3

Folders and files

Latest commit

History

Repository files navigation

This is the submission package for Doug Needham

Data Science Challenge 3

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages