PageRank implementation on the Hadoop plateform and with Pig Latin.
- correctly installed Hadoop
- (optional) define a variable $HADOOP_ROOT containing path to hadoop installation :
export HADOOP_ROOT="/path/to/hadoop-2.7.1"
- make sure $HADOOP_CLASSPATH is set
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
$HADOOP_ROOT/bin/hadoop com.sun.tools.javac.Main App.java FDMapper.java FDReducer.java FDCrawler.java
jar cf ForagingDumbo.jar App.class FDMapper.class FDReducer.class FDCrawler.class
### Run the crawler
$HADOOP_ROOT/bin/hadoop jar ForagingDumbo.jar App crawl <url> <max_depth> <output file>
### Run the page rank operation
$HADOOP_ROOT/bin/hadoop jar ForagingDumbo.jar App rank <input path> <output path>
- correctly installed Pig Latin
- (optional) define a variable $PIG_ROOT containing path to pig installation :
export PIG_ROOT="/path/to/pig-0.15.0"
$PIG_ROOT/bin/pig -x local -p INPUT=<input path> -p OUTPUT=<output path> pagerank.pl
cd input
python2 -m SimpleHTTPServer 8080
Then, when running application, you can pass "http://localhost:8080" as url.
You can find into hadoop_res and pig_res the result of each iteration of the pageRank.
Both start from crawl.stallman.txt, crawled with as starting url the Richard Stallman's personal page.