Err uh... MapReduce Design Patterns.
As I work through the MapReduce Design Patterns book I need a place to stash my source code. This is it.
I stayed moderately true to the examples, with some re-arrangement here and there. Most notably the MRDPUtils#transformXmlToMap performs a StringEscapeUtils#unescapeHtml within itself rather than separately in any mapper that needs that functionality.
$ mvn package
I've placed a bunch of scripts in the ./bin/ directory. These make a few terrible assumptions about your environment. You can change ./bin/env.sh to be more accomodating.
- There is a
$HADOOP_HOME
, even though its deprecated - The
$DATADIR
is mapped to$DATADIR=/Users/$USER/Downloads/stack-overflow-dump-2009-09
- You have the CC data dump from StackOverflow (I used 2009 because its smallish, you should be able to use any year)
- The launch scripts assume single node mode
Make sure Hadoop is running ($HADOOP_HOME/bin/start-all.sh
) and execute the script of your choice.