Development Notes

Changing the server ports

In conf/macrobase.yaml:

        - type: http
          port: 6666
        - type: http
          port: 6667


To view test coverage: mvn cobertura:cobertura; open target/site/cobertura/index.html

Diagnostics tools live under test in macrobase.diagnostic. Add a new instance of a class that extends ConfiguredCommand<MacroBaseConf> to the bootstrap in DiagnosticApplication, then run bin/ <your command name>.

Git Workflow

To merge a PR, don't just click Merge pull request. Instead, follow the following rebase steps and push directly to master:

// Say you're currently on branch wip; get latest changes from origin
git fetch origin
// Rebase change in wip on top of the existing master
// ... or git pull --rebase
git rebase -i origin/master
// Get back to master
git checkout master
// Merge wip with master (using the "fast-forward only" option)
git merge --ff-only wip
// Now can push to origin/master
git push origin master


Oracle recently released their Java Mission Control software, which is available on Mac OS X. To find the appropriate command, run find /Library/Java -name jmc. To run a Java program with their 'Flight Recorder' enabled, append -XX:+UnlockCommercialFeatures -XX:+FlightRecorder to the JVM arguments. bin/profile/ provides an example.

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

Adding Images to Wiki

Clone the repo (git clone, commit images to img subdirectory, link to them using relative link in Markdown (img/myfolder/foo.png).


The MIT ISTC machines don't have Maven, Java 1.8, or the latest version of Postgres. As a possibly temporary workaround, I have installed binaries locally on the istc3 host in /data/pbailis/bin and have set up a Postgres 9.5 instance in /data/pbailis/pgdata running on port 5050. For everything to work correctly, please copy the following to your .bashrc, or just copy mine from ~/pbailis/.bashrc (and make sure to add source ~/.bashrc to .bash_profile):

export PATH="/data/pbailis/bin:$PATH"
export PGDATA=/data/pbailis/pgdata
export PGPORT=5050
export PGHOST=localhost
export JAVA_OPTS="-Dmacrobase.loader.db.url=localhost:$PGPORT -Xms128m -Xmx16G"

An rwx copy of the stanford-futuredata/macrobase repo is in /data/pbailis/macrobase. You should have your own directory in /data/ from which you can also clone the repo (git clone [email protected]:stanford-futuredata/macrobase.git /data/`whoami` ) and work privately.

A description of the Postgres tables is in /data/pbailis/dataset-descriptions.txt.

Parquet Conversion via Spark SQL

PostgreSQL is slow for sparse column accesses, so we use Spark SQL to convert each table to Parquet, a columnar storage format.

In the UNIX shell, run:

SPARK_CLASSPATH=postgresql-9.4.1207.jre6.jar bin/spark-shell --driver-memory 50G --executor-memory 50G --executor-cores 64

In the Spark shell, run (replacing List() with your list of tables):

for(table <- List("campaign_expenditures", "fed_disbursements", "hubway_trips", "milan_telecom", "sensor_data_demo", "joined_cmt_data", "uk_road_accidents")) {
    println(s"Loading $table")
    var jdbcDF = sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://localhost:5050/postgres", "dbtable" -> table))"*").write.format("parquet").save(s"$table.parquet")

Running benchmarks and producing plots

  1. Running benchmarks is easy: just run python within the bench sub-directory. You can pick the workflows to run by specifying your own JSON workflow configuration file. Take a look at bench/conf/workflow_config.json to get a sense of what this configuration file looks like. Please specify where the workflow is a batch or streaming job, its name, what the target attributes are, what the high and low metrics are, and what the base query is. In addition, sweeps over the parameters provided in your own JSON sweeping parameters configuration file: an example of this can be found at bench/conf/sweeping_parameters_config.json. Provide the name of the parameter, along with the range of parameters that you want to explore. (The file can be left empty if you do not want to sweep over any parameters) The names of the JSON configuration files are passed into through command line arguments. Use python -h to get all the arguments available.

  2. To produce plots, pipe the output of the script to a file (eg, output.out, then run python --output-file <output_file> --plot-directory <plot_directory>. Be sure to choose the workloads you're interested in seeing graphs of through the plotting configuration JSON file (bench/conf/plotting_config.json). A graph will be produced for every swept parameter, with each graph containing as many lines as number of workloads chosen.

Note: All scripts should be run from the bench/ sub-directory within macrobase.