(c) 2016 Chris Hodapp, [email protected]
This is the source code for a project I did at the end of 2016 which applied some machine learning techniques (mostly unsupervised learning) to the MIMIC-III Critical Care Database. This project was for a course I took as part of my CS masters: CSE 8803 - Big Data Analytics for Healthcare.
The paper describing this work in more detail is: https://arxiv.org/abs/1612.08425
- The MIMIC-III dataset
- SBT (Scala Build Tools) >= 0.13; other versions may work, but I have not tried them.
- Apache Spark
- Python 2.7 or 3.x, and the following packages (
pip
versions should be fine):- Keras and ideally a GPU-enabled backend (Theano or TensorFlow)
- h5py (if you want to save and load trained networks from Keras)
- scikit-learn
- pydot-ng (optional)
sbt compile
should handle pulling dependencies and building
everything. sbt package
should produce a JAR that spark-submit
can handle.
pip install keras h5py scikit-learn pydot-ng
should handle the
Python prerequisites, but note that you may need to configure Keras or
its backend further in order to have GPU acceleration.
To produce what was in the paper, run the below commands from the same
directory as the code. For the first command, you will need to supply
two paths: the path containing the .csv.gz
files from MIMIC-III (for
the -i
option), and the full path to the data
directory in this
archive (for the -o
option).
spark-submit --master "local[*]" \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages "com.github.scopt:scopt_2.11:3.5.0" \
target/scala-2.11/mimic3_phenotyping_2.11-1.0.jar \
-i "file:////mnt/dev/mimic3/" \
-o "file:///home/hodapp/source/bd4h-project-code/data/" \
-m -c -r -b --icd9a 428 --icd9b 571 -l "1742-6"
python timeseries_plots.py -d ./data -o ./data \
--icd9a 428 --icd9b 571 --loinc 1742-6
python feature_learning.py -d ./data -o ./data \
--icd9a 428 --icd9b 571 --loinc 1742-6 \
--activity_l1 0.0001 --weight_l2 0.001 \
--load_model 428_571_1742-6.h5 --tsne --logistic_regression
The spark-submit
command still sometimes exhibits an issue in which
it completes the job but fails to return to the prompt. Check Spark's
web UI (i.e. http://localhost:4040) for all jobs actually being done.
For expediency, this will skip hyperparameter optimization (which can
take 20-30 minutes depending on machine) and use hyperparameters
already estimated, and it will use weights from a pre-trained neural
network instead of training it. To actually run through the full
process, add -h
to the first command, and remove the --load_model
option from the feature_learning
invocation.
All output will be in the data
directory in PNG and EPS format.
This will include CSV and Parquet files from the Spark code, and PNG
and EPS files from the Python code.