Skip to content

Commit

Permalink
Merge pull request mmcdermott#193 from mmcdermott/dev
Browse files Browse the repository at this point in the history
Release Candidate 0.0.7
  • Loading branch information
mmcdermott authored Sep 1, 2024
2 parents 9549d7e + 3519769 commit 9588583
Show file tree
Hide file tree
Showing 85 changed files with 4,446 additions and 2,208 deletions.
157 changes: 54 additions & 103 deletions MIMIC-IV_Example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,34 @@ up from this one).

## Step 0: Installation

Download this repository and install the requirements:
If you want to install via pypi, (note that for now, you still need to copy some files locally even with a
pypi installation, which is covered below, so make sure you are in a suitable directory) use:

```bash
conda create -n MEDS python=3.12
conda activate MEDS
pip install "MEDS_transforms[local_parallelism]"
mkdir MIMIC-IV_Example
cd MIMIC-IV_Example
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script.sh
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script_slurm.sh
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/pre_MEDS.py
chmod +x joint_script.sh
chmod +x joint_script_slurm.sh
chmod +x pre_MEDS.py
cd ..
pip install "MEDS_transforms[local_parallelism,slurm_parallelism]"
```

If you want to install locally, use:
If you want to profile the time and memory costs of your ETL, also install: `pip install hydra-profiler`.

## Step 0.5: Set-up
Set some environment variables and download the necessary files:
```bash
git clone [email protected]:mmcdermott/MEDS_transforms.git
cd MEDS_transforms
conda create -n MEDS python=3.12
conda activate MEDS
pip install .[local_parallelism]
export MIMICIV_RAW_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
export MIMICIV_PRE_MEDS_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
export MIMICIV_MEDS_COHORT_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data

export VERSION=0.0.6 # or whatever version you want
export URL="https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/$VERSION/MIMIC-IV_Example"

wget $URL/run.sh
wget $URL/pre_MEDS.py
wget $URL/local_parallelism_runner.yaml
wget $URL/slurm_runner.yaml
mkdir configs
cd configs
wget $URL/configs/extract_MIMIC.yaml
cd ..
chmod +x run.sh
chmod +x pre_MEDS.py
```

## Step 1: Download MIMIC-IV
Expand All @@ -46,101 +47,51 @@ the root directory of where the resulting _core data files_ are stored -- e.g.,

```bash
cd $MIMIC_RAW_DIR
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/d_labitems_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/inputevents_to_rxnorm.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/lab_itemid_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_main.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_value.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/numerics-summary.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/outputevents_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_datetimeevents.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_itemid.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/waveforms-summary.csv
export MIMIC_URL=https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map
wget $MIMIC_URL/d_labitems_to_loinc.csv
wget $MIMIC_URL/inputevents_to_rxnorm.csv
wget $MIMIC_URL/lab_itemid_to_loinc.csv
wget $MIMIC_URL/meas_chartevents_main.csv
wget $MIMIC_URL/meas_chartevents_value.csv
wget $MIMIC_URL/numerics-summary.csv
wget $MIMIC_URL/outputevents_to_loinc.csv
wget $MIMIC_URL/proc_datetimeevents.csv
wget $MIMIC_URL/proc_itemid.csv
wget $MIMIC_URL/waveforms-summary.csv
```

## Step 2: Run the basic MEDS ETL

This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the
`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make
sure you enable this feature by including the `[local_parallelism]` option during installation) or via
`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you
enable this feature by including the `[slurm_parallelism]` option during installation). This script entails
several steps:

### Step 2.1: Get the data ready for base MEDS extraction

This is a step in a few parts:

1. Join a few tables by `hadm_id` to get the right times in the right rows for processing. In
particular, we need to join:
- the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each
`hadm_id`.
- the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`.
2. Convert the patient's static data to a more parseable form. This entails:
- Get the patient's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
`anchor_offset` fields.
- Merge the patient's `dod` with the `deathtime` from the `admissions` table.

After these steps, modified files or symlinks to the original files will be written in a new directory which
will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
directory.
## Step 2: Run the MEDS ETL

This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the
base command that is run is as follows (assumed to be run **not** from this directory but from the
root directory of this repository):
To run the MEDS ETL, run the following command:

```bash
./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true
```

In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
To not unzip the `.csv.gz` files, set `do_unzip=false` instead of `do_unzip=true`.

### Step 2.2: Run the MEDS extraction ETL
To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an
additional argument

We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
subdirectories of the same root directory).

This is a step in 4 parts:

1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers.

This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected
format of the command.

2. Extract and form the patient splits and sub-shards. The `./scripts/extraction/split_and_shard_patients.py`
script is used for this step. See `joint_script*.sh` for the expected format of the command.

3. Extract patient sub-shards and convert to MEDS events. The
`./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for
the expected format of the command.

4. Merge the MEDS events into a single file per patient sub-shard. The
`./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the
expected format of the command.

5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
currently in the `joint_script*.sh` scripts.

## Limitations / TO-DOs:

Currently, some tables are ignored, including:
```bash
export N_WORKERS=5
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true \
stage_runner_fp=slurm_runner.yaml
```

1. `hosp/emar_detail`
2. `hosp/microbiologyevents`
3. `hosp/services`
4. `icu/datetimeevents`
5. `icu/ingredientevents`
The `N_WORKERS` environment variable set before the command controls how many parallel workers should be used
at maximum.

Lots of questions remain about how to appropriately handle times of the data -- e.g., things like HCPCS
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
timeline which is otherwise stored at the _datetime_ resolution?
The `slurm_runner.yaml` file (downloaded above) runs each stage across several workers on separate slurm
worker nodes using the `submitit` launcher. _**You will need to customize this file to your own slurm system
so that the partition names are correct before use.**_ The memory and time costs are viable in the current
configuration, but if your nodes are sufficiently different you may need to adjust those as well.

Other questions:
The `local_parallelism_runner.yaml` file (downloaded above) runs each stage via separate processes on the
launching machine. There are no additional arguments needed for this stage beyond the `N_WORKERS` environment
variable and there is nothing to customize in this file.

1. How to handle merging the deathtimes between the hosp table and the patients table?
2. How to handle the dob nonsense MIMIC has?
To profile the time and memory costs of your ETL, add the `do_profile=true` flag at the end.

## Notes

Expand All @@ -153,4 +104,4 @@ may need to run `unset SLURM_CPU_BIND` in your terminal first to avoid errors.

If you wanted, some other processing could also be done here, such as:

1. Converting the patient's dynamically recorded race into a static, most commonly recorded race field.
1. Converting the subject's dynamically recorded race into a static, most commonly recorded race field.
18 changes: 10 additions & 8 deletions MIMIC-IV_Example/configs/event_configs.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
patient_id_col: subject_id
subject_id_col: subject_id
hosp/admissions:
ed_registration:
code: ED_REGISTRATION
Expand Down Expand Up @@ -27,7 +27,7 @@ hosp/admissions:
time: col(dischtime)
time_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
# We omit the death event here as it is joined to the data in the patients table in the pre-MEDS step.
# We omit the death event here as it is joined to the data in the subjects table in the pre-MEDS step.

hosp/diagnoses_icd:
diagnosis:
Expand All @@ -42,7 +42,7 @@ hosp/diagnoses_icd:
_metadata:
hosp/d_icd_diagnoses:
description: "long_title"
parent_codes: "ICD{icd_version}CM/{icd_code}" # Single strings are templates of columns.
parent_codes: "ICD{icd_version}CM/{norm_icd_code}" # Single strings are templates of columns.

hosp/drgcodes:
drg:
Expand Down Expand Up @@ -100,6 +100,7 @@ hosp/labevents:
description: ["omop_concept_name", "label"] # List of strings are columns to be collated
itemid: "itemid (omop_source_code)"
parent_codes: "{omop_vocabulary_id}/{omop_concept_code}"
valueuom: "valueuom"

hosp/omr:
omr:
Expand Down Expand Up @@ -164,8 +165,8 @@ hosp/procedures_icd:
hosp/d_icd_procedures:
description: "long_title"
parent_codes: # List of objects are string labels mapping to filters to be evaluated.
- "ICD{icd_version}Proc/{icd_code}": { icd_version: 9 }
- "ICD{icd_version}PCS/{icd_code}": { icd_version: 10 }
- "ICD{icd_version}Proc/{norm_icd_code}": { icd_version: "9" }
- "ICD{icd_version}PCS/{norm_icd_code}": { icd_version: "10" }

hosp/transfers:
transfer:
Expand Down Expand Up @@ -218,6 +219,7 @@ icu/chartevents:
description: ["omop_concept_name", "label"] # List of strings are columns to be collated
itemid: "itemid (omop_source_code)"
parent_codes: "{omop_vocabulary_id}/{omop_concept_code}"
valueuom: "valueuom"

icu/procedureevents:
start:
Expand Down Expand Up @@ -295,9 +297,9 @@ icu/inputevents:
description: ["omop_concept_name", "label"] # List of strings are columns to be collated
itemid: "itemid (omop_source_code)"
parent_codes: "{omop_vocabulary_id}/{omop_concept_code}"
patient_weight:
subject_weight:
code:
- PATIENT_WEIGHT_AT_INFUSION
- SUBJECT_WEIGHT_AT_INFUSION
- KG
time: col(starttime)
time_format: "%Y-%m-%d %H:%M:%S"
Expand All @@ -306,7 +308,7 @@ icu/inputevents:
icu/outputevents:
output:
code:
- PATIENT_FLUID_OUTPUT
- SUBJECT_FLUID_OUTPUT
- col(itemid)
- col(valueuom)
time: col(charttime)
Expand Down
36 changes: 36 additions & 0 deletions MIMIC-IV_Example/configs/extract_MIMIC.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
defaults:
- _extract
- _self_

description: |-
This pipeline extracts the MIMIC-IV dataset in longitudinal, sparse form from an input dataset meeting
select criteria and converts them to the flattened, MEDS format. You can control the key arguments to this
pipeline by setting environment variables:
```bash
export EVENT_CONVERSION_CONFIG_FP=# Path to your event conversion config
export MIMICIV_PRE_MEDS_DIR=# Path to the output dir of the pre-MEDS step
export MIMICIV_MEDS_COHORT_DIR=# Path to where you want the dataset to live
```
# The event conversion configuration file is used throughout the pipeline to define the events to extract.
event_conversion_config_fp: ${oc.env:EVENT_CONVERSION_CONFIG_FP}

input_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR}
cohort_dir: ${oc.env:MIMICIV_MEDS_COHORT_DIR}

etl_metadata:
dataset_name: MIMIC-IV
dataset_version: 2.2

stage_configs:
shard_events:
infer_schema_length: 999999999

stages:
- shard_events
- split_and_shard_subjects
- convert_to_sharded_events
- merge_to_MEDS_cohort
- extract_code_metadata
- finalize_MEDS_metadata
- finalize_MEDS_data
12 changes: 8 additions & 4 deletions MIMIC-IV_Example/configs/pre_MEDS.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
raw_cohort_dir: ???
output_dir: ???
input_dir: ${oc.env:MIMICIV_RAW_DIR}
cohort_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR}

do_overwrite: false

log_dir: ${cohort_dir}/.logs

# Hydra
hydra:
job:
name: pre_MEDS_${now:%Y-%m-%d_%H-%M-%S}
run:
dir: ${output_dir}/.logs/${hydra.job.name}
dir: ${log_dir}
sweep:
dir: ${output_dir}/.logs/${hydra.job.name}
dir: ${log_dir}
Loading

0 comments on commit 9588583

Please sign in to comment.