forked from mmcdermott/MEDS_transforms
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request mmcdermott#193 from mmcdermott/dev
Release Candidate 0.0.7
- Loading branch information
Showing
85 changed files
with
4,446 additions
and
2,208 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,33 +6,34 @@ up from this one). | |
|
||
## Step 0: Installation | ||
|
||
Download this repository and install the requirements: | ||
If you want to install via pypi, (note that for now, you still need to copy some files locally even with a | ||
pypi installation, which is covered below, so make sure you are in a suitable directory) use: | ||
|
||
```bash | ||
conda create -n MEDS python=3.12 | ||
conda activate MEDS | ||
pip install "MEDS_transforms[local_parallelism]" | ||
mkdir MIMIC-IV_Example | ||
cd MIMIC-IV_Example | ||
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script.sh | ||
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script_slurm.sh | ||
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/pre_MEDS.py | ||
chmod +x joint_script.sh | ||
chmod +x joint_script_slurm.sh | ||
chmod +x pre_MEDS.py | ||
cd .. | ||
pip install "MEDS_transforms[local_parallelism,slurm_parallelism]" | ||
``` | ||
|
||
If you want to install locally, use: | ||
If you want to profile the time and memory costs of your ETL, also install: `pip install hydra-profiler`. | ||
|
||
## Step 0.5: Set-up | ||
Set some environment variables and download the necessary files: | ||
```bash | ||
git clone [email protected]:mmcdermott/MEDS_transforms.git | ||
cd MEDS_transforms | ||
conda create -n MEDS python=3.12 | ||
conda activate MEDS | ||
pip install .[local_parallelism] | ||
export MIMICIV_RAW_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data | ||
export MIMICIV_PRE_MEDS_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data | ||
export MIMICIV_MEDS_COHORT_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data | ||
|
||
export VERSION=0.0.6 # or whatever version you want | ||
export URL="https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/$VERSION/MIMIC-IV_Example" | ||
|
||
wget $URL/run.sh | ||
wget $URL/pre_MEDS.py | ||
wget $URL/local_parallelism_runner.yaml | ||
wget $URL/slurm_runner.yaml | ||
mkdir configs | ||
cd configs | ||
wget $URL/configs/extract_MIMIC.yaml | ||
cd .. | ||
chmod +x run.sh | ||
chmod +x pre_MEDS.py | ||
``` | ||
|
||
## Step 1: Download MIMIC-IV | ||
|
@@ -46,101 +47,51 @@ the root directory of where the resulting _core data files_ are stored -- e.g., | |
|
||
```bash | ||
cd $MIMIC_RAW_DIR | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/d_labitems_to_loinc.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/inputevents_to_rxnorm.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/lab_itemid_to_loinc.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_main.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_value.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/numerics-summary.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/outputevents_to_loinc.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_datetimeevents.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_itemid.csv | ||
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/waveforms-summary.csv | ||
export MIMIC_URL=https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map | ||
wget $MIMIC_URL/d_labitems_to_loinc.csv | ||
wget $MIMIC_URL/inputevents_to_rxnorm.csv | ||
wget $MIMIC_URL/lab_itemid_to_loinc.csv | ||
wget $MIMIC_URL/meas_chartevents_main.csv | ||
wget $MIMIC_URL/meas_chartevents_value.csv | ||
wget $MIMIC_URL/numerics-summary.csv | ||
wget $MIMIC_URL/outputevents_to_loinc.csv | ||
wget $MIMIC_URL/proc_datetimeevents.csv | ||
wget $MIMIC_URL/proc_itemid.csv | ||
wget $MIMIC_URL/waveforms-summary.csv | ||
``` | ||
|
||
## Step 2: Run the basic MEDS ETL | ||
|
||
This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the | ||
`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make | ||
sure you enable this feature by including the `[local_parallelism]` option during installation) or via | ||
`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you | ||
enable this feature by including the `[slurm_parallelism]` option during installation). This script entails | ||
several steps: | ||
|
||
### Step 2.1: Get the data ready for base MEDS extraction | ||
|
||
This is a step in a few parts: | ||
|
||
1. Join a few tables by `hadm_id` to get the right times in the right rows for processing. In | ||
particular, we need to join: | ||
- the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each | ||
`hadm_id`. | ||
- the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`. | ||
2. Convert the patient's static data to a more parseable form. This entails: | ||
- Get the patient's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and | ||
`anchor_offset` fields. | ||
- Merge the patient's `dod` with the `deathtime` from the `admissions` table. | ||
|
||
After these steps, modified files or symlinks to the original files will be written in a new directory which | ||
will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this | ||
directory. | ||
## Step 2: Run the MEDS ETL | ||
|
||
This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the | ||
base command that is run is as follows (assumed to be run **not** from this directory but from the | ||
root directory of this repository): | ||
To run the MEDS ETL, run the following command: | ||
|
||
```bash | ||
./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR | ||
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true | ||
``` | ||
|
||
In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total. | ||
To not unzip the `.csv.gz` files, set `do_unzip=false` instead of `do_unzip=true`. | ||
|
||
### Step 2.2: Run the MEDS extraction ETL | ||
To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an | ||
additional argument | ||
|
||
We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`. | ||
Note this is a different directory than the pre-MEDS directory (though, of course, they can both be | ||
subdirectories of the same root directory). | ||
|
||
This is a step in 4 parts: | ||
|
||
1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers | ||
performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers. | ||
|
||
This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected | ||
format of the command. | ||
|
||
2. Extract and form the patient splits and sub-shards. The `./scripts/extraction/split_and_shard_patients.py` | ||
script is used for this step. See `joint_script*.sh` for the expected format of the command. | ||
|
||
3. Extract patient sub-shards and convert to MEDS events. The | ||
`./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for | ||
the expected format of the command. | ||
|
||
4. Merge the MEDS events into a single file per patient sub-shard. The | ||
`./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the | ||
expected format of the command. | ||
|
||
5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed | ||
currently in the `joint_script*.sh` scripts. | ||
|
||
## Limitations / TO-DOs: | ||
|
||
Currently, some tables are ignored, including: | ||
```bash | ||
export N_WORKERS=5 | ||
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true \ | ||
stage_runner_fp=slurm_runner.yaml | ||
``` | ||
|
||
1. `hosp/emar_detail` | ||
2. `hosp/microbiologyevents` | ||
3. `hosp/services` | ||
4. `icu/datetimeevents` | ||
5. `icu/ingredientevents` | ||
The `N_WORKERS` environment variable set before the command controls how many parallel workers should be used | ||
at maximum. | ||
|
||
Lots of questions remain about how to appropriately handle times of the data -- e.g., things like HCPCS | ||
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the | ||
timeline which is otherwise stored at the _datetime_ resolution? | ||
The `slurm_runner.yaml` file (downloaded above) runs each stage across several workers on separate slurm | ||
worker nodes using the `submitit` launcher. _**You will need to customize this file to your own slurm system | ||
so that the partition names are correct before use.**_ The memory and time costs are viable in the current | ||
configuration, but if your nodes are sufficiently different you may need to adjust those as well. | ||
|
||
Other questions: | ||
The `local_parallelism_runner.yaml` file (downloaded above) runs each stage via separate processes on the | ||
launching machine. There are no additional arguments needed for this stage beyond the `N_WORKERS` environment | ||
variable and there is nothing to customize in this file. | ||
|
||
1. How to handle merging the deathtimes between the hosp table and the patients table? | ||
2. How to handle the dob nonsense MIMIC has? | ||
To profile the time and memory costs of your ETL, add the `do_profile=true` flag at the end. | ||
|
||
## Notes | ||
|
||
|
@@ -153,4 +104,4 @@ may need to run `unset SLURM_CPU_BIND` in your terminal first to avoid errors. | |
|
||
If you wanted, some other processing could also be done here, such as: | ||
|
||
1. Converting the patient's dynamically recorded race into a static, most commonly recorded race field. | ||
1. Converting the subject's dynamically recorded race into a static, most commonly recorded race field. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
defaults: | ||
- _extract | ||
- _self_ | ||
|
||
description: |- | ||
This pipeline extracts the MIMIC-IV dataset in longitudinal, sparse form from an input dataset meeting | ||
select criteria and converts them to the flattened, MEDS format. You can control the key arguments to this | ||
pipeline by setting environment variables: | ||
```bash | ||
export EVENT_CONVERSION_CONFIG_FP=# Path to your event conversion config | ||
export MIMICIV_PRE_MEDS_DIR=# Path to the output dir of the pre-MEDS step | ||
export MIMICIV_MEDS_COHORT_DIR=# Path to where you want the dataset to live | ||
``` | ||
# The event conversion configuration file is used throughout the pipeline to define the events to extract. | ||
event_conversion_config_fp: ${oc.env:EVENT_CONVERSION_CONFIG_FP} | ||
|
||
input_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR} | ||
cohort_dir: ${oc.env:MIMICIV_MEDS_COHORT_DIR} | ||
|
||
etl_metadata: | ||
dataset_name: MIMIC-IV | ||
dataset_version: 2.2 | ||
|
||
stage_configs: | ||
shard_events: | ||
infer_schema_length: 999999999 | ||
|
||
stages: | ||
- shard_events | ||
- split_and_shard_subjects | ||
- convert_to_sharded_events | ||
- merge_to_MEDS_cohort | ||
- extract_code_metadata | ||
- finalize_MEDS_metadata | ||
- finalize_MEDS_data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,15 @@ | ||
raw_cohort_dir: ??? | ||
output_dir: ??? | ||
input_dir: ${oc.env:MIMICIV_RAW_DIR} | ||
cohort_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR} | ||
|
||
do_overwrite: false | ||
|
||
log_dir: ${cohort_dir}/.logs | ||
|
||
# Hydra | ||
hydra: | ||
job: | ||
name: pre_MEDS_${now:%Y-%m-%d_%H-%M-%S} | ||
run: | ||
dir: ${output_dir}/.logs/${hydra.job.name} | ||
dir: ${log_dir} | ||
sweep: | ||
dir: ${output_dir}/.logs/${hydra.job.name} | ||
dir: ${log_dir} |
Oops, something went wrong.