Merge pull request mmcdermott#193 from mmcdermott/dev

Release Candidate 0.0.7
VectorInstitute · Sep 1, 2024 · 9588583 · 9588583
2 parents 9549d7e + 3519769
commit 9588583
Show file tree

Hide file tree

Showing 85 changed files with 4,446 additions and 2,208 deletions.
diff --git a/MIMIC-IV_Example/README.md b/MIMIC-IV_Example/README.md
@@ -6,33 +6,34 @@ up from this one).
 
 ## Step 0: Installation
 
-Download this repository and install the requirements:
-If you want to install via pypi, (note that for now, you still need to copy some files locally even with a
-pypi installation, which is covered below, so make sure you are in a suitable directory) use:
-
 ```bash
 conda create -n MEDS python=3.12
 conda activate MEDS
-pip install "MEDS_transforms[local_parallelism]"
-mkdir MIMIC-IV_Example
-cd MIMIC-IV_Example
-wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script.sh
-wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script_slurm.sh
-wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/pre_MEDS.py
-chmod +x joint_script.sh
-chmod +x joint_script_slurm.sh
-chmod +x pre_MEDS.py
-cd ..
+pip install "MEDS_transforms[local_parallelism,slurm_parallelism]"
 ```
 
-If you want to install locally, use:
+If you want to profile the time and memory costs of your ETL, also install: `pip install hydra-profiler`.
 
+## Step 0.5: Set-up
+Set some environment variables and download the necessary files:
 ```bash
-git clone [email protected]:mmcdermott/MEDS_transforms.git
-cd MEDS_transforms
-conda create -n MEDS python=3.12
-conda activate MEDS
-pip install .[local_parallelism]
+export MIMICIV_RAW_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
+export MIMICIV_PRE_MEDS_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
+export MIMICIV_MEDS_COHORT_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
+
+export VERSION=0.0.6 # or whatever version you want
+export URL="https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/$VERSION/MIMIC-IV_Example"
+
+wget $URL/run.sh
+wget $URL/pre_MEDS.py
+wget $URL/local_parallelism_runner.yaml
+wget $URL/slurm_runner.yaml
+mkdir configs
+cd configs
+wget $URL/configs/extract_MIMIC.yaml
+cd ..
+chmod +x run.sh
+chmod +x pre_MEDS.py
 ```
 
 ## Step 1: Download MIMIC-IV
@@ -46,101 +47,51 @@ the root directory of where the resulting _core data files_ are stored -- e.g.,
 
 ```bash
 cd $MIMIC_RAW_DIR
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/d_labitems_to_loinc.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/inputevents_to_rxnorm.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/lab_itemid_to_loinc.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_main.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_value.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/numerics-summary.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/outputevents_to_loinc.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_datetimeevents.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_itemid.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/waveforms-summary.csv
+export MIMIC_URL=https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map
+wget $MIMIC_URL/d_labitems_to_loinc.csv
+wget $MIMIC_URL/inputevents_to_rxnorm.csv
+wget $MIMIC_URL/lab_itemid_to_loinc.csv
+wget $MIMIC_URL/meas_chartevents_main.csv
+wget $MIMIC_URL/meas_chartevents_value.csv
+wget $MIMIC_URL/numerics-summary.csv
+wget $MIMIC_URL/outputevents_to_loinc.csv
+wget $MIMIC_URL/proc_datetimeevents.csv
+wget $MIMIC_URL/proc_itemid.csv
+wget $MIMIC_URL/waveforms-summary.csv
 ```
 
-## Step 2: Run the basic MEDS ETL
-
-This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the
-`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make
-sure you enable this feature by including the `[local_parallelism]` option during installation) or via
-`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you
-enable this feature by including the `[slurm_parallelism]` option during installation). This script entails
-several steps:
-
-### Step 2.1: Get the data ready for base MEDS extraction
-
-This is a step in a few parts:
-
-1. Join a few tables by `hadm_id` to get the right times in the right rows for processing. In
-    particular, we need to join:
-    - the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each
-        `hadm_id`.
-    - the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`.
-2. Convert the patient's static data to a more parseable form. This entails:
-    - Get the patient's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
-        `anchor_offset` fields.
-    - Merge the patient's `dod` with the `deathtime` from the `admissions` table.
-
-After these steps, modified files or symlinks to the original files will be written in a new directory which
-will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
-directory.
+## Step 2: Run the MEDS ETL
 
-This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the
-base command that is run is as follows (assumed to be run **not** from this directory but from the
-root directory of this repository):
+To run the MEDS ETL, run the following command:
 
 ```bash
-./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR
+./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true
 ```
 
-In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
+To not unzip the `.csv.gz` files, set `do_unzip=false` instead of `do_unzip=true`.
 
-### Step 2.2: Run the MEDS extraction ETL
+To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an
+additional argument
 
-We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
-Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
-subdirectories of the same root directory).
-
-This is a step in 4 parts:
-
-1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
-    performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers.
-
-    This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected
-    format of the command.
-
-2. Extract and form the patient splits and sub-shards. The `./scripts/extraction/split_and_shard_patients.py`
-    script is used for this step. See `joint_script*.sh` for the expected format of the command.
-
-3. Extract patient sub-shards and convert to MEDS events. The
-    `./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for
-    the expected format of the command.
-
-4. Merge the MEDS events into a single file per patient sub-shard. The
-    `./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the
-    expected format of the command.
-
-5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
-    currently in the `joint_script*.sh` scripts.
-
-## Limitations / TO-DOs:
-
-Currently, some tables are ignored, including:
+```bash
+export N_WORKERS=5
+./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true \
+    stage_runner_fp=slurm_runner.yaml
+```
 
-1. `hosp/emar_detail`
-2. `hosp/microbiologyevents`
-3. `hosp/services`
-4. `icu/datetimeevents`
-5. `icu/ingredientevents`
+The `N_WORKERS` environment variable set before the command controls how many parallel workers should be used
+at maximum.
 
-Lots of questions remain about how to appropriately handle times of the data -- e.g., things like HCPCS
-events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
-timeline which is otherwise stored at the _datetime_ resolution?
+The `slurm_runner.yaml` file (downloaded above) runs each stage across several workers on separate slurm
+worker nodes using the `submitit` launcher. _**You will need to customize this file to your own slurm system
+so that the partition names are correct before use.**_ The memory and time costs are viable in the current
+configuration, but if your nodes are sufficiently different you may need to adjust those as well.
 
-Other questions:
+The `local_parallelism_runner.yaml` file (downloaded above) runs each stage via separate processes on the
+launching machine. There are no additional arguments needed for this stage beyond the `N_WORKERS` environment
+variable and there is nothing to customize in this file.
 
-1. How to handle merging the deathtimes between the hosp table and the patients table?
-2. How to handle the dob nonsense MIMIC has?
+To profile the time and memory costs of your ETL, add the `do_profile=true` flag at the end.
 
 ## Notes
 
@@ -153,4 +104,4 @@ may need to run `unset SLURM_CPU_BIND` in your terminal first to avoid errors.
 
 If you wanted, some other processing could also be done here, such as:
 
-1. Converting the patient's dynamically recorded race into a static, most commonly recorded race field.
+1. Converting the subject's dynamically recorded race into a static, most commonly recorded race field.
diff --git a/MIMIC-IV_Example/configs/event_configs.yaml b/MIMIC-IV_Example/configs/event_configs.yaml
@@ -1,4 +1,4 @@
-patient_id_col: subject_id
+subject_id_col: subject_id
 hosp/admissions:
   ed_registration:
     code: ED_REGISTRATION
@@ -27,7 +27,7 @@ hosp/admissions:
     time: col(dischtime)
     time_format: "%Y-%m-%d %H:%M:%S"
     hadm_id: hadm_id
-  # We omit the death event here as it is joined to the data in the patients table in the pre-MEDS step.
+  # We omit the death event here as it is joined to the data in the subjects table in the pre-MEDS step.
 
 hosp/diagnoses_icd:
   diagnosis:
@@ -42,7 +42,7 @@ hosp/diagnoses_icd:
     _metadata:
       hosp/d_icd_diagnoses:
         description: "long_title"
-        parent_codes: "ICD{icd_version}CM/{icd_code}" # Single strings are templates of columns.
+        parent_codes: "ICD{icd_version}CM/{norm_icd_code}" # Single strings are templates of columns.
 
 hosp/drgcodes:
   drg:
@@ -100,6 +100,7 @@ hosp/labevents:
         description: ["omop_concept_name", "label"] # List of strings are columns to be collated
         itemid: "itemid (omop_source_code)"
         parent_codes: "{omop_vocabulary_id}/{omop_concept_code}"
+        valueuom: "valueuom"
 
 hosp/omr:
   omr:
@@ -164,8 +165,8 @@ hosp/procedures_icd:
       hosp/d_icd_procedures:
         description: "long_title"
         parent_codes: # List of objects are string labels mapping to filters to be evaluated.
-          - "ICD{icd_version}Proc/{icd_code}": { icd_version: 9 }
-          - "ICD{icd_version}PCS/{icd_code}": { icd_version: 10 }
+          - "ICD{icd_version}Proc/{norm_icd_code}": { icd_version: "9" }
+          - "ICD{icd_version}PCS/{norm_icd_code}": { icd_version: "10" }
 
 hosp/transfers:
   transfer:
@@ -218,6 +219,7 @@ icu/chartevents:
         description: ["omop_concept_name", "label"] # List of strings are columns to be collated
         itemid: "itemid (omop_source_code)"
         parent_codes: "{omop_vocabulary_id}/{omop_concept_code}"
+        valueuom: "valueuom"
 
 icu/procedureevents:
   start:
@@ -295,9 +297,9 @@ icu/inputevents:
         description: ["omop_concept_name", "label"] # List of strings are columns to be collated
         itemid: "itemid (omop_source_code)"
         parent_codes: "{omop_vocabulary_id}/{omop_concept_code}"
-  patient_weight:
+  subject_weight:
     code:
-      - PATIENT_WEIGHT_AT_INFUSION
+      - SUBJECT_WEIGHT_AT_INFUSION
       - KG
     time: col(starttime)
     time_format: "%Y-%m-%d %H:%M:%S"
@@ -306,7 +308,7 @@ icu/inputevents:
 icu/outputevents:
   output:
     code:
-      - PATIENT_FLUID_OUTPUT
+      - SUBJECT_FLUID_OUTPUT
       - col(itemid)
       - col(valueuom)
     time: col(charttime)

diff --git a/MIMIC-IV_Example/configs/extract_MIMIC.yaml b/MIMIC-IV_Example/configs/extract_MIMIC.yaml
@@ -0,0 +1,36 @@
+defaults:
+  - _extract
+  - _self_
+
+description: |-
+  This pipeline extracts the MIMIC-IV dataset in longitudinal, sparse form from an input dataset meeting
+  select criteria and converts them to the flattened, MEDS format. You can control the key arguments to this
+  pipeline by setting environment variables:
+  ```bash
+    export EVENT_CONVERSION_CONFIG_FP=# Path to your event conversion config
+    export MIMICIV_PRE_MEDS_DIR=# Path to the output dir of the pre-MEDS step
+    export MIMICIV_MEDS_COHORT_DIR=# Path to where you want the dataset to live
+  ```
+
+# The event conversion configuration file is used throughout the pipeline to define the events to extract.
+event_conversion_config_fp: ${oc.env:EVENT_CONVERSION_CONFIG_FP}
+
+input_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR}
+cohort_dir: ${oc.env:MIMICIV_MEDS_COHORT_DIR}
+
+etl_metadata:
+  dataset_name: MIMIC-IV
+  dataset_version: 2.2
+
+stage_configs:
+  shard_events:
+    infer_schema_length: 999999999
+
+stages:
+  - shard_events
+  - split_and_shard_subjects
+  - convert_to_sharded_events
+  - merge_to_MEDS_cohort
+  - extract_code_metadata
+  - finalize_MEDS_metadata
+  - finalize_MEDS_data
diff --git a/MIMIC-IV_Example/configs/pre_MEDS.yaml b/MIMIC-IV_Example/configs/pre_MEDS.yaml
@@ -1,11 +1,15 @@
-raw_cohort_dir: ???
-output_dir: ???
+input_dir: ${oc.env:MIMICIV_RAW_DIR}
+cohort_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR}
+
+do_overwrite: false
+
+log_dir: ${cohort_dir}/.logs
 
 # Hydra
 hydra:
   job:
     name: pre_MEDS_${now:%Y-%m-%d_%H-%M-%S}
   run:
-    dir: ${output_dir}/.logs/${hydra.job.name}
+    dir: ${log_dir}
   sweep:
-    dir: ${output_dir}/.logs/${hydra.job.name}
+    dir: ${log_dir}