Skip to content

3. Running the Preprocessing Pipeline

Joshua Levy edited this page Jun 26, 2019 · 1 revision

Now we have a samplesheet.csv pheno csv file and a directory of idats.

We can run the preprocessing pipeline by running:

pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -qc

This preprocesses the IDATs in geo_idats using minfi (-p) option with noob normalization (-noob), and only is storing the RGSet objects once loaded (-qc) (hasn't performed the normalization yet). By disabling the -qc option, it will execute the entire pipeline, but this can be useful for debugging or setting threshold parameters.

To finish the qc/norm process, run:

pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -u

Which loads the saved QC/RGSet(s) via the -u option.

The preprocessing pipeline comes with other options to remove or fill with NA values CpG or samples that fail some detection p-value and bead number thresholds. Using the meffil or enmix options (see docs) allows you to specify multiple cores to preprocess the data. Minfi does not offer this, but as a workaround, you can split up the pheno csv file into batches (will be added as future feature) and use the:

pymethyl-preprocess split_preprocess_input_by_subtype -h # similar to a groupby statement
pymethyl-preprocess batch_deploy_preprocess -h # runs the preprocessing command for each new pheno csv from the above step
pymethyl-preprocess combine_methylation_arrays -h # combines and merges the resulting objects

Commands to preprocess batches of these IDATs in parallel. See the help documentation for more detail.