3. Running the Preprocessing Pipeline

Now we have a samplesheet.csv pheno csv file and a directory of idats.

We can run the preprocessing pipeline by running:

pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -qc

This preprocesses the IDATs in geo_idats using minfi (-p) option with noob normalization (-noob), and only is storing the RGSet objects once loaded (-qc) (hasn't performed the normalization yet). By disabling the -qc option, it will execute the entire pipeline, but this can be useful for debugging or setting threshold parameters.

To finish the qc/norm process, run:

pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -u

Which loads the saved QC/RGSet(s) via the -u option.

The preprocessing pipeline comes with other options to remove or fill with NA values CpG or samples that fail some detection p-value and bead number thresholds. Using the meffil or enmix options (see docs) allows you to specify multiple cores to preprocess the data. Minfi does not offer this, but as a workaround, you can split up the pheno csv file into batches (will be added as future feature) and use the:

pymethyl-preprocess split_preprocess_input_by_subtype -h # similar to a groupby statement
pymethyl-preprocess batch_deploy_preprocess -h # runs the preprocessing command for each new pheno csv from the above step
pymethyl-preprocess combine_methylation_arrays -h # combines and merges the resulting objects

Commands to preprocess batches of these IDATs in parallel. See the help documentation for more detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Running the Preprocessing Pipeline

Clone this wiki locally