-
Notifications
You must be signed in to change notification settings - Fork 8
3. Running the Preprocessing Pipeline
Now we have a samplesheet.csv pheno csv file and a directory of idats.
We can run the preprocessing pipeline by running:
pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -qc
This preprocesses the IDATs in geo_idats using minfi (-p) option with noob normalization (-noob), and only is storing the RGSet objects once loaded (-qc) (hasn't performed the normalization yet). By disabling the -qc option, it will execute the entire pipeline, but this can be useful for debugging or setting threshold parameters.
To finish the qc/norm process, run:
pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -u
Which loads the saved QC/RGSet(s) via the -u option.
The preprocessing pipeline comes with other options to remove or fill with NA values CpG or samples that fail some detection p-value and bead number thresholds. Using the meffil or enmix options (see docs) allows you to specify multiple cores to preprocess the data. Minfi does not offer this, but as a workaround, you can split up the pheno csv file into batches (will be added as future feature) and use the:
pymethyl-preprocess split_preprocess_input_by_subtype -h # similar to a groupby statement
pymethyl-preprocess batch_deploy_preprocess -h # runs the preprocessing command for each new pheno csv from the above step
pymethyl-preprocess combine_methylation_arrays -h # combines and merges the resulting objects
Commands to preprocess batches of these IDATs in parallel. See the help documentation for more detail.