7. Splitting the Data into Training, Testing, and Validation Sets

Splitting the data in this way is key for machine learning analyses. The training set updates the parameters of the model, the validation set ensures the data generalizes to unseen data and helps with optimization of the model hyperparameters, and the testing set is the held out data to test or make new predictions. We can split the MethylationArray from the final_outputs/ directory into smaller MethylationArrays by similar denotations using:

pymethyl-utils train_test_val_split -tp .8 -vp .125

which creates 3 new MethylationArrays (70% training, 10% validation, and 20% testing) and outputs them as such:

ls train_val_test_sets/
-> test_methyl_array.pkl train_methyl_array.pkl  val_methyl_array.pkl

If you wanted to combine them again, you can add them to a MethylationArrays object and use the .combine() method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7. Splitting the Data into Training, Testing, and Validation Sets

Clone this wiki locally