-
Notifications
You must be signed in to change notification settings - Fork 8
7. Splitting the Data into Training, Testing, and Validation Sets
Joshua Levy edited this page Jun 26, 2019
·
1 revision
Splitting the data in this way is key for machine learning analyses. The training set updates the parameters of the model, the validation set ensures the data generalizes to unseen data and helps with optimization of the model hyperparameters, and the testing set is the held out data to test or make new predictions. We can split the MethylationArray from the final_outputs/ directory into smaller MethylationArrays by similar denotations using:
pymethyl-utils train_test_val_split -tp .8 -vp .125
which creates 3 new MethylationArrays (70% training, 10% validation, and 20% testing) and outputs them as such:
ls train_val_test_sets/
-> test_methyl_array.pkl train_methyl_array.pkl val_methyl_array.pkl
If you wanted to combine them again, you can add them to a MethylationArrays object and use the .combine() method.