Skip to content

2. Formatting the Sample Sheet

Joshua Levy edited this page Jun 26, 2019 · 1 revision

We now need to format this cumbersome clinical CSV file so that it is easy to interpret from downstream steps, used easily. To do this, only requires two commands and the creation of a file that maps the column names from the table we just saw to pretty looking names.

First, let's see what header names need changing. The key columns/info in the pheno csv file are geo_accession, gender:ch1, tissue:ch1, "disease state:ch1" and age:ch1. We want the pheno csv to only contain this information, and maybe change the column names to AccNum, Sex, Tissue, disease and Age respectively. We introduce this mapping in a TSV file (tab separated because some of the column names have spaces):

nano include_col.txt

In include_col.txt, add and separate by tabs (not spaces as I've posted below):

gender:ch1       Sex
tissue:ch1       Tissue
age:ch1       Age

We add this file to the command below:

pymethyl-preprocess create_sample_sheet -is ./geo_idats/GSE87571_clinical_info.csv -s geo -i geo_idats/ -os geo_idats/samplesheet.csv -d "disease state:ch1" -c include_col.txt
mkdir backup_clinical && mv ./geo_idats/GSE87571_clinical_info.csv backup_clinical

The first command formats the CSV to match our aims above. Here we've specified "disease state:ch1" with command option -d. -s geo formats for geo and takes into account the geo_accession and makes the according change. It also scans the IDAT directory, -i, and adds a Basename column for minfi and meffil to search as we parallelize and preprocess. -is and -os are responsible for the input and output csvs. Only upp to one csv (can have csv in another directory and reference the idats) can exist in the geo_idat directory else our preprocessing program

This command below makes sure the sex column is properly formatted if there is one.

pymethyl-preprocess meffil_encode -is geo_idats/samplesheet.csv -os geo_idats/samplesheet.csv

Now we can look at what we've generated: geo_idats/samplesheet.csv

Basename AccNum disease Age Sex Tissue
0 geo_idats/GSM2333901_9370847096_R05C02 GSM2333901 normal 72 Male whole blood
1 geo_idats/GSM2333902_9376538120_R06C01 GSM2333902 normal 55 Male whole blood
2 geo_idats/GSM2333903_7766148053_R06C02 GSM2333903 normal 23 Male whole blood
3 geo_idats/GSM2333904_7766148077_R04C02 GSM2333904 normal 86 Male whole blood
4 geo_idats/GSM2333905_9370847096_R06C02 GSM2333905 normal 74 Male whole blood
5 geo_idats/GSM2333906_7766148116_R01C02 GSM2333906 normal 76 Female whole blood

Looks good to me.