-
Notifications
You must be signed in to change notification settings - Fork 8
2. Formatting the Sample Sheet
We now need to format this cumbersome clinical CSV file so that it is easy to interpret from downstream steps, used easily. To do this, only requires two commands and the creation of a file that maps the column names from the table we just saw to pretty looking names.
First, let's see what header names need changing. The key columns/info in the pheno csv file are geo_accession, gender:ch1, tissue:ch1, "disease state:ch1" and age:ch1. We want the pheno csv to only contain this information, and maybe change the column names to AccNum, Sex, Tissue, disease and Age respectively. We introduce this mapping in a TSV file (tab separated because some of the column names have spaces):
nano include_col.txt
In include_col.txt, add and separate by tabs (not spaces as I've posted below):
gender:ch1 Sex
tissue:ch1 Tissue
age:ch1 Age
We add this file to the command below:
pymethyl-preprocess create_sample_sheet -is ./geo_idats/GSE87571_clinical_info.csv -s geo -i geo_idats/ -os geo_idats/samplesheet.csv -d "disease state:ch1" -c include_col.txt
mkdir backup_clinical && mv ./geo_idats/GSE87571_clinical_info.csv backup_clinical
The first command formats the CSV to match our aims above. Here we've specified "disease state:ch1" with command option -d. -s geo formats for geo and takes into account the geo_accession and makes the according change. It also scans the IDAT directory, -i, and adds a Basename column for minfi and meffil to search as we parallelize and preprocess. -is and -os are responsible for the input and output csvs. Only upp to one csv (can have csv in another directory and reference the idats) can exist in the geo_idat directory else our preprocessing program
This command below makes sure the sex column is properly formatted if there is one.
pymethyl-preprocess meffil_encode -is geo_idats/samplesheet.csv -os geo_idats/samplesheet.csv
Now we can look at what we've generated: geo_idats/samplesheet.csv
Basename | AccNum | disease | Age | Sex | Tissue | |
---|---|---|---|---|---|---|
0 | geo_idats/GSM2333901_9370847096_R05C02 | GSM2333901 | normal | 72 | Male | whole blood |
1 | geo_idats/GSM2333902_9376538120_R06C01 | GSM2333902 | normal | 55 | Male | whole blood |
2 | geo_idats/GSM2333903_7766148053_R06C02 | GSM2333903 | normal | 23 | Male | whole blood |
3 | geo_idats/GSM2333904_7766148077_R04C02 | GSM2333904 | normal | 86 | Male | whole blood |
4 | geo_idats/GSM2333905_9370847096_R06C02 | GSM2333905 | normal | 74 | Male | whole blood |
5 | geo_idats/GSM2333906_7766148116_R01C02 | GSM2333906 | normal | 76 | Female | whole blood |
Looks good to me.