-
Notifications
You must be signed in to change notification settings - Fork 8
1. Downloading the Data
Let's assume that you've already followed all of the installation instructions on the README page.
We're first going to download the IDATs of a GEO dataset along with some covariate information. Normally, you'd have to specify a few commands in R to make this happen using GEOQuery or a similar service. Instead, now all you have to do is:
pymethyl-preprocess download_geo -g GSE87571 -o geo_idats/
-g is asking for a particular geo data set, this is our dataset GSE87571. By default, the idat files are stored in a directory called geo_idats.
Once the data is downloaded, you should find a list of idat files in that directory along with a CSV file containing the clinical covariates. Let's take a quick look at the clinical covariate csv file that is output from this step (you can also acquire this for less time and data using the download_pheno cwl tool):
geo_idats/GSE87571_clinical_info.csv
Unnamed: 0 | title | geo_accession | status | submission_date | last_update_date | type | channel_count | source_name_ch1 | organism_ch1 | characteristics_ch1 | characteristics_ch1.1 | characteristics_ch1.2 | characteristics_ch1.3 | molecule_ch1 | extract_protocol_ch1 | label_ch1 | label_protocol_ch1 | taxid_ch1 | description | platform_id | contact_department | contact_institute | contact_city | contact_state | contact_zip/postal_code | contact_country | supplementary_file | supplementary_file.1 | data_row_count | age:ch1 | disease state:ch1 | gender:ch1 | tissue:ch1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GSM2333901 | X1 genomic DNA from whole blood | GSM2333901 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Male | age: 72 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333901/suppl/GSM2333901_9370847096_R05C02_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333901/suppl/GSM2333901_9370847096_R05C02_Red.idat.gz | 0 | 72 | normal | Male | whole blood |
1 | GSM2333902 | X2 genomic DNA from whole blood | GSM2333902 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Male | age: 55 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333902/suppl/GSM2333902_9376538120_R06C01_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333902/suppl/GSM2333902_9376538120_R06C01_Red.idat.gz | 0 | 55 | normal | Male | whole blood |
2 | GSM2333903 | X3 genomic DNA from whole blood | GSM2333903 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Male | age: 23 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333903/suppl/GSM2333903_7766148053_R06C02_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333903/suppl/GSM2333903_7766148053_R06C02_Red.idat.gz | 0 | 23 | normal | Male | whole blood |
3 | GSM2333904 | X4 genomic DNA from whole blood | GSM2333904 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Male | age: 86 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333904/suppl/GSM2333904_7766148077_R04C02_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333904/suppl/GSM2333904_7766148077_R04C02_Red.idat.gz | 0 | 86 | normal | Male | whole blood |
4 | GSM2333905 | X5 genomic DNA from whole blood | GSM2333905 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Male | age: 74 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333905/suppl/GSM2333905_9370847096_R06C02_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333905/suppl/GSM2333905_9370847096_R06C02_Red.idat.gz | 0 | 74 | normal | Male | whole blood |
5 | GSM2333906 | X6 genomic DNA from whole blood | GSM2333906 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Female | age: 76 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333906/suppl/GSM2333906_7766148116_R01C02_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333906/suppl/GSM2333906_7766148116_R01C02_Red.idat.gz | 0 | 76 | normal | Female | whole blood |
6 | GSM2333907 | X7 genomic DNA from whole blood | GSM2333907 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Male | age: 18 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333907/suppl/GSM2333907_7766148116_R06C01_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333907/suppl/GSM2333907_7766148116_R06C01_Red.idat.gz | 0 | 18 | normal | Male | whole blood |
7 | GSM2333908 | X8 genomic DNA from whole blood | GSM2333908 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Male | age: 38 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333908/suppl/GSM2333908_9376538120_R01C02_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333908/suppl/GSM2333908_9376538120_R01C02_Red.idat.gz | 0 | 38 | normal | Male | whole blood |
8 | GSM2333909 | X9 genomic DNA from whole blood | GSM2333909 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Female | age: 33 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333909/suppl/GSM2333909_9379082138_R01C01_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333909/suppl/GSM2333909_9379082138_R01C01_Red.idat.gz | 0 | 33 | normal | Female | whole blood |
9 | GSM2333910 | X10 genomic DNA from whole blood | GSM2333910 | Public on Oct 04 2016 | Oct 03 2016 | Oct 04 2016 | genomic | 1 | whole blood | Homo sapiens | gender: Female | age: 34 | tissue: whole blood | disease state: normal | genomic DNA | phenol:chloroform protocol | Cy5 and Cy3 | Standard Illumina Protocol | 9606 | normal whole blood | GPL13534 | IGP | Uppsala University | Uppsala | Sweden | 75108 | Sweden | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333910/suppl/GSM2333910_9379082138_R02C01_Grn.idat.gz | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333910/suppl/GSM2333910_9379082138_R02C01_Red.idat.gz | 0 | 34 | normal | Female | whole blood |
It's easy to see that there are a lot of columns we do not need, and some that we would like to change. See the next step for formatting the data.
Note: This repository is currently oriented for IDAT preprocessing. I am currently working on a higher throughput workflow for stored unmethylated and methylated intensity tab/comma-separated value files, as available for many studies on GEO. I may add this component as an experimental part to this pipeline soon, and will take you from these matrix files to the MethylationArray to be introduced.