Skip to content

1. Downloading the Data

Joshua Levy edited this page Jul 11, 2019 · 2 revisions

Let's assume that you've already followed all of the installation instructions on the README page.

We're first going to download the IDATs of a GEO dataset along with some covariate information. Normally, you'd have to specify a few commands in R to make this happen using GEOQuery or a similar service. Instead, now all you have to do is:

pymethyl-preprocess download_geo -g GSE87571 -o geo_idats/

-g is asking for a particular geo data set, this is our dataset GSE87571. By default, the idat files are stored in a directory called geo_idats.

Once the data is downloaded, you should find a list of idat files in that directory along with a CSV file containing the clinical covariates. Let's take a quick look at the clinical covariate csv file that is output from this step (you can also acquire this for less time and data using the download_pheno cwl tool): geo_idats/GSE87571_clinical_info.csv

Unnamed: 0 title geo_accession status submission_date last_update_date type channel_count source_name_ch1 organism_ch1 characteristics_ch1 characteristics_ch1.1 characteristics_ch1.2 characteristics_ch1.3 molecule_ch1 extract_protocol_ch1 label_ch1 label_protocol_ch1 taxid_ch1 description platform_id contact_department contact_institute contact_city contact_state contact_zip/postal_code contact_country supplementary_file supplementary_file.1 data_row_count age:ch1 disease state:ch1 gender:ch1 tissue:ch1
0 GSM2333901 X1 genomic DNA from whole blood GSM2333901 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Male age: 72 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333901/suppl/GSM2333901_9370847096_R05C02_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333901/suppl/GSM2333901_9370847096_R05C02_Red.idat.gz 0 72 normal Male whole blood
1 GSM2333902 X2 genomic DNA from whole blood GSM2333902 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Male age: 55 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333902/suppl/GSM2333902_9376538120_R06C01_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333902/suppl/GSM2333902_9376538120_R06C01_Red.idat.gz 0 55 normal Male whole blood
2 GSM2333903 X3 genomic DNA from whole blood GSM2333903 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Male age: 23 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333903/suppl/GSM2333903_7766148053_R06C02_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333903/suppl/GSM2333903_7766148053_R06C02_Red.idat.gz 0 23 normal Male whole blood
3 GSM2333904 X4 genomic DNA from whole blood GSM2333904 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Male age: 86 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333904/suppl/GSM2333904_7766148077_R04C02_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333904/suppl/GSM2333904_7766148077_R04C02_Red.idat.gz 0 86 normal Male whole blood
4 GSM2333905 X5 genomic DNA from whole blood GSM2333905 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Male age: 74 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333905/suppl/GSM2333905_9370847096_R06C02_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333905/suppl/GSM2333905_9370847096_R06C02_Red.idat.gz 0 74 normal Male whole blood
5 GSM2333906 X6 genomic DNA from whole blood GSM2333906 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Female age: 76 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333906/suppl/GSM2333906_7766148116_R01C02_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333906/suppl/GSM2333906_7766148116_R01C02_Red.idat.gz 0 76 normal Female whole blood
6 GSM2333907 X7 genomic DNA from whole blood GSM2333907 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Male age: 18 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333907/suppl/GSM2333907_7766148116_R06C01_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333907/suppl/GSM2333907_7766148116_R06C01_Red.idat.gz 0 18 normal Male whole blood
7 GSM2333908 X8 genomic DNA from whole blood GSM2333908 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Male age: 38 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333908/suppl/GSM2333908_9376538120_R01C02_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333908/suppl/GSM2333908_9376538120_R01C02_Red.idat.gz 0 38 normal Male whole blood
8 GSM2333909 X9 genomic DNA from whole blood GSM2333909 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Female age: 33 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333909/suppl/GSM2333909_9379082138_R01C01_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333909/suppl/GSM2333909_9379082138_R01C01_Red.idat.gz 0 33 normal Female whole blood
9 GSM2333910 X10 genomic DNA from whole blood GSM2333910 Public on Oct 04 2016 Oct 03 2016 Oct 04 2016 genomic 1 whole blood Homo sapiens gender: Female age: 34 tissue: whole blood disease state: normal genomic DNA phenol:chloroform protocol Cy5 and Cy3 Standard Illumina Protocol 9606 normal whole blood GPL13534 IGP Uppsala University Uppsala Sweden 75108 Sweden ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333910/suppl/GSM2333910_9379082138_R02C01_Grn.idat.gz ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2333nnn/GSM2333910/suppl/GSM2333910_9379082138_R02C01_Red.idat.gz 0 34 normal Female whole blood

It's easy to see that there are a lot of columns we do not need, and some that we would like to change. See the next step for formatting the data.

Note: This repository is currently oriented for IDAT preprocessing. I am currently working on a higher throughput workflow for stored unmethylated and methylated intensity tab/comma-separated value files, as available for many studies on GEO. I may add this component as an experimental part to this pipeline soon, and will take you from these matrix files to the MethylationArray to be introduced.