This workflow is using the TCGAbiolinks package to download data from the NCI's Genomic Data Commons.
All files are stored as <cohort>.RData
in their respecitive analysis
directories.
The following software is required to run this workflow:
- A recent version of R
- The TCGAbiolinks package from Bioconductor
- GNU make
Optionally, the following R packages for post-processing:
- edgeR - for
log2 cpm transformation of RNA-seq reads
- DESeq2 - for variance stabilizing transformation of RNA-seq reads
The are three options to download and save TCGA data:
# Download everything
make # add the -j<n> flag to run n data sets in parallel
# Selection by cohort
# - see projects.txt for valid cohorts
make <cohort> # eg. 'TCGA-LUAD' for lung adenocarcinoma
# Selection by data type
# - valid types are: snv_mutect2, rna_seq_raw, cnv_segments, mirna_seq, clinical
make <data type> # eg. 'clinical' for downloading clinical data
Data will be stored as RData
files (containing a data.frame
or
SummarizedExperiment
object) for each cohort in the respective data type
directories.
The data processing steps underlying the data being downloaded is fully documented on the GDC webpage.