-
Notifications
You must be signed in to change notification settings - Fork 28
/
epic2-demo.rmd
183 lines (147 loc) · 4.71 KB
/
epic2-demo.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
```{r epic2-demo-init, echo=FALSE, message=F}
library(knitr)
library(Cairo)
opts_chunk$set(warning=FALSE, fig.width=6, fig.height=6, dev="CairoPNG", stop=TRUE)
```
# Normalize an EPIC v2 dataset
## Download example data set
```{r child = 'dataset-epic2-demo.rmd'}
```
```{r}
path <- download.epic2.demo.dataset()
```
## Normalize dataset
Create samplesheet
```{r}
library(meffil)
samplesheet <- meffil.read.samplesheet(base=path, pattern="SampleSheet")
```
Parameters.
```{r}
qc.file <- "epic2-demo/qc-report.html"
author <- "Illumina, et al."
study <- "EPIC demo dataset"
number.pcs <- 2
norm.file <- "epic2-demo/normalization-report.html"
```
Generate QC objects.
```{r epic2-demo-qc, cache=T}
qc.objects <- meffil.qc(samplesheet, cell.type.reference="blood gse35069 complete", verbose=T)
## If you get an error relating to a function from
## the 'preprocessCore' R package,
## then you may need to reinstall it as follows:
## BiocManager::install("preprocessCore", configure.args="--disable-threading", force = TRUE)
```
QC report.
```{r}
qc.summary <- meffil.qc.summary(qc.objects, verbose=T)
meffil.qc.report(
qc.summary,
output.file=qc.file,
author=author,
study=study)
```
Normalization dataset.
```{r epic2-demo-norm, cache=T}
norm.objects <- meffil.normalize.quantiles(
qc.objects,
number.pcs=number.pcs,
verbose=T)
beta.meffil <- meffil.normalize.samples(
norm.objects,
just.beta=T,
cpglist.remove=qc.summary$bad.cpgs$name,
verbose=T)
```
Compute principal components of the normalized methylation matrix.
```{r epic2-demo-pcs, cache=T}
pcs <- meffil.methylation.pcs(
beta.meffil,
sites=meffil.get.autosomal.sites("epic"),
verbose=T)
```
Normalization report.
```{r}
parameters <- meffil.normalization.parameters(norm.objects)
parameters$batch.threshold <- 0.01
norm.summary <- meffil.normalization.summary(
norm.objects=norm.objects,
pcs=pcs,
parameters=parameters,
verbose=T)
meffil.normalization.report(
norm.summary,
output.file=norm.file,
author=author,
study=study)
```
## Horvath's clock and the EPIC v2 microarray
```{r}
#require(RCurl)
#clock <- read.csv(textConnection(getURL("https://labs.genetics.ucla.edu/horvath/dnamage/AdditionalFile3.csv")), comment.char="#", stringsAsFactors=F)
clock <- read.csv("AdditionalFile3.csv", comment.char="#", stringsAsFactors=F)
length(setdiff(clock$CpGmarker, rownames(beta.meffil)))
nrow(clock)
length(intersect(clock$CpGmarker, rownames(beta.meffil)))/nrow(clock)
```
Some of the 'clock' sites are missing.
## Hannum predictor CpG sites and the EPIC microarray
71 CpG sites were used in the Hannum et al. age predictor:
> Hannum G, et al.
> Genome-wide methylation profiles reveal quantitative views of human aging rates.
> Mol Cell. 2013 Jan 24;49(2):359-67.
The list can be obtained from here:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3780611/bin/NIHMS418935-supplement-02.xlsx
```{r}
hannum.sites <- readLines("hannum.txt")
length(hannum.sites)
length(setdiff(hannum.sites, rownames(beta.meffil)))
length(intersect(hannum.sites, rownames(beta.meffil)))/length(hannum.sites)
```
Some but not all of the sites are missing.
## Duplicated probes
Several thousand probes appear in duplicate across the microarray.
By default, the `meffil.normalize.samples` function collapses
duplicated probes by taking their median.
It is possible to a different summarizing function by setting the
`dup.fun` argument.
It is also possible to keep the duplicate probes by
setting `dup.fun` to `NULL` as follows:
```{r epic2-demo-dups,cache=T}
beta.dup <- meffil.normalize.samples(
norm.objects,
just.beta=T,
cpglist.remove=qc.summary$bad.cpgs$name,
dup.fun=NULL,
verbose=T)
```
We can then collapse the duplicate probes as follows:
```{r}
beta.nodup <- meffil.collapse.dups(beta.dup)
```
This function also has a `dup.fun` argument allowing the user
to specify a different summarizing function.
For example, cg06373096 appears
`r length(grep("cg06373096",rownames(beta.dup)))` times.
```{r}
quantile(beta.nodup["cg06373096",]-beta.meffil["cg06373096",],na.rm=T)
quantile(colMedians(beta.dup[grepl("cg06373096",rownames(beta.dup)),],na.rm=T)-beta.meffil["cg06373096",],na.rm=T)
```
Here are the correlations between these duplicates:
```{r}
cor(t(beta.dup[grep("cg06373096",rownames(beta.dup)),]))
colMeans(cor(t(beta.dup[grep("cg06373096",rownames(beta.dup)),])))
```
How do matrices differ when we collapse
duplicates (`beta.meffil` and `beta.dup`)
or not (`beta.dup`)?
```{r}
identical(rownames(beta.nodup), rownames(beta.meffil))
all(rownames(beta.nodup) %in% sub("_.*", "", rownames(beta.dup)))
all(sub("_.*", "", rownames(beta.dup)) %in% rownames(beta.nodup))
```
```{r}
dim(beta.meffil)
dim(beta.nodup)
dim(beta.dup)
```