Nov 2021
Created by Matthew Suderman [email protected]
- Prepare the data for analysis
- Conduct an eQTL analysis using the MatrixEQTL R package
- Produce plots to visually inspect findings
- (Optional) Conduct an meQTL analysis
- (Optional) Identify trios of associated genes, CpG sites and SNPs
Clone a copy of this tutorial in your home directory.
git clone perishky/matrixeqtl-tutorial
Change to cloned repository for this practical:
cd ~/matrixeqtl-tutorial
It is good practice to save outputs to another directory so that it is easy to distinguish output files from other types of files such as data, script or documentation files. We create the output directory as follows:
mkdir output
This tutorial can be run on BlueCrystal Phase 4. Login to the system using PuTTY and the following server name: bc4login.acrc.bris.ac.uk
You shouldn't run this tutorial on the login node. Run the following command to request an interactive session on a compute node:
salloc --nodes=1 --ntasks=1 --mem=3G --time=02:00:00
srun --job-name="example" --pty bash -i
Change to the repository directory and create the output directory.
cd ~/matrixeqtl-tutorial
mkdir output
You are not ready to for analysis.
In this practical we will run eQTL and meQTL analyses on a small publicly available dataset:
Wagner, et al. The relationship between DNA methylation, genetic and expression inter-individual variation in untransformed human fibroblasts. Genome Biol. 2014 Feb 20;15(2):R37. GEO link
It includes:
- genotype (Illumina Human1M-Duov3 DNA Analysis BeadChip),
- DNA methylation (Illumina HumanMethylation450 BeadChip), and
- gene expression (Illumina HumanRef-8 v3.0 Expression Beadchip) profiles from 57 human skin fibroblast samples obtained from Coriell and McGill Cellbank.
- Preprocessing.
a. Population stratification.
b. Gene expression outliers.
c. Matching samples between datasets.
d. Non-genetic variation removal. - Software and data.
a. R package MatrixEQTL.
b. Genetic data.
c. Population stratification covariates.
d. Expression data.
e. Gene and SNP locations. - eQTL analysis.
- Aftermath.
a. Number of associations.
b. Top associations.
c. Association plotting. - (Optional) meQTL analysis.
- (Optional) Associated trios.
This is detected by visualizing genetic distances between all pairs of individuals using multi-dimensional scaling (MDS).
We will use PLINK for these computations. Please ensure that PLINK is installed before continuing.
On the Bluecrystal Phase 4 cluster, PLINK can be loaded as a module (so no need to install it).
module add apps/plink/2.00
Calculate genetic distances:
plink --bfile data/snp --genome --out output/plink
The output output/plink.genome
file in the current directory
gives all pairwise distances (see the DST
column).
You can type the following command to see the first few lines in the file:
head output/plink.genome
We represent these distances in two dimensions as follows:
plink --bfile data/snp --read-genome output/plink.genome --cluster --mds-plot 2 --out output/plink
This generates a file output/plink.mds
that looks something like this:
FID IID SOL C1 C2
1 GM02704 0 -0.00860421 -0.0114082
2 GM02706 0 -0.0145491 0.0005202
3 GM01650 0 -0.00431194 -0.00171059
4 GM01653 0 -0.00306369 -0.00175129
5 GM02640 0 -0.0035343 -0.00811608
6 GM02641 0 -0.00876442 -0.00616735
FID - Family ID;
IID - Individual ID (or sample name);
SOL - Assigned solution code;
C1 - Position on first dimension;
C2 - Position on second dimension.
Ensure that R is installed.
On the Bluecrystal Phase 4 cluster, R can be loaded as a module (so no need to install it).
module add languages/r/4.1.0
To start R, simply type R
.
R
We'll send outputs from R to the output directory we set up earlier. For convenience, we?ll save the location as a variable.
out.dir <- "output"
We now plot the coordinates in output/plink.mds
in R
as follows:
snp.mds <- read.table(file.path(out.dir, "plink.mds"), header=T, sep="", stringsAsFactors=F)
pdf(file.path(out.dir, "mds.pdf"))
plot(snp.mds[,"C1"], snp.mds[,"C2"], main="SNP MDS plot", xlab="first", ylab="second", pch=19)
dev.off()
Question: Can you identify the four potential outlier samples in the plot?
Answer:
snp.mds[which(snp.mds[,"C1"] > 0.06 | snp.mds[,"C2"] > 0.06),]
FID IID SOL C1 C2
39 39 WG2120 0 -0.0336041 0.0728445
40 40 WG2121 0 -0.0323253 0.0719037
43 43 WG1977 0 0.0725762 0.0344586
44 44 WG1978 0 0.0713593 0.0332588
Given how far these samples differ from the rest, we might actually consider removing them from the analysis. In this case, we'll attempt to adjust for these differences.
We could use a similar approach to identify gene expression outliers.
rna <- read.csv("data/rna-data.csv", row.names=1)
rna.dist <- 1-abs(cor(rna))
rna.mds <- cmdscale(rna.dist,eig=TRUE, k=2)
pdf(file.path(out.dir, "mds-rna.pdf"))
plot(rna.mds$points[,1], rna.mds$points[,2],
main="RNA MDS plot", xlab="first", ylab="second", pch=19)
dev.off()
All too often samples are mislabelled in the lab. Fortunately, it is possible to computationally detect mismatches between genetic and gene expression data using MixupMapper:
http://genenetwork.nl/wordpress/mixupmapper/
In this case, we will skip this step, assuming that the study authors haven't made such errors!
It is possible to do the same for genetic and DNA methylation data, particularly for data generated using the Illumina HumanMethylation450 BeadChip. It actually measures a selection of SNPs in addition to the methylation levels.
Gene expression measurements can be influenced by factors other than genetic variation. To maximize power to detect associations between genes and genetic variants, we can remove all non-genetic variation from the gene expression data. To save time, we will not do that here. In case you're curious, here are the basic steps:
- Calculate the principal components of the gene expression data.
- Test associations between principal components and genetic variants.
- Remove any components with strong genetic associations.
- Remove variation of the remaining principal components from the gene expression data.
If MatrixEQTL
has not already been installed, we install it as follows:
BiocManager::install("MatrixEQTL")
# Respond by typing 'yes' to any questions.
We then load the library in R as follows:
library(MatrixEQTL)
MatrixEQTL operates on SlicedData
objects.
These objects provide access to large datasets
without using huge amounts of computer memory.
The genetic data is contained in the file data/snp-data.csv
.
We load it as follows:
snp.data <- SlicedData$new() ## create a new SlicedData object
snp.data$fileDelimiter <- "," ## columns in the file are separated by commas
snp.data$fileOmitCharacters <- "NA" ## missing values are represented by "NA"
snp.data$fileSkipRows <- 1 ## the first column provides row names
snp.data$fileSkipColumns <- 1 ## the first row provides column names
snp.data$fileSliceSize <- 2000 ## only load the data 2000 rows at a time to save memory
snp.data$LoadFile("data/snp-data.csv")
Loading the data takes about half a minute.
When we test associations between genetic variants, we will need to adjust for population stratification. This is encoded by the multi-dimensional scaling coordinates that we created earlier.
As required by MatrixEQTL
,
we load them as SlicedData
objects:
covariates.data <- SlicedData$new()
covariates.data$initialize(t(snp.mds[,c("C1","C2")]))
Next we convert the gene expression data to SlicedData
objects
and verify that samples match the genotype data samples.
rna.data <- SlicedData$new() ## create a new SlicedData object
rna.data$fileDelimiter <- "," ## columns in the file are separated by commas
rna.data$fileOmitCharacters <- "NA" ## missing values are represented by "NA"
rna.data$fileSkipRows <- 1 ## the first column provides row names
rna.data$fileSkipColumns <- 1 ## the first row provides column names
rna.data$fileSliceSize <- 2000 ## only load the data 2000 rows at a time to save memory
rna.data$LoadFile("data/rna-data.csv")
Don't forget to make sure that the gene expression and genetic datasets have the samples in the same order!
stopifnot(all(colnames(snp.data) == colnames(rna.data)))
Question: If we tested associations between all SNP-gene pairs, how many tests would we have to perform?
Answer:
nrow(snp.data) * nrow(rna.data)
## [1] 869593928
That's a lot of tests. To save time, we will only consider
cis pairs (i.e. SNPs that are with 1 million bases of the gene).
MatrixEQTL
can do this for us if we provide
SNP and gene locations in the genome:
snp.loc <- read.csv("data/snp-features.csv", row.names=1)
rna.loc <- read.csv("data/rna-features.csv", row.names=1)
Question:
MatrixEQTL
requires a p-value threshold for significance. To adjust for the number of tests, we could divide 0.05 by the number of tests to be performed. Because we only test cis pairs, the exact number of difficult to estimate. Can you think of a simple lower bound and upper bound on the number of tests?
Answer:
lower.bound <- nrow(snp.data)
upper.bound <- nrow(snp.data) * nrow(rna.data)
We'll use the lower bound to estimate a p-value threshold so that we don't miss any associations with statistically significant p-values.
threshold <- 0.05/lower.bound
Finally we are ready to run the eQTL analysis.
eqtl <- Matrix_eQTL_main(snps = snp.data,
gene = rna.data,
cvrt = covariates.data, ## population stratification
pvOutputThreshold = 0, ## consider only cis pairs
pvOutputThreshold.cis = threshold,
output_file_name.cis = file.path(out.dir, "eqtl-cis.txt"),
snpspos = snp.loc,
genepos = rna.loc[,c("geneid","chr","left","right")],
cisDist = 1e6, ## define cis as < 1Mb
useModel = modelLINEAR, ## test using linear models
verbose = TRUE,
pvalue.hist = TRUE,
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)
This takes about 30 seconds.
Here is the exact number of tests that were performed:
eqtl$cis$ntests
## [1] 4863929
Question: Above we guessed 1 test per SNP. How many tests were performed on average for each SNP? What p-value threshold should we apply given this number?
Answer:
eqtl$cis$ntests/nrow(snp.data) ## number of tests per SNP
## [1] 27.28987
threshold <- 0.05/eqtl$cis$ntests ## correct p-value threshold
At this threshold (p < 1.03e-08), we identify hundreds of associations.
eqtls <- eqtl$cis$eqtls
eqtls <- eqtls[which(eqtls$pvalue < threshold),]
nrow(eqtls)
## [1] 468
Question: How many eQTLs did we identify?
Answer:
length(unique(eqtls$snps))
## [1] 403
We will plot the strongest association:
idx <- which.min(eqtls$pvalue)
eqtls[idx,]
## snps gene statistic pvalue FDR beta
## 1 rs1025996 ILMN_2100085 -21.33979 1.096262e-27 8.063919e-22 -0.4323303
We save the corresponding gene and SNP:
top.gene <- as.character(eqtls$gene[idx])
top.snp <- as.character(eqtls$snps[idx])
top.gene
## [1] "ILMN_2100085"
top.snp
## [1] "rs1025996"
To plot the association, we will need to extract the
SNP and gene expression values for the pair.
This can be a bit painful using SlicedData
objects
because they don't support direct access to individual
rows or columns.
Here we create our own function to do this.
Don't worry about how it works, just copy and paste into R.
sliced.data.row <- function(object, row) {
if (is.factor(row))
row <- as.character(row)
if (is.character(row)) {
row <- which(rownames(object) == row)
if (length(row) == 0)
stop("Invalid row")
}
slice.size <- nrow(object[[1]])
slice.idx <- floor(row/slice.size) + 1
slice <- object[[slice.idx]]
row <- row - slice.size * (slice.idx - 1)
ret <- slice[row,]
names(ret) <- colnames(object)
ret
}
We can use this function to obtain the genotypes for any given SNP. For example, below we extract data for the top SNP-gene pair.
genotypes <- sliced.data.row(snp.data, top.snp)
genotypes <- factor(genotypes, levels=0:2, labels=c("AA","AB","BB"))
expression.levels <- sliced.data.row(rna.data, top.gene)
We plot their association as follows:
pdf(file.path(out.dir, "top-snp-gene.pdf"))
boxplot(expression.levels ~ genotypes,
main=paste(top.snp,top.gene),
ylim=range(expression.levels),
outline=F)
stripchart(expression.levels ~ genotypes,
method="jitter", add=TRUE, vertical=TRUE,
col=c("blue","orange","black"), pch=19)
dev.off()
cpg.data <- SlicedData$new()
cpg.data$fileDelimiter <- "," ## columns are separated by commas
cpg.data$fileOmitCharacters <- "NA" ## missing values are represented by "NA"
cpg.data$fileSkipRows <- 1 ## the first column provides row names
cpg.data$fileSkipColumns <- 1 ## the first row provides column names
cpg.data$fileSliceSize <- 2000 ## only load the data 2000 rows at a time
cpg.data$LoadFile("data/cpg-data.csv")
cpg.loc <- read.csv("data/cpg-features.csv", row.names=1)
threshold <- 0.05/nrow(snp.data)
meqtl <- Matrix_eQTL_main(snps = snp.data,
gene = cpg.data,
cvrt = covariates.data,
pvOutputThreshold = 0, ## ignore trans associations
pvOutputThreshold.cis = threshold,
output_file_name.cis = file.path(out.dir, "meqtl-cis.txt"),
snpspos = snp.loc,
genepos = cpg.loc[,c("geneid","chr","left","right")],
cisDist = 1e6, ## define cis as < 1Mb
useModel = modelLINEAR,
verbose = TRUE,
pvalue.hist = TRUE,
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)
threshold <- 0.05/meqtl$cis$ntests
meqtls <- meqtl$cis$eqtls
meqtls <- meqtls[which(meqtls$pvalue < threshold),]
Such trios are of interest because they could indicate cases where a SNP modifies methylation levels which in turn modify gene expression levels.
Construct the trios:
trios <- merge(data.frame(snp=eqtls$snps, gene=eqtls$gene, eqtl.statistic=eqtls$statistic),
data.frame(snp=meqtls$snps, cpg=meqtls$gene, meqtl.statistic=meqtls$statistic))
There are 168 trios.
Calculate p-values for each gene/CpG site association:
trios$p.value <- sapply(1:nrow(trios), function(i) {
expr <- sliced.data.row(rna.data, trios$gene[i])
meth <- sliced.data.row(cpg.data, trios$cpg[i])
cor.test(meth, expr)$p.value
})
Identify the strongest gene/CpG site association.
idx <- which.min(trios$p.value)
trios[idx,]
## snp gene eqtl.statistic cpg meqtl.statistic
## 5 rs10220917 ILMN_1656045 9.664163 cg25118879 -13.50116
## p.value
## 5 2.513491e-15