bootcamp_intro.Rmd

---
title: SPEAQeasy bootcamp
author: 
  - name: Leonardo Collado-Torres
    affiliation:
    - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus
    - &ccb Center for Computational Biology, Johns Hopkins University
    email: lcolladotor@gmail.com
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
---

This document is part of the _R/Bioconductor-powered Team Data Science_ [LIBD bootcamps](https://lcolladotor.github.io/bioc_team_ds/rbioconductor-data-science-bootcamps.html#libd-bootcamps).

You can download the [**latest version of this Rmd document here**](https://github.com/LieberInstitute/SPEAQeasy-example/blob/master/bootcamp_intro.Rmd). ^[This was useful particularly during October 5-7 while I was updating this document frequently.]

# Setup

Please install the latest:

* R version from [CRAN](https://cran.r-project.org/)
* [RStudio Destkop](https://rstudio.com/products/rstudio/download/) ^[You might want to try out [RStudio Desktop Preview](https://rstudio.com/products/rstudio/download/preview/) as discussed on the [2020-10-02 LIBD rstats club session](https://twitter.com/LIBDrstats/status/1312187753371561984?s=20). ]

Once you have installed R and RStudio, please install the following R packages:

```{r "install_packages", eval = FALSE}
## For installing R packages from CRAN and GitHub
if (!requireNamespace("remotes", quietly = TRUE)) {
    install.packages("remotes")
}

## For installing Bioconductor packages
remotes::install_cran("BiocManager")

## Workflow-related R packages
remotes::install_cran(c(
    "rmarkdown",
    "sessioninfo",
    "getopt",
    "here",
    "tidyr",
    "usethis"
))
remotes::install_github("LieberInstitute/jaffelab")

## Differential-expression related R packages
BiocManager::install(c("limma", "edgeR", "SummarizedExperiment", "recount", "ExploreModelMatrix"))

## Genomics-related R packages
BiocManager::install(c(
    "clusterProfiler", "org.Hs.eg.db",
    "VariantAnnotation"
))

## Visualization-related R packages
remotes::install_cran(c(
    "pheatmap",
    "RColorBrewer",
    "ggplot2",
    "plotly"
))
BiocManager::install(c("iSEE", "enrichplot"))

## Extra ones
remotes::install_cran(c("knitcitations", "statmod"))
BiocManager::install(c("BiocStyle"))
```

# Bootcamp plan

* Day 1: Monday October 5th 2020, 3-5 pm
  - RNA sequencing primer
  - LIBD's RNA-seq processing pipeline: SPEAQeasy
  - `r BiocStyle::Biocpkg("SummarizedExperiment")` overview
  - Exploring quality control metrics
* Day 2: Tuesday October 6th 2020, 3-5 pm
  - Exploring gene expression
  - Identifying sample swaps
  - Statistical modeling
* Day 3: Wednesday October 7th 2020, 1-3 pm
  - Identifying differentially expressed genes
  - Visualizing differentially expressed genes
  - Gene ontology enrichment analyses
  - Beyond SPEAQeasy
  

# Day 1

## RNA sequencing primer

[Slides](https://docs.google.com/presentation/d/1X3U2lcf75Yh9BK_Scs-O3M8b9-JauUNN8ZO_xhlKs7w/edit?usp=sharing)

For more details, I have some older slides on the technology for RNA-seq from [PDCB-HTS 2010](http://lcolladotor.github.io/courses/PDCB-HTS.html).

## SPEAQeasy overview

Let's go to the [SPEAQeasy](http://research.libd.org/SPEAQeasy/) documentation website.

We'll mostly be using the [SPEAQeasy main output files](http://research.libd.org/SPEAQeasy/outputs.html). 

### SPEAQeasy-example

Example data using [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/) with 40 samples from the Bipolar project with Peter Zandi et al.

It shows:

1. How to download the example data
1. How to run SPEAQeasy with this data
1. How to identify sample swaps
1. How to run differential expression analyses with this data

We'll view these links, but we'll start with:

* [Home](http://research.libd.org/SPEAQeasy-example/index.html)
* [Download data](http://research.libd.org/SPEAQeasy-example/prepare_data.html)

## SummarizedExperiment overview

### General

In general:

* `r BiocStyle::Biocpkg("SummarizedExperiment")`
  - [Introduction vignette](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)
* Bioconductor 2015 paper
  - DOI: [10.1038/nmeth.3252](https://doi.org/10.1038/nmeth.3252)
  
### Example data

Let's look at the example data

```{r "load_data"}
## For loading the data
library("SummarizedExperiment")
library("here")

## Load the example data
load(here("rse_speaqeasy.RData"), verbose = TRUE)
```

Let's look at the sample metadata we have in this example data. Some of these variables are defined in the [SPEAQeasy documentation](http://research.libd.org/SPEAQeasy/outputs.html#quality-metrics) that we'll be using shortly.

```{r "metadata"}
colData(rse_gene)
```

SPEAQeasy can align data to the human or mouse genomes using specific Gencode annotations. The example data we are working with is from human using Gencode v32.

```{r "gene data"}
rowData(rse_gene)
```

The bulk of the data is stored in the gene count matrix generated using `featureCounts` by `SPEAQeasy`. This is a large matrix with one row per gene and one column per sample. As such, we'll just explore a tiny piece of it.

```{r "count data"}
## The names of the count matrices we have
assayNames(rse_gene)

## The full "counts" matrix
dim(assay(rse_gene, "counts"))

## The top left corner of the matrix counts
jaffelab::corner(assay(rse_gene, "counts"))
```


## Exploring quality control metrics

The [SPEAQeasy documentation](http://research.libd.org/SPEAQeasy/outputs.html#quality-metrics) explains the meaning of some of the quality metrics that are included in the RSE files generated by this processing pipeline. Let's explore these metrics and their relationship to some of our sample metadata we manually included, such as the age at time of death and primary diagnosis.

We can start with some boxplots comparing some metrics against primary diagnosis and brain region, as shown in the [SPEAQeasy-example differential expression script](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#14_Stats_vs_diagnosis_and_brain_region).

We can also use tools like `r BiocStyle::CRANpkg("ggplot2")` and `r BiocStyle::CRANpkg("plotly")` to interactively explore some metrics. For doing so, it'll be useful to check some of our LIBD rstats club sessions on:

* [High quality graphics in R](https://twitter.com/LIBDrstats/status/1308934265267195905?s=20)
* [Interactive graphics with plotly](https://twitter.com/LIBDrstats/status/1304799631386308608?s=20)


### Exercise

Adapt the interactive graphics code from the LIBD rstats club session available [here](https://gist.github.com/lcolladotor/350ac8843e153d135f64880a1d029b14) for our data.


### Exercise hints

```{r "interactive_exploration"}
## First we load the required packages
library("ggplot2")
library("plotly")

## Next, we convert the sample metadata to a data.frame
## which will make it easier to work with
pd_df <- as.data.frame(colData(rse_gene))

## We need a "key" variable that is unique

## Let's make the key

## Let's make a "highlighted" table

## Make a plot using the highlight table

## Make a second plot

## Convert them to interactive plots

## Now group them together

## Fancy one
```

### Exercise solution

#### Step 1

As a first step, let's add the code from the gist that is listed below each of the comments from the previous [code](https://gist.github.com/lcolladotor/350ac8843e153d135f64880a1d029b14).

```{r "interactive_exploration_step1", eval = FALSE}
## First we load the required packages
library("ggplot2")
library("plotly")

## Next, we convert the sample metadata to a data.frame
## which will make it easier to work with
pd_df <- as.data.frame(colData(rse_gene))

## We need a "key" variable that is unique
length(unique(pd_df$RNum))
nrow(pd_df)

## Let's make the key
pd_df$key <- pd_df$RNum

## Let's make a "highlighted" table
pd_key <- highlight_key(pd_df, ~key)

## Make a plot using the highlight table
gg_mean_mito_vs_mean_gene <-
    ggplot(
        pd_key,
        aes(x = mean_mitoRate, y = mean_totalAssignedGene, color = Region)
    ) +
    geom_point()
gg_mean_mito_vs_mean_gene

## Make a second plot
gg_mean_mito_vs_mean_RIN <-
    ggplot(pd_key, aes(x = mean_mitoRate, y = mean_RIN, color = Region)) +
    geom_point()
gg_mean_mito_vs_mean_RIN

## Convert them to interactive plots
p_mean_mito_vs_mean_gene <- ggplotly(gg_mean_mito_vs_mean_gene)
p_mean_mito_vs_mean_RIN <- ggplotly(gg_mean_mito_vs_mean_RIN)

## Now group them together
p_merged <- subplot(
    p_mean_mito_vs_mean_gene,
    p_mean_mito_vs_mean_RIN,
    nrows = 1,
    shareX = TRUE,
    shareY = FALSE,
    which_layout = 2
)

## Fancy one
p_fancy <- highlight(
    p_merged,
    on = "plotly_click",
    selectize = TRUE,
    dynamic = TRUE,
    persistent = TRUE
)
```

#### Step 2

Now that we have the code, we need to adapt it. Some variable names are slightly different, some don't exist. This process will typically involve several error messages while we figure out what's there and what's different. The end result will look like the following code.


```{r "interactive_exploration_answer"}
## First we load the required packages
library("ggplot2")
library("plotly")

## Next, we convert the sample metadata to a data.frame
## which will make it easier to work with
pd_df <- as.data.frame(colData(rse_gene))

## We need a "key" variable that is unique
## We replaced RNum by SAMPLE_ID
length(unique(pd_df$SAMPLE_ID))
nrow(pd_df)

## Let's make the key
pd_df$key <- pd_df$SAMPLE_ID

## Let's make a "highlighted" table
pd_key <- highlight_key(pd_df, ~key)

## Make a plot using the highlight table
## Here we remoted "mean_" from the variable and object names
## and we changed Region to BrainRegion
gg_mito_vs_gene <-
    ggplot(
        pd_key,
        aes(x = mitoRate, y = totalAssignedGene, color = BrainRegion)
    ) +
    geom_point()

## Make a second plot
## We don't have RIN, so we'll use rRNA_rate here
gg_mito_vs_rRNA_rate <-
    ggplot(pd_key, aes(x = mitoRate, y = rRNA_rate, color = BrainRegion)) +
    geom_point()

## Convert them to interactive plots
p_mito_vs_gene <- ggplotly(gg_mito_vs_gene)
p_mito_vs_rRNA_rate <- ggplotly(gg_mito_vs_rRNA_rate)

## Now group them together
p_merged <- subplot(
    p_mito_vs_gene,
    p_mito_vs_rRNA_rate,
    nrows = 1,
    shareX = TRUE,
    shareY = FALSE,
    which_layout = 2
)

## Fancy one
p_fancy <- highlight(
    p_merged,
    on = "plotly_click",
    selectize = TRUE,
    dynamic = TRUE,
    persistent = TRUE
)

## End result
p_fancy
```


# Day 2

You can download the [**latest version of this Rmd document here**](https://github.com/LieberInstitute/SPEAQeasy-example/blob/master/bootcamp_intro.Rmd).

## Exploring gene expression

One way to explore gene expression data as a whole is to use a dimension reduction technique such as principal component analysis (PCA) or multidimensional scaling (MDS) ^[For details about them, check [this thread on CrossValidated](https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-component-analysis-and-multidimensional).]. This is typically done with a subset of highly variable genes for computational speed, though with smaller datasets, you can do it with all the data.

In the [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#15_Explore_and_visualize_gene_expression), PCA was computed on the `log2(RPKM + 1)` data.

Once we have the top principal components (PCs), we can then make scatterplots or boxplots comparing them against some known covariates from our study as well as quality metrics generated by SPEAQeasy. It is also useful to compare the top PCs (the ones that explain the most variance) and color the samples (dots) by some grouping variables we are interested in; check [this example](http://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#pca-plot) where the main covariate is associated with PC1 that explains ~50% of the variance. This process can reveal potential batch effects that we need to consider in our statistical models.

### Exercise

* Compute the PCs just like in the [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#15_Explore_and_visualize_gene_expression).
* Adapt the PC1 vs PC2 code to make three new scatterplots:
  - PC1 vs `mitoRate`,
  - PC1 vs `totalAssignedGene`,
  - PC1 vs `rRNA_rate`
* Optional: make interactive versions of these graphs.

Here's some code to get you started.

```{r "pcs"}
## Load some required packages
library("recount")
library("jaffelab")

# Filter for expressed
rse_gene <- rse_gene[rowMeans(getRPKM(rse_gene, "Length")) > 0.2, ]

# Explore gene expression
geneExprs <- log2(getRPKM(rse_gene, "Length") + 1)
set.seed(20201006)
pca <- prcomp(t(geneExprs))
pca_vars <- getPcaVars(pca)
pca_vars_lab <- paste0(
    "PC", seq(along = pca_vars), ": ",
    pca_vars, "% Var Expl"
)
```


## Identifying sample swaps

One of the main outputs from SPEAQeasy is a VCF file with ~740 common coding variants that you can compare across RNA-seq samples to identify issues with sample identity, or even better, against DNA genotyping data. [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/swap_speaqeasy.html) includes an example where we do the comparison and identify a few samples that are problematic.

Resolving sample identities is a headache-inducing process, since some issues are more complex than others to resolve and the decisions you make in how to prioritize the evidence will lead to different conclusions.

## Statistical modeling

There are many Bioconductor packages that enable us to perform the statistical tests for differential expression analysis. Some of the main ones are `r BiocStyle::Biocpkg("limma")`, `r BiocStyle::Biocpkg("edgeR")`, and `r BiocStyle::Biocpkg("DESeq2")`. In our recent projects we have mostly used `r BiocStyle::Biocpkg("limma")` due to its performance and speed, which matters most when analyzing exon or exon-exon junction expression data (count matrices with thousands of rows more than gene expression data).

### Normalization

Prior to running a differential expression analysis, we first have to make the data comparable. This is done through a process called _normalization_. Many methods exist across Bioconductor packages for doing this. Given that we like to use `limma-voom` for the modeling, we'll use the companion normalization methods from `r BiocStyle::Biocpkg("edgeR")`. The `RNAseq123` Bioconductor workflow explains why normalization is important, so let's [take a look](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#normalising-gene-expression-distributions).

Let's use `edgeR::calcNormFactors()` with the SPEAQeasy-example data.

```{r "calc_norm_factors"}
## For the normalization functions
library("edgeR")

## From
## http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#16_Modeling
dge <- DGEList(
    counts = assays(rse_gene)$counts,
    genes = rowData(rse_gene)
)
dge <- calcNormFactors(dge)
```


### Design matrix

Differential expression analysis methods rely on the user specifying a _design matrix_. The simplest way to think about it is in terms of a linear regression, although some models use other distributions. R provides a useful function for making design matrices called `model.matrix()` that can take a _formula_ input such as `~ group + lane` or `~ 0 + group + lane` as shown [in RNAseq123](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#differential-expression-analysis).

We can spend more time learning about regression starting with [linear regression](https://lcolladotor.github.io/bioc_team_ds/helping-others.html#linear-regression-example). However a user-friendly way that is quite powerful is implemented in the `r BiocStyle::Biocpkg("ExploreModelMatrix")` package. Let's look at the [first example](http://bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html) from their vignette.


#### Exercise

1. Create the following model matrix.

```{r "model"}
## Clean up the PrimaryDx variable
rse_gene$PrimaryDx <- factor(rse_gene$PrimaryDx)

## Modified from
## http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#16_Modeling
mod <- model.matrix(~ PrimaryDx + BrainRegion,
    data = colData(rse_gene)
)
```

2. Use `r BiocStyle::Biocpkg("ExploreModelMatrix")` to visualize the design matrix.


You might need to install `r BiocStyle::Biocpkg("ExploreModelMatrix")` first as shown below.

```{r "install_explore_model_matrix", eval = FALSE}
## Install a new package
BiocManager::install("ExploreModelMatrix")
```

Then adapt the code from the [first example](http://bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html) in their vignette documentation.

```{r "use_explore_model_matrix"}
library("ExploreModelMatrix")

## Adapt code from
## http://bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html
```

Note that `sampleData` cannot take `NA` values, so you'll have to choose some columns to work with. For example:

* `PrimaryDx`
* `BrainRegion`
* `Sex`
* `AgeDeath`
* `mitoRate`
* `totalAssignedGene`
* `rRNA_rate`

#### Solution

Here's one solution to the exercise.

```{r "explore_model_solution"}
## Here we select manually a few columns to work with
## and we use the same formula we used for making our model matrix
app <-
    ExploreModelMatrix(
        sampleData = colData(rse_gene)[, c(
            "PrimaryDx",
            "BrainRegion",
            "Sex",
            "AgeDeath",
            "mitoRate",
            "totalAssignedGene",
            "rRNA_rate"
        )],
        designFormula = ~ PrimaryDx + BrainRegion
    )

## We can the explore the result interactively through a nice
## shiny web application created by ExploreModelMatrix
if (interactive()) {
      shiny::runApp(app)
  }
```


# Day 3

You can download the [**latest version of this Rmd document here**](https://github.com/LieberInstitute/SPEAQeasy-example/blob/master/bootcamp_intro.Rmd).

## Identifying DEGs

Let's take a look at the `RNAseq123` workflow to learn about how to [remove heteroscedascity from count data](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#removing-heteroscedascity-from-count-data).

```{r "run_voom"}
## Load limma
library("limma")

## Let's run limma-voom
vGene <- invisible(voom(dge, mod, plot = TRUE))
```

Once we have removed the heteroscedascity from the data, we can proceed with the statistical tests. However, note that we have some repeated brains which means that we have non-independent data and need to take this into account.

```{r "dup_brains"}
table(table(rse_gene$BrNum))
```

We can do so using the function `limma::duplicateCorrelation()`.

```{r "dup_corr", warning=FALSE}
# Get duplicate correlation
gene_dupCorr <- duplicateCorrelation(vGene$E, mod,
    block = colData(rse_gene)$BrNum
)
gene_dupCorr$consensus.correlation
```

In this particular case, `duplicateCorrelation()` had several warnings, but in larger studies where we have more samples across individuals it will be more precise. 

In any case, we can next finally perform the statistical tests using a series of `limma` functions: `lmFit()`, `eBayes()` and `topTable()`. Some key arguments are:

* `lmFit(correlation, block)` which you won't need if you don't have a blocking covariate (duplicated brains or some other technical covariate like batch). 
* `topTable(coef)` which specifies which coefficient we want to test. If we specify more than one, we will compute an F-statistic instead of a t-statistic. The other arguments we specify below allow us to get the full table of results instead of just the very first.


```{r "limma"}
# Fit linear model
fitGeneDupl <- lmFit(
    vGene,
    correlation = gene_dupCorr$consensus.correlation,
    block = colData(rse_gene)$BrNum
)

# Here we perform an empirical Bayesian calculation to obtain
# our significant genes
ebGeneDupl <- eBayes(fitGeneDupl)
outGeneDupl <- topTable(
    ebGeneDupl,
    coef = 2,
    p.value = 1,
    number = nrow(rse_gene),
    sort.by = "none"
)
dim(outGeneDupl)

## Compare against the default topTable() output
## (we'll skip a few columns of output)
topTable(ebGeneDupl, coef = 2)[, -c(1, 3, 6:10)]
```

## Visualizing DEGs

Once we have our list of differentially expressed genes (DEGs), we can make some boxplots as in [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#17_Check_plots) to check that the model results make sense. We frequently use [`jaffelab::cleaningY()`](http://research.libd.org/jaffelab/reference/cleaningY.html) to remove (regress out) the effect of covariates in our model.

Another way to visualize DEGs is to make a heatmap which is where `r BiocStyle::CRANpkg("pheatmap")` comes into play. It can be useful to see how the top genes are related.

```{r "pheatmap"}
## For making a heatmap
library("pheatmap")

## Find the genes with a p-value less than 0.005
## (normally we would look at the FDR < 0.05 genes or 0.1)
sigGene <- outGeneDupl[outGeneDupl$P.Value < 0.005, ]

## Extract the normalized expression
exprs_heatmap <- vGene$E[rownames(sigGene), ]

## Build an annotation data.frame for our heatmap
df <- as.data.frame(colData(rse_gene)[, c("PrimaryDx", "Sex", "BrainRegion")])
rownames(df) <- colnames(exprs_heatmap) <- gsub("_.*", "", colnames(exprs_heatmap))
colnames(df) <- c("Diagnosis", "Sex", "Region")

#  Manually determine coloring for plot annotation
palette_names = c('Dark2', 'Paired', 'YlOrRd')
ann_colors = list()
for (i in 1:ncol(df)) {
    col_name = colnames(df)[i]
    n_uniq_colors = length(unique(df[,col_name]))
    
    #   Use a unique palette with the correct number of levels, named with
    #   those levels
    ann_colors[[col_name]] = RColorBrewer::brewer.pal(n_uniq_colors, palette_names[i])[1:n_uniq_colors]
    names(ann_colors[[col_name]]) = unique(df[,col_name])
}

# Display heatmap
pheatmap(
    exprs_heatmap,
    cluster_rows = TRUE,
    show_rownames = FALSE,
    cluster_cols = TRUE,
    annotation_col = df,
    annotation_colors = ann_colors
)
```

```{r pdf_heatmap, eval=FALSE, echo=FALSE}
## For some reason, I need to run this manually to save the PDF

# Write heatmap to PDF
pdf(file = here("intro_de_heatmap.pdf"))
pheatmap(
    exprs_heatmap,
    cluster_rows = TRUE,
    show_rownames = FALSE,
    cluster_cols = TRUE,
    annotation_col = df,
    annotation_colors = ann_colors
)
dev.off()
```

### Exercise

* Make a plot with 4 boxplots (brain region by primary diagnosis) for the top differentially expressed gene.

### Solution

```{r "top_gene_solution"}
## Let's find that top gene
i <- which.min(outGeneDupl$P.Value)

## We can use boxplot() from base R plots
boxplot(
    vGene$E[i,] ~ rse_gene$PrimaryDx + rse_gene$BrainRegion,
    las = 2,
    ylab = "Expression",
    xlab = ""
)

## Or we can use ggplot2
## First we need to build a temporary data frame with
## the data for ggplot2
df_temp <- data.frame(
    Expression = vGene$E[i,],
    Region = rse_gene$BrainRegion,
    Diagnosis = rse_gene$PrimaryDx
)
## Next we can make the boxplot, we'll use "fill" to color
## the boxes by the primary diagnosis variable
ggplot(df_temp, aes(y = Expression, x = Region, fill = Diagnosis)) +
    geom_boxplot() +
    theme_dark(base_size = 20)
```


## GO enrichment analyses

Josh Stolz will show us some [slides](https://docs.google.com/presentation/d/1sQdOThDTq1iu8SV8H-HOJ8nGwgTxd5NSXnyfR1w6c28/edit?usp=sharing).

### Demo

You might need to install `r BiocStyle::Biocpkg("enrichplot")` first as shown below.

```{r "install_explore_enrichplot", eval = FALSE}
## Install a new package
BiocManager::install("enrichplot")
```

Here's code for a GO demonstration by Josh.

```{r "go_demo"}
## Load required R packages
library("enrichplot")
library("clusterProfiler")
library("org.Hs.eg.db")

## Sample Data
data(geneList, package = "DOSE")

## make cutoff log2 fold change
gene <- names(geneList)[abs(geneList) > 2]

## We need get different kinds of gene labels
gene.df <- bitr(
    gene,
    fromType = "ENTREZID",
    toType = c("ENSEMBL", "SYMBOL"),
    OrgDb = org.Hs.eg.db
)

## do gene ontology enrichment for the CC ontology
ego <- enrichGO(
    gene = gene,
    universe = names(geneList),
    OrgDb = org.Hs.eg.db,
    ont = "CC",
    pAdjustMethod = "BH",
    pvalueCutoff = 0.01,
    qvalueCutoff = 0.05,
    readable = TRUE
)

## Visualize the GO enrichment results in different ways
barplot(ego, showCategory = 20)
dotplot(ego, showCategory = 30)
edox <- setReadable(ego, "org.Hs.eg.db", "ENTREZID")

## cnetplot() and heatplot() show the actual genes
## that are present in our data (geneList) for each
## of the GO terms that were significantly enriched
cnetplot(edox,
    foldChange = geneList,
    circular = TRUE,
    colorEdge = TRUE)
heatplot(edox, foldChange = geneList)
```

### Exercise

* Which genes from our dataset are involved in the _chromosome, centromeric region_ ontology?
* Make a `heatplot()` using the results of enrichment analysis from the biological process ontology.

### Solution

```{r "go_solution"}
## Finding the genes for _chromosome, centromeric region_
ego_df <- as.data.frame(ego)
strsplit(
    ego_df$geneID[
        ego_df$Description == "chromosome, centromeric region"
        ],
    "/")[[1]]

## Run the enrichment with the BP ontology
ego_bp <- enrichGO(
    gene = gene,
    universe = names(geneList),
    OrgDb = org.Hs.eg.db,
    ont = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff = 0.01,
    qvalueCutoff = 0.05,
    readable = TRUE
)

## Then make the heatplot
ego_bp_x <- setReadable(ego_bp, "org.Hs.eg.db", "ENTREZID")
heatplot(ego_bp_x, foldChange = geneList)
```


## Beyond SPEAQeasy

* [recount3](http://research.libd.org/recount3/): over 750,000 human and mouse public RNA-seq processed with another aligner from the one used in SPEAQeasy: STAR instead of HISAT2.
* Other tools like Salmon and kallisto.

# Reproducibility

```{r reproducibility}
# Date this report was generated
message(Sys.time())

# Reproducibility info
options(width = 120)
sessioninfo::session_info()
```