analysis8hpa5RR.Rmd

---
title: "8hpa Control - Five Replicates Removed" 
author: "Kirsten Gotting"
output:
  html_document:
      code_download: true
      toc: true
      toc_depth: 2
      toc_float:
        collapsed: false
---


```{r initializeData8hpaControlFiveReplicatesRemoved, echo = FALSE, results = "hide", message = FALSE, error = FALSE, warning = FALSE, cache = FALSE}

## user input to 'DESeq2All.Rdata'

load("./output/DESeq2All5RR.Rdata")
setwd("./")

pval <- as.numeric(pval)
## First list the R libraries I want to use
libraries        <- c("DESeq2", "gplots", "RColorBrewer", 'knitr', 'grid', 'gridExtra', "tidyverse", 'kiRsten')

## Now read in the libraries
lapply(libraries, function(x){
    library(x, character.only = TRUE, quietly = TRUE)
    })

## Initialize knitr options
opts_chunk$set(echo=FALSE, message=FALSE, results="hide", fig.keep="all", warning=FALSE, error=FALSE, fig.path="figures/", cache = TRUE)
opts_knit$set(root.dir = "./")


```

# Overview

Differential expression analysis with 8hpa control and all replicates.


Replicated removed: '7hpf_2012_N1', '7hpf_2013_17', '2hpf_2012_N6', '1hpf_2012_N7', '6hpf_2012_N2'

#Spearman Correlation

Spearman's rank correlation coefficient is a nonparametric measure of rank correlation, recommended for data that does not necessarily come from a bivariate normal distribution (Wikipedia). Below, we plot the Spearman correlation of the RPM (reads per million) for all genes that have an average RPM &#8805; 1.


```{r spearman_correlation8hpaControlFiveReplicatesRemoved, fig.cap = ""}

## Make a spearman plot


filteredRPMs <- normCounts %>% group_by(ID) %>% mutate(rpmFlag = ifelse((mean(rpm) >= 1), yes = 'yes', no = 'no')) %>% ungroup() %>%
    filter(rpmFlag == 'yes') %>% select(ID, Sample, rpm) %>% spread(Sample, rpm)


rpmfilt        <- data.frame(select(filteredRPMs, -ID), row.names = filteredRPMs$ID, check.names = FALSE)
spearman       <- data.frame(cor(rpmfilt, method = 'spearman'), check.names = FALSE)
spearman$ID    <- rownames(spearman)
spearmanTidy   <- spearman %>% gather(Sample, correlation, -ID)

## and plot it!

ggplot(data = spearmanTidy, aes(x = ID, y = Sample, fill = correlation)) + geom_tile() +
    scale_fill_gradient2(low = "yellow", mid = 'white', high = "purple4", midpoint = 0.8, limit = c(0,1)) +
        ggtitle('Spearman Correlation of Samples') + xlab('') + ylab('') +
            theme(axis.text.x = element_text(angle = 90, hjust = 1))
```


# Pairwise PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (Wikipedia). Below is a PCA of the 8hpf control versus the other developmental timepoints. 

```{r PCA8hpaControlFiveReplicatesRemoved, fig.fullwidth = TRUE, fig.width = 10, fig.height = 10}

pcas_data <- lapply(contrast_names, function(x){
    y <- plotPCA(rld_all[[x]], returnData = TRUE, ntop = 10000)
    y$ID <- x
    y
})

pca_data <- bind_rows(pcas_data)

ggplot(data = pca_data, aes(x = PC1, y = PC2, label = name, colour = factor(condition))) +
            geom_point(size = 3) + theme_bw() + geom_text(hjust = 0, vjust = -0.4) +
            xlab("PC 1") + ylab("PC 2") + 
            scale_x_continuous(expand = c(0.98, 0))  +
            scale_y_continuous(expand = c(0.2, 0)) +
            theme(legend.position="none") +
            facet_wrap(~ ID, scales = "free")

```


# MA and Volcano plots {.tabset}

MA plot: The log2 fold change (M) plotted against the log2 average (A) of the normalized read count for each gene. Significantly differentially expressed genes are plotted in red.

Volcano plot: The log2 fold change (M) plotted against the -log10 (eg. 1e^-10 = 10) of the adjusted p-value.

```{r mavol8hpaControlFiveReplicatesRemoved}


x <- contrast_names[[1]]

plots <- lapply(contrast_names, function(x){
    df      <- all_results_tidyDF %>% filter(contrastID == x) %>% spread(dea_ID, dea_Value) %>% mutate(logbaseMean = log(baseMean))
    pval    <- as.numeric(pval)
    if(is.null(df)){
        message(paste0("Pvalue too stringent for ", contrast, ". MA/Volcano plot ommited.\n"))
    } else {
        pval                  <- as.numeric(pval)
        lower_label           <- paste0("Genes with Counts: ", nrow(df), "\nUp: ",#create the label of the legend
                                  nrow(subset(df, log2FoldChange > 0 & padj < pval)),
                                  ", Down: ", nrow(subset(df, log2FoldChange < 0 & padj < pval)), "\n")
  
        df <- df %>% mutate(significance_group = ifelse(padj <= pval, yes = 'yes', no = 'no'))
  
        legend_labels        <- c(paste0("padj > ", pval,": ", nrow(filter(df, significance_group == 'no'))),
                                  paste0("padj < ", pval,": ", nrow(filter(df, significance_group == 'yes'))))
        ma.plot <-   ggplot(df, aes(x = logbaseMean, y = log2FoldChange, group = significance_group, colour = significance_group)) +
                      geom_point() + theme_bw() +
                      ggtitle('MA plot') + scale_x_continuous(expand = c(0, 0)) +
                      xlab("A = Mean of log2(Abundance)") + ylab("M = log2(Fold Change)") +
                      scale_color_manual(values = c("gray24", "red"), name = lower_label, breaks = c("gray24", "red"), labels = legend_labels) +
                      geom_hline(aes(yintercept = 0), colour = "black", linetype = "dashed") +
                      geom_hline(aes(yintercept = 1), colour = "gray66", linetype = "dashed") +
                      geom_hline(aes(yintercept = -1), colour = "gray66", linetype="dashed") +
                      theme(legend.position = 'none', text = element_text(size=20))
        volcano.plot <-   ggplot(df, aes(x = log2FoldChange, y = -log(padj, base = c(10)), group = significance_group, colour = significance_group)) +
                          geom_point() +
                          ggtitle('Volcano plot') +
                          xlab("M = log2(Fold Change)") + ylab("-log10(Adjusted P-Value)") + 
                          scale_color_manual(values = c("gray24", "red"), labels = legend_labels, name=lower_label) +
                          theme_bw() +
                          theme(legend.title = element_text(size = 16, face = "bold"), legend.text = element_text(size = 20),  text = element_text(size=20))
     grob <-   marrangeGrob(list(ma.plot,
                          volcano.plot),
                          top = textGrob(x, gp = gpar(fonsize = 40, fontface = 'bold', cex = 2), hjust = 1.8),
                          nrow = 1, ncol = 2,
                          widths=c(0.38, 0.62))
     invisible(grob)
    }
})

names(plots) <- contrast_names
```

```{r plotmavol8hpaControlFiveReplicatesRemoved, fig.keep = "all", fig.height = 10, fig.width = 20, echo=FALSE, results='asis'}

for(i in contrast_names){
  cat(paste0("\n\n\n##", i, "\n"))
  print(plots[[i]])
}

```


#Heatmaps


The first heatmap was made by hierarchically clustering the euclidean distances of the log2 fold change for genes with p-values less than  1e^-5 at any time point.

The second heatmap uses the same hierarchical method to cluster timepoints as well as genes.


```{r pairwiseHeatmaps8hpaControlFiveReplicatesRemoved, results = "hide", fig.cap = "", fig.keep = "high"}


lfc_table <- all_results_tidyDF %>%
  mutate(significant = ifelse(dea_ID == 'padj' & dea_Value <= pval, yes = TRUE, no = FALSE)) %>% # label all significant genes
  group_by(contrastID, Gene) %>% # label the # of times genes that are significant
  ungroup()  %>%
  group_by(Gene) %>%
  filter(any(significant), dea_ID == 'log2FoldChange') %>%
  dplyr::select(-significant) %>% 
  unite(colname, contrastID, dea_ID) %>% # make the table wide formatted
  spread(colname, dea_Value)


## create a table of significant genes with the cluster colors assigned

sign.table <- all_results_tidyDF  %>% 
  spread(key = dea_ID, value = dea_Value)  %>% 
  filter(Gene %in% lfc_table$Gene) %>%
  group_by(contrastID, Gene) %>% 
  mutate(sortby = -log(padj, base = c(10))*sign(log2FoldChange)) %>% 
  ungroup() %>% gather(dea_ID, dea_Value, -Gene, -contrastID) %>% 
  unite(idAll, contrastID, dea_ID) %>% spread(key = idAll, value = dea_Value) %>%
  dplyr::rename('TranscriptID' = Gene)


heatmap.input.table           <- data.frame(lfc_table)
rownames(heatmap.input.table) <- lfc_table$Gene
heatmap.input.table$Gene      <- NULL
heatmap.input.table           <- na.omit(data.matrix(heatmap.input.table[which(rownames(heatmap.input.table) %in% sign.table$TranscriptID), ]))
colnames(heatmap.input.table) <- contrast_names


prettycolors  <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
col.breaks    <- seq(-3, 3, length.out = ncol(heatmap.input.table))


## Make two heatmaps, allow custimizable graphing

hr <- hclust(dist(heatmap.input.table), method="complete") # Creates row dendrogram
hc <- hclust(dist(t(heatmap.input.table)), method="complete") # Creates column dendrogram

myclr     <- cutree(hr, k = 7) # set up where the groups will be, change 'k' for more or less groups
mycolr    <- c("Green", "Red", "Blue", "Orange", "Purple", "Pink", "Yellow") # must be the same length as 'k'
mycolr    <- mycolr[as.vector(myclr)] # pull out the colors assigned to each gene
mycolr.df <- data.frame(Cluster_Color = mycolr, TranscriptID = names(myclr)) # create a data frame of the cluster color each gene belongs to


if(ncol(heatmap.input.table) > 1){
    ## Make a heatmap without a column dendrogram
    col.breaks    <- seq(-3, 3, length.out = ncol(heatmap.input.table))
    heatmap       <- heatmap.2(heatmap.input.table, margin = c(15,5), col = prettycolors, dendrogram = 'row', trace = c("none"), Rowv=as.dendrogram(hr), Colv = NA, labRow = FALSE, key = TRUE, breaks = col.breaks, key.ylab = "Gene Count", key.xlab = "log2FoldChange", cexCol = 1.5, RowSideColors = mycolr)
    title(main = paste0("Clustering of ", nrow(heatmap.input.table)," Genes"), line = 0)
    par(cex.main=0.9)
}


if(ncol(heatmap.input.table) > 2) {
    k <- ifelse(ncol(heatmap.input.table < 5), 3, 5)
    myclc  <- cutree(hc, k = k) # create groupings for the column dendrogram
    mycolc <- c("Green", "Red", "Blue", "Orange", "Purple")
    mycolc <- mycolc[as.vector(myclc)]
    ## Make a heatmap of the logfold changes with column dendrogram.
    heatmap       <- heatmap.2(heatmap.input.table, margin = c(15,5), dendrogram = 'both', Rowv = as.dendrogram(hr), Colv = as.dendrogram(hc), col = prettycolors, trace = c("none"), labRow = FALSE, key = TRUE, breaks = col.breaks, main = paste0("Clustering of Timepoints on ", nrow(heatmap.input.table)," Genes"), key.ylab = "Gene Count", key.xlab = "log2FoldChange", cexCol = 1.5, RowSideColors = mycolr, ColSideColors = mycolc)
}

```


```{r session8hpaControlFiveReplicatesRemoved, echo=FALSE}
writeLines(capture.output(sessionInfo()), "sessionInfo/8hpa5RRSessionInfo.txt")
```

#[R-session information](sessionInfo/8hpa5RRSessionInfo.txt)