diff --git a/00-author.Rmd b/00-author.Rmd index 93bbf94..4d94774 100755 --- a/00-author.Rmd +++ b/00-author.Rmd @@ -1,20 +1,18 @@ -# About the Author(s) {-} +# About the Authors {-} -The authors have decades of combined experience in data analysis for genomics. They are developers of Bioconductor packages such as [**methylKit**](https://bioconductor.org/packages/release/bioc/html/methylKit.html), [**genomation**](https://bioconductor.org/packages/release/bioc/html/genomation.html), [**RCAS**](https://bioconductor.org/packages/release/bioc/html/RCAS.html) and [**netSmooth**](https://bioconductor.org/packages/release/bioc/html/netSmooth.html). In addition, they have played key roles in developing end-to-end genomics data analysis pipelines for RNA-seq, ChIP-seq, Bisulfite-seq, and single cell RNA-seq called [PiGx](http://bioinformatics.mdc-berlin.de/pigx/). +[*Dr. Altuna Akalin*](https://github.com/al2na) organized the book structure, wrote most of the book and edited the rest. He is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute for Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He is interested in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He lived in the USA, Norway, Turkey, Japan and Switzerland in order to pursue research work and education related to computational genomics. The underlying aim of his current work is utilizing complex molecular signatures to provide decision support systems for disease diagnostics and biomarker discovery. In addition to the research efforts and the managing of a scientific lab, since 2015, he has been organizing and teaching at computational genomics courses in Berlin with participants from across the world. This book is mostly a result of material developed for those and previous teaching efforts at Weill Cornell Medical College in New York and Friedrich Miescher Institute in Basel, Switzerland. +Dr. Akalin and the following contributing authors have decades of combined experience in data analysis for genomics. They are developers of Bioconductor packages such as [**methylKit**](https://bioconductor.org/packages/release/bioc/html/methylKit.html), [**genomation**](https://bioconductor.org/packages/release/bioc/html/genomation.html), [**RCAS**](https://bioconductor.org/packages/release/bioc/html/RCAS.html) and [**netSmooth**](https://bioconductor.org/packages/release/bioc/html/netSmooth.html). In addition, they have played key roles in developing end-to-end genomics data analysis pipelines for RNA-seq, ChIP-seq, Bisulfite-seq, and single cell RNA-seq called [PiGx](http://bioinformatics.mdc-berlin.de/pigx/). -[*Dr. Altuna Akalin*](https://github.com/al2na) wrote most of the book and edited the rest. Altuna is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He is interested in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He lived in the USA, Norway, Turkey, Japan, and Switzerland in order to pursue research work and education related to computational genomics. The underlying aim of his current work is utilizing complex molecular signatures to provide decision support systems for disease diagnostics and biomarker discovery. In addition to the research efforts and managing a scientific lab, since 2015, he has been organizing and teaching at computational genomics courses in Berlin with participants from across the world. This book is mostly a result of material developed for those and previous teaching efforts. +**Contributing authors** - -**Contributing Authors** +[*Dr. Bora Uyar*](https://github.com/borauyar) contributed Chapter 8, "RNA-seq Analysis". He started his bioinformatics training in Sabanci University (Istanbul/Turkey), from which he got his undergraduate degree. Later, he obtained an MSc from Simon Fraser University (Vancouver/Canada), then a PhD from the European Molecular Biology Laboratory in Heidelberg/Germany. Since 2015, he has been working as a bioinformatics scientist at the Bioinformatics Platform and Omics Data Science Platform at the Berlin Institute for Medical Systems Biology. He has been contributing to the bioinformatics platform through research, collaborations, services and data analysis method development. His current primary research interest is the integration of multiple types of omics datasets to discover prognostic/diagnostic biomarkers of cancers. -[*Dr. Bora Uyar*](https://github.com/borauyar) contributed "RNA-seq Analysis" chapter. Bora started his bioinformatics training in Sabanci University (Istanbul/Turkey), from where he got his undergraduate degree. Later, he obtained MSc from Simon Fraser University (Vancouver/Canada), then a PhD from the European Molecular Biology Laboratory in Heidelberg/Germany. Since 2015, he has been working as a bioinformatics scientist at the Bioinformatics Platform and Omics Data Science Platform at the Berlin Institute for Medical Systems Biology. He has been contributing to the bioinformatics platform through research, collaborations, services and data analysis method development. His current primary research interest is the integration of multiple types of omics datasets to discover prognostic/diagnostic biomarkers of cancers. +[*Dr. Vedran Franke*](https://github.com/frenkiboy) contributed Chapter 9, "ChIP-seq Analysis". He received his PhD from the University of Zagreb. His work focused on the biogenesis and function of small RNA molecules during early embryogenesis, and establishment of pluripotency. Prior to his PhD, he worked as a scientific researcher under Boris +Lenhard at the University of Bergen, Norway, focusing on principles of gene enhancer functions. He continues his research in the +Bioinformatics and Omics Data Science Platform at the Berlin Institute for Medical System Biology. He develops tools for multi-omics data integration, focusing on single-cell RNA sequencing, and epigenomics. His integrated knowledge of cellular physiology along with his proficiency in data analysis enable him to find creative solutions to difficult biological problems. -[*Dr. Vedran Franke*](https://github.com/frenkiboy) contributed "ChIP-seq Analysis" chapter. Vedran Franke received his PhD from University of Zagreb. His work focused on the biogenesis and function of small RNA molecules during early embryogenesis, and establishment of pluripotency. Prior to his PhD, he worked as a scientific researcher under Boris -Lenhard in University of Bergen, Norway, focusing on principles of gene enhancer functions. He continues his research in the -Bioinformatics and Omics Data Science Platform in Berlin Institute for Medical System Biology. He develops tools for multi-omics data integration, with the focus on single cell RNA sequencing, and epigenomics. His integrated knowledge of cellular physiology along with his proficiency in data analysis enables him to find creative solutions to difficult biological problems. - -[*Dr. Jonathan Ronen*](https://github.com/jonathanronen) contributed "Multi-omics Analysis" chapter. Jonathan got his MSc. in control engineering from the Norwegian University of Science and Technology in 2010. He then worked as a software developer in Oslo, Brussels, and Munich. During that time, he was also on the founding team of www.holderdeord.no, a website that links votes in the Norwegian parliament to pledges made in party manifestos. In 2014-2015, Jonathan worked as a data scientist in New York University's Social Media and Political Participation lab. During that time, he also launched www.lahadam.co.il, a website which tracked Israeli politician's facebook posts. Jonathan obtained a PhD in computational biology in 2020, where he has published tools for imputation for single cell RNA-seq using priors, and integrative analysis of multi-omics data using deep learning. +[*Dr. Jonathan Ronen*](https://github.com/jonathanronen) contributed Chapter 11, "Multi-omics Analysis". Dr. Ronen got his MSc in control engineering from the Norwegian University of Science and Technology in 2010. He then worked as a software developer in Oslo, Brussels, and Munich. During that time, he was also on the founding team of www.holderdeord.no, a website that links votes in the Norwegian parliament to pledges made in party manifestos. In 2014--2015, He worked as a data scientist in New York University's Social Media and Political Participation lab. During that time, he also launched www.lahadam.co.il, a website which tracked Israeli politicians' Facebook posts. He obtained a PhD in computational biology in 2020, where he has published tools for imputation for single cell RNA-seq using priors, and integrative analysis of multi-omics data using deep learning. diff --git a/01-intro2Genomics.Rmd b/01-intro2Genomics.Rmd index 18bb8a9..88bd829 100644 --- a/01-intro2Genomics.Rmd +++ b/01-intro2Genomics.Rmd @@ -14,8 +14,8 @@ knitr::opts_chunk$set(echo = TRUE, The aim of this chapter is to provide the reader with some of the fundamentals required for -understanding genome biology. By no means, this is a complete overview of the -subject but just a summary that will help the non-biologist reader understand +understanding genome biology. By no means, is this a complete overview of the +subject, but just a summary that will help the non-biologist reader understand the recurring biological concepts in computational genomics. Readers that are well-versed in genome biology and modern genome-wide quantitative assays should feel @@ -24,16 +24,16 @@ free to skip this chapter or skim it through. ## Genes, DNA and central dogma A central concept that will come up again and again is "the gene". -Before we can explain that we need to +Before we can explain that, we need to introduce a few other concepts that are important to understand the gene concept. -Human body is made up of billions of cells. These cells specialize in different +The human body is made up of billions of cells. These cells specialize in different tasks. For example, in the liver there are cells that help produce enzymes to break toxins. In the heart, there are specialized muscle cells that make -the heart beat. Yet, all these different kinds of cells come from a single celled +the heart beat. Yet, all these different kinds of cells come from a single-celled embryo. All the instructions to make different kinds of cells are contained within that single cell and with every division of that cell, those instructions -are transmitted to new cells. These instructions can be coded into a string - a +are transmitted to new cells. These instructions can be coded into a string -- a molecule of DNA, a polymer made of recurring units called nucleotides. The four nucleotides in DNA molecules, Adenine, Guanine, Cytosine and Thymine (coded as four letters: A, C, G, and T) in a specific sequence, store the information for @@ -49,7 +49,7 @@ In eukaryotic cells, DNA is wrapped around proteins (histones) \index{histone} f nucleosomes which make up chromatins \index{chromatin} and chromosomes (see Figure \@ref(fig:chromatinChr)). -```{r,chromatinChr,fig.cap="Chromosome structure in animals",fig.align = 'center',out.width='60%',echo=FALSE} +```{r,chromatinChr,fig.cap="Chromosome structure in animals.",fig.align = 'center',out.width='60%',echo=FALSE} knitr::include_graphics("images/chromatinChr.png" ) ``` @@ -58,22 +58,22 @@ knitr::include_graphics("images/chromatinChr.png" ) There might be several chromosomes \index{chromosome} depending on the organism. However, in some species (such as most prokaryotes) -DNA is stored in a circular form. The size of genome between species differs too. -Human genome has 46 chromosomes and over 3 billion base-pairs, whereas wheat genome -has 42 chromosomes and 17 billion base-pairs, both genome size and chromosome numbers +DNA is stored in a circular form. The size of the genome between species differs too. +The human genome has 46 chromosomes and over 3 billion base-pairs, whereas the wheat genome +has 42 chromosomes and 17 billion base-pairs; both genome size and chromosome numbers are variable between different organisms. Genome sequences of organisms are obtained using sequencing technology. With this technology, fragments of the DNA sequence from the genome, called reads, are obtained. Larger chunks of the genome sequence -is later obtained by stitching the -initial fragments to larger ones by using the overlapping reads. Latest, +are later obtained by stitching the +initial fragments to larger ones by using the overlapping reads. The latest sequencing technologies made genome sequencing cheaper and faster. These technologies output more reads, longer reads and more accurate reads. -Estimated cost -of the first human genome is $300 million in 1999-2000, today a high-quality human genome +The estimated cost +of the first human genome was $300 million in 1999--2000; today a high-quality human genome can be obtained for $1500. Since the costs are going down, researchers and clinicians -can generate more data. This drives up to costs for data storage and also drives +can generate more data. This drives up the costs for data storage and also drives up the demand for qualified people to analyze genomic data. This was one of the motivations behind writing this book. @@ -95,20 +95,20 @@ basic units of heredity in all living organisms. All cells use their hereditary information in the same way most of the time; the DNA is replicated to transfer the information to new cells. If activated, the genes are transcribed into -messenger RNAs (mRNAs) \index{mRNA} in nucleus (in eukaryotes), followed by mRNAs (if the +messenger RNAs (mRNAs) \index{mRNA} in the nucleus (in eukaryotes), followed by mRNAs (if the gene is protein coding) getting translated into proteins in the cytoplasm. This is -essentially a process of information transfer between information carrying +essentially a process of information transfer between information-carrying polymers; DNA, RNA and proteins, known as the “central dogma” \index{central dogma} of molecular biology (see Figure \@ref(fig:CentDog) for a summary). Proteins are essential elements for life. -The growth and repair, functioning and structure of all living cells depends on them. +The growth and repair, functioning and structure of all living cells depend on them. This is why the gene is a central concept in genome biology, because a gene can encode information for proteins and other functional molecules. How genes are controlled and activated dictates everything about an organism. From the identity of a cell to response to an infection, how cells develop and behave against certain stimuli is governed -by activity of the genes and functional molecules they encode. The liver cell becomes a liver cell because certain -genes are activated and their functional products are produced to help liver +by the activity of the genes and the functional molecules they encode. The liver cell becomes a liver cell because certain +genes are activated and their functional products are produced to help the liver cell achieve its tasks. ```{r,CentDog,fig.cap="Central Dogma: replication, transcription, translation",fig.align = 'center',out.width='100%',echo=FALSE} @@ -116,37 +116,37 @@ knitr::include_graphics("images/centDogma.png" ) ``` -### How genes are controlled? The transcriptional and the post-transcriptional regulation -In order to answer this question, we have to dig a little deeper on the +### How are genes controlled? Transcriptional and post-transcriptional regulation +In order to answer this question, we have to dig a little deeper into the transcription concept we introduced via the central dogma. -The first step in a process of information transfer - a production of an RNA \index{gene regulation} -copy of a part of the DNA sequence - is called transcription. This task is +The first step in a process of information transfer - the production of an RNA \index{gene regulation} +copy of a part of the DNA sequence - is called transcription. This task is carried out by the RNA polymerase enzyme. RNA polymerase-dependent initiation of transcription is enabled by the existence of a specific region in the sequence of DNA - a core promoter. Core promoters are regions of DNA that promote transcription and are found upstream from the start site of transcription. In -eukaryotes, several proteins, called general transcription factors recognize and +eukaryotes, several proteins, called general transcription factors, recognize and bind to core promoters and form a pre-initiation complex. RNA polymerases recognize these complexes and initiate synthesis of RNAs, the -polymerase travels along the template DNA and making an RNA -copy[@hager2009transcription]. After mRNA is -produced it is often spliced by spliceosome. The sections called 'introns' are removed and -sections called 'exons' left in. Then, the remaining mRNA translated into proteins. Which exons +polymerase travels along the template DNA and makes an RNA +copy [@hager2009transcription]. After mRNA is +produced it is often spliced by spliceosome. The sections, called 'introns', are removed and +sections called 'exons' left in. Then, the remaining mRNA is translated into proteins. Which exons will be part of the final mature transcript can also be regulated and creates diversity in protein structure and function (See Figure \@ref(fig:TransSplice)). -```{r,TransSplice,fig.cap="Transcription could be followed by splicing, which creates different transcript isoforms. This will in return create different protein isoforms since the information required to produce the protein is encoded in the transcripts. Differences in transcript of the same gene can give rise to different protein isoforms",fig.align = 'center',out.width='70%',ref.label='TransSplice',echo=FALSE} +```{r,TransSplice,fig.cap="Transcription can be followed by splicing, which creates different transcript isoforms. This will in return create different protein isoforms since the information required to produce the protein is encoded in the transcripts. Differences in transcripts of the same gene can give rise to different protein isoforms.",fig.align = 'center',out.width='70%',ref.label='TransSplice',echo=FALSE} knitr::include_graphics("images/TransSplice.png" ) ``` -On the contrary to protein coding genes, non-coding RNA (ncRNAs) +Contrary to protein coding genes, non-coding RNA (ncRNAs) genes are processed and assume their functional structures after transcription and without going into translation, hence the name: non-coding RNAs. Certain ncRNAs can also be spliced but still not translated. ncRNAs and other RNAs in general can form complementary base-pairs within the RNA molecule -which gives them additional complexity. This self-complementarity based -structure, termed RNA secondary structure, is often necessary for functions of many +which gives them additional complexity. This self-complementarity-based +structure, termed the RNA secondary structure, is often necessary for functions of many ncRNA species. In summary, the set of processes, from transcription initiation to production @@ -164,16 +164,16 @@ As someone interested in computational genomics, you will frequently encounter a gene on a computer screen, and how it is represented on the computer will be equivalent to what you imagine when you hear the word "gene". In the online databases, the genes will appear as a sequence of letters -or as a series of connected boxes showing exon-intron structure which may -include the direction of transcription as well (see Figure \@ref(fig:RealGene)). You will encounter more with the latter so this is likely what will +or as a series of connected boxes showing exon-intron structure, which may +include the direction of transcription as well (see Figure \@ref(fig:RealGene)). You will encounter more with the latter, so this is likely what will pop into your mind when you think of genes. -As we have mentioned DNA has two strands, and a gene can be located -on either of them, and direction of transcription will depend on that. In the -Figure you can see arrows on introns (lines connecting boxes) indicating the +As we have mentioned, DNA has two strands. A gene can be located +on either of them, and the direction of transcription will depend on that. In the +Figure \@ref(fig:RealGene), you can see arrows on introns (lines connecting boxes) indicating the direction of the gene. -```{r,RealGene,fig.cap="A) Representation of a gene at UCSC browser. Boxes indicate exons, and lines indicate introns. B) Partial sequence of FATE1 gene as shown in NCBI GenBank database.",fig.align = 'center',out.width='80%',ref.label='RealGene',echo=FALSE} +```{r,RealGene,fig.cap="A) Representation of a gene in the UCSC browser. Boxes indicate exons, and lines indicate introns. B) Partial sequence of FATE1 gene as shown in the NCBI GenBank database.",fig.align = 'center',out.width='80%',ref.label='RealGene',echo=FALSE} knitr::include_graphics("images/RealGene.png" ) ``` @@ -181,22 +181,22 @@ knitr::include_graphics("images/RealGene.png" ) ## Elements of gene regulation The mechanisms regulating gene expression \index{gene regulation} are essential for -all living organisms as they dictate where and how much of a gene product (may -it be protein or ncRNA) should be manufactured. This regulation could occur +all living organisms as they dictate where and how much of a gene product (it may +be protein or ncRNA) should be manufactured. This regulation could occur at the pre- and co-transcriptional level by controlling how many transcripts should be produced and/or which version of the transcript should be produced by regulating -splicing. Different versions of the same gene could encode for proteins by -regulating splicing the process can decide which parts will go into the final -mRNA that will code for the protein. +splicing. The same gene could encode for different versions of the same protein via +splicing regulation.This process defines which parts of the gene will go into the final +mRNA that will code for the protein variant. In addition, gene products can be regulated post-transcriptionally where certain molecules bind to RNA and mark them for degradation even before they can be used in protein production. Gene regulation drives cellular differentiation; a process during which different tissues and cell types are produced. It also -helps cells maintain differentiated states of cells/tissues. As a product of +helps cells maintain differentiated states of cells/tissues. As a result of this process, at the final stage of differentiation, different kinds of cells -maintain different expression profiles although they contain the same genetic -material. As mentioned above there are two main types of regulation and next we +maintain different expression profiles, although they contain the same genetic +material. As mentioned above, there are two main types of regulation and next we will provide information on those. ### Transcriptional regulation @@ -205,22 +205,22 @@ expression regulation. The rate is controlled by core promoter elements as well distant-acting regulatory elements such as enhancers. On top of that, processes like histone modifications and/or DNA methylation have a crucial regulatory impact on transcription. If a region is not accessible for the transcriptional -machinery, e.g. in the case when chromatin structure is compacted due to the +machinery, e.g. in the case where the chromatin structure is compacted due to the presence of specific histone modifications, or if the promoter DNA is -methylated, transcription may not start at all. Last but the not least, gene +methylated, transcription may not start at all. Last but not least, gene activity is also controlled post-transcriptionally by ncRNAs such as microRNAs -(miRNAs), as well as by cell signaling resulting in protein modification or +(miRNAs), as well as by cell signaling, resulting in protein modification or altered protein-protein interactions. #### Regulation by transcription factors through regulatory regions Transcription factors are proteins that \index{transcription factors (TFs)} recognize a specific DNA motif to bind on a regulatory region and regulate the transcription rate of \index{DNA motif} -the gene associated with that regulatory region (See Figure \@ref(fig:regSummary)) +the gene associated with that regulatory region (see Figure \@ref(fig:regSummary) for an illustration). These factors bind to a variety of regulatory regions summarized in Figure \@ref(fig:regSummary), and their concerted action controls the transcription rate. Apart from their binding preference, their -concentration, the availability of synergistic or competing transcription +concentration, and the availability of synergistic or competing transcription factors will also affect the transcription rate. ```{r,regSummary,fig.cap="Representation of regulatory regions in animal genomes",fig.align = 'center',out.width='70%',ref.label='regSummary',echo=FALSE} @@ -233,7 +233,7 @@ knitr::include_graphics("images/regulationSummary.png" ) ##### Core and proximal promoters Core promoters are the immediate neighboring regions around \index{promoter} -the transcription start site (TSS) \index{transcription start site (TSS)} that serves as a docking site for the +the transcription start site (TSS) \index{transcription start site (TSS)} that serve as a docking site for the transcriptional machinery and pre-initiation complex (PIC) assembly. The textbook model for transcription initiation is as follows: The core promoter has a TATA motif (referred as TATA-box) 30 bp upstream of an initiator sequence @@ -244,22 +244,22 @@ TATA-box and Inr, there are a number of sequence elements on the animal core promoters that are associated with transcription initiation and PIC assembly, such as downstream promoter elements (DPEs), the BRE elements and CpG islands. DPEs are found 28-32 bp downstream of the TSS in TATA-less promoters of -Drosophila melanogaster, it generally co-occurs with the Inr element, and is +_Drosophila melanogaster_. They generally co-occur with the Inr element, and are thought to have a similar function to the TATA-box. The BRE element is -recognized by TFIIB protein and lies upstream of the TATA-box. CpG islands +recognized by the TFIIB protein and lie upstream of the TATA-box. CpG islands are CG dinucleotide-enriched segments of vertebrate genomes, despite the general -depletion of CG dinucleotides in those genomes. 50-70% of promoters in +depletion of CG dinucleotides in those genomes. 50 to 70% of promoters in the human genome are associated with CpG islands. Proximal promoter elements are typically right upstream -of the core promoters and usually contain binding sites for activator -transcription factors and they +of the core promoters, usually contain binding sites for activator +transcription factors, and provide additional control over gene expression. -##### Enhancers: -Proximal regulation is not the only, \index{enhancer} nor the most important mode +##### Enhancers +Proximal regulation is not the only\index{enhancer} or the most important mode of gene regulation. Most of the transcription factor binding sites in -the human genome are found in intergenic regions or in introns . +the human genome are found in intergenic regions or in introns. This indicates the widespread usage of distal regulatory elements in animal genomes. On a molecular function level, enhancers are similar to proximal promoters; they contain binding @@ -270,11 +270,11 @@ tissues. In addition, their activity is independent of their orientation and their distance to the promoter they interact with. A number of studies showed that enhancers can act upon their target genes over several kilobases away. According to a popular -model, enhancers achieve this by looping the DNA and coming to contact with +model, enhancers achieve this by looping the DNA and coming into contact with their target genes. -##### Silencers: +##### Silencers Silencers are similar to enhancers; however their effect is opposite of enhancers on the transcription of the target gene, and results in decreasing their level of transcription. They contain binding sites for @@ -283,15 +283,15 @@ block the binding of an activator , directly compete for the same binding site, or induce a repressive chromatin state in which no activator binding is possible. Silencer effects, similar to those of enhancers, are independent of orientation and distance to target genes. In contradiction to this general view, -in Drosophila there are two types of silencers, long-range and short-range. +in _Drosophila_ there are two types of silencers, long-range and short-range. Short-range silencers are close to promoters and long-range silencers can silence multiple promoters or enhancers over kilobases away. Like enhancers, silencers bound by repressors may also induce changes in DNA -structure by looping and creating higher order structures. One class of +structure by looping and creating higher-order structures. One class of such repressor proteins, which is thought to initiate higher-order structures by looping, is Polycomb group proteins (PcGs). -##### Insulators: +##### Insulators Insulator regions limit the effect of other regulatory elements to certain chromosomal boundaries; in other words, they create regulatory domains untainted by the regulatory elements in regions outside that domain. @@ -300,64 +300,63 @@ repressive chromatin domains. In vertebrates and insects, some of the well-studied insulators are bound by CTCF (CCCTC-binding factor). Genome-wide studies from different mammalian tissues confirm that CTCF binding is largely invariant of cell type, and CTCF \index{CTCF protein} motif locations are conserved in -vertebrates. At present, there are two models of explaining the insulator +vertebrates. At present, there are two models that explain the insulator function; the most prevalent model claims insulators create physically separate domains by modifying chromosome structure. This is thought to be achieved by CTCF-driven chromatin looping and recent evidence shows that CTCF can induce a higher-order chromosome structure through creating loops of chromatins. According to the second model, an insulator-bound activator cannot bind an enhancer; thus enhancer-blocking activity is achieved and insulators can also -recruit active histone domain, creating an active domain for enhancers to +recruit an active histone domain, creating an active domain for enhancers to function. -##### Locus control regions: +##### Locus control regions Locus control regions (LCRs) are clusters of -different regulatory elements that control entire set of genes on a locus. LCRs +different regulatory elements that control an entire set of genes on a locus. LCRs help genes achieve their temporal and/or tissue-specific expression programs. -LCRs may be composed of multiple cis-regulatory elements, such as insulators, -enhancers and they act upon their targets even from long distances. However -LCRs function with an orientation dependent manner, for example the activity of +LCRs may be composed of multiple cis-regulatory elements, such as insulators and +enhancers, and they act upon their targets even from long distances. However, +LCRs function in an orientation-dependent manner, for example the activity of beta-globin LCR is lost if inverted. The mechanism of LCR function otherwise seems similar to other long-range regulators described above. The evidence is mounting in the direction of a model where DNA-looping creates a chromosomal structure in which target genes are clustered together, which seems to be -essential for maintaining open chromatin domain. +essential for maintaining an open chromatin domain. #### Epigenetic regulation Epigenetics in biology usually refers to \index{gene regulation} -constructions (chromatin structure, DNA methylation etc.) other than DNA \index{epigenetics} +constructions (chromatin structure, DNA methylation, etc.) other than DNA \index{epigenetics} sequence that influence gene regulation. In essence, epigenetic regulation is the regulation of DNA packing and structure, the consequence of which is gene expression regulation. A typical example is that DNA packing inside the nucleus can directly influence gene expression by creating accessible regions for transcription factors to bind. There are two main mechanisms in -epigenetic regulation: i) DNA modifications ii) histone modifications. Below, +epigenetic regulation: i) DNA modifications and ii) histone modifications. Below, we will introduce these two mechanisms. -##### DNA modifications such as methylation: +##### DNA modifications such as methylation DNA methylation is usually associated with gene silencing. \index{DNA methylation} DNA methyltransferase enzyme catalyzes the addition of a methyl group to cytosine of CpG dinucleotides (while in mammals the addition of methyl group is largely -restricted to CpG dinucleotides, methylation can occur in other bases as well) -. This covalent modification either interferes with transcription factor +restricted to CpG dinucleotides, methylation can occur in other bases as well). This covalent modification either interferes with transcription factor binding on the region, or methyl-CpG binding proteins induce the spread of repressive chromatin domains, thus the gene is silenced if its promoter has methylated CG dinucleotides. DNA methylation usually occurs in repeat -sequences to repress transposable elements, these elements when active can +sequences to repress transposable elements. These elements, when active, can jump around and insert them to random parts of the genome, potentially disrupting the genomic functions.\index{CpG island} DNA methylation is also related to a key core and proximal promoter element: CpG islands. CpG islands are usually -unmethylated, however for some genes CpG island methylation accompanies their -silenced expression. For example, during X-chromosome inactivation many CpG +unmethylated, however, for some genes, CpG island methylation accompanies their +silenced expression. For example, during X-chromosome inactivation, many CpG islands are heavily methylated and the associated genes are silenced. In -addition, in embryonic stem cell differentiation pluripotency-associated genes +addition, in embryonic stem cell differentiation, pluripotency-associated genes are silenced due to DNA methylation. Apart from methylation, there are other -kinds of DNA modifications present in mamalian genomes, such as hydroxy-methylation and +kinds of DNA modifications present in mammalian genomes, such as hydroxy-methylation and formylcytosine. These are other modifications under current research that are either intermediate or stable modifications with distinct functional associations. There are at least a dozen distinct DNA modifications observed when we look across @@ -365,9 +364,9 @@ all studied species [@sood2019dnamod]. -##### Histone modifications: -Histones are proteins that constitute \index{histone} nucleosome. In -eukaryotes, eight histones nucleosomes are wrapped around by DNA and build +##### Histone modifications +Histones are proteins that constitute a \index{histone} nucleosome. In +eukaryotes, eight histone proteins are wrapped by DNA and make up the nucleosome. They help super-coiling of DNA and inducing high-order structure called chromatin. In chromatin, DNA is either densely packed (called heterochromatin or closed chromatin), or it is loosely packed (called @@ -378,7 +377,7 @@ machinery and might therefore harbor active genes. Histones have long and unstructured N-terminal tails which can be covalently modified. The most studied modifications include acetylation, methylation and phosphorylation [@strahl2000language]. Using their tails, histones interact with neighboring nucleosomes and the -modifications on the tail affect the nucleosomes affinity to bind DNA and +modifications on the tail affect the nucleosomes' affinity to bind DNA and therefore influence DNA packaging around nucleosomes. Different modifications on histones are used in different combinations to program the activity of the genes during differentiation. Histone modifications have a distinct nomenclature, for \index{histone modification} @@ -389,16 +388,17 @@ Modifications Effect ------------ ------ H3K9ac Active promoters and enhancers H3K14ac Active transcription - H3K4me3/me2/me1 Active promoters and enhancers, H3K4me1 and H3K27ac is enhancer-specific + H3K4me3/me2/me1 Active promoters and enhancers, + H3K4me1 and H3K27ac is enhancer-specific H3K27ac H3K27ac is enhancer-specific H3K36me3 Active transcribed regions H3K27me3/me2/me1 Silent promoters H3K9me3/me2/me1 Silent promoters -Table: Table 1 Histone modifications and their effects. If more than one histone modification has the same effect, they are separated by commas. +Table: (\#tab:histoneMod) Histone modifications and their effects. If more than one histone modification has the same effect, they are separated by commas. Histone modifications are associated with a number of different -transcription-related conditions; some of them are summarized in Table 1. +transcription-related conditions; some of them are summarized in Table \@ref(tab:histoneMod). Histone modifications can indicate where the regulatory regions are and they can also indicate activity of the genes. From a gene regulatory perspective, maybe the most important modifications are the ones associated with enhancers and @@ -412,7 +412,7 @@ of developmental genes, and trithorax group proteins (trxG) for maintaining their active state [@henikoff2008nucleosome ; @schwartz2007polycomb]. PcGs and trxGs induce repressed or active states by catalyzing histone modifications or DNA methylation. Both the proteins bind PREs that can be on promoters or several kilobases away. Another protein -that induces histone modifications is CTCF. CTCF is associated with boundaries between active and repressive histone marks \index{CTCF protein} [@phillips2009ctcf]. This is due to the role of CTCF in regulating the 3D genome structure. Two CTCF binding sites that are far away from each other in linear distance can bound together in 3D space thus forming chromatin loops. +that induces histone modifications is CTCF. CTCF is associated with boundaries between active and repressive histone marks \index{CTCF protein} [@phillips2009ctcf]. This is due to the role of CTCF in regulating the 3D genome structure. Two CTCF binding sites that are far away from each other in linear distance can bind together in 3D space thus forming chromatin loops. ```{block2, transReg, type='rmdtip'} @@ -420,20 +420,20 @@ __Want to know more?__ - Transcriptional regulatory elements in the human genome: http://www.ncbi.nlm.nih.gov/pubmed/16719718 -- On metazoan promoters: types and transcriptional properties: +- On metazoan promoters: Types and transcriptional properties: http://www.ncbi.nlm.nih.gov/pubmed/22392219 -- General principles of regulatory sequence function +- General principles of regulatory sequence function: http://www.nature.com/nrg/journal/v15/n7/abs/nrg3684.html -- DNA methylation: roles in mammalian development +- DNA methylation: Roles in mammalian development: http://www.nature.com/doifinder/10.1038/nrg3354 -- Histone modifications and organization of the genome +- Histone modifications and organization of the genome: http://www.nature.com/nrg/journal/v12/n1/full/nrg2905.html -- DNA methylation and histone modifications are linked +- DNA methylation and histone modifications are linked: http://www.nature.com/nrg/journal/v10/n5/abs/nrg2540.html ``` @@ -442,18 +442,19 @@ http://www.nature.com/nrg/journal/v10/n5/abs/nrg2540.html #### Regulation by non-coding RNAs Recent years have witnessed an explosion in non-coding \index{gene regulation} -RNA (ncRNA)-related research \index{non-coding RNA (ncRNA)}. Many publications implicated ncRNAs as important +RNA (ncRNA)-related research\index{non-coding RNA (ncRNA)}. Many publications implicated ncRNAs as important regulatory elements. Plants and animals produce many different types of ncRNAs such as long non-coding RNAs (lncRNAs), -small-interferring RNAs (siRNAs), microRNAs (miRNAs), promoter-associated RNAs -(PARs) and small nucleolar RNAs (snoRNAs) [@morris2014rise]. lncRNAs are typically >200 bp long, +small interferring RNAs (siRNAs), microRNAs (miRNAs), promoter-associated RNAs +(PARs) and small nucleolar RNAs (snoRNAs) [@morris2014rise]. lncRNAs are typically >200-bp long, they are involved in epigenetic regulation by interacting with chromatin remodeling factors and they function in gene regulation. siRNAs are short -double-stranded RNAs which are involved in gene-regulation and transposon -control, they silence their target genes by cooperating with Argonaute proteins. miRNAs are short single-stranded RNA molecules that interact with their +double-stranded RNAs which are involved in gene regulation and transposon +control; they silence their target genes by cooperating with Argonaute proteins. miRNAs are short single-stranded RNA molecules that interact with their target genes by using their complementary sequence and mark them for quicker -degradation. PARs may regulate gene expression as well: they are ~18-200bp -long ncRNAs originating from promoters of coding genes [@morris2014rise]. snoRNAs are also shown +degradation. PARs may regulate gene expression as well: they are approximately +18-to -200-bp-long ncRNAs originating from promoters of coding genes [@morris2014rise]. +snoRNAs are also shown to play roles in gene regulation, although they are mostly believed to guide ribosomal RNA modifications [@morris2014rise]. @@ -463,16 +464,17 @@ Splicing is regulated by regulatory elements on the pre-mRNA and proteins \index binding to those elements. Regulatory elements are categorized as splicing enhancers and repressors. They can be located either in exons -or introns. Depending of their activity and their locations there are four types of regulatory elements: +or introns. Depending on their activity and their locations there are four types of regulatory elements for splicing: + - exonic splicing enhancers (ESEs) - exonic splicing silencers (ESSs) - intronic splicing enhancers (ISEs) - intronic splicing silencers (ISSs). The majority of splicing repressors are heterogeneous nuclear ribonucleoproteins (hnRNPs). If splicing repressor protein bind -silencer elements they reduce the chance of nearby site to be -used as splice junction. On the contrary, splicing enhancers are sites to which splicing activator proteins bind and binding -on that region increases the probability that a nearby site will be used as a splice junction [@wang2008splicing]. Most of the activator proteins that bind to splicing enhancers are members of the SR protein family. Such proteins can recognize specific RNA recognition motifs. By regulating splicing exons can be skipped or included +silencer elements, they reduce the chance of a nearby site being +used as a splice junction. On the contrary, splicing enhancers are sites to which splicing activator proteins bind and binding +on that region increases the probability that a nearby site will be used as a splice junction [@wang2008splicing]. Most of the activator proteins that bind to splicing enhancers are members of the SR protein family. Such proteins can recognize specific RNA recognition motifs. By regulating splicing exons can be skipped or included, which creates protein diversity [@wang2008splicing]. @@ -480,10 +482,10 @@ which creates protein diversity [@wang2008splicing]. __Want to know more?__ -- On miRNAs, Their genesis and modes of regulation [@bartel2004micrornas]: +- On miRNAs, their genesis, and modes of regulation [@bartel2004micrornas]: http://www.sciencedirect.com/science/article/pii/S0092867404000455 -- Functions of non coding RNAs [@morris2014rise] : +- Functions of non-coding RNAs [@morris2014rise]: http://www.nature.com/nrg/journal/v15/n6/abs/nrg3722.html - On splicing and its regulation [@wang2008splicing]: https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18369186/ @@ -493,33 +495,33 @@ http://www.nature.com/nrg/journal/v15/n6/abs/nrg3722.html ## Shaping the genome: DNA mutation Human and chimpanzee genomes are 98.8% similar. The 1.2% difference is what separates \index{mutation} -us from chimpanzees. The further you move away from human in terms of evolutionary -distance the higher the difference gets. However, even between the members of -the same species differences in genome sequences exist. These differences are +us from chimpanzees. The further you move away from human species in terms of evolutionary +distance, the higher the difference gets. However, even between the members of +the same species, differences in genome sequences exist. These differences are due to a process called mutation which drives differences between individuals but also provides the fuel for evolution as the source of the genetic variation. Individuals with beneficial mutations can adapt to their surroundings -better than others and in time these mutations which are beneficial for survival +better than others and in time, these mutations, which are beneficial for survival, spread in the population due to a process called "natural selection". Selection -acts upon individuals with beneficial features which gives them an edge for +acts upon individuals with beneficial features, which gives them an edge for survival in a given environment. Genetic variation created by the mutations -in individuals provide the material on which selection can act upon. If the +in individuals provides the material on which selection can act. If the selection process goes for a long time in a relatively isolated environment that requires adaptation, this population can evolve into a different species given enough time. This is the basic idea behind evolution in a nutshell, -and without mutations providing the genetic variation there will be no evolution. +and without mutations providing the genetic variation, there would be no evolution. Mutations in the genome occur due to multiple reasons. First, DNA replication is not an error-free process. Before a cell division, the DNA is replicated with 1 mistake per 10^8 to 10^10 base-pairs. Second, mutagens such as UV light -can induce mutations on the genome. Third factor that contributes to mutation -is imperfect DNA repair. Every day any human cell suffers multiple instances of DNA damage. +can induce mutations on the genome. The third factor that contributes to mutation +is imperfect DNA repair. Every day, any human cell suffers multiple instances of DNA damage. DNA repair enzymes are there to cope with this damage but they are also not -error-free, depending on which DNA repair mechanism is used (there are multiple) +error-free, depending on which DNA repair mechanism is used (there are multiple), mistakes will be made at varying rates. -Mutations are classified by how many bases they effect, their effect on +Mutations are classified by how many bases they affect, their effect on DNA structure and gene function. By their effect on DNA structure the mutations \index{mutation} are classified as follows: @@ -534,18 +536,18 @@ the mutations \index{mutation} are classified as follows: Mutations can also be classified by their size as follows: -- __Point mutations__: mutations that involve one base. Substitutions, deletions and +- __Point mutations__: Mutations that involve one base. Substitutions, deletions and insertions are point mutations. They are also termed as single nucleotide polymorphisms (__SNPs__).\index{SNP} -- __Small-scale mutations__: mutations that involve several bases. -- __Large-scale mutations__: mutations which involve larger chromosomal regions. +- __Small-scale mutations__: Mutations that involve several bases. +- __Large-scale mutations__: Mutations which involve larger chromosomal regions. Transposable element insertions (where a segment of the genome - jumps to another region in the genome) and segmental duplications ( a large + jumps to another region in the genome) and segmental duplications (a large region is copied multiple times in tandem) are typical large scale mutations. - __Aneuploidies__: Insertions or deletions of whole chromosomes. -- __Whole-genome polyploidies__: duplications involving whole genome. +- __Whole-genome polyploidies__: Duplications involving whole genome. -Mutations by their effect on gene function can be classified as follows: +Mutations can be classified by their effect on gene function as follows: - __Gain-of-function mutations__: A type of mutation in which the altered gene product possesses a new molecular function or a new pattern of gene @@ -572,11 +574,10 @@ __Want to know more?__ ## High-throughput experimental methods in genomics -Most of the biological phenomena described above relating to transcription -, gene regulation or DNA mutation can be measured over the entire genome using +Most of the biological phenomena described above relating to transcription, gene regulation or DNA mutation can be measured over the entire genome using high-throughput experimental techniques, which are quickly becoming the standard for studying genome biology. In addition, their -applications in the clinic are also gaining momemntum: there are already diagnostic +applications in the clinic are also gaining momentum as there are already diagnostic tests that are based on these techniques. Some of the things that can be measured by high-throughput assays are as follows: @@ -597,8 +598,8 @@ to answer a new question. However, one has to keep in mind that these methods are at varying degrees of maturity and they all come with technical limitations and are not noise-free. Despite this, they are extremely useful for research and clinical -purposes. And, thanks to these methods we are able to sequence and annotate -genomes at a massive scale. +purposes. And, thanks to these methods, we are able to sequence and annotate +genomes on a massive scale. ### The general idea behind high-throughput techniques High-throughput methods aim to quantify or locate all or most of the genome that harbors @@ -608,42 +609,42 @@ biological feature. For example, if you want to measure expression of protein co genes you need to be able to extract mRNA molecules with special post-transcriptional alterations that protein-coding genes acquire, as done in many RNA sequencing (RNA-seq) experiments\index{RNA-seq}. If you are looking for transcription factor binding, you need to enrich -for the DNA fragments that are bound by the protein of interest, as it is done in ChIP-seq experiments.\index{ChIP-seq}This part depends on available molecular biology and chemistry techniques, and the final +for the DNA fragments that are bound by the protein of interest, as it is done in ChIP-seq experiments. \index{ChIP-seq}This part depends on available molecular biology and chemistry techniques, and the final product of this part is RNA or DNA fragments. Next, you need to be able to tell where these fragments are coming from in the genome and how many of them -are there. Microarrays \index{microarray} were the standard tool for the quantification step -until spread of sequencing techniques. In microarrays, one had to design complementary bases, called "oligos" or "probes", to the genetic material enriched via the experimental protocol. +there are. Microarrays \index{microarray} were the standard tool for the quantification step +until the spread of sequencing techniques. In microarrays, one had to design complementary bases, called "oligos" or "probes", to the genetic material enriched via the experimental protocol. If the enriched material is complementary to the oligos, a light signal will be produced and the intensity of the signal will be proportional to the amount of the genetic material pairing with that oligo. There will be more probes available for -hybridization (process of complementary bases forming bonds ), so the more fragments -available stronger the signal. For this to be able to work, you need to know +hybridization (process of complementary bases forming bonds), so the more fragments +available, stronger the signal. For this to be able to work, you need to know at least part of your genome sequence, and design probes. If you want to measure gene expression, your probes should overlap with genes and should be unique enough to not to bind sequences from other genes. This technology is now being replaced with sequencing technology, where you directly sequence your genetic \index{high-throughput sequencing} material. If you have the sequence of your fragments, you can align them back -to genome, see where they are coming from and count them. This is a better +to the genome, see where they are coming from, and count them. This is a better technology where the quantification is based on the real identity of fragments rather than based on hybridization to designed probes. -In summary HT techniques have the following steps, and this also summarized in +In summary, HT techniques have the following steps, and this also summarized in Figure \@ref(fig:HTassays): - Extraction: This is the step where you extract the genetic material of interest, RNA or DNA. - Enrichment: In this step, you enrich for the event you are interested in. For example, protein binding sites. In some cases such as whole-genome DNA - sequencing there is no need for enrichment step. You just get fragments of + sequencing, there is no need for enrichment step. You just get fragments of genomic DNA and sequence them. - Quantification: This is where you quantify your enriched material. Depending on the protocol you may need to quantify a control set as well, where you should see no enrichment or only background enrichment. -```{r,HTassays,fig.cap="Common steps of High-throughput assays in genome biology",fig.align = 'center',out.width='70%',ref.label='HTassays',echo=FALSE} +```{r,HTassays,fig.cap="Common steps of high-throughput assays in genome biology.",fig.align = 'center',out.width='70%',ref.label='HTassays',echo=FALSE} knitr::include_graphics("images/HTassays.png" ) ``` @@ -654,7 +655,7 @@ High-throughput sequencing, or massively parallel sequencing, is a collection of methods and technologies that can sequence DNA thousands/millions \index{high-throughput sequencing} of fragments at a time. This is in contrast to older technologies that can produce a limited -number of fragments at a time. Here, throughput refers to number +number of fragments at a time. Here, throughput refers to the number of sequenced bases per hour. The older low-throughput sequencing methods have ~100 times less throughput compared to modern high-throughput methods. The increased throughput gives the ability to measure biological features on a @@ -662,20 +663,20 @@ genome-wide scale in a shorter time frame. Similar to other high-throughput methods, sequencing-based methods also require an enrichment step. This step enriches for the features we are interested in. -The main difference of the sequencing based methods is the quantification step. +The main difference of the sequencing-based methods is the quantification step. In high-throughput sequencing, enriched fragments are put through the sequencer which outputs the sequences for the fragments. Due to limitations in current leading technologies, only -limited number of bases can be sequenced from the input fragments. However, +a limited number of bases can be sequenced from the input fragments. However, the length is usually enough to uniquely map the reads to the genome and quantify the input fragments. #### High-throughput sequencing data If there is a genome available, the reads are aligned to the genome and based -on the library preparation protocol different strategies are applied for analysis. A sequencing library is composed of fragments of RNA or DNA ready to be sequenced. The library preparation primarily depends on the experiment of interest. There are a number of library preparation protocols aimed at quantifying different signals from the genome. Some of the potential analysis strategies for different library-prep protocols and processed output of read alignments +on the library preparation protocol, different strategies are applied for analysis. A sequencing library is composed of fragments of RNA or DNA ready to be sequenced. The library preparation primarily depends on the experiment of interest. There are a number of library preparation protocols aimed at quantifying different signals from the genome. Some of the potential analysis strategies for different library-prep protocols and processed output of read alignments are depicted in Figure \@ref(fig:HTseq).\index{high-throughput sequencing} -For example, we maybe interested to quantify the gene expression. -The experimental protocol, called RNA sequencing- +For example, we may be interested in quantifying the gene expression. +The experimental protocol, called RNA sequencing, RNA-seq, enriches for fragments of RNA that are coming from protein coding genes.\index{RNA-seq} Upon alignment, we can calculate the coverage profile which gives us a read count per base along the genome. This information can be stored in a text file or @@ -683,15 +684,15 @@ specialized file formats to be used in subsequent analysis or visualization. We can also just count how many reads overlap with exons of each gene and record read counts per gene for further analysis. This essentially produces a table with gene names and read counts for different samples. As we will see in later -chapters, this is an essential information for statistical models that model +chapters, this is an essential information for statistical models for RNA-seq data. Furthermore, we can stack up the reads and count how many times a base position in a read mismatches the base in the genome. Read aligners allow for mismatches, and for this reason we can see reads with mismatches. This information can be used to identify SNPs, and can be stored again in a tabular format with the information of position and mismatch type and number of reads supporting the mismatch. The original algorithms are a bit more complicated than just counting mismatches but the general idea is the -same, what they are doing differently is trying to minimize false positive -rates by using filters, so that not every mismatch is recorded as SNP.\index{SNP} +same; what they are doing differently is trying to minimize false positive +rates by using filters, so that not every mismatch is recorded as a SNP.\index{SNP} @@ -705,14 +706,14 @@ knitr::include_graphics("images/HTseq.png" ) The sequencing technology is still evolving. Obtaining longer single-molecule reads, and preferably, being able to call base modifications \index{high-throughput sequencing} on the fly is the next frontier. -With longer reads, the genome asssembly will be easier for the regions +With longer reads, the genome assembly will be easier for the regions that have high repeat content. With single-molecule sequencing, we will be able to tell how many transcripts are present in a given cell population without relying on fragment amplification methods which can introduce biases. Another recent development is single-cell sequencing. Current technologies usually work on genetic material from thousands to millions of cells. This means that the -results you receive represents the population of cells that were used in the +results you receive represent the population of cells that were used in the experiment. However, there is a lot of variation between the same type of cells, but this variation is not observed at all. Newer sequencing techniques can work on single cells and give quantitative information on each cell. @@ -721,9 +722,9 @@ on single cells and give quantitative information on each cell. __Want to know more?__ -- Current and the future high-throughput sequencing technologies http://www.sciencedirect.com/science/article/pii/S1097276515003408 +- Current and the future high-throughput sequencing technologies: http://www.sciencedirect.com/science/article/pii/S1097276515003408 -- Illumina repository for different library preparation protocols for sequencing http://www.illumina.com/techniques/sequencing/ngs-library-prep/library-prep-methods.html +- Illumina repository for different library preparation protocols for sequencing: http://www.illumina.com/techniques/sequencing/ngs-library-prep/library-prep-methods.html ``` @@ -737,33 +738,33 @@ There are ~100 animal genomes sequenced as of 2016. On top these, there are many research projects from either individual labs or consortia that produce petabytes of auxiliary genomics data, such as ChIP-seq, RNA-seq, etc. \index{ChIP-seq} \index{RNA-seq} -There are two requirements to be able to visualize genomes and its associated -data, 1) you need to be able to work with a species +There are two requirements to be able to visualize genomes and their associated +data: 1) you need to be able to work with a species that has a sequenced genome and 2) you want to have annotation on that genome, meaning, at the very least, you want to know where the genes are. Most genomes after sequencing are quickly annotated with gene-predictions or -known gene sequences are mapped on to them, you can also +known gene sequences are mapped on to them, and you can also have conservation to other species to filter functional elements. If you -are working with a model organism or human you will also have a lot of +are working with a model organism or human, you will also have a lot of auxiliary information to help demarcate the functional regions such -as regulatory regions, ncRNAs, SNPs that are common in the population. -Or you might have disease or tissue specific data available. -The more the organism is worked on the more auxiliary data you will have. +as regulatory regions, ncRNAs, and SNPs that are common in the population. +Or you might have disease- or tissue-specific data available. +The more the organism is worked on, the more auxiliary data you will have. #### Accessing genome sequences and annotations via genome browsers -As someone intends to work with genomics, you will need to visualize a +As someone who intends to work with genomics, you will need to visualize a large amount of data to make biological inferences or simply check regions of interest in the genome visually. Looking at the genome case by case with all -the additional datasets is a necessary step to develop hypothesis and understand +the additional datasets is a necessary step to develop a hypothesis and understand the data. Many genomes and their associated data are available through genome browsers. A genome browser is a website or an application that helps you visualize the genome and all the available data associated - with it. Via genome browsers \index{genome browser}, you will be able to see where genes are in + with it. Via genome browsers\index{genome browser}, you will be able to see where genes are in relation to each other and other functional elements. You will be able to see gene structure. You will be able to see auxiliary data such as conservation, repeat content and SNPs. Here we review some of the popular @@ -776,29 +777,28 @@ and annotations for many species. You can search for genes or genome coordinates for the species of your interest. It is usually very responsive and allows you to visualize large amounts of data. In addition, it has multiple other tools that can be used in connection with the browser. One of the most useful tools -is _UCSC Table Browser_, which lets you download all the data you see on the +is the _UCSC Table Browser_, which lets you download all the data you see on the browser, including sequence data, in multiple formats. Users can upload data or provide links to the data -to visualize user specific data. +to visualize user-specific data. -__Ensembl:__ This is another online browser maintained by -European Bioinformatics Institute and the Wellcome Trust Sanger Institute in +__Ensembl:__ This is another online browser maintained by the European Bioinformatics Institute and the Wellcome Trust Sanger Institute in the UK, http://www.ensembl.org. -Similar to UCSC browser, users can visualize genes or genomic coordinates +Similar to the UCSC browser, users can visualize genes or genomic coordinates from multiple species and it also comes with auxiliary data. Ensembl is -associated with _Biomart_ \index{Biomart} tool which is similar to UCSC Table browser, can +associated with the _Biomart_ \index{Biomart} tool which is similar to UCSC Table browser, and can download genome data including all the auxiliary data set in multiple formats.\index{Ensembl Genome Browser} __IGV:__ Integrated genomics viewer (IGV) is a desktop application developed by Broad institute (https://www.broadinstitute.org/igv/). It is -developed to deal with large amounts of high-throughput sequencing data which +developed to deal with large amounts of high-throughput sequencing data, which is harder to view in online browsers. IGV can integrate your local sequencing results with online annotation on your desktop machine. This is useful when viewing sequencing data, especially alignments. Other browsers mentioned above -have similar features however you will need to make your large +have similar features, however you will need to make your large sequencing data available online somewhere before it can be viewed by browsers. \index{IGV Browser} @@ -806,21 +806,21 @@ sequencing data available online somewhere before it can be viewed by browsers. Genome browsers contain lots of auxiliary high-throughput data. However, there are many more public high-throughput data sets available and they are certainly not available through genome browsers. Normally, every high-throughput dataset -associated with a publication should be deposited to public archives. There +associated with a publication should be deposited in public archives. There are two major public archives we use to deposit data. One of them is -_Gene expression Omnibus (GEO)_ hosted at http://www.ncbi.nlm.nih.gov/geo/, -and the other one is _European nucleotide archive (ENA)_ hosted at +_Gene Expression Omnibus (GEO)_ hosted at http://www.ncbi.nlm.nih.gov/geo/, +and the other one is _European Nucleotide Archive (ENA)_ hosted at http://www.ebi.ac.uk/ena. These repositories accept high-throughput datasets and users can freely download and use these public data sets for their own research. Many data sets in these repositories are in their raw format, -for example the format the sequencer provides mostly. Some data sets will also +for example, the format the sequencer provides mostly. Some data sets will also have processed data but that is not a norm. -Apart from these repositories, there are multiple multi-national consortia dedicated to certain genome biology or disease related problems and +Apart from these repositories, there are multiple multi-national consortia dedicated to certain genome biology or disease-related problems and they maintain their own databases and provide access to processed and raw data. Some of these consortia are mentioned below. -Consortium | what is it for? +Consortium | What is it for? ------------------- | ------------------------------------- [ENCODE](https://www.encodeproject.org/) | Transcription factor binding sites, gene expression and epigenomics data for cell lines | https://www.encodeproject.org/ [Epigenomics Roadmap](http://www.roadmapepigenomics.org/) | Epigenomics data for multiple cell types | diff --git a/02-intro2R.Rmd b/02-intro2R.Rmd deleted file mode 100644 index 0b8f7bc..0000000 --- a/02-intro2R.Rmd +++ /dev/null @@ -1,1092 +0,0 @@ -# Introduction to R for Genomic Data Analysis {#Rintro} - - -```{r setup_introtoR_seq, include=FALSE} -knitr::opts_chunk$set(echo = TRUE, - message = FALSE, - error = FALSE, - cache = TRUE, - out.width = "55%", - fig.width = 5, - fig.align = 'center') -``` - - -The aim of computational genomics is to provide biological interpretation and insights from high -dimensional genomics data. Generally speaking, it is similar to any other kind -of data analysis endeavor but often times doing computational genomics will require domain specific knowledge and tools. - -As new high-throughput experimental techniques are on the rise, data analysis -capabilities are sought-after features for researchers. The aim of this chapter is to first familiarize the readers with data analysis steps and then provide basics of R programming within the context of genomic data analysis. R is a free statistical programming language that is popular among researchers and data miners to build software and analyze data. Although -basic R programming tutorials are easily accessible, we are aiming to introduce -the subject with the genomic context in the background. The examples and -narrative will always be from real-life situations when you try to analyze -genomic data with R. We believe tailoring material to the context of genomics -makes a difference when learning this programming language for sake of analyzing -genomic data. - -## Steps of (genomic) data analysis -Regardless of the analysis type, the data analysis has a common pattern. We will -discuss this general pattern and how it applies to genomics problems. The data analysis steps typically include data collection, quality check and cleaning, processing, modeling, visualization and reporting. Although, one expects to go through these steps in a linear fashion, it is normal to go back and repeat the steps with different parameters or tools. In practice, data analysis requires going through the same steps over and over again in order to be able to do a combination of the following: a) answering other related questions, b) dealing with data quality issues that are later realized, and, c) including new data sets to the analysis. - -We will now go through a brief explanation of the steps within the context of genomic data analysis. - -### Data collection -Data collection refers to any source, experiment or survey that provides data for the data analysis question you have. In genomics, data collection is done by high-throughput assays introduced in chapter \@ref(intro). One can also use publicly available data sets and specialized databases also mentioned in chapter \@ref(intro). How much data and what type of data you should collect depends on the question you are trying to answer and the technical and biological variability of the system you are studying. - -### Data quality check and cleaning -In general, data analysis almost always deals with imperfect data. It is -common to have missing values or measurements that are noisy. Data quality check -and cleaning aims to identify any data quality issue and clean it from the dataset. - -High-throughput genomics data is produced by technologies that could embed -technical biases into the data. If we were to give an example from sequencing, -the sequenced reads do not have the same quality of bases called. Towards the -ends of the reads, you could have bases that might be called incorrectly. Identifying those low quality bases and removing them will improve read mapping step. - -### Data processing -This step refers to processing the data to a format that is suitable for -exploratory analysis and modeling. Often times, the data will not come in ready -to analyze format. You may need to convert it to other formats by transforming -data points (such as log transforming, normalizing etc), or subset the data set -with some arbitrary or pre-defined condition. In terms of genomics, processing -includes multiple steps. Following the sequencing analysis example above, -processing will include aligning reads to the genome and quantification over genes or regions of interest. This is simply counting how many reads are covering your regions of interest. This quantity can give you ideas about how much a gene is expressed if your experimental protocol was RNA sequencing \index{RNA-seq}. This can be followed by some normalization to aid the next step. - -### Exploratory data analysis and modeling -This phase usually takes in the processed or semi-processed data and applies machine-learning or statistical methods to explore the data. Typically, one needs to see relationship between variables measured, relationship between samples based on the variables measured. At this point, we might be looking to see if the samples group as expected by the experimental design, are there outliers or any other anomalies ? After this step you might want to do additional clean up or re-processing to deal with anomalies. - -Another related step is modeling. This generally refers to modeling your variable of interest based on other variables you measured. In the context of genomics, it could be that you are trying to predict disease status of the patients from expression of genes you measured from their tissue samples. Then your variable of interest is the disease status and . This is generally called predictive modeling and could be solved with regression based or any other machine-learning methods. This kind of approach is generally called "predictive modeling". - - -Statistical modeling would also be a part of this modeling step, this can cover predictive modeling as well where we use statistical methods such as linear regression. Other analyses such as hypothesis testing, where we have an expectation and we are trying to confirm that expectation is also related to statistical modeling. A good example of this in genomics is the differential gene expression analysis. This can be formulated as comparing two data sets, in this case expression values from condition A and condition B, with the expectation that condition A and condition B has similar expression values. You will see more on this in chapter \@ref(stats). - -### Visualization and reporting -Visualization is necessary for all the previous steps more or less. But in the final phase, we need final figures, tables and text that describes the outcome of your analysis. This will be your report. In genomics, we use common data visualization methods as well as specific visualization methods developed or popularized by genomic data analysis. You will see many popular visualization methods in chapters \@ref(stats) and \@ref(genomicIntervals). - -### Why use R for genomics ? -R, with its statistical analysis -heritage, plotting features and rich user-contributed packages is one of the -best languages for the task of analyzing genomic data. -High-dimensional genomics datasets are usually suitable to -be analyzed with core R packages and functions. On top of that, Bioconductor and CRAN have an -array of specialized tools for doing genomics-specific analysis. Here is a list of computational genomics tasks that can be completed using R. - -#### Data cleanup and processing - -Most of general data clean up, such as removing incomplete columns and values, reorganizing and transforming data, these tasks can be achieved using R. In addition, with the help of packages R can connect to databases in various formats such as mySQL, mongoDB, etc., and query and get the data to R environment using database specific tools. - -On top of these, genomic data specific processing and quality check can be achieved via R/Bioconductor packages. For example, sequencing read quality checks and even \index{high-throughput sequencing} HT-read alignments \index{read alignment} can be achieved via R packages. - -#### General data anaylsis and exploration - -Most genomics data sets are suitable for application of general data analysis tools. In some cases, you may need to preprocess the data to get it to a state that is suitable for application of such tools. Here is a non-exhaustive list of what kind of things can be done via R. You will see popular data analysis methods in chapters \@ref(stats),\@ref(unsupervisedLearning) and \@ref(supervisedLearning). - - - unsupervised data analysis: clustering (k-means, hierarchical), matrix factorization -(PCA, ICA etc.) - - supervised data analysis: generalized linear models, support vector machines, random forests - -#### Genomics-specific data analysis methods - R/Bioconductor gives you access to multitude of other bioinformatics specific algorithms. Here are some of the things you can do. We will touch upon many od the following methods in chapter \@ref(genomicIntervals) and onwards. - - - Sequence analysis: TF binding motifs, GC content and CpG counts of a given DNA sequence - - Differential expression (or arrays and sequencing based measurements) - - Gene set/Pathway analysis: What kind of genes are enriched in my gene set - - Genomic Interval operations such as Overlapping CpG islands with transcription start sites, and filtering based on overlaps - - Overlapping aligned reads with exons and counting aligned reads per gene - -#### Visualization -Visualization is an important part of all data analysis techniques including computational genomics. Again, you can use core visualization techniques in R and also genomics specific ones with the help of specific packages. Here are some of the things you can do with R. - - - Basic plots: Histograms, scatter plots, bar plots, box plots, heatmaps - - ideograms and circos plots for genomics provides visualization of different features over the whole genome. - - meta-profiles of genomic features, such as read enrichment over all promoters - - Visualization of quantitative assays for given locus in the genome - -## Getting started with R -Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. Rstudio is optional but it is a great tool if you are just starting to learn R. -You will need specific data sets to run the code snippets in this book, we have explained how to install and use the data in the [Data for the book] section in [Preface]. If you haven not use Rstudio before, we reccomend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, plots you make and your data all in one interface. - - -### Installing packages -R packages are add-ons to base R that help you achieve additional tasks that are not directly supported by base R. It is by the action of these extra functionality that R excels as a tool for computational genomics. Bioconductor project (http://bioconductor.org/) is a dedicated package repository for computational biology related packages. However main package repository of R, called CRAN, has also computational biology related packages. In addition, R-Forge (http://r-forge.r-project.org/), GitHub (https://github.com/), and Bitbucket (http://www.bitbucket.org) are some of the other locations where R packages might be hosted. The packages needed for the code snippets in this book and how to install them are explained in the [Packages needed to run the book code] section in the [Preface] of the book. - -You can install CRAN packages using `install.packages()`. (# is the comment character in R) -```{r installpack1,eval=FALSE} -# install package named "randomForests" from CRAN -install.packages("randomForests") -``` -You can install bioconductor packages with a specific installer script -```{r installpack2,eval=FALSE} -# get the installer package if you don't have -install.packages("BiocManager") - -# install bioconductor package "rtracklayer" -BiocManager::install("rtracklayer") -``` -You can install packages from github using `install_github()` function from `devtools` package. -```{r installpack3,eval=FALSE} -library(devtools) -install_github("hadley/stringr") -``` -Another way to install packages are from the source. -```{r installpack4,eval=FALSE} -# download the source file -download.file("http://goo.gl/3pvHYI", - destfile="methylKit_0.5.7.tar.gz") -# install the package from the source file -install.packages("methylKit_0.5.7.tar.gz", - repos=NULL,type="source") -# delete the source file -unlink("methylKit_0.5.7.tar.gz") -``` -You can also update CRAN and Bioconductor packages. -```{r installpack5,eval=FALSE} -# updating CRAN packages -update.packages() - -# updating bioconductor packages -if (!requireNamespace("BiocManager", quietly = TRUE)) - install.packages("BiocManager") -BiocManager::install() -``` - -### Installing packages in custom locations -If you will be using R on servers or computing clusters rather than your personal computer it is unlikely that you will have administrator access to install packages. In that case, you can install packages in custom locations by telling R where to look for additional packages. This is done by setting up an `.Renviron` file in your home directory and add the following line: -``` -R_LIBS=~/Rlibs -``` - -This tells R that “Rlibs” directory at your home directory will be the first choice of locations to look for packages and install packages (The directory name and location is up to you above is just an example). You should go and create that directory now. After that, start a fresh R session and start installing packages. From now on, packages will be installed to your local directory where you have read-write access. - -### Getting help on functions and packages -You can get help on functions by `help()` and `help.search()` functions. You can list the functions in a package with `ls()` function - -```{r getHelp,eval=FALSE} -library(MASS) -ls("package:MASS") # functions in the package -ls() # objects in your R enviroment -# get help on hist() function -?hist -help("hist") -# search the word "hist" in help pages -help.search("hist") -??hist - -``` -#### More help needed? -In addition, check package vignettes for help and practical understanding of the functions. All Bionconductor packages have vignettes that walk you through example analysis. Google search will always be helpful as well, there are many blogs and web pages that have posts about R. R-help mailing list (https://stat.ethz.ch/mailman/listinfo/r-help), Stackoverflow.com and R-bloggers.com are usually source of good and reliable information. - - -## Computations in R -R can be used as an ordinary calculator, some say it is an over-grown calculator. Here are some examples. Remember `#` is the comment character. The comments give details about the operations in case they are not clear. -```{r basics, eval=FALSE} -2 + 3 * 5 # Note the order of operations. -log(10) # Natural logarithm with base e -5^2 # 5 raised to the second power -3/2 # Division -sqrt(16) # Square root -abs(3-7) # Absolute value of 3-7 -pi # The number -exp(2) # exponential function -# This is a comment line - - -``` - -## Data structures -R has multiple data structures. If you are familiar with excel you can think of a single excel sheet as a table and data structures as building blocks of that table. Most of the time you will deal with tabular data sets or you will want to transform your raw data to a tabular data set, and you will try to manipulate this tabular data set in some way. For example, you may want to take sub-sections of the table or extract all the values in a column. For these and similar purposes, it is essential to know what are the common data structures in R and how they can be used. R deals with named data structures, this means you can give names to data structures and manipulate or operate on them using those names. It will be clear soon what we mean by this if "named data structures" does not ring a bell. - -### Vectors -Vectors are one of the core R data structures. It is basically a list of elements of the same type (numeric,character or logical). Later you will see that every column of a table will be represented as a vector. R handles vectors easily and intuitively. You can create vectors with the `c()` function, however that is not the only way. The operations on vectors will propagate to all the elements of the vectors.\index{R Programming Language!vector} - -```{r vectors} -x<-c(1,3,2,10,5) #create a vector named x with 5 components -x = c(1,3,2,10,5) -x -y<-1:5 #create a vector of consecutive integers y -y+2 #scalar addition -2*y #scalar multiplication -y^2 #raise each component to the second power -2^y #raise 2 to the first through fifth power -y #y itself has not been unchanged -y<-y*2 -y #it is now changed -r1<-rep(1,3) # create a vector of 1s, length 3 -length(r1) #length of the vector -class(r1) # class of the vector -a<-1 # this is actually a vector length one -``` - -The standard assignment operator in R is `<-`. This operator is preferentially used in books and documentation. However, it is also possible to use `=` operator for the assignment. We have an example in -the above code snippet and throughout the book we use `<-` and `=` interchangeably for assignment. - -### Matrices -A matrix refers to a numeric array of rows and columns. You can think of it as a stacked version of vectors where each row or column is a vector. One of the easiest ways to create a matrix is to combine vectors of equal length using `cbind()`, meaning 'column bind'.\index{R Programming Language!matrix} - -```{r matrices} -x<-c(1,2,3,4) -y<-c(4,5,6,7) -m1<-cbind(x,y);m1 -t(m1) # transpose of m1 -dim(m1) # 2 by 5 matrix -``` -You can also directly list the elements and specify the matrix: -```{r matrix2} -m2<-matrix(c(1,3,2,5,-1,2,2,3,9),nrow=3) -m2 -``` -Matrices and the next data structure **data frames** are tabular data structures. You can subset them using `[]` and providing desired rows and columns to subset. Figure \@ref(fig:slicingDataFrames) shows how that works conceptually. - -```{r,slicingDataFrames,fig.cap="slicing/subsetting of a matrix and a data frame",fig.align = 'center',out.width='80%',echo=FALSE} -knitr::include_graphics("images/slicingDataFrames.png" ) -``` - -### Data Frames - -A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). A data frame can be constructed by `data.frame()` function. For example, we illustrate how to construct a data frame from genomic intervals or coordinates.\index{R Programming Language!data frame} - -```{r dfcreate} -chr <- c("chr1", "chr1", "chr2", "chr2") -strand <- c("-","-","+","+") -start<- c(200,4000,100,400) -end<-c(250,410,200,450) -mydata <- data.frame(chr,start,end,strand) -#change column names -names(mydata) <- c("chr","start","end","strand") -mydata # OR this will work too -mydata <- data.frame(chr=chr,start=start,end=end,strand=strand) -mydata -``` -There are a variety of ways to extract the elements of a data frame. You can extract certain columns using column numbers or names, or you can extract certain rows by using row numbers. You can also extract data using logical arguments, such as extracting all rows that has a value in a column larger than your threshold. - -```{r dfSlice} -mydata[,2:4] # columns 2,3,4 of data frame -mydata[,c("chr","start")] # columns chr and start from data frame -mydata$start # variable start in the data frame -mydata[c(1,3),] # get 1st and 3rd rows -mydata[mydata$start>400,] # get all rows where start>400 -``` - -### Lists -at list in R is an ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. You can create a list with the `list()` function. Each object or element in the list has a numbered position and can have names. Below we show a few examples of how to create lists. -```{r makeList} -# example of a list with 4 components -# a string, a numeric vector, a matrix, and a scalar -w <- list(name="Fred", - mynumbers=c(1,2,3), - mymatrix=matrix(1:4,ncol=2), - age=5.3) -w -``` -You can extract elements of a list using the ``[[]]``, the double square-bracket, convention using either its position in the list or its name. -```{r sliceList} -w[[3]] # 3rd component of the list -w[["mynumbers"]] # component named mynumbers in list -w$age -``` - -### Factors -Factors are used to store categorical data. They are important for statistical modeling since categorical variables are treated differently in statistical models than continuous variables. This ensures categorical data treated accordingly in statistical models. -```{r makeFactor} -features=c("promoter","exon","intron") -f.feat=factor(features) -``` -Important thing to note is that when you are reading a data.frame with read.table() or creating a data frame with `data.frame()` character columns are stored as factors by default, to change this behavior you need to set `stringsAsFactors=FALSE` in `read.table()` and/or `data.frame()` function arguments.\index{R Programming Language!factor} - - -## Data types -There are four common data types in R, they are `numeric`, `logical`, `character` and `integer`. All these data types can be used to create vectors natively.\index{R Programming Language!data types} -```{r dataTypes} -#create a numeric vector x with 5 components -x<-c(1,3,2,10,5) -x -#create a logical vector x -x<-c(TRUE,FALSE,TRUE) -x -# create a character vector -x<-c("sds","sd","as") -x -class(x) -# create an integer vector -x<-c(1L,2L,3L) -x -class(x) -``` - - -## Reading and writing data -Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is BED format used primarily by UCSC genome browser \index{UCSC Genome Browser} but most other genome browsers and tools will support BED file format \index{BED file}. We have all the annotation data in BED format. You will read more about data formats in Chapter \@ref(genomicIntervals). In R, you can easily read tabular format data with `read.table()` function. \index{R Programming Language!reading in data} -```{r readData} -enhancerFilePath=system.file("extdata", - "subset.enhancers.hg18.bed", - package="compGenomRData") -cpgiFilePath=system.file("extdata", - "subset.cpgi.hg18.bed", - package="compGenomRData") -# read enhancer marker BED file -enh.df <- read.table(enhancerFilePath, header = FALSE) - -# read CpG island BED file -cpgi.df <- read.table(cpgiFilePath, header = FALSE) - -# check first lines to see how the data looks like -head(enh.df) -head(cpgi.df) -``` - -You can save your data by writing it to disk as a text file. A data frame or matrix can be written out by using write.table() function. Now let us write out `cpgi.df`, we will write it out as a tab-separated file, pay attention to the arguments. -```{r writeData,tidy=FALSE,eval=FALSE} -write.table(cpgi.df,file="cpgi.txt",quote=FALSE, - row.names=FALSE,col.names=FALSE,sep="\t") -``` -You can save your R objects directly into a file using `save()` and `saveRDS()` and load them back in with `load()` and `readRDS()`. By using these functions you can save any R object whether or not they are in data frame or matrix classes. -```{r writeData2,eval= FALSE} -save(cpgi.df,enh.df,file="mydata.RData") -load("mydata.RData") -# saveRDS() can save one object at a type -saveRDS(cpgi.df,file="cpgi.rds") -x=readRDS("cpgi.rds") -head(x) -``` -One important thing is that with `save()` you can save many objects at a time and when they are loaded into memory with `load(` they retain their variable names. For example, in the above code when you use `load("mydata.RData")` in a fresh R session, an object names `cpg.df` will be created. That means you have to figure out what name you gave it to the objects before saving them. On the contrary to that, when you save an object by `saveRDS()` and read by `readRDS()` the name of the object is not retained, you need to assign the output of `readRDS()` to a new variable (`x` in the above code chunk).\index{R Programming Language!writing data} - -### Reading large files -Reading large files that contain tables with base R function `read.table()` might take a very long time. Therefore, there are additional packages that provide faster functions to read the files. `data.table` \index{R Packages!\texttt{data.table}} and `readr` \index{R Packages!\texttt{readr}}packages provide this functionality. Below, we show how to use them. These functions with provided parameters will return equivalent output to `read.table()` function. -```{r fastreader, eval=FALSE} -library(data.table) -df.f=d(enhancerFilePath, header = FALSE,data.table=FALSE) - -library(readr) -df.f2=read_table(enhancerFilePath, col_names = FALSE) - -``` - -## Plotting in R with base graphics -R has great support for plotting and customizing plots by default. This basic capability for plotting in R are referred to as "base graphics" or "R base graphics". We will show only a few below. Let us sample 50 values from normal distribution \index{normal distribution} and plot them as a histogram. Histogram is an aproximate representation of a distribution. Bars show how frequently we observe certain values in our sample.\index{R Programming Language!plotting}. The resulting histogram from the code chunk below is shown in Figure \@ref(fig:sampleForPlots). - -```{r sampleForPlots,out.width='40%',fig.width=5,fig.cap="Histogram of values sampled from normal distribution"} -# sample 50 values from normal distribution -# and store them in vector x -x<-rnorm(50) -hist(x) # plot the histogram of those values -``` - -We can modify all the plots by providing certain arguments to the plotting function. Now let's give a title to the plot using `'main'` argument. We can also change the color of the bars using `'col'` argument. You can simply provide the name of the color. Below, we are using `'red'` for the color. See Figure \@ref(fig:makeHist) for the result this code chunk. - -```{r makeHist,out.width='50%',fig.width=5, fig.cap="Histogram in red color"} -hist(x,main="Hello histogram!!!",col="red") -``` - -Next, we will make a scatter plot. Scatter plots are one the most common plots you will encounter in data analysis. We will sample another set of 50 values and plotted those against the ones we sampled earlier. Scatterplot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those that in following chapters. The scatter plot from our sampling experiment is shown in the Figure \@ref(fig:makeScatter). Notice that, in addition to main we used `"xlab"` and `"ylab"` arguments to give labels to the plot. You can customize the plots even more than this. See `?plot` and `?par` for more arguments that can help you customize the plots. - -```{r makeScatter,out.width='50%',fig.width=5, fig.cap="Scatterplot example"} -# randomly sample 50 points from normal distribution -y<-rnorm(50) -#plot a scatter plot -# control x-axis and y-axis labels -plot(x,y,main="scatterplot of random samples", - ylab="y values",xlab="x values") -``` - -we can also plot boxplots for vectors x and y. Boxplots depict groups of numerical data through their quartiles. The edges of the box denote 1st and 3rd quartile, and the line that crosses the box is the median. Whiskers usually are defined using interquartile range: - -`lowerWhisker=Q1-1.5[IQR] and upperWhisker=Q1+1.5[IQR]` - -In addition, outliers can be depicted as dots. In this case, outliers are the values that remain outside the whiskers. The resulting plot from the code snippet below is shown in Figure \@ref(fig:makeBoxplot). - -```{r makeBoxplot,out.width='50%',fig.width=5,fig.cap="Boxplot example"} - boxplot(x,y,main="boxplots of random samples") -``` -Next up is bar plot which you can plot by `barplot()` function. We are going to plot four imaginary percentage values and color them with two colors, and this time we will also show how to draw a legend on the plot using `legend()` function. The resulting plot is in Figure \@ref(fig:makebarplot). - -```{r makebarplot,out.width='50%',fig.width=5,tidy=FALSE,fig.cap="barplot example"} -perc=c(50,70,35,25) -barplot(height=perc, - names.arg=c("CpGi","exon","CpGi","exon"), - ylab="percentages",main="imagine %s", - col=c("red","red","blue","blue")) -legend("topright",legend=c("test","control"), - fill=c("red","blue")) -``` - -### Combining multiple plots -In R, we can combine multiple plots in the same graphic. For this purpose, we use `par()` function for simple combinations. More complicated arrangements with different sizes of sub-plots can be created with `layout()` function. Below we will show how to combine two plot side-by-side using `par(mfrow=c(1,2))`. The `mfrow=c(nrows, ncols)` construct will create a matrix of `nrows` x `ncols` plots that are filled in by row. The following code will produce a histogram and a scatterplot stacked side by side. The results is shown in Figure \@ref(fig:combineBasePlots). If you want to see the plots on top of each other simply change `mfrow=c(1,2)` to `mfrow=c(2,1)`. - -```{r combineBasePlots,fig.cap="Combining two plots, a histogram and a scatterplot with `par()` function.",fig.height=3.5} -par(mfrow=c(1,2)) # - -# make the plots -hist(x,main="Hello histogram!!!",col="red") -plot(x,y,main="scatterplot", - ylab="y values",xlab="x values") - -``` - -### Saving plots -If you want to save your plots to an image file there are couple of ways of doing that. Normally, you will have to do the following: - 1. Open a graphics device - 2. Create the plot - 3. Close the graphics device - -```{r savePlot,eval=FALSE} -pdf("mygraphs/myplot.pdf",width=5,height=5) -plot(x,y) -dev.off() -``` - Alternatively, you can first create the plot then copy the plot to a graphic device. -```{r savePlot2,eval=FALSE} -plot(x,y) -dev.copy(pdf,"mygraphs/myplot.pdf",width=7,height=5) -dev.off() -``` - -## Plotting in R with ggplot2 -In R, there are other plotting systems besides “base graphics”, which is what we have shown until now. There is another popular plotting system called `ggplot2`\index{R Packages!\texttt{ggplot2}} which implements a different logic when constructing the plots. This system or logic is known as “grammar of graphics”. This system defines a plot or graphics as a combination of different components. For example, in the scatterplot in \@ref(fig:makeScatter), we have the points which are geometric shapes, we have the coordinate system and scales of data. In addition, data transformations are also part of a plot. In Figure \@ref(fig:makeHist), the histogram has a binning operation and it puts the data into bins before displaying it as geometric shapes, the bars. `ggplot2` system and its implementation of “grammar of graphics”^[This is a concept developed by Leland Wilkinson and popularized in R community by Hadley Wickham: https://doi.org/10.1198/jcgs.2009.07098] allows us to build the plot layer by layer using the predefined components. - - -Next we will see how this works in practice. Let’s start with a simple scatterplot using `ggplot2`. In order to make basic plots in `ggplot2` one needs to combine different components. First, we need the data and its transformation to a geometric object, for a scatter plot this would be mapping data to points, for histograms it would be binning the data and making bars. Second, we need the scales and coordinate system, this generates axes and legends so that we can see the values on the plot. And the last component is the plat annotation such as plot title and the background. - - -The main `ggplot2` function, called `ggplot()`, requires a data frame to work with and this data frame is its first argument as shown in the code snippet below. The second thing you will notice is the `aes()` function in the `ggplot()` function. This functions defines which columns in the data frame maps to x and y coordinates and if they should be colored based on the values in different column or their shapes can be defined. These elements are the “aesthetic” elements, this is what we observe in the plot. The last line in the code represents the geometric object to be plotted. These geometric objects defines the type of the plot. In this case, the object is a point, indicated by `geom_point()`function. Another, peculiar thing in the code is the `+` operation. In `ggplot2`, this operation is used to add layers and modify the plot. The resulting scatter plot from the code snippet below can be seen in Figure \@ref(fig:ggScatterchp3). - -```{r ggScatterchp3,fig.cap="Scatter plot with ggplot2"} - -library(ggplot2) - -myData=data.frame(col1=x,col2=y) - - # the data is myData and I’m using col1 and col2 -# columns on x and y axes -ggplot(myData, aes(x=col1, y=col2)) + - geom_point() # map x and y as points - -``` - - -Now, let’s recreate the histogram we have created before. For this, we will start again with the `ggplot()` function, we are interested only in the x-axis in the histogram so we will only use one column of the data frame. Then, we will add the histogram layer with `geom_histogram()` function. In addition, we will be showing how to modify your plot further by adding additional layer with `labs()` function, which controls the axis lables and titles. The resulting plot from the code chunk below is shown in Figure \@ref(fig:ggHistChp3). - -```{r ggHistChp3,fig.cap="Histograms made with ggplot2, the left histogram contains additional modifications introduced by `labs()` function.",fig.width=5} -ggplot(myData, aes(x=col1)) + - geom_histogram() + # map x and y as points - labs(title="Histogram for a random variable", x="my variable", y="Count") - -``` - -We can also plot boxplots using `ggplot2`. Let's recreate the boxplot we did on Figure \@ref(fig:makeBoxplot). This time we will have to put all our data into a single data frame with extra columns denoting the group of our values. In the base graphics case, we could just input variables containing different vectors. However, `ggplot2` does not work like that and we need to create a data frame with the right format to use `ggplot()` function. Below, we are first concatanating the `x` and `y` vectors and creating a second column denoting the group for the vectors. In this case, x-axis will be the "group" variable which is just a character denoting the group and y-axis will be the numeric "values" for `x` and `y` vectors. You can see how this is passed to `aes()` function below. The resulting plot is shown in Figure \@ref(fig:ggBoxplotchp3). -```{r ggBoxplotchp3,fig.cap="Boxplots using ggplot2"} - -# data frame with group column showing which -# groups the vector x and y belong -myData2=rbind(data.frame(values=x,group="x"), - data.frame(values=y,group="y")) - -# x-axis will be group and y-axis will be values -ggplot(myData2, aes(x=group,y=values)) + - geom_boxplot() - - -``` - -### Combining multiple plots. -There are different options for combining multiple plots. If we are trying to make similar plots for the subsets of the same data set, we can use faceting. This is a built-in and very useful feature of `ggplot2`. This feature is frequently used when investigating whether patterns are the same or different in different conditions or subsets of the data. It can be used via `facet_grid()` function. Below, we will make two histograms facetted by the `group` variable in the input data frame. We will be using the same data frame we created for the boxplot in the previous section. The resulting plot is in Figure \@ref(fig:facetHistChp3). - -(ref:reffacetHistChp3) Combining two plots using `ggplot2::facet_grid()` function - - -```{r facetHistChp3,fig.cap='(ref:reffacetHistChp3)',fig.height=3} - -ggplot(myData2, aes(x=values)) + - geom_histogram() +facet_grid(.~group) -``` - -Facetting only works when you are using the subsets of the same data set. However, you may want to combine different type of plots from different data sets. The base R functions such as `par()` and `layout()` will not work with `ggplot2` because it uses a different graphics system and this system does not recognize base R functionality for plotting. However, there are multiple ways you can combine plots from `ggplot2`. One way is using the `cowplot` package. This package aligns the individual plots in a grid and will help you create publication ready compound plots. Below, we will show how to combine a histogram and a scatter plot side by side. The resulting plot is shown in Figure \@ref(fig:cowPlotChp3). - - -(ref:refcowPlotChp3) Combining a histogram and scatterplot using `cowplot` package. The plots are labeled as A and B using the arguments in `plot_grid()` function - -```{r cowPlotChp3,fig.cap='(ref:refcowPlotChp3)',fig.height=3.5,fig.width=7} -library(cowplot) -# histogram -p1 <- ggplot(myData2, aes(x=values,fill=group)) + - geom_histogram() -# scatterplot -p2 <- ggplot(myData, aes(x=col1, y=col2)) + - geom_point() - -# plot two plots in a grid and label them as A and B -plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12) - -``` - - - -### ggplot2 and tidyverse -`ggplot2` is actually part of a larger ecosystem. You will need packages from this ecosytem when you want to use `ggplot2` in a more sophisticated manner or if you need additional functionality that is not readily available in base R or other packages. For example, when you want to make more complicated plots using `ggplot2`, you will need to modify your data frames to the formats required by the `ggplot()` function, and you will need to learn about `dplyr`\index{R Packages!\texttt{dplyr}} and `tidyr`\index{R Packages!\texttt{tidyr}} packages for data formatting purposes. If you are working with strings `stringr` package might have functionality that is not available in base R. There are many more packages that users find it useful in `tidyverse` and it could be important to know about this ecosystem of R packages. - - - -```{block2, ggplotNote, type='rmdtip'} - -__Want to know more ?__ - -- `ggplot2` has a free online book written by Hadley Wickham: https://ggplot2-book.org/ -- The `tidyverse` packages and the ecosystem is described in their website: https://www.tidyverse.org/. There you will find extensive documentation and resources on `tidyverse` packages. - -``` - -## Functions and control structures (for, if/else etc.) - -### User defined functions -Functions are useful for transforming larger chunks of code to re-usable pieces of code. Generally, if you need to execute certain tasks with variable parameters then it is time you write a function. A function in R takes different arguments and returns a definite output, much like mathematical functions. Here is a simple function takes two arguments, `x` and `y`, and returns the sum of their squares \index{R Programming Language!functions}. - -```{r makeOwnFunc} -sqSum<-function(x,y){ -result=x^2+y^2 -return(result) -} -# now try the function out -sqSum(2,3) -``` - - -Functions can also output plots and/or messages to the terminal. Here is a function that prints a message to the terminal: -```{r makeOwnFunc2} -sqSumPrint<-function(x,y){ -result=x^2+y^2 -cat("here is the result:",result,"\n") -} -# now try the function out -sqSumPrint(2,3) -``` - -Sometimes we would want to execute a certain part of the code only if certain condition is satisfied. This condition can be anything from the type of an object (Ex: if object is a matrix execute certain code), or it can be more complicated such as if object value is between certain thresholds. Let us see how they can be used. They can be used anywhere in your code, now we will use it in a function. - -```{r makeOwnFunc3,eval=FALSE} -cpgi.df <- read.table("intro2R_data/data/subset.cpgi.hg18.bed", header = FALSE) -# function takes input one row -# of CpGi data frame -largeCpGi<-function(bedRow){ - cpglen=bedRow[3]-bedRow[2]+1 - if(cpglen>1500){ - cat("this is large\n") - } - else if(cpglen<=1500 & cpglen>700){ - cat("this is normal\n") - } - else{ - cat("this is short\n") - } -} -largeCpGi(cpgi.df[10,]) -largeCpGi(cpgi.df[100,]) -largeCpGi(cpgi.df[1000,]) -``` - -### Loops and looping structures in R -When you need to repeat a certain task or execute a function multiple times, you can do that with the help of loops. A loop will execute the task until a certain condition is reached. The loop below is called a “for-loop” and it executes the task sequentially 10 times. -```{r forloop} -for(i in 1:10){ # number of repetitions -cat("This is iteration") # the task to be repeated -print(i) -} -``` -The task above is a bit pointless, normally in a loop, you would want to do something meaningful. Let us calculate the length of the CpG islands we read in earlier. Although this is not the most efficient way of doing that particular task, it serves as a good example for looping. The code below will be execute hundred times, and it will calculate the length of the CpG islands for the first 100 islands in -the data frame (by subtracting the end coordinate from the start coordinate).\index{R Programming Language!loops} - - -**Note:**If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it for thousands of times. - -```{r forloop2} -# this is where we will keep the lenghts -# for now it is an empty vector -result=c() -# start the loop -for(i in 1:100){ - #calculate the length - len=cpgi.df[i,3]-cpgi.df[i,2]+1 - #append the length to the result - result=c(result,len) -} -# check the results -head(result) -``` - -#### apply family functions instead of loops -R has other ways of repeating tasks that tend to be more efficient than using loops. They are known as the "apply" family of functions, which include `apply`, `lapply`,`mapply` and `tapply` (and some other variants). All of these functions apply a given function to a set of instances and returns the result of those functions for each instance. The differences between them is that they take different type of inputs. For example apply works on data frames or matrices and applies the function on each row or column of the data structure. `lapply` works on lists or vectors and applies a function which takes the list element as an argument. Next we will demonstrate how to use `apply()` on a matrix. The example applies the sum function on the rows of a matrix, it basically sums up the values on each row of the matrix, which is conceptualized in Figure \@ref(fig:applyConcept).\index{R Programming Language!apply family functions} - - -```{r,applyConcept,fig.cap="apply concept in R",fig.align = 'center',out.width='80%',echo=FALSE} -knitr::include_graphics("images/apply.png" ) -``` - -```{r showapply1} -mat=cbind(c(3,0,3,3),c(3,0,0,0),c(3,0,0,3),c(1,1,0,0),c(1,1,1,0),c(1,1,1,0)) -result<-apply(mat,1,sum) -result -# OR you can define the function as an argument to apply() -result<-apply(mat,1,function(x) sum(x)) -result -``` -Notice that we used a second argument which equals to 1, that indicates that rows of the matrix/ data frame will be the input for the function. If we change the second argument to 2, this will indicate that columns should be the input for the function that will be applied. See Figure \@ref(fig:applyConcept2) for the visualization of apply() on columns. - - -```{r,applyConcept2,fig.cap="apply function on columns",fig.align = 'center',out.width='60%',echo=FALSE} -knitr::include_graphics("images/apply2.png" ) -``` - -```{r showapply2} -result<-apply(mat,2,sum) -result -``` -Next, we will use `lapply()`, which applies a function on a list or a vector. The function that will be applied is a simple function that takes the square of a given number. - -```{r showapply3} -input=c(1,2,3) -lapply(input,function(x) x^2) -``` -`mapply()` is another member of apply family, it can apply a function on an unlimited set of vectors/lists, it is like a version of `lapply` that can handle multiple vectors as arguments. In this case, the argument to the `mapply()` is the function to be applied and the sets of parameters to be supplied as arguments of the function. This conceptualized Figure \@ref(fig:mapplyConcept), the function to be applied is a function that takes to arguments and sums them up. The arguments to be summed up are in the format of vectors, Xs and Ys. `mapply()` applies the summation function to each pair in Xs and Ys vector. Notice that the order of the input function and extra arguments are different for `mapply`. - -```{r,mapplyConcept,fig.cap="mapply concept",fig.align = 'center',out.width='50%',echo=FALSE} -knitr::include_graphics("images/mapply.png" ) -``` - -```{r showMapply1} -Xs=0:5 -Ys=c(2,2,2,3,3,3) -result<-mapply(function(x,y) sum(x,y),Xs,Ys) -result -``` - -#### apply family functions on multiple cores -If you have large data sets apply family functions can be slow (although probably still better than for loops). If that is the case, you can easily use the parallel versions of those functions from parallel package. These functions essentially divide your tasks to smaller chunks run them on separate CPUs and merge the results from those parallel operations. This concept is visualized at Figure below , `mcapply` runs the summation function on three different processors. Each processor executes the summation function on a part of the data set, and the results are merged and returned as a single vector that has the same order as the input parameters Xs and Ys.\index{R Programming Language!apply family functions} - -```{r,mcapplyConcept,fig.cap="mcapplyconcept",fig.align = 'center',out.width='50%',echo=FALSE} -knitr::include_graphics("images/mcmapply.png" ) -``` - -#### Vectorized Functions in R -The above examples have been put forward to illustrate functions and loops in R because functions using sum() are not complicated and easy to understand. You will probably need to use loops and looping structures with more complicated functions. In reality, most of the operations we used do not need the use of loops or looping structures because there are already vectorized functions that can achieve the same outcomes, meaning if the input arguments are R vectors the output will be a vector as well, so no need for loops or vectorization. - -For example, instead of using `mapply()` and `sum()` functions we can just use + operator and sum up Xs and Ys. -```{r vectorized1} -result=Xs+Ys -result -``` -In order to get the column or row sums, we can use the vectorized functions `colSums()` and `rowSums()`. -```{r vectorized2} -colSums(mat) -rowSums(mat) -``` -However, remember that not every function is vectorized in R, use the ones that are. But sooner or later, apply family functions will come in handy. - - -## Exercises - -### Computations in R - -1. Sum 2 and 3, use `+` operator. [Difficulty: **Beginner**] - -2. Take the square root of 36, use `sqrt()`. [Difficulty: **Beginner**] - -3. Take the log10 of 1000, use function `log10()`.[Difficulty: **Beginner**] - -4. Take the log2 of 32, use function `log2()`.[Difficulty: **Beginner**] - -5. Assign the sum of 2,3 and 4 to variable x. [Difficulty: **Beginner**] - -6. Find the absolute value of `5 - 145` using `abs()` function. [Difficulty: **Beginner**] - -7. Calculate the square root of 625, divide it by 5 and assign it to variable `x`. -Ex: `y= log10(1000)/5`, the previous statement takes log10 of 1000, divides it -by 5 and assigns the value to variable y. [Difficulty: **Beginner**] - -8. Multiply the value you get from previous exercise with 10000, assign it variable x -Ex: `y=y*5`, multiplies y with 5 and assigns the value to y. -**KEY CONCEPT:** results of computations or arbitrary values can be stored in variables we can re-use those variables later on and over-write them with new values. -[Difficulty: **Beginner**] - -### Data structures in R - - -10. Make a vector of 1,2,3,5 and 10 using `c()`, assign it to `vec` variable. [Difficulty: **Beginner**] -Ex: `vec1=c(1,3,4)` makes a vector out of 1,3,4. - -11. Check the length of your vector with length(). -Ex: `length(vec1)` should return 3. [Difficulty: **Beginner**] - -12. Make a vector of all numbers between 2 and 15. -Ex: `vec=1:6` makes a vector of numbers between 1 and 6, assigns to `vec` variable. [Difficulty: **Beginner**] - -13. Make a vector of 4s repeated 10 times using `rep()` function. Ex: `rep(x=2,times=5)` makes a vector of 2s repeated 5 times. [Difficulty: **Beginner**] - -14. Make a logical vector with TRUE, FALSE values of length 4, use `c()`. -Ex: `c(TRUE,FALSE)`. [Difficulty: **Beginner**] - -15. Make a character vector of gene names PAX6,ZIC2,OCT4 and SOX2. -Ex: `avec=c("a","b","c")` a makes a character vector of a,b and c. [Difficulty: **Beginner**] - -16. Subset the vector using `[]` notation, get 5th and 6th elements. -Ex: `vec1[1]` gets the first element. `vec1[c(1,3)]` gets 1st and 3rd elements. [Difficulty: **Beginner**] - -17. You can also subset any vector using a logical vector in `[]`. Run the following: - -```{r subsetLogicExercise, eval=FALSE} -myvec=1:5 -# the length of the logical vector -# should be equal to length(myvec) -myvec[c(TRUE,TRUE,FALSE,FALSE,FALSE)] -myvec[c(TRUE,FALSE,FALSE,FALSE,TRUE)] -``` -[Difficulty: **Beginner**] - -18. `==,>,<, >=, <=` operators create logical vectors. See the results of the following operations: - -```{r,eval=FALSE} -myvec > 3 -myvec == 4 -myvec <= 2 -myvec != 4 -``` -[Difficulty: **Beginner**] - -19. Use `>` operator in `myvec[ ]` to get elements larger than 2 in `myvec` which is described above. [Difficulty: **Beginner**] - - -20. make a 5x3 matrix (5 rows, 3 columns) using `matrix()`. -Ex: `matrix(1:6,nrow=3,ncol=2)` makes a 3x2 matrix using numbers between 1 and 6. [Difficulty: **Beginner**] - -21. What happens when you use `byrow = TRUE` in your matrix() as an additional argument? -Ex: `mat=matrix(1:6,nrow=3,ncol=2,byrow = TRUE)`. [Difficulty: **Beginner**] - -22. Extract first 3 columns and first 3 rows of your matrix using `[]` notation.[Difficulty: **Beginner**] - -23. Extract last two rows of the matrix you created earlier. -Ex: `mat[2:3,]` or `mat[c(2,3),]` extracts 2nd and 3rd rows. -[Difficulty: **Beginner**] - - -24. Extract the first two columns and run `class()` on the result. -[Difficulty: **Beginner**] - -25. Extract first column and run `class()` on the result, compare with the above exercise. -[Difficulty: **Beginner**] - -26. Make a data frame with 3 columns and 5 rows, make sure first column is sequence -of numbers 1:5, and second column is a character vector. -Ex: `df=data.frame(col1=1:3,col2=c("a","b","c"),col3=3:1) # 3x3 data frame`. -Remember you need to make 3x5 data frame. [Difficulty: **Beginner**] - -27. Extract first two columns and first two rows. -**HINT:** Use the same notation as matrices. [Difficulty: **Beginner**] - -28. Extract last two rows of the data frame you made. -**HINT:** Same notation as matrices. [Difficulty: **Beginner**] - -29. Extract last two columns using column names of the data frame you made. [Difficulty: **Beginner**] - - -30. Extract second column using column names. -You can use `[]` or `$` as in lists, use both in two different answers. [Difficulty: **Beginner**] - -31. Extract rows where 1st column is larger than 3. -**HINT:** you can get a logical vector using `>` operator -,logical vectors can be used in `[]` when subsetting. [Difficulty: **Beginner**] - -32. Extract rows where 1st column is larger than or equal to 3. -[Difficulty: **Beginner**] - -33. Convert data frame to the matrix. **HINT:** use `as.matrix()`. -Observe what happens to numeric values in the data frame. [Difficulty: **Beginner**] - - -34. Make a list using `list()` function, your list should have 4 elements -the one below has 2. Ex: `mylist= list(a=c(1,2,3),b=c("apple,"orange"))` -[Difficulty: **Beginner**] - -35. Select the 1st element of the list you made using `$` notation. -Ex: `mylist$a` selects first element named "a". -[Difficulty: **Beginner**] - - -36. Select the 4th element of the list you made earlier using `$` notation. [Difficulty: **Beginner**] - -```{r,echo=FALSE,eval=FALSE} -mylist$d -``` - -37. Select the 1st element of your list using `[ ]` notation. -Ex: `mylist[1]` selects first element named "a", you get a list with one element. `mylist["a"]` selects first element named "a", you get a list with one element. -[Difficulty: **Beginner**] - -38. select the 4th element of your list using `[ ]` notation. [Difficulty: **Beginner**] - - -39. Make a factor using factor(), with 5 elements. -Ex: `fa=factor(c("a","a","b"))`. [Difficulty: **Beginner**] - -40. Convert a character vector to factor using `as.factor()`. -First, make a character vector using `c()` then use `as.factor()`. -[Difficulty: **Intermediate**] - -41. Convert the factor you made above to character using `as.character()`. [Difficulty: **Beginner**] - - - -### Reading in and writing data out in R - -1. Read CpG island (CpGi) data from the compGenomRData package `CpGi.table.hg18.txt`, this is a tab-separated file, store it in a variable called `cpgi`. -Use `cpgFilePath=system.file("extdata", - "CpGi.table.hg18.txt", - package="compGenomRData")` -to get the file path within the installed `compGenomRData` package. -[Difficulty: **Beginner**] - -2. Use `head()` on CpGi to see first few rows. -[Difficulty: **Beginner**] - -3. Why doesn't the following work? see `sep` argument at `help(read.table)`.[Difficulty: **Beginner**] - -```{r readCpGex, eval=FALSE} -cpgtFilePath=system.file("extdata", - "CpGi.table.hg18.txt", - package="compGenomRData") -cpgtFilePath -cpgiSepComma=read.table(cpgtFilePath,header=TRUE,sep=",") -head(cpgiSepComma) -``` - -4. What happens when you set `stringsAsFactors=FALSE` in `read.table()` ? -``` -cpgiHF=read.table("intro2R_data/data/CpGi.table.hg18.txt", - header=FALSE,sep="\t", - stringsAsFactors=FALSE) -``` -[Difficulty: **Beginner**] - -5. Read only first 10 rows of the CpGi table. [Difficulty: **Beginner/Intermediate**] - -6. Use `cpgFilePath=system.file("extdata","CpGi.table.hg18.txt",package="compGenomRData")` to get the file path, then use -`read.table()` with argument `header=FALSE`. Use `head()` to see the results. [Difficulty: **Beginner**] - - -7. Write CpG islands to a text file called "my.cpgi.file.txt". Write the file -to your home folder, you can use `file="~/my.cpgi.file.txt"` in linux. `~/` denotes -home folder.[Difficulty: **Beginner**] - - -8. Same as above but this time make sure use `quote=FALSE`,`sep="\t"` and `row.names=FALSE` arguments. -Save the file to "my.cpgi.file2.txt" and compare it with "my.cpgi.file.txt". [Difficulty: **Beginner**] - - -9. Write out the first 10 rows of 'cpgi' data frame. -**HINT:** use subsetting for data frames we learned before. [Difficulty: **Beginner**] - - - -10. Write the first 3 columns of 'cpgi' data frame.[Difficulty: **Beginner**] - -11. Write CpG islands only on chr1. **HINT:** use subsetting with `[]`, feed a logical vector using `==` operator.[Difficulty: **Beginner/Intermediate**] - - -12. Read two other data sets "rn4.refseq.bed" and "rn4.refseq2name.txt" -with `header=FALSE`, assign them to df1 and df2 respectively. -They are again included in the compGenomRData package, and you -can use `system.file()` function to get the file paths. [Difficulty: **Beginner**] - - -13. Use `head()` to see what is inside of the the data frames above.[Difficulty: **Beginner**] - -14. Merge data sets using `merge()` and assign the results to variable named 'new.df', and use `head()` to see the results. [Difficulty: **Intermediate**] - - - -### Plotting in R - - -Please run the following code snippet for the rest of the exercises. -```{r plotExSeed} -set.seed(1001) -x1=1:100+rnorm(100,mean=0,sd=15) -y1=1:100 -``` - -1. Make a scatter plot using `x1` and `y1` vectors generated above. [Difficulty: **Beginner**] - - -2. Use `main` argument to give a title to `plot()` as in `plot(x,y,main="title")`.[Difficulty: **Beginner**] - - -3. Use `xlab` argument to set a label to x-axis.Use `ylab` argument to set a label to y-axis.[Difficulty: **Beginner**] -4. Once you have the plot, run the following expression in R console. `mtext(side=3,text="hi there")` does. *HINT:* `mtext` stands for margin text. [Difficulty: **Beginner**] - -5. See what `mtext(side=2,text="hi there")` does. check your plot after execution. [Difficulty: **Beginner**] - -6. You can use `paste()` as 'text' argument in mtext() try that, you need to re-plot. -your plot first. **HINT:** `mtext(side=3,text=paste(...))` -See how `paste()` is used for below. - -```{r pasteExample} -paste("Text","here") -myText=paste("Text","here") -myText -``` -Use *mtext()* and *paste()* to put a margin text on the plot. [Difficulty: **Beginner/Intermediate**] - -7. `cor()` calculates correlation between two vectors. -Pearson correlation is a measure of the linear correlation (dependence) -between two variables X and Y. Try using `cor()` function on `x1` and `y1` variables. [Difficulty: **Beginner/Intermediate**] - -8. Try use `mtext()`,`cor()` and `paste()` to display correlation coefficient on your scatterplot ? [Difficulty: **Intermediate**] - -9. Change the colors of your plot using `col` argument. -Ex: `plot(x,y,col="red")`[Difficulty: **Beginner**] - -10. Use `pch=19` as an argument in your `plot()` command.[Difficulty: **Beginner**] - - -11. Use `pch=18` as an argument to your `plot()` command.[Difficulty: **Beginner**] - -12. Make histogram of `x1` with `hist()` function.Histogram is a graphical representation of the data distribution.[Difficulty: **Beginner**] - - -13. You can change colors with 'col', add labels with 'xlab', 'ylab', and add a 'title' with 'main' arguments. Try all these in a histogram. -[Difficulty: **Beginner**] - ```{r,echo=FALSE,eval=FALSE} -hist(x1,main="title") -``` - -14. Make boxplot of y1 with `boxplot()`.[Difficulty: **Beginner**] - -15. Make boxplots of `x1` and `y1` vectors in the same plot.[Difficulty: **Beginner**] - - -16. In boxplot use `horizontal = TRUE` argument. [Difficulty: **Beginner**] - - -17. make multiple plots with `par(mfrow=c(2,1))` - - run `par(mfrow=c(2,1))` - - make a boxplot - - make a histogram -[Difficulty: **Beginner/Intermediate**] - - -18. Do the same as above but this time with `par(mfrow=c(1,2))`.[Difficulty: **Beginner/Intermediate**] - - -19. Save your plot using "Export" button in Rstudio.[Difficulty: **Beginner**] - -20. You can make a scatter plot showing density -of points rather than points themselves. If you use points it looks like this: - -```{r colorScatterEx,out.width='50%'} - -x2=1:1000+rnorm(1000,mean=0,sd=200) -y2=1:1000 -plot(x2,y2,pch=19,col="blue") -``` - -If you use `smoothScatter()` function, you get the densities. -```{r smoothScatterEx,out.width='50%'} -smoothScatter(x2,y2, - colramp=colorRampPalette(c("white","blue", - "green","yellow","red"))) -``` - -Now, plot with `colramp=heat.colors` argument and then use a custom color scale using the following argument. -``` -colramp = colorRampPalette(c("white","blue", "green","yellow","red"))) -``` -[Difficulty: **Beginner/Intermediate**] - -### Functions and control structures (for, if/else etc.) -Read CpG island data as shown below for the rest of the exercises. - -```{r CpGexReadchp2,eval=TRUE} -cpgtFilePath=system.file("extdata", - "CpGi.table.hg18.txt", - package="compGenomRData") -cpgi=read.table(cpgtFilePath,header=TRUE,sep="\t") -head(cpgi) -``` - -1. Check values at perGc column using a histogram. -'perGc' column in the data stands for GC percent => percentage of C+G nucleotides. [Difficulty: **Beginner**] - -2. Make a boxplot for 'perGc' column. [Difficulty: **Beginner**] - - - -3. Use if/else structure to decide if given GC percent high, low or medium. -If it is low, high, or medium. low < 60, high>75, medium is between 60 and 75 -use greater or less than operators < or > . -Fill in the values in the in code below, where it is written 'YOU_FILL_IN' -[Difficulty: **Intermediate**] -```{r functionEvExchp2,echo=TRUE,eval=FALSE} - -GCper=65 - - # check if GC value is lower than 60, - # assign "low" to result - if('YOU_FILL_IN'){ - result="low" - cat("low") - } - else if('YOU_FILL_IN'){ # check if GC value is higher than 75, - #assign "high" to result - result="high" - cat("high") - }else{ # if those two conditions fail then it must be "medium" - result="medium" - } - -result - -``` - -4. Write a function that takes a value of GC percent and decides -if it is low, high, or medium. low < 60, high>75, medium is between 60 and 75. -Fill in the values in the in code below, where it is written 'YOU_FILL_IN'. [Difficulty: **Intermediate/Advanced**] - - -``` -GCclass<-function(my.gc){ - - YOU_FILL_IN - - return(result) -} -GCclass(10) # should return "low" -GCclass(90) # should return "high" -GCclass(65) # should return "medium" -``` - - -5. Use a for loop to get GC percentage classes for `gcValues` below. Use the function -you wrote above.[Difficulty: **Intermediate/Advanced**] - -``` -gcValues=c(10,50,70,65,90) -for( i in YOU_FILL_IN){ - YOU_FILL_IN -} -``` - - -6. Use `lapply` to get to get GC percentage classes for `gcValues`. Example: - -```{r lapplyExExerciseChp2,eval=FALSE} -vec=c(1,2,4,5) -power2=function(x){ return(x^2) } - lapply(vec,power2) -``` -[Difficulty: **Intermediate/Advanced**] - - - -7. Use sapply to get values to get GC percentage classes for `gcValues`. [Difficulty: **Intermediate**] - -8. Is there a way to decide on the GC percentage class of given vector of `GCpercentages` -without using if/else structure and loops ? if so, how can you do it? -**HINT:** subsetting using < and > operators. -[Difficulty: **Intermediate**] diff --git a/03-StatsForGenomics.Rmd b/03-StatsForGenomics.Rmd index ea0da4e..a7bfad2 100644 --- a/03-StatsForGenomics.Rmd +++ b/03-StatsForGenomics.Rmd @@ -12,28 +12,26 @@ knitr::opts_chunk$set(echo = TRUE, This chapter will summarize statistics methods frequently used in computational genomics. As these fields are continuously evolving, the -techniques introduced here do not form an exhaustive list but mostly corner -stone methods +techniques introduced here do not form an exhaustive list but mostly cornerstone methods that are often and still being used. In addition, we focused on giving intuitive and -practical understanding of the methods with relevant examples from the field. -If you want to dig deeper into statistics and math, beyond what is described +practical understanding of the methods with relevant examples from the field. If you want to dig deeper into statistics and math, beyond what is described here, we included appropriate references with annotation after each major section. ## How to summarize collection of data points: The idea behind statistical distributions -In biology and many other fields data is collected via experimentation. +In biology and many other fields, data is collected via experimentation. The nature of the experiments and natural variation in biology makes it impossible to get the same exact measurements every time you measure something. For example, if you are measuring gene expression values for a certain gene, say PAX6, and let's assume you are measuring expression -per sample and cell with any method( microarrays, rt-qPCR, etc.). You will not \index{gene expression} -get the same expression value even if your samples are homogeneous. Due +per sample and cell with any method (microarrays, rt-qPCR, etc.). You will not \index{gene expression} +get the same expression value even if your samples are homogeneous, due to technical bias in experiments or natural variation in the samples. Instead, we would like to describe this collection of data some other way -that represents the general properties of the data. The Figure \@ref(fig:pax6ReplicatesChp3) shows a sample of -20 expression values from PAX6 gene. +that represents the general properties of the data. Figure \@ref(fig:pax6ReplicatesChp3) shows a sample of +20 expression values from the PAX6 gene. -```{r pax6ReplicatesChp3,fig.align='center', out.width='50%',echo=FALSE,warning=FALSE,fig.height=5.6,fig.cap="Expression of PAX6 gene in 20 replicate experiments"} +```{r pax6ReplicatesChp3,fig.align='center', out.width='50%',echo=FALSE,warning=FALSE,fig.height=5.6,fig.cap="Expression of the PAX6 gene in 20 replicate experiments."} set.seed(1) old.par <- par() a=rnorm(20,mean=6,sd=0.7) @@ -51,30 +49,28 @@ par(old.par) ``` -### Describing the central tendency: mean and median -As seen in the figure above, the points from this sample are distributed around -a central value and the histogram below the dot plot shows number of points in +### Describing the central tendency: Mean and median +As seen in Figure \@ref(fig:pax6ReplicatesChp3), the points from this sample are distributed around +a central value and the histogram below the dot plot shows the number of points in each bin. Another observation is that there are some bins that have more points than others. If we want to summarize what we observe, we can try to represent the collection of data points with an expression value that is typical to get, something that represents the general tendency we observe on the dot plot and the histogram. This value is -sometimes called central +sometimes called the central value or central tendency, and there are different ways to calculate such a value. -In the figure above, we see that all the values are spread around 6.13 (red line), -and that is indeed what we call mean value of this sample of expression values. +In Figure \@ref(fig:pax6ReplicatesChp3), we see that all the values are spread around 6.13 (red line), +and that is indeed what we call the mean value of this sample of expression values. It can be calculated with the following formula $\overline{X}=\sum_{i=1}^n x_i/n$, where $x_i$ is the expression value of an experiment and $n$ is the number of -expression value obtained from the experiments. In R, `mean()` function will calculate the \index{mean} -mean of a provided vector of numbers. This is called a "sample mean". In reality -the possible values of PAX6 expression for all cells (provided each cell is of the -identical cell type and is in identical conditions) are much much more than 20. -If we had the time and the funding to sample all cells and measure PAX6 expression we would -get a collection values that would be called, in statistics, a "population". In -our case the population will look like the left hand side of the Figure \@ref(fig:pax6MorereplicatesChp3). What we have done with +expression values obtained from the experiments. In R, the `mean()` function will calculate the \index{mean} +mean of a provided vector of numbers. This is called a "sample mean". In reality, there are many more than 20 possible PAX6 expression values (provided each cell is of the +identical cell type and is in identical conditions). If we had the time and the funding to sample all cells and measure PAX6 expression we would +get a collection of values that would be called, in statistics, a "population". In +our case, the population will look like the left hand side of the Figure \@ref(fig:pax6MorereplicatesChp3). What we have done with our 20 data points is that we took a sample of PAX6 expression values from this population, and calculated the sample mean. -```{r pax6MorereplicatesChp3,out.width='75%',fig.width=6.5,echo=FALSE,warning=FALSE,fig.cap="Expression of all possible PAX6 gene expressions measures on all available biological samples (left). Expression of PAX6 gene from statistical sample, a random subset, from the population of biological samples (Right). "} +```{r pax6MorereplicatesChp3,out.width='75%',fig.width=6.5,echo=FALSE,warning=FALSE,fig.cap="Expression of all possible PAX6 gene expression measures on all available biological samples (left). Expression of the PAX6 gene from the statistical sample, a random subset from the population of biological samples (right). "} df=data.frame(x=rnorm(10000,6,0.7)) @@ -96,26 +92,23 @@ hist(a,xlim=c(2,10),col="red",border="white",main="", par(old.par) ``` -The mean of the population is calculated the same way but traditionally +The mean of the population is calculated the same way but traditionally the Greek letter $\mu$ is used to denote the population mean. Normally, we would not -have access to the population and we will use sample mean and other quantities +have access to the population and we will use the sample mean and other quantities derived from the sample to estimate the population properties. This is the basic -idea behind statistical inference which we will see this in action in later +idea behind statistical inference, which we will see in action in later sections as well. We estimate the population parameters from the sample parameters and there is some uncertainty associated with those estimates. We will be trying to assess those uncertainties and make decisions in the presence of those uncertainties. \index{mean} We are not yet done with measuring central tendency. -There are other ways to describe it, such as the median value. -Mean can be affected by outliers easily\index{outliers}. -If certain values are very high or low from the -bulk of the sample this will shift mean towards those outliers. However, median -is not affected by outliers. It is simply the value in a distribution where half -of the values are above and the other half is below. In R, `median()` function -will calculate the mean of a provided vector of numbers. \index{median} - -Let's create a set of random numbers and calculate their mean and median using +There are other ways to describe it, such as the median value. The +mean can be affected by outliers easily\index{outliers}. +If certain values are very high or low compared to the +bulk of the sample, this will shift mean toward those outliers. However, the median is not affected by outliers. It is simply the value in a distribution where half +of the values are above and the other half are below. In R, the `median()` function +will calculate the mean of a provided vector of numbers. \index{median}Let's create a set of random numbers and calculate their mean and median using R. ```{r runifMeanMedChp3} #create 10 random numbers from uniform distribution @@ -127,19 +120,18 @@ median(x) ``` -### Describing the spread: measurements of variation +### Describing the spread: Measurements of variation Another useful way to summarize a collection of data points is to measure -how variable the values are. You can simply describe the range of the values -, such as minimum and maximum values. You can easily do that in R with `range()` +how variable the values are. You can simply describe the range of the values, +such as the minimum and maximum values. You can easily do that in R with the `range()` function. A more common way to calculate variation is by calculating something called "standard deviation" or the related quantity called "variance". This is a -quantity that shows how variable the values are, a value around zero indicates +quantity that shows how variable the values are. A value around zero indicates there is not much variation in the values of the data points, and a high value indicates high variation in the values. The variance is the squared distance of data points from the mean. Population variance\index{variance} is again a quantity we usually -do not have access to and is simply calculate as follows $\sigma^2=\sum_{i=1}^n \frac{(x_i-\mu)^2}{n}$, where $\mu$ is the population mean, $x_i$ is the $i$th -data point in the population and $n$ is the population size. However, when the -we have only access to a sample this formulation is biased. It means that it +do not have access to and is simply calculated as follows $\sigma^2=\sum_{i=1}^n \frac{(x_i-\mu)^2}{n}$, where $\mu$ is the population mean, $x_i$ is the $i$th +data point in the population and $n$ is the population size. However, when we only have access to a sample, this formulation is biased. That means that it underestimates the population variance, so we make a small adjustment when we calculate the sample variance, denoted as $s^2$: @@ -151,13 +143,13 @@ $\overline{X}$ is the sample mean.} $$ -The sample standard deviation is simply the square-root of the sample variance, $s=\sqrt{\sum_{i=1}^n \frac{(x_i-\overline{X})^2}{n-1}}$. +The sample standard deviation is simply the square root of the sample variance, $s=\sqrt{\sum_{i=1}^n \frac{(x_i-\overline{X})^2}{n-1}}$. The good thing about standard deviation is that it has the same unit as the mean so it is more intuitive. -We can calculate sample standard deviation and variation with `sd()` and `var()` -functions in R. These functions take vector of numeric values as input and +We can calculate the sample standard deviation and variation with the `sd()` and `var()` +functions in R. These functions take a vector of numeric values as input and calculate the desired quantities. Below we use those functions on a randomly generated vector of numbers. ```{r varSdChp3} @@ -168,19 +160,19 @@ sd(x) One potential problem with the variance is that it could be affected by outliers.\index{outliers} The points that are too far away from the mean will have a large -affect on the variance even though there might be few of them. +effect on the variance even though there might be few of them. A way to measure variance that could be less affected by outliers is -looking at where bulk of the distribution is. How do we define where the bulk is? -One common way is to look at the the difference between 75th percentile and 25th +looking at where the bulk of the distribution is. How do we define where the bulk is? +One common way is to look at the difference between 75th percentile and 25th percentile, this effectively removes a lot of potential outliers which will be\index{outliers} towards the edges of the range of values. -This is called interquartile range \index{interquartile range} , and -can be easily calculated using R via `IQR()` function and the quantiles of a vector -is calculated with `quantile()` function. +This is called the interquartile range\index{interquartile range}, and +can be easily calculated using R via the `IQR()` function and the quantiles of a vector +are calculated with the `quantile()` function. Let us plot the boxplot for a random vector and also calculate IQR using R. In the boxplot (Figure \@ref(fig:boxplot2Chp3)), 25th and 75th percentiles are the edges of the box, and -the median is marked with a thick line going through roughly middle the box. +the median is marked with a thick line cutting through the box. ```{r IQRChp3} x=rnorm(20,mean=6,sd=0.7) IQR(x) @@ -191,7 +183,7 @@ quantile(x) boxplot(x,horizontal = T) ``` -```{r boxplot2Chp3,fig.height=5.1,out.width='50%',echo=FALSE,warnings=FALSE,message=FALSE,fig.cap="Boxplot showing 25th percentile and 75th percentile and median for a set of points sample from a normal distribution with mean=6 and standard deviation=0.7"} +```{r boxplot2Chp3,fig.height=5.1,out.width='50%',echo=FALSE,warnings=FALSE,message=FALSE,fig.cap="Boxplot showing the 25th percentile and 75th percentile and median for a set of points sampled from a normal distribution with mean=6 and standard deviation=0.7."} a=quantile(x)[c(2:4)] boxplot(x,horizontal = T) @@ -201,20 +193,20 @@ text(a[3],1.25,"75th percentile") #### Frequently used statistical distributions The distributions have parameters (such as mean and variance) that -summarizes them but also they are functions that assigns each outcome of a \index{normal distribution} +summarize them, but also they are functions that assign each outcome of a \index{normal distribution} statistical experiment to its probability of occurrence. One distribution that you will frequently encounter is the normal distribution or Gaussian distribution. The normal distribution has a typical "bell-curve" shape -and, characterized by mean and standard deviation. A set of data points +and is characterized by mean and standard deviation. A set of data points that -follow normal distribution mostly will be close to the mean -but spread around it controlled by the standard deviation parameter. That -means if we sample data points from a normal distribution we are more -likely to sample nearby the mean and sometimes away from the mean. -Probability of an event occurring is higher if it is nearby the mean. +follow normal distribution will mostly be close to the mean +but spread around it, controlled by the standard deviation parameter. That +means that if we sample data points from a normal distribution, we are more +likely to sample data points near the mean and sometimes away from the mean. +The probability of an event occurring is higher if it is nearby the mean. The effect -of the parameters for normal distribution can be observed in the following +of the parameters for the normal distribution can be observed in the following plot. ```{r normDistChp3,echo=FALSE,out.width='50%',fig.width=5.1, fig.cap="Different parameters for normal distribution and effect of those on the shape of the distribution"} @@ -232,37 +224,33 @@ legend("topright",c(expression(paste(mu,"=0, ",sigma,"=0.5")), ``` -The normal distribution is often denoted by $\mathcal{N}(\mu,\,\sigma^2)$ When a random variable $X$ is distributed normally with mean $\mu$ and variance $\sigma^2$, we write: +The normal distribution is often denoted by $\mathcal{N}(\mu,\,\sigma^2)$. When a random variable $X$ is distributed normally with mean $\mu$ and variance $\sigma^2$, we write: -$$X\ \sim\ \mathcal{N}(\mu,\,\sigma^2).$$ +$$X\ \sim\ \mathcal{N}(\mu,\,\sigma^2)$$ The probability -density function of Normal distribution with mean $\mu$ and standard deviation -$\sigma$ is as follows +density function of the normal distribution with mean $\mu$ and standard deviation +$\sigma$ is as follows: $$P(x)=\frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} } $$ The probability density function gives the probability of observing a value -on a normal distribution defined by $\mu$ and +on a normal distribution defined by the $\mu$ and $\sigma$ parameters. -Often times, we do not need the exact probability of a value but we need the +Oftentimes, we do not need the exact probability of a value, but we need the probability of observing a value larger or smaller than a critical value or reference point. For example, we might want to know the probability of $X$ being smaller than or -equal to -2 for a normal distribution with mean 0 and standard deviation 2. -,$P(X <= -2 \; | \; \mu=0,\sigma=2)$. In this case, what we want is the are under the -curve shaded in blue. To be able to that we need to integrate the probability +equal to -2 for a normal distribution with mean $0$ and standard deviation $2$: $P(X <= -2 \; | \; \mu=0,\sigma=2)$. In this case, what we want is the area under the +curve shaded in dark blue. To be able to do that, we need to integrate the probability density function but we will usually let software do that. Traditionally, one calculates a Z-score which is simply $(X-\mu)/\sigma=(-2-0)/2= -1$, and corresponds to how many standard deviations you are away from the mean. -This is also called "standardization", the corresponding value is distributed in "standard normal distribution" where $\mathcal{N}(0,\,1)$. +This is also called "standardization", the corresponding value is distributed in "standard normal distribution" where $\mathcal{N}(0,\,1)$. After calculating the Z-score, +we can look up the area under the curve for the left and right sides of the Z-score in a table, but again, we use software for that. +The tables are outdated when you can use a computer. -After calculating the Z-score, -we can go look up in a table, that contains the area under the curve for -the left and right side of the Z-score, but again we use software for that -tables are outdated. - -Below in Figure \@ref(fig:zscore), we are showing the Z-score and the associated probabilities derived +Below in Figure \@ref(fig:zscore), we show the Z-score and the associated probabilities derived from the calculation above for $P(X <= -2 \; | \; \mu=0,\sigma=2)$. ```{r zscore,echo=FALSE,message=FALSE,out.width='50%',fig.width=5.1,fig.cap='Z-score and associated probabilities for Z= -1'} @@ -272,10 +260,10 @@ xpnorm(c(-2), mean=0, sd=2,lower.tail = TRUE,invisible=T,verbose=FALSE) ``` -In R, family of `*norm` functions (`rnorm`,`dnorm`,`qnorm` and `pnorm`) can +In R, the family of `*norm` functions (`rnorm`,`dnorm`,`qnorm` and `pnorm`) can be used to -operate with normal distribution, such as calculating probabilities and -generating random numbers drawn from normal distribution. +operate with the normal distribution, such as calculating probabilities and +generating random numbers drawn from a normal distribution. We show some of those capabilities below. ```{r drnormChp3} # get the value of probability density function when X= -2, @@ -297,20 +285,20 @@ qnorm( 0.15, mean=0 , sd=2) ``` There are many other distribution functions in R that can be used the same -way. You have to enter the distribution specific parameters along -with your critical value, quantiles or number of random numbers depending -on which function you are using in the family.We will list some of those functions below. +way. You have to enter the distribution-specific parameters along +with your critical value, quantiles, or number of random numbers depending +on which function you are using in the family. We will list some of those functions below. -- `dbinom` is for binomial distribution \index{binomial disdistribution}. This distribution is usually used +- `dbinom` is for the binomial distribution\index{binomial disdistribution}. This distribution is usually used to model fractional data and binary data. Examples from genomics include methylation data. -- `dpois` is used for Poisson distribution and `dnbinom` is used for -negative binomial distribution. These distributions are used to model count \index{Poisson distribution} +- `dpois` is used for the Poisson distribution and `dnbinom` is used for +the negative binomial distribution. These distributions are used to model count \index{Poisson distribution} data such as sequencing read counts. - `df` (F distribution) and `dchisq` (Chi-Squared distribution) are used \index{F distribution} -in relation to distribution of variation. F distribution is used to model \index{Chi-Squared distribution} +in relation to the distribution of variation. The F distribution is used to model \index{Chi-Squared distribution} ratios of variation and Chi-Squared distribution is used to model distribution of variations. You will frequently encounter these in linear models and generalized linear models. @@ -320,24 +308,24 @@ When we take a random sample from a population and compute a statistic, such as the mean, we are trying to approximate the mean of the population. How well this \index{confidence intervals} sample statistic estimates the population value will always be a concern. A confidence interval addresses this concern because it provides a -range of values which is plausible to contain the population parameter of interest. -Normally, we would not have access to a population. If we did, we would not have to estimate the population parameters and its precision. +range of values which will plausibly contain the population parameter of interest. +Normally, we would not have access to a population. If we did, we would not have to estimate the population parameters and their precision. When we do not have access to the population, one way to estimate intervals is to repeatedly take samples from the -original sample with replacement, that is we take a data point from the sample +original sample with replacement, that is, we take a data point from the sample we replace, and we take another data point until we have sample size of the -original sample. Then, we calculate the parameter of interest, in this case mean, and -repeat this step a large number of times, such as 1000. At this point, we would have a distribution of re-sampled -means, we can then calculate the 2.5th and 97.5th percentiles and these will +original sample. Then, we calculate the parameter of interest, in this case the mean, and +repeat this process a large number of times, such as 1000. At this point, we would have a distribution of re-sampled +means. We can then calculate the 2.5th and 97.5th percentiles and these will be our so-called 95% confidence interval. This procedure, resampling with replacement to -estimate the precision of population parameter estimates, is known as the __bootstrap__.\index{bootstrap resampling} +estimate the precision of population parameter estimates, is known as the __bootstrap resampling__ or __bootstraping__.\index{bootstrap resampling} Let's see how we can do this in practice. We simulate a sample coming from a normal distribution (but we pretend we don't know the population parameters). We will estimate the precision -of the mean of the sample using bootstrap to build confidence intervals, the resulting plot after this procedure is shown in Figure \@ref(fig:bootstrapChp3). +of the mean of the sample using bootstrapping to build confidence intervals, the resulting plot after this procedure is shown in Figure \@ref(fig:bootstrapChp3). ```{r bootstrapChp3,out.width='55%',fig.width=5.1,fig.cap="Precision estimate of the sample mean using 1000 bootstrap samples. Confidence intervals derived from the bootstrap samples are shown with red lines."} library(mosaic) @@ -359,23 +347,23 @@ text(x=q[2],y=200,round(q[2],3),adj=c(0,0)) ``` -If we had a convenient mathematical method to calculate confidence interval +If we had a convenient mathematical method to calculate the confidence interval, we could also do without resampling methods. It turns out that if we take repeated -samples from a population of with sample size $n$, the distribution of means -( $\overline{X}$) of those samples +samples from a population with sample size $n$, the distribution of means +($\overline{X}$) of those samples will be approximately normal with mean $\mu$ and standard deviation -$\sigma/\sqrt{n}$. This is also known as __Central Limit Theorem(CLT)__ and +$\sigma/\sqrt{n}$. This is also known as the __Central Limit Theorem(CLT)__ and is one of the most important theorems in statistics. This also means that $\frac{\overline{X}-\mu}{\sigma\sqrt{n}}$ has a standard normal -distribution and we can calculate the Z-score and then we can get +distribution and we can calculate the Z-score, and then we can get the percentiles associated with the Z-score. Below, we are showing the Z-score calculation for the distribution of $\overline{X}$, and then we are deriving the confidence intervals starting with the fact that -probability of Z being between -1.96 and 1.96 is 0.95. We then use algebra +the probability of Z being between $-1.96$ and $1.96$ is $0.95$. We then use algebra to show that the probability that unknown $\mu$ is captured between -$\overline{X}-1.96\sigma/\sqrt{n}$ and $\overline{X}+1.96\sigma/\sqrt{n}$ is 0.95, which is commonly known as 95% confidence interval. +$\overline{X}-1.96\sigma/\sqrt{n}$ and $\overline{X}+1.96\sigma/\sqrt{n}$ is $0.95$, which is commonly known as the 95% confidence interval. $$\begin{array}{ccc} Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\\ @@ -387,19 +375,18 @@ confint=[\overline{X}-1.96\sigma/\sqrt{n},\overline{X}+1.96\sigma/\sqrt{n}] \end{array}$$ -A 95% confidence interval for population mean is the most common -common interval to use, and would \index{confidence interval} +A 95% confidence interval for the population mean is the most common interval to use, and would \index{confidence interval} mean that we would expect 95% of the interval estimates to include the -population parameter, in this case the mean. However, we can pick any value +population parameter, in this case, the mean. However, we can pick any value such as 99% or 90%. We can generalize the confidence interval for $100(1-\alpha)$ as follows: $$\overline{X} \pm Z_{\alpha/2}\sigma/\sqrt{n}$$ -In R, we can do this using `qnorm()` function to get Z-scores associated +In R, we can do this using the `qnorm()` function to get Z-scores associated with ${\alpha/2}$ and ${1-\alpha/2}$. As you can see, the confidence intervals we calculated using CLT are very -similar to the ones we got from bootstrap for the same sample. For bootstrap we got $[19.21, 21.989]$ and for the CLT based estimate we got $[19.23638, 22.00819]$. +similar to the ones we got from the bootstrap for the same sample. For bootstrap we got $[19.21, 21.989]$ and for the CLT-based estimate we got $[19.23638, 22.00819]$. ```{r qnormchp3} alpha=0.05 sd=5 @@ -408,10 +395,10 @@ mean(sample1)+qnorm(c(alpha/2,1-alpha/2))*sd/sqrt(n) ``` -The good thing about CLT as long as the sample size is large regardless of \index{central limit theorem (CLT)} +The good thing about CLT is, as long as the sample size is large, regardless of \index{central limit theorem (CLT)} the population distribution, the distribution of sample means drawn from -that population will always be normal. In Figure \@ref(fig:sampleMeanschp3), we are repeatedly -drawing samples 1000 times with sample size $n$=10,30, and 100 from a bimodal, +that population will always be normal. In Figure \@ref(fig:sampleMeanschp3), we repeatedly +draw samples 1000 times with sample size $n=10$,$30$, and $100$ from a bimodal, exponential and a uniform distribution and we are getting sample mean distributions following normal distribution. @@ -482,21 +469,20 @@ hist(unif100,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray", However, we should note that how we constructed the confidence interval -using standard normal distribution, $N(0,1)$, only works when the when we know the \index{normal distribution} +using standard normal distribution, $N(0,1)$, only works when we know the \index{normal distribution} population standard deviation. In reality, we usually have only access to a sample and have no idea about the population standard deviation. If -this is the case we should use estimate the standard deviation using -sample standard deviation and use something called _t distribution_ instead \index{t distribution} -of standard normal distribution in our interval calculation. Our confidence interval becomes +this is the case, we should estimate the standard deviation using +the sample standard deviation and use something called the _t distribution_ instead \index{t distribution} +of the standard normal distribution in our interval calculation. Our confidence interval becomes $\overline{X} \pm t_{\alpha/2}s/\sqrt{n}$, with t distribution parameter $d.f=n-1$, since now the following quantity is t distributed $\frac{\overline{X}-\mu}{s/\sqrt{n}}$ instead of standard normal distribution. -The t distribution is similar to standard normal distribution has mean 0 but its spread is larger than the normal distribution -especially when sample size is small, and has one parameter $v$ for +The t distribution is similar to the standard normal distribution and has mean $0$ but its spread is larger than the normal distribution +especially when the sample size is small, and has one parameter $v$ for the degrees of freedom, which is $n-1$ in this case. Degrees of freedom -is simply number of data points minus number of parameters estimated.\index{degrees of freedom} Here -we are estimating the mean from the data and the distribution is for the means, therefore degrees of freedom is $n-1$. The resulting distributions are shown in Figure \@ref(fig:tdistChp3). -```{r tdistChp3,echo=FALSE,warning=FALSE,message=FALSE,out.width='60%',fig.cap="Normal distribution and t distribution with different degrees of freedom. With increasing degrees of freedom, t distribution approximates the normal distribution better."} +is simply the number of data points minus the number of parameters estimated.\index{degrees of freedom}Here we are estimating the mean from the data, therefore the degrees of freedom is $n-1$. The resulting distributions are shown in Figure \@ref(fig:tdistChp3). +```{r tdistChp3,echo=FALSE,warning=FALSE,message=FALSE,out.width='60%',fig.cap="Normal distribution and t distribution with different degrees of freedom. With increasing degrees of freedom, the t distribution approximates the normal distribution better."} plot(function(x) dnorm(x,0,1), -4,4, main = "",col="red",lwd=2,ylab="P(x)") curve(dt(x,1),add=TRUE,col="orange",lwd=2) @@ -512,41 +498,39 @@ legend("topright",c(expression(paste("N(",mu,"=0, ",sigma,"=1)")), ``` ## How to test for differences between samples -Often times we would want to compare sets of samples. Such comparisons include +Oftentimes we would want to compare sets of samples. Such comparisons include if wild-type samples have different expression compared to mutants or if healthy samples are different from disease samples in some measurable feature (blood count, gene expression, methylation of certain loci). Since there is variability in our measurements, we need to take that into account when comparing the sets of samples. We can simply subtract the means of two samples, but given the variability of sampling, at the very least we need to decide a cutoff value for differences -of means, small differences of means can be explained by random chance due to +of means; small differences of means can be explained by random chance due to sampling. That means we need to compare the difference we get to a value that is typical to get if the difference between two group means were only due to sampling. If you followed the logic above, here we actually introduced two core -ideas of something called "hypothesis testing", this is simply using +ideas of something called "hypothesis testing", which is simply using statistics to \index{hypothesis testing} -determine the probability that a given hypothesis (if two sample sets -are from the same population or not) is true. Formally, those two core -ideas are as follows: +determine the probability that a given hypothesis (Ex: if two sample sets +are from the same population or not) is true. Formally, expanded version of those two core ideas are as follows: -1. Decide on a hypothesis to test, often called "null hypothesis" ($H_0$). In our - case, the hypothesis is there is no difference between sets of samples. An the "Alternative hypothesis" ($H_1$) is there is a difference between the +1. Decide on a hypothesis to test, often called the "null hypothesis" ($H_0$). In our + case, the hypothesis is that there is no difference between sets of samples. An "alternative hypothesis" ($H_1$) is that there is a difference between the samples. 2. Decide on a statistic to test the truth of the null hypothesis. -3. Calculate the statistic -4. Compare it to a reference value to establish significance, the P-value. Based on that either reject or not reject the null hypothesis, $H_0$ +3. Calculate the statistic. +4. Compare it to a reference value to establish significance, the P-value. Based on that, either reject or not reject the null hypothesis, $H_0$. -### randomization based testing for difference of the means +### Randomization-based testing for difference of the means There is one intuitive way to go about this. If we believe there are no -differences between samples that means the sample labels (test-control or -healthy-disease) has no meaning. So, if we randomly assign labels to the -samples -that and calculate the difference of the mean, this creates a null -distribution for the $H_0$ where we can compare the real difference and +differences between samples, that means the sample labels (test vs. control or +healthy vs. disease) have no meaning. So, if we randomly assign labels to the +samples and calculate the difference of the means, this creates a null +distribution for $H_0$ where we can compare the real difference and measure how unlikely it is to get such a value under the expectation of the null hypothesis. We can calculate all possible permutations to calculate -the null distribution. However, sometimes that is not very feasible and +the null distribution. However, sometimes that is not very feasible and the equivalent approach would be generating the null distribution by taking a smaller number of random samples with shuffled group membership. @@ -560,7 +544,7 @@ often we would get the original difference we calculated under the assumption that $H_0$ is true. The resulting null distribution and the original value is shown in Figure \@ref(fig:randomTestchp3). -```{r randomTestchp3,out.width='60%',fig.cap="The null distribution for differences of means obtained via randomization. The original difference is marked via blue line. The red line marks the value that corresponds to P-value of 0.05"} +```{r randomTestchp3,out.width='60%',fig.cap="The null distribution for differences of means obtained via randomization. The original difference is marked via the blue line. The red line marks the value that corresponds to P-value of 0.05"} set.seed(100) gene1=rnorm(30,mean=4,sd=2) gene2=rnorm(30,mean=2,sd=2) @@ -588,18 +572,18 @@ directly related to the P-value calculation above. ### Using t-test for difference of the means between two samples -We can also calculate the difference between means using a t-test\index{t-test}. Sometimes we will have too few data points in a sample to do meaningful +We can also calculate the difference between means using a t-test\index{t-test}. Sometimes we will have too few data points in a sample to do a meaningful randomization test, also randomization takes more time than doing a t-test. This is a test that depends on the t distribution\index{t distribution}. The line of thought follows from the CLT and we can show differences in means are t distributed. -There are couple of variants of the t-test for this purpose. If we assume -the variances are equal we can use the following version +There are a couple of variants of the t-test for this purpose. If we assume +the population variances are equal we can use the following version $$t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$ where $$s_{X_1X_2} = \sqrt{\frac{(n_1-1)s_{X_1}^2+(n_2-1)s_{X_2}^2}{n_1+n_2-2}}$$ -In the first equation above the quantity is t distributed with $n_1+n_2-2$ degrees of freedom. We can calculate the quantity then use software -to look for the percentile of that value in that t distribution, which is our P-value. When we can not assume equal variances we use "Welch's t-test" +In the first equation above, the quantity is t distributed with $n_1+n_2-2$ degrees of freedom. We can calculate the quantity and then use software +to look for the percentile of that value in that t distribution, which is our P-value. When we cannot assume equal variances, we use "Welch's t-test" which is the default t-test in R and also works well when variances and the sample sizes are the same. For this test we calculate the following quantity: @@ -608,7 +592,7 @@ $$t = \frac{\overline{X}_1 - \overline{X}_2}{s_{\overline{X}_1 - \overline{X}_2} where $$s_{\overline{X}_1 - \overline{X}_2} = \sqrt{\frac{s_1^2 }{ n_1} + \frac{s_2^2 }{n_2}}$$ -and the degrees of freedom equals to: +and the degrees of freedom equals to $$\mathrm{d.f.} = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1) + (s_2^2/n_2)^2/(n_2-1)} $$ @@ -619,41 +603,40 @@ above. # Welch's t-test stats::t.test(gene1,gene2) -# t-test with equal varience assumption +# t-test with equal variance assumption stats::t.test(gene1,gene2,var.equal=TRUE) ``` A final word on t-tests: they generally assume a population where samples coming -from have normal +from them have a normal distribution, however it is been shown t-test can tolerate deviations from -normality. Especially, when two distributions are moderately skewed in the -same direction. This is due to central limit theorem which says means of +normality, especially, when two distributions are moderately skewed in the +same direction. This is due to the central limit theorem, which says that the means of samples will be distributed normally no matter the population distribution if sample sizes are large. -### multiple testing correction +### Multiple testing correction We should think of hypothesis testing as a non-error-free method of making \index{multiple testing correction} decisions. There will be times when we declare something significant and accept $H_1$ but we will be wrong. -These decisions are also called "false positives" or "false discoveries", this -is also known as "type I error". Similarly, we can fail to reject a hypothesis +These decisions are also called "false positives" or "false discoveries", and are also known as "type I errors". Similarly, we can fail to reject a hypothesis when we actually should. These cases are known as "false negatives", also known -as "type II error". +as "type II errors". The ratio of true negatives to the sum of true negatives and false positives ($\frac{TN}{FP+TN}$) is known as specificity. And we usually want to decrease the FP and get higher specificity. The ratio of true positives to the sum of true positives and false negatives ($\frac{TP}{TP+FN}$) is known as sensitivity. -And, again we usually want to decrease the FN and get higher sensitivity. -Sensitivity is also known as "power of a test" in the context of hypothesis -testing. More powerful tests will be highly sensitive and will do less type -II errors. For the t-test the power is positively associated with sample size -and the effect size. Higher the sample size, smaller the standard error and +And, again, we usually want to decrease the FN and get higher sensitivity. +Sensitivity is also known as the "power of a test" in the context of hypothesis +testing. More powerful tests will be highly sensitive and will have fewer type +II errors. For the t-test, the power is positively associated with sample size +and the effect size. The larger the sample size, the smaller the standard error, and looking for the larger effect sizes will similarly increase the power. -The general summary of these the different combination of the decisions are +The general summary of these different decision combinations are included in the table below. ------------------------------------------------------------- @@ -677,33 +660,33 @@ expressed) ------------------------------------------------------------- -We expect to make more type I errors as the number of tests increase, that +We expect to make more type I errors as the number of tests increase, which means we will reject the null hypothesis by mistake. For example, if we -perform a test the 5% significance level, there is a 5% chance of +perform a test at the 5% significance level, there is a 5% chance of incorrectly rejecting the null hypothesis if the null hypothesis is true. However, if we make 1000 tests where all null hypotheses are true for each of them, the average number of incorrect rejections is 50. And if we -apply the rules of probability, there are is almost a 100% chance that +apply the rules of probability, there is almost a 100% chance that we will have at least one incorrect rejection. There are multiple statistical techniques to prevent this from happening. -These techniques generally shrink the P-values obtained from multiple -tests to higher values, if the individual P-value is low enough it survives -this process. The most simple method is just to multiply the individual, -P-value ($p_i$) with the number of tests ($m$): $m \cdot p_i$, this is +These techniques generally push the P-values obtained from multiple +tests to higher values; if the individual P-value is low enough it survives +this process. The simplest method is just to multiply the individual +P-value ($p_i$) by the number of tests ($m$), $m \cdot p_i$. This is called "Bonferroni correction". However, this is too harsh if you have thousands of tests. Other methods are developed to remedy this. Those methods rely on ranking the P-values and dividing $m \cdot p_i$ by the -rank,$i$, :$\frac{m \cdot p_i }{i}$, this is derived from Benjamini–Hochberg \index{P-value} +rank, $i$, :$\frac{m \cdot p_i }{i}$, which is derived from the Benjamini–Hochberg \index{P-value} procedure. This procedure is developed to control for "False Discovery Rate (FDR)" -, which is proportion of false positives among all significant tests. And in -practical terms, we get the "FDR adjusted P-value" from the procedure described -above. This gives us an estimate of proportion of false discoveries for a given -test. To elaborate, p-value of 0.05 implies that 5% of all tests will be false positives. An FDR adjusted p-value of 0.05 implies that 5% of significant tests will be false positives. The FDR adjusted P-values will result in a lower number of false positives. +, which is the proportion of false positives among all significant tests. And in +practical terms, we get the "FDR-adjusted P-value" from the procedure described +above. This gives us an estimate of the proportion of false discoveries for a given +test. To elaborate, p-value of 0.05 implies that 5% of all tests will be false positives. An FDR-adjusted p-value of 0.05 implies that 5% of significant tests will be false positives. The FDR-adjusted P-values will result in a lower number of false positives. One final method that is also popular is called the "q-value" method and related to the method above. This procedure relies on estimating the proportion of true null hypotheses from the distribution of raw p-values and using that quantity -to come up with what is called a "q-value", which is also an FDR adjusted P-value [@Storey2003-nv]. That can be practically defined +to come up with what is called a "q-value", which is also an FDR-adjusted P-value [@Storey2003-nv]. That can be practically defined as "the proportion of significant features that turn out to be false leads." A q-value 0.01 would mean 1% of the tests called significant at this \index{q-value} level will be truly null on average. Within the genomics community @@ -712,9 +695,9 @@ calculated differently. In R, the base function `p.adjust()` implements most of the p-value correction methods described above. For the q-value, we can use the `qvalue` package from -Bioconductor. Below we are demonstrating how to use them on a set of simulated -p-values.The plot in Figure \@ref(fig:multtest) shows that Bonferroni correction does a terrible job. FDR(BH) and q-value -approach are better but q-value approach is more permissive than FDR(BH). +Bioconductor. Below we demonstrate how to use them on a set of simulated +p-values. The plot in Figure \@ref(fig:multtest) shows that Bonferroni correction does a terrible job. FDR(BH) and q-value +approach are better but, the q-value approach is more permissive than FDR(BH). ```{r multtest,out.width='60%',fig.cap="Adjusted P-values via different methods and their relationship to raw P-values"} library(qvalue) @@ -732,40 +715,35 @@ legend("bottomright",legend=c("q-value","FDR (BH)","Bonferroni"), fill=c("black","blue","red")) ``` -### moderated t-tests: using information from multiple comparisons +### Moderated t-tests: Using information from multiple comparisons In genomics, we usually do not do one test but many, as described above. That means we\index{moderated t-test} may be able to use the information from the parameters obtained from all comparisons to influence the individual parameters. For example, if you have many variances calculated for thousands of genes across samples, you can force individual -variance estimates to shrunk towards the mean or the median of the distribution +variance estimates to shrink toward the mean or the median of the distribution of variances. This usually creates better performance in individual variance -estimates and therefore better performance in significance testing which -depends on variance estimates. How much the values be shrunk towards a common -value comes in many flavors. These tests in general are called moderated +estimates and therefore better performance in significance testing, which +depends on variance estimates. How much the values are shrunk toward a common +value depends on the exact method used. These tests in general are called moderated t-tests or shrinkage t-tests. One approach popularized by Limma software is -to use so-called "Empirical Bayes methods" -\index{empirical Bayes methods}. The main formulation in these -methods is $\hat{V_g} = aV_0 + bV_g$, where $V_0$ is the background variability -and $V_g$ is the individual variability. Then, these methods estimate $a$ and $b$ -in various ways to come up with shrunk version of variability, $\hat{V_g}$. Bayesian inference can make use of prior knowledge to make inference about properties of the data. In a Bayesian viewpoint, +to use so-called "Empirical Bayes methods"\index{empirical Bayes methods}. The main formulation in these +methods is $\hat{V_g} = aV_0 + bV_g$, where $V_0$ is the background variability and $V_g$ is the individual variability. Then, these methods estimate $a$ and $b$ in various ways to come up with a "shrunk" version of the variability, $\hat{V_g}$. Bayesian inference can make use of prior knowledge to make inference about properties of the data. In a Bayesian viewpoint, the prior knowledge, in this case variability of other genes, can be used to calculate the variability of an individual gene. In our case, $V_0$ would be the prior knowledge we have on the variability of the genes and we use that knowledge to influence our estimate for the individual genes. Below we are simulating a gene expression matrix with 1000 genes, and 3 test -and 3 control groups. Each row is a gene and in normal circumstances we would -like to find out differentially expressed genes. In this case, we are simulating -them from the same distribution so in reality we do not expect any differences. -We then use the adjusted standard error estimates in empirical Bayesian spirit but -in a very crude way. We just shrink the gene-wise standard error estimates towards the median with equal $a$ and $b$ weights. That is to say, we add individual estimate to the -median of standard error distribution from all genes and divide that quantity by 2. -So if we plug that in the to the above formula what we do is: +and 3 control groups. Each row is a gene, and in normal circumstances we would +like to find differentially expressed genes. In this case, we are simulating +them from the same distribution, so in reality we do not expect any differences. +We then use the adjusted standard error estimates in empirical Bayesian spirit but, in a very crude way. We just shrink the gene-wise standard error estimates towards the median with equal $a$ and $b$ weights. That is to say, we add the individual estimate to the +median of the standard error distribution from all genes and divide that quantity by 2. So if we plug that into the above formula, what we do is: $$ \hat{V_g} = (V_0 + V_g)/2 $$ In the code below, we are avoiding for loops or apply family functions -by using vectorized operations. The code below samples gene expression values from a hypotethical distribution. Since all the values come from the same distribution we do not expect differences between groups. We then calculate moderated and unmoderated t-test statistics and plot the P-value distributions for tests, the results are shown in Figure \@ref(fig:modTtestChp3). +by using vectorized operations. The code below samples gene expression values from a hypothetical distribution. Since all the values come from the same distribution, we do not expect differences between groups. We then calculate moderated and unmoderated t-test statistics and plot the P-value distributions for tests. The results are shown in Figure \@ref(fig:modTtestChp3). ```{r modTtestChp3, out.width='60%',fig.width=8,fig.cap="The distributions of P-values obtained by t-tests and moderated t-tests"} set.seed(100) @@ -816,25 +794,24 @@ mtext(paste("signifcant tests:",sum(p.mod<0.05)) ) __Want to know more ?__ -- basic statistical concepts - - "Cartoon guide to statistics" by Gonick & Smith [@gonick2005cartoon]. Provides central concepts are depicted as cartoons in funny but clear and accurate manner. +- Basic statistical concepts + - "Cartoon guide to statistics" by Gonick & Smith [@gonick2005cartoon]. Provides central concepts depicted as cartoons in a funny but clear and accurate manner. - "OpenIntro Statistics" [@diez2015openintro] (Free e-book http://openintro.org). This book provides fundamental statistical concepts in a clear and easy way. It includes R code. - Hands-on statistics recipes with R - "The R book" [@crawley2012r]. This is the main R book for anyone interested in statistical concepts and their application in R. It requires some background in statistics since the main focus is applications in R. -- moderated tests - - comparison of moderated tests for differential expression [@de2010benchmark] http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-17 - - limma method developed for testing differential expression between genes using a moderated test [@smyth2004linear] http://www.statsci.org/smyth/pubs/ebayes.pdf +- Moderated tests + - Comparison of moderated tests for differential expression [@de2010benchmark] http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-17 + - Limma method developed for testing differential expression between genes using a moderated test [@smyth2004linear] http://www.statsci.org/smyth/pubs/ebayes.pdf ``` -## Relationship between variables: linear models and correlation +## Relationship between variables: Linear models and correlation In genomics, we would often need to measure or model the relationship between variables. We might want to know about expression of a particular gene in liver in relation to the dosage of a drug that patient receives. Or, we may want to know -DNA methylation of certain locus in the genome in relation to age of the sample -donor's. Or, we might be interested in the relationship between histone +DNA methylation of a certain locus in the genome in relation to the age of the sample donor. Or, we might be interested in the relationship between histone modifications and gene expression\index{histone modification}. Is there a linear relationship, the more \index{gene expression} histone modification the more the gene is expressed ? @@ -842,25 +819,25 @@ In these situations and many more, linear regression or linear models can be used to \index{linear regression} model the relationship with a "dependent" or "response" variable (expression or methylation -in the above examples) and one or more "independent"" or "explanatory" variables (age, drug dosage or histone modification in the above examples). Our simple linear model has the +in the above examples) and one or more "independent" or "explanatory" variables (age, drug dosage or histone modification in the above examples). Our simple linear model has the following components. $$ Y= \beta_0+\beta_1X + \epsilon $$ In the equation above, $Y$ is the response variable and $X$ is the explanatory - variable. $\epsilon$ is the mean-zero error term. Since, the line fit will not + variable. $\epsilon$ is the mean-zero error term. Since the line fit will not be able to precisely predict the $Y$ values, there will be some error associated with each prediction when we compare it to the original $Y$ values. This error - is captured in $\epsilon$ term. We can alternatively write the model as - follows to emphasize that the model approximates $Y$, in this case notice that we removed the $\epsilon$ term: $Y \sim \beta_0+\beta_1X$ + is captured in the $\epsilon$ term. We can alternatively write the model as + follows to emphasize that the model approximates $Y$, in this case notice that we removed the $\epsilon$ term: $Y \sim \beta_0+\beta_1X$. The plot below in Figure \@ref(fig:histoneLmChp3) shows the relationship between histone modification (trimethylated forms of histone H3 at lysine 4, aka H3K4me3) and gene expression for 100 genes. The blue line is our model with estimated coefficients ($\hat{y}=\hat{\beta}_0 + \hat{\beta}_1X$, where $\hat{\beta}_0$ - and $\hat{\beta}_1$ the estimated values of $\beta_0$ and + and $\hat{\beta}_1$ are the estimated values of $\beta_0$ and $\beta_1$, and $\hat{y}$ indicates the prediction). The red lines indicate the individual errors per data point, indicated as $\epsilon$ in the formula above. @@ -891,19 +868,19 @@ segments(x1, y, x1, pre, col="red") ``` -There could be more than one explanatory variable, we then simply add more $X$ +There could be more than one explanatory variable. We then simply add more $X$ and $\beta$ to our model. If there are two explanatory variables our model will look like this: $$ Y= \beta_0+\beta_1X_1 +\beta_2X_2 + \epsilon $$ In this case, we will be fitting a plane rather than a line. However, the fitting -process which we will describe in the later sections will not change. For our +process which we will describe in the later sections will not change for our gene expression problem. We can introduce one more histone modification, H3K27me3. We will then have a linear model with 2 explanatory variables and the -fitted plane will look like the one below in Figure \@ref(fig:histoneLm2chp3). The gene expression values are shown -as dots below and above the fitted plane. Linear regression or linear models and their extensions which makes use of other distributions, generalized linear models,\index{generalized linear model} are central in computational genomics for statistical tests. We will see more of how regression is used in statistical hypothesis testing for computational genomics in Chapters \@ref(rnaseqanalysis) and \@ref(bsseq). +fitted plane will look like the one in Figure \@ref(fig:histoneLm2chp3). The gene expression values are shown +as dots below and above the fitted plane. Linear regression and its extensions which make use of other distributions (generalized linear models) \index{generalized linear model} are central in computational genomics for statistical tests. We will see more of how regression is used in statistical hypothesis testing for computational genomics in Chapters \@ref(rnaseqanalysis) and \@ref(bsseq). -```{r histoneLm2chp3,echo=FALSE,out.width='65%',warning=FALSE,message=FALSE,fig.cap="Association of Gene expression with H3K4me3 and H27Kme3 histone modifications."} +```{r histoneLm2chp3,echo=FALSE,out.width='65%',warning=FALSE,message=FALSE,fig.cap="Association of gene expression with H3K4me3 and H3K27me3 histone modifications."} set.seed(32) x2 <- runif(100,10,200) @@ -947,7 +924,7 @@ scatter3D(z = y2, x = x1, y = x2, pch = 19, cex = 0.4,colvar=sign(residuals(mod1 #### Matrix notation for linear models -We can naturally have more explanatory variables than just two.The formula +We can naturally have more explanatory variables than just two. The formula below has $n$ explanatory variables. $$Y= \beta_0+\beta_1X_1+\beta_2X_2 + \beta_3X_3 + .. + \beta_nX_n +\epsilon$$ @@ -955,9 +932,9 @@ below has $n$ explanatory variables. If there are many variables, it would be easier to write the model in matrix notation. The matrix form of linear model with two explanatory variables will look like the one -below. First matrix would be our data matrix. This contains our explanatory +below. The first matrix would be our data matrix. This contains our explanatory variables and a column of 1s. The second term is a column vector of $\beta$ -values. We add a vector of error terms,$\epsilon$s to the matrix multiplication. +values. We also add a vector of error terms, $\epsilon$s, to the matrix multiplication. $$ \mathbf{Y} = \left[\begin{array}{rrr} @@ -982,8 +959,8 @@ $$ \end{array}\right] $$ -The multiplication of data matrix and $\beta$ vector and addition of the -error terms simply results in the the following set of equations per data point: +The multiplication of the data matrix and $\beta$ vector and addition of the +error terms simply results in the following set of equations per data point: $$ \begin{aligned} @@ -1001,54 +978,54 @@ could be simply written as follows. $$Y=X\beta + \epsilon$$ -In the equation above $Y$ is the vector of response variables and $X$ is the -data matrix and $\beta$ is the vector of coefficients. +In the equation, above $Y$ is the vector of response variables, $X$ is the +data matrix, and $\beta$ is the vector of coefficients. This notation is more concise and often used in scientific papers. However, this also means you need some understanding of linear algebra to follow the math laid out in such resources. ### How to fit a line -At this point a major questions is left unanswered: How did we fit this line? +At this point a major question is left unanswered: How did we fit this line? We basically need to define $\beta$ values in a structured way. -There are multiple ways or understanding how -to do this, all of which converges to the same +There are multiple ways of understanding how +to do this, all of which converge to the same end point. We will describe them one by one. #### The cost or loss function approach This is the first approach and in my opinion is easiest to understand. \index{cost function} -We try to optimize a function, often called "cost function" or "loss function". \index{loss function} +We try to optimize a function, often called the "cost function" or "loss function". \index{loss function} The cost function is the sum of squared differences between the predicted $\hat{Y}$ values from our model and the original $Y$ values. The optimization procedure tries to find $\beta$ values \index{optimization} -that minimizes this difference between reality and the predicted values. +that minimize this difference between the reality and predicted values. $$min \sum{(y_i-(\beta_0+\beta_1x_i))^2}$$ -Note that this is related to the the error term, $\epsilon$, we already mentioned -above, we are trying to minimize the squared sum of $\epsilon_i$ for each data +Note that this is related to the error term, $\epsilon$, we already mentioned +above. We are trying to minimize the squared sum of $\epsilon_i$ for each data point. We can do this minimization by a bit of calculus. The rough algorithm is as follows: -1. Pick a random starting point, random $\beta$ values +1. Pick a random starting point, random $\beta$ values. 2. Take the partial derivatives of the cost function to see which direction is the way to go in the cost function. 3. Take a step toward the direction that minimizes the cost function. - - step size is parameter to choose, there are many variants. -4. repeat step 2,3 until convergence. + - Step size is a parameter to choose, there are many variants. +4. Repeat step 2,3 until convergence. -This is the basis of "gradient descent" algorithm.\index{gradient descent} With the help of partial +This is the basis of the "gradient descent" algorithm.\index{gradient descent} With the help of partial derivatives we define a "gradient" on the cost function and follow that through -multiple iterations and until convergence, meaning until the results do not +multiple iterations until convergence, meaning until the results do not improve defined by a margin. The algorithm usually converges to optimum $\beta$ values. In Figure \@ref(fig:3dcostfunc), we show the cost function over various $\beta_0$ and $\beta_1$ values for the histone modification and gene expression data set. The algorithm will pick a point on this graph and traverse it incrementally based on the -derivatives and converge on the bottom of the cost function "well". Such optimization methods is the core of machine learning methods we will cover later in Chapters \@ref(unsupervisedLearning) and +derivatives and converge to the bottom of the cost function "well". Such optimization methods are the core of machine learning methods we will cover later in Chapters \@ref(unsupervisedLearning) and \@ref(supervisedLearning). -```{r 3dcostfunc,fig.height=3,echo=FALSE,warning=FALSE,message=FALSE,fig.cap="Cost function landscape for linear regression with changing beta values. The optimization process tries to find the lowest point in this landscape by implementing a strategy for updating beta values towards the lowest point in the landscape."} +```{r 3dcostfunc,fig.height=4,echo=FALSE,warning=FALSE,message=FALSE,fig.cap="Cost function landscape for linear regression with changing beta values. The optimization process tries to find the lowest point in this landscape by implementing a strategy for updating beta values toward the lowest point in the landscape."} require(plot3D) @@ -1090,11 +1067,11 @@ par(mfrow=c(1,1)) #### Not cost function but maximum likelihood function -We can also think of this problem from more a statistical point of view. In \index{maximum likelihood estimation} +We can also think of this problem from a more statistical point of view. In \index{maximum likelihood estimation} essence, we are looking for best statistical parameters, in this case $\beta$ values, for our model that are most likely to produce such a -scatter of data points given the explanatory variables.This is called -"Maximum likelihood" approach. The approach assumes that a given response variable $y_i$ follows a normal distribution with mean $\beta_0+\beta_1x_i$ and \index{variance} variance $s^2$. Therefore probability of observing any given $y_i$ value is dependent on $\beta_0$ and $\beta_1$ values. Since $x_i$, the explanatory variable, is fixed within our data set, by varying $\beta_0$ and $\beta_1$ values we can maximize the probability of observing any given $y_i$. The trick is to find $\beta_0$ and $\beta_1$ values that maximizes the probability of observing all the response variables in the dataset given the explanatory variables. The probability of observing a response variable $y_i$ with assumptions we described above is shown below. Note that this assumes variance is constant and $s^2=\frac{\sum{\epsilon_i}}{n-2}$ is an unbiased estimation for population variance, $\sigma^2$.\index{variance} +scatter of data points given the explanatory variables. This is called the +"maximum likelihood" approach. The approach assumes that a given response variable $y_i$ follows a normal distribution with mean $\beta_0+\beta_1x_i$ and \index{variance} variance $s^2$. Therefore the probability of observing any given $y_i$ value is dependent on the $\beta_0$ and $\beta_1$ values. Since $x_i$, the explanatory variable, is fixed within our data set, we can maximize the probability of observing any given $y_i$ by varying $\beta_0$ and $\beta_1$ values. The trick is to find $\beta_0$ and $\beta_1$ values that maximizes the probability of observing all the response variables in the dataset given the explanatory variables. The probability of observing a response variable $y_i$ with assumptions we described above is shown below. Note that this assumes variance is constant and $s^2=\frac{\sum{\epsilon_i}}{n-2}$ is an unbiased estimation for population variance, $\sigma^2$.\index{variance} $$P(y_{i})=\frac{1}{s\sqrt{2\pi} }e^{-\frac{1}{2}\left(\frac{y_i-(\beta_0 + \beta_1x_i)}{s}\right)^2}$$ @@ -1103,22 +1080,22 @@ linear regression is \index{linear regression} multiplication of $P(y_{i})$ for $$L=P(y_1)P(y_2)P(y_3)..P(y_n)=\prod\limits_{i=1}^n{P_i}$$ -This can be simplified to the following equation by some algebra, assumption of normal distribution and taking logs (since it is -easier to add than multiply) +This can be simplified to the following equation by some algebra, assumption of normal distribution, and taking logs (since it is +easier to add than multiply). $$ln(L) = -nln(s\sqrt{2\pi}) - \frac{1}{2s^2} \sum\limits_{i=1}^n{(y_i-(\beta_0 + \beta_1x_i))^2} $$ As you can see, the right part of the function is the negative of the cost function -defined above. If we wanted to optimize this function we would need to take derivative of -the function with respect to $\beta$ parameters. That means we can ignore the -first part since there is no $\beta$ terms there. This simply reduces to the +defined above. If we wanted to optimize this function we would need to take the derivative of +the function with respect to the $\beta$ parameters. That means we can ignore the +first part since there are no $\beta$ terms there. This simply reduces to the negative of the cost function. Hence, this approach produces exactly the same result as the cost function approach. The difference is that we defined our problem within the domain of statistics. This particular function has still to be optimized. This can be done with some calculus without the need for an iterative approach. -The maximum likelihood approach also opens up other possibilities for regression. For the case above we assumed that the points around the mean are distributed by normal distribution. However, there are other cases where this assumption may not hold. For example, the for the count data the mean and variance relationship is not constant, the higher the mean counts the higher the variance. In this cases, the regression framework with maximum likelihood estimation can still be used. We simply change the underlying assumptions about the distribution and calculate the likelihood with a new distribution in mind, +The maximum likelihood approach also opens up other possibilities for regression. For the case above, we assumed that the points around the mean are distributed by normal distribution. However, there are other cases where this assumption may not hold. For example, for the count data the mean and variance relationship is not constant; the higher the mean counts, the higher the variance. In these cases, the regression framework with maximum likelihood estimation can still be used. We simply change the underlying assumptions about the distribution and calculate the likelihood with a new distribution in mind, and maximize the parameters for that likelihood. This gives way to "generalized linear model"\index{generalized linear model} approach where errors for the response variables can have other distributions than normal distribution. We will see examples of these generalized linear models in Chapter \@ref(rnaseqanalysis) and \@ref(bsseq). @@ -1126,13 +1103,13 @@ and maximize the parameters for that likelihood. This gives way to "generalized #### Linear algebra and closed-form solution to linear regression The last approach we will describe is the minimization process using linear \index{linear regression} -algebra. If you find this concept challenging, feel free to skip it but scientific publications and other books frequently use matrix notation and linear algebra to define and solve regression problems. In this case, we do not use an iterative approach. Instead, we will -minimize cost function by explicitly taking its derivatives with respect to +algebra. If you find this concept challenging, feel free to skip it, but scientific publications and other books frequently use matrix notation and linear algebra to define and solve regression problems. In this case, we do not use an iterative approach. Instead, we will +minimize the cost function by explicitly taking its derivatives with respect to $\beta$'s and setting them to zero. This is doable by employing linear algebra and matrix calculus. This approach is also called "ordinary least squares". We \index{ordinary least squares regression} will not -show the whole derivation here but the following expression -is what we are trying to minimize in matrix notation, this is basically a +show the whole derivation here, but the following expression +is what we are trying to minimize in matrix notation, which is basically a different notation of the same minimization problem defined above. Remember $\epsilon_i=Y_i-(\beta_0+\beta_1x_i)$ @@ -1149,14 +1126,13 @@ the following for estimated $\beta$ values, $\hat{\beta}$: $$\hat{\beta}=(X^TX)^{-1}X^TY$$ -This requires for you to calculate the inverse of the $X^TX$ term, which could -be slow for large matrices. Iterative approach over the cost function +This requires you to calculate the inverse of the $X^TX$ term, which could +be slow for large matrices. Using an iterative approach over the cost function derivatives will be faster for larger problems. The linear algebra notation is something you will see in the papers or other resources often. If you input the data matrix X and solve the $(X^TX)^{-1}$ , -you get the following values for $\beta_0$ and $\beta_1$ for simple regression \index{linear regression} -. However, we should note that this simple linear regression case can easily +you get the following values for $\beta_0$ and $\beta_1$ for simple regression \index{linear regression}. However, we should note that this simple linear regression case can easily be solved algebraically without the need for matrix operations. This can be done by taking the derivative of $\sum{(y_i-(\beta_0+\beta_1x_i))^2}$ with respect to $\beta_1$, rearranging the terms and equalizing the derivative to zero. @@ -1167,13 +1143,13 @@ $$\hat{\beta_0}=\overline{Y}-\hat{\beta_1}\overline{X}$$ #### Fitting lines in R After all this theory, you will be surprised how easy it is to fit lines in R. -This is achieved just by `lm()` command, stands for linear models. Let's do this -for a simulated data set and plot the fit. First step is to simulate the -data, we will decide on $\beta_0$ and $\beta_1$ values. The we will decide -on the variance parameter,$\sigma$ to be used in simulation of error terms,\index{variance} +This is achieved just by the `lm()` function, which stands for linear models. Let's do this +for a simulated data set and plot the fit. The first step is to simulate the +data. We will decide on $\beta_0$ and $\beta_1$ values. Then we will decide +on the variance parameter, $\sigma$, to be used in simulation of error terms,\index{variance} $\epsilon$. We will first find $Y$ values, just using the linear equation $Y=\beta0+\beta_1X$, for -a set of $X$ values. Then, we will add the error terms get our simulated values. +a set of $X$ values. Then, we will add the error terms to get our simulated values. ```{r getFittinLineData} # set random number seed, so that the random numbers from the text # is the same when you run the code. @@ -1182,7 +1158,7 @@ set.seed(32) # get 50 X values between 1 and 100 x = runif(50,1,100) -# set b0,b1 and varience (sigma) +# set b0,b1 and variance (sigma) b0 = 10 b1 = 2 sigma = 20 @@ -1193,10 +1169,10 @@ y = b0 + b1*x+ eps ``` -Now let us fit a line using `lm()` function. The function requires a formula, and -optionally a data frame. We need the pass the following expression within the -`lm()` function, `y~x`, where `y` is the simulated $Y$ values and `x` is the explanatory variables $X$. We will then use `abline()` function to draw the fit. The resulting plot is shown in Figure \@ref(fig:geneExpLinearModel). -```{r geneExpLinearModel,out.width='60%',fig.cap="Gene expression and histone modification score modelled by linear regression"} +Now let us fit a line using the `lm()` function. The function requires a formula, and +optionally a data frame. We need to pass the following expression within the +`lm()` function, `y~x`, where `y` is the simulated $Y$ values and `x` is the explanatory variables $X$. We will then use the `abline()` function to draw the fit. The resulting plot is shown in Figure \@ref(fig:geneExpLinearModel). +```{r geneExpLinearModel,out.width='60%',fig.cap="Gene expression and histone modification score modeled by linear regression."} mod1=lm(y~x) # plot the data points @@ -1208,15 +1184,14 @@ abline(mod1,col="blue") ### How to estimate the error of the coefficients -Since we are using a sample to estimate the coefficients they are -not exact, with every random sample they will vary. Below in Figure \@ref(fig:regCoeffRandomSamples), we -are taking multiple samples from the population and fitting lines to each -sample, with each sample the lines slightly change.We are overlaying the -points and the lines for each sample on top of the other samples -.When we take 200 samples and fit lines for each of them,the lines fit are +Since we are using a sample to estimate the coefficients, they are +not exact; with every random sample they will vary. In Figure \@ref(fig:regCoeffRandomSamples), we +take multiple samples from the population and fit lines to each +sample; with each sample the lines slightly change. We are overlaying the +points and the lines for each sample on top of the other samples. When we take 200 samples and fit lines for each of them, the line fits are variable. And, we get a normal-like distribution of $\beta$ values with a defined mean \index{linear regression} -and standard deviation a, which is called standard error of the +and standard deviation, which is called standard error of the coefficients. ```{r regCoeffRandomSamples,message=FALSE,warning=FALSE,echo=FALSE,fig.cap="Regression coefficients vary with every random sample. The figure illustrates the variability of regression coefficients when regression is done using a sample of data points. Histograms depict this variability for $b_0$ and $b_1$ coefficients."} @@ -1287,7 +1262,7 @@ hist(b1s,breaks=10,xlab=expression(beta[1]), ``` Normally, we will not have access to the population to do repeated sampling, -model fitting and estimation of the standard error for the coefficients. But +model fitting, and estimation of the standard error for the coefficients. But there is statistical theory that helps us infer the population properties from the sample. When we assume that error terms have constant variance and mean zero , we can model the uncertainty in the regression coefficients, $\beta$s. @@ -1304,25 +1279,25 @@ $$ Notice that that $SE(\beta_1)$ depends on the estimate of variance of residuals shown as $s$ or __Residual Standard Error (RSE)__.\index{variance} \index{residuals} -Notice also standard error depends on the spread of $X$. If $X$ values have more \index{Residual Standard Error (RSE)} +Notice also the standard error depends on the spread of $X$. If $X$ values have more \index{Residual Standard Error (RSE)} variation, the standard error will be lower. This intuitively makes sense since if the -spread of the $X$ is low, the regression line will be able to wiggle more +spread of $X$ is low, the regression line will be able to wiggle more compared to a regression line that is fit to the same number of points but covers a greater range on the X-axis. The standard error estimates can also be used to calculate confidence intervals and test -hypotheses, since the following quantity called t-score approximately follows a +hypotheses, since the following quantity, called t-score, approximately follows a t-distribution with $n-p$ degrees of freedom, where $n$ is the number of data points and $p$ is the number of coefficients estimated. $$ \frac{\hat{\beta_i}-\beta_test}{SE(\hat{\beta_i})}$$ Often, we would like to test the null hypothesis if a coefficient is equal to -zero or not. For simple regression this could mean if there is a relationship -between explanatory variable and response variable. We would calculate the +zero or not. For simple regression, this could mean if there is a relationship +between the explanatory variable and the response variable. We would calculate the t-score as follows $\frac{\hat{\beta_i}-0}{SE(\hat{\beta_i})}$, and compare it -t-distribution with $d.f.=n-p$ to get the p-value. +to the t-distribution with $d.f.=n-p$ to get the p-value. We can also @@ -1334,11 +1309,11 @@ $t_{0.975}$ is the 97.5% percentile of the t-distribution with $d.f. = n – p$. -In R, `summary()` function will test all the coefficients for the null hypothesis +In R, the `summary()` function will test all the coefficients for the null hypothesis $\beta_i=0$. The function takes the model output obtained from the `lm()` function. To demonstrate this, let us first get some data. The procedure below simulates data to be used in a regression setting and it is useful to examine \index{linear regression} -what the linear model expect to model the data. +what the linear model expects to model the data. ```{r extraLinarModelDataGeneration,echo=FALSE,warning=FALSE,message=FALSE} # set random number seed, so that the random numbers from the text # is the same when you run the code. @@ -1346,7 +1321,7 @@ set.seed(32) # get 100 X values between 1 and 100 x = runif(100,10,200) -# set b0,b1 and varience (sigma) +# set b0,b1 and variance (sigma) b0 = 17 b1 = 0.5 sigma = 30 @@ -1359,8 +1334,8 @@ y = b0 + b1*x+ eps ``` Since we have the data, we can build our model and call the `summary` function. -We will then use `confint()` function to get the confidence intervals on the -coefficients and `coef()` function to pull out the estimated coefficients from +We will then use the `confint()` function to get the confidence intervals on the +coefficients and the `coef()` function to pull out the estimated coefficients from the model. ```{r confintLM} mod1=lm(y~x) @@ -1373,11 +1348,11 @@ confint(mod1) coef(mod1) ``` The `summary()` function prints out an extensive list of values. -The "Coefficients" section has the estimates, their standard error, t score +The "Coefficients" section has the estimates, their standard error, t score, and the p-value from the hypothesis test $H_0:\beta_i=0$. As you can see, the estimate we get for the coefficients and their standard errors are close to -the ones we get from the repeatedly sampling and getting a distribution of -coefficients. This is statistical inference at work, we can estimate the +the ones we get from repeatedly sampling and getting a distribution of +coefficients. This is statistical inference at work, so we can estimate the population properties within a certain error using just a sample. @@ -1385,106 +1360,104 @@ population properties within a certain error using just a sample. ### Accuracy of the model -If you have observed the table output by `summary()` function, you must have noticed there are some other outputs, such as "Residual standard error", +If you have observed the table output of the `summary()` function, you must have noticed there are some other outputs, such as "Residual standard error", "Multiple R-squared" and "F-statistic". These are metrics that are useful for assessing the accuracy of the model. We will explain them one by one. -_ (RSE)_ simply is the square-root of the \index{Residual Standard Error (RSE)} -the sum of squared error terms, divided by degrees of freedom, $n-p$, for simple -linear regression case, $n-2$. Sum of of the squares of the error terms is also -called __"Residual sum of squares"__, RSS. \index{Residual sum of squares (RSS)}So RSE is +__RSE__ is simply the square-root of\index{Residual Standard Error (RSE)} +the sum of squared error terms, divided by degrees of freedom, $n-p$. For the simple +linear regression case, degrees of freedom is $n-2$. Sum of the squares of the error terms is also +called the __"Residual sum of squares"__, RSS. \index{Residual sum of squares (RSS)}So the RSE is calculated as follows: $$ s=RSE=\sqrt{\frac{\sum{(y_i-\hat{Y_i})^2 }}{n-p}}=\sqrt{\frac{RSS}{n-p}}$$ -RSE is a way of assessing the model fit. The larger the RSE the worse the +The RSE is a way of assessing the model fit. The larger the RSE the worse the model is. However, this is an absolute measure in the units of $Y$ and we have nothing to -compare against. One idea is that we divide it by RSS of a simpler model +compare against. One idea is that we divide it by the RSS of a simpler model for comparative purposes. That simpler model is in this case is the model -with the intercept,$\beta_0$. A very bad model will have close zero +with the intercept, $\beta_0$. A very bad model will have close to zero coefficients for explanatory variables, and the RSS of that model will be close to the RSS of the model with only the intercept. In such -a model intercept will be equal to $\overline{Y}$. As it turns out, RSS of -the the model with -just the intercept is called _"Total Sum of Squares" or TSS_. A good model will have a low $RSS/TSS$. The metric $R^2$ uses these quantities to calculate a score between 0 and 1, and closer to 1 the better the model. Here is how +a model the intercept will be equal to $\overline{Y}$. As it turns out, the RSS of the model with +just the intercept is called the _"Total Sum of Squares" or TSS_. A good model will have a low $RSS/TSS$. The metric $R^2$ uses these quantities to calculate a score between 0 and 1, and the closer to 1, the better the model. Here is how it is calculated: $$R^2=1-\frac{RSS}{TSS}=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}$$ -$TSS-RSS$ part of the formula often referred to as "explained variability" in -the model. The bottom part is for "total variability". With this interpretation, higher -the "explained variability" better the model. For simple linear regression +The $TSS-RSS$ part of the formula is often referred to as "explained variability" in +the model. The bottom part is for "total variability". With this interpretation, the higher +the "explained variability", the better the model. For simple linear regression with one explanatory variable, the square root of $R^2$ is a quantity known -as absolute value of the correlation coefficient, which can be calculated for any pair of variables, not only +as the absolute value of the correlation coefficient, which can be calculated for any pair of variables, not only the -response and the explanatory variables. _Correlation_ is a general measure of \index{correlation} +response and the explanatory variables. _Correlation_ is the general measure of \index{correlation} linear relationship between two variables. One -of the most popular flavors of correlation is the Pearson correlation coefficient. Formally, It is the +of the most popular flavors of correlation is the Pearson correlation coefficient. Formally, it is the _covariance_ of X and Y divided by multiplication of standard deviations of \index{covariance} -X and Y. In R, it can be calculated with `cor()` function. +X and Y. In R, it can be calculated with the `cor()` function. $$ r_{xy}=\frac{cov(X,Y)}{\sigma_x\sigma_y} =\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum\limits_{i=1}^n (x_i-\bar{x})^2 \sum\limits_{i=1}^n (y_i-\bar{y})^2}} $$ -In the equation above, $cov$ is the covariance, this is again a measure of +In the equation above, $cov$ is the covariance; this is again a measure of how much two variables change together, like correlation. If two variables \index{covariance} -show similar behavior they will usually have positive covariance value, if they have opposite behavior, the covariance will have negative value. +show similar behavior, they will usually have a positive covariance value. If they have opposite behavior, the covariance will have a negative value. However, these values are boundless. A normalized way of looking at covariance is to divide covariance by the multiplication of standard errors of X and Y. This bounds the values to -1 and 1, and as mentioned -above called Pearson correlation coefficient. The values that change in a similar manner will have a positive coefficient, the values that change in \index{correlation} -opposite manner will have negative coefficient, and pairs do not have -a linear relationship will have 0 or near 0 correlation. In -the Figure \@ref(fig:CorCovar), we are showing $R^2$, correlation -coefficient and covariance for different scatter plots. +above, is called Pearson correlation coefficient. The values that change in a similar manner will have a positive coefficient, the values that change in \index{correlation} +an opposite manner will have a negative coefficient, and pairs that do not have +a linear relationship will have $0$ or near $0$ correlation. In +Figure \@ref(fig:CorCovar), we are showing $R^2$, the correlation +coefficient, and covariance for different scatter plots. -```{r CorCovar,fig.width=9,fig.height=3,echo=FALSE,warning=FALSE,message=FALSE, fig.cap="Correlation and covariance for different scatter plots"} -set.seed(32) +```{r CorCovar,fig.width=17,fig.height=5,out.width = "100%",echo=FALSE,warning=FALSE,message=FALSE, fig.cap="Correlation and covariance for different scatter plots."} +set.seed(31) x=runif(50,min=5,max=75) eps=rnorm(50,sd=50) par(mfrow=c(1,5)) par(mar=c(5.1,1.1,4.1,0.1)) y3=5+5*x -plot(x,y3,xlab="",xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=0.7, +plot(x,y3,xlab="",xaxt="n",yaxt="n",col="cornflowerblue",pch=19, + cex.main=1.5, main= bquote(R^2 == .(cor(x,y3)^2) ~~ r == .(cor(x,y3)) ~~ Cov== .(cov(x,y3)) ) ) y3=5+5*x+eps -plot(x,y3,xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=0.8, +plot(x,y3,xlab="",xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=1.5, main= bquote(R^2 == .(round(cor(x,y3)^2,2)) ~~ r == .(round(cor(x,y3),2)) ~~ Cov== .(cov(x,y3)) ) ) y3=rep(5,length(x))+eps -plot(x,y3,xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=0.8, +plot(x,y3,xlab="",xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=1.5, main= bquote(R^2 == .(round(cor(x,y3)^2,2)) ~~ r == .(round(cor(x,y3),2)) ~~ Cov== .(cov(x,y3)) ) ) y3=5-5*x+eps -plot(x,y3,xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=0.8, +plot(x,y3,xlab="",xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=1.5, main= bquote(R^2 == .(round(cor(x,y3)^2,2)) ~~ r == .(round(cor(x,y3),2)) ~~ Cov== .(cov(x,y3)) ) ) y3=5-5*x -plot(x,y3,xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=0.8, +plot(x,y3,xlab="",xaxt="n",yaxt="n",col="cornflowerblue",pch=19,cex.main=1.5, main= bquote(R^2 == .(round(cor(x,y3)^2,2)) ~~ r == .(round(cor(x,y3),2)) ~~ Cov== .(cov(x,y3)) ) ) ``` -For simple linear regression, correlation can be used to asses the model. However, this becomes useless as a measure of general accuracy -if the there are more than one explanatory -variable as in multiple linear regression. In that case, $R^2$ is a measure -of accuracy for the model. Interestingly, square of the +For simple linear regression, correlation can be used to assess the model. However, this becomes useless as a measure of general accuracy +if there is more than one explanatory variable as in multiple linear regression. In that case, $R^2$ is a measure +of accuracy for the model. Interestingly, the square of the correlation of predicted values -and original response variables ($(cor(Y,\hat{Y}))^2$ ) equals to $R^2$ for \index{$R^2$} +and original response variables ($(cor(Y,\hat{Y}))^2$ ) equals $R^2$ for \index{$R^2$} multiple linear regression.\index{linear regression} -The last accuracy measure or the model fit in general we are going to explain is _F-statistic_. This is a quantity that depends on RSS and TSS again. It can also answer one important question that other metrics can -not easily answer. That question is whether or not any of the explanatory +The last accuracy measure, or the model fit in general we are going to explain is _F-statistic_. This is a quantity that depends on the RSS and TSS again. It can also answer one important question that other metrics cannot easily answer. That question is whether or not any of the explanatory variables have predictive value or in other words if all the explanatory variables are zero. We can write the null hypothesis as follows: $$H_0: \beta_1=\beta_2=\beta_3=...=\beta_p=0 $$ @@ -1493,29 +1466,27 @@ where the alternative is: $$H_1: \text{at least one } \beta_i \neq 0 $$ -Remember $TSS-RSS$ is analogous to "explained variability" and the RSS is -analogous to "unexplained variability". For the F-statistic, we divide explained variance to +Remember that $TSS-RSS$ is analogous to "explained variability" and the RSS is +analogous to "unexplained variability". For the F-statistic, we divide explained variance by unexplained variance. Explained variance is just the $TSS-RSS$ divided by degrees of freedom, and unexplained variance is the RSE. The ratio will follow the F-distribution with two parameters, the degrees of freedom for the explained variance and -the degrees of freedom for the the unexplained variance.F-statistic for a linear model is calculated as follows. +the degrees of freedom for the unexplained variance. The F-statistic for a linear model is calculated as follows. $$F=\frac{(TSS-RSS)/(p-1)}{RSS/(n-p)}=\frac{(TSS-RSS)/(p-1)}{RSE} \sim F(p-1,n-p)$$ If the variances are the same, the ratio will be 1, and when $H_0$ is true, then -it can be shown that expected value of $(TSS-RSS)/(p-1)$ will be $\sigma^2$ -which is estimated by RSE. So, if the variances are significantly different, +it can be shown that expected value of $(TSS-RSS)/(p-1)$ will be $\sigma^2$, which is estimated by the RSE. So, if the variances are significantly different, the ratio will need to be significantly bigger than 1. -If the ratio is large enough we can reject the null hypothesis. To asses that +If the ratio is large enough we can reject the null hypothesis. To assess that, we need to use software or look up the tables for F statistics with calculated parameters. In R, function `qf()` can be used to calculate critical value of the ratio. Benefit of the F-test over looking at significance of coefficients one by one is that we circumvent multiple testing problem. If there are lots of explanatory variables at least 5% of the time (assuming we use 0.05 as P-value significance -cutoff), p-values from coefficient t-tests will be wrong\index{t-test}. In summary, -F-test is a better choice for testing if there is any association +cutoff), p-values from coefficient t-tests will be wrong\index{t-test}. In summary, F-test is a better choice for testing if there is any association between the explanatory variables and the response variable. @@ -1524,22 +1495,21 @@ between the explanatory variables and the response variable. ### Regression with categorical variables An important feature of linear regression is that categorical variables can be used as explanatory variables, this feature is very useful in genomics -where explanatory variables often could be categorical. To put it in +where explanatory variables can often be categorical. To put it in context, in our histone modification \index{histone modification} example we can also include if promoters have CpG islands or not as a variable. In addition, in differential gene expression, we usually test the difference between -different condition which can be encoded as categorical variables in -a linear regression.\index{linear regression} We can sure use t-test for that as well if there -are only 2 conditions, but if there are more conditions and other variables -to control for such as Age or sex of the samples, we need to take those -into account for our statistics, and t-test alone can not handle such +different conditions, which can be encoded as categorical variables in +a linear regression.\index{linear regression} We can sure use the t-test for that as well if there are only 2 conditions, but if there are more conditions and other variables +to control for, such as age or sex of the samples, we need to take those +into account for our statistics, and the t-test alone cannot handle such complexity. In addition, when we have categorical variables we can also have numeric variables in the model and we certainly do not have to include only one type of variable in a model. -The simplest model with categorical variables include two levels that -can be encoded in 0 and 1. Below, we are showing linear regression with categorical variable. We then plot the fitted line. This plot is shown in Figure \@ref(fig:LMcategorical). -```{r LMcategorical,out.width='50%',fig.cap="Linear model with a categorical variable coded as 0 and 1"} +The simplest model with categorical variables includes two levels that +can be encoded in 0 and 1. Below, we show linear regression with categorical variables. We then plot the fitted line. This plot is shown in Figure \@ref(fig:LMcategorical). +```{r LMcategorical,out.width='50%',fig.cap="Linear model with a categorical variable coded as 0 and 1."} set.seed(100) gene1=rnorm(30,mean=4,sd=2) gene2=rnorm(30,mean=2,sd=2) @@ -1552,8 +1522,8 @@ require(mosaic) plotModel(mod2) ``` -we can even compare more levels, we do not even have to encode them -ourselves. We can pass categorical variables to `lm()` function. +We can even compare more levels, and we do not even have to encode them +ourselves. We can pass categorical variables to the `lm()` function. ```{r LMcatcompare} gene.df=data.frame(exp=c(gene1,gene2,gene2), @@ -1566,49 +1536,49 @@ summary(mod3) ### Regression pitfalls -In most cases one should look at the error terms (residuals) vs fitted +In most cases one should look at the error terms (residuals) vs. the fitted values plot. Any structure in this plot indicates problems such as -non-linearity, correlation of error terms \index{correlation}, non-constant variance or +non-linearity, correlation of error terms\index{correlation}, non-constant variance or unusual values driving the fit. Below we briefly explain the potential issues with the linear regression. -##### non-linearity +##### Non-linearity If the true relationship is far from linearity, prediction accuracy is reduced and all the other conclusions are questionable. In some cases, -transforming the data with $logX$, $\sqrt{X}$ and $X^2$ could resolve +transforming the data with $logX$, $\sqrt{X}$, and $X^2$ could resolve the issue. -##### correlation of explanatory variables +##### Correlation of explanatory variables If the explanatory variables are correlated that could lead to something known as multicolinearity. When this happens SE estimates of the coefficients will be too large. This is usually observed in time-course data. -##### correlation of error terms -This assumes that the errors of the response variables are uncorrelated with each other. If they are confidence intervals in the coefficients -might too narrow. +##### Correlation of error terms +This assumes that the errors of the response variables are uncorrelated with each other. If they are, the confidence intervals of the coefficients +might be too narrow. ##### Non-constant variance of error terms This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables. If -the errors are not constant, if for the errors grow as X grows this +the errors are not constant (ex: the errors grow as X values increase), this will result in unreliable estimates in standard errors as the model assumes constant variance. Transformation of data, such as -$logX$ and $\sqrt{X}$ could help in some cases. +$logX$ and $\sqrt{X}$, could help in some cases. -##### outliers and high leverage points +##### Outliers and high leverage points Outliers are extreme values for Y and high leverage points are unusual\index{outliers} -X values. Both of these extremes have power to affect the fitted line -and the standard errors. In some cases (measurement error), they can be +X values. Both of these extremes have the power to affect the fitted line +and the standard errors. In some cases (Ex: if there are measurement errors), they can be removed from the data for a better fit. ```{block2, statsLinMod, type='rmdtip'} __Want to know more ?__ -- linear models and derivations of equations including matrix notation - - Applied Linear Statistical Models by Kutner, Nachtsheim, et al. [@kutner2003applied] - - Elements of statistical learning by Hastie & Tibshirani [@friedman2001elements] - - An Introduction to statistical learning by James, Witten, et al.[@james2013introduction] +- Linear models and derivations of equations including matrix notation + - _Applied Linear Statistical Models_ by Kutner, Nachtsheim, et al. [@kutner2003applied] + - _Elements of Statistical Learning_ by Hastie & Tibshirani [@friedman2001elements] + - _An Introduction to Statistical Learning_ by James, Witten, et al. [@james2013introduction] ``` @@ -1617,7 +1587,7 @@ __Want to know more ?__ ### How to summarize collection of data points: The idea behind statistical distributions 1. Calculate the means and variances -of the rows of the following simulated data set, plot the distributions +of the rows of the following simulated data set, and plot the distributions of means and variances using `hist()` and `boxplot()` functions. [Difficulty: **Beginner/Intermediate**] ```{r getDataChp3Ex,eval=FALSE} set.seed(100) @@ -1629,26 +1599,27 @@ data=matrix(gset,ncol=6) 2. Using the data generated above, calculate the standard deviation of the -distribution of the means using `sd()` function. Compare that to the expected -standard error obtained from central limit theorem keeping in mind the +distribution of the means using the `sd()` function. Compare that to the expected +standard error obtained from the central limit theorem keeping in mind the population parameters were $\sigma=70$ and $n=6$. How does the estimate from the random samples change if we simulate more data with `data=matrix(rnorm(6000,mean=200,sd=70),ncol=6)`? [Difficulty: **Beginner/Intermediate**] -3. simulate 30 random variables using `rpois()` function, do this 1000 times and calculate means of each sample. Plot the sampling distributions of the means +3. Simulate 30 random variables using the `rpois()` function. Do this 1000 times and calculate the mean of each sample. Plot the sampling distributions of the means using a histogram. Get the 2.5th and 97.5th percentiles of the distribution. [Difficulty: **Beginner/Intermediate**] -4. Use `t.test` function to calculate confidence intervals -of the mean on the first random sample `pois1` simulated from`rpois()` function below.[Difficulty: **Intermediate**] +4. Use the `t.test()` function to calculate confidence intervals +of the mean on the first random sample `pois1` simulated from the `rpois()` function below. [Difficulty: **Intermediate**] ```{r exRpoisChp3,eval=FALSE} +#HINT set.seed(100) #sample 30 values from poisson dist with lamda paramater =30 pois1=rpois(30,lambda=5) ``` -5. Use bootstrap confidence interval for the mean on `pois1`[Difficulty: **Intermediate/Advanced**] -6. compare the theoretical confidence interval of the mean from `t.test` and the bootstrap confidence interval. Are they similar ?[Difficulty: **Intermediate/Advanced**] -7. Try to recreate the following figure, which demonstrates the CLT concept.[Difficulty: **Advanced**] +5. Use the bootstrap confidence interval for the mean on `pois1`. [Difficulty: **Intermediate/Advanced**] +6. Compare the theoretical confidence interval of the mean from the `t.test` and the bootstrap confidence interval. Are they similar? [Difficulty: **Intermediate/Advanced**] +7. Try to re-create the following figure, which demonstrates the CLT concept.[Difficulty: **Advanced**] ```{r,echo=FALSE,message=FALSE,warning=FALSE} set.seed(101) require(mosaic) @@ -1716,8 +1687,8 @@ hist(unif100,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray", ### How to test for differences in samples 1. Test the difference of means of the following simulated genes -using the randomization, t-test and `wilcox.test()` functions. -Plot the distributions using histograms and boxplots.[Difficulty: **Intermediate/Advanced**] +using the randomization, `t-test()`, and `wilcox.test()` functions. +Plot the distributions using histograms and boxplots. [Difficulty: **Intermediate/Advanced**] ```{r exRnorm1chp3,eval=FALSE} set.seed(101) gene1=rnorm(30,mean=4,sd=3) @@ -1725,9 +1696,9 @@ gene2=rnorm(30,mean=3,sd=3) ``` -2. Test the difference of means of the following simulated genes -using the randomization, t-test and `wilcox.test()` functions. -Plot the distributions using histograms and boxplots.[Difficulty: **Intermediate/Advanced**] +2. Test the difference of the means of the following simulated genes +using the randomization, `t-test()` and `wilcox.test()` functions. +Plot the distributions using histograms and boxplots. [Difficulty: **Intermediate/Advanced**] ```{r exRnorm2chp3,eval=FALSE} set.seed(100) gene1=rnorm(30,mean=4,sd=2) @@ -1736,15 +1707,15 @@ gene2=rnorm(30,mean=2,sd=2) ``` 3. We need an extra data set for this exercise. Read the gene expression data set as follows: -`gexpFile=system.file("extdata","geneExpMat.rds",package="compGenomRData") data=readRDS(gexpFile)` The data has 100 differentially expressed genes.First 3 columns are the test samples, and the last 3 are the control samples. Do -a t-test for each gene (each row is a gene), record the p-values. -Then, do a moderated t-test, as shown in the lecture notes and record -the p-values. Do a p-value histogram and compare two approaches in terms of the number of significant tests with 0.05 threshold. +`gexpFile=system.file("extdata","geneExpMat.rds",package="compGenomRData") data=readRDS(gexpFile)`. The data has 100 differentially expressed genes. The first 3 columns are the test samples, and the last 3 are the control samples. Do +a t-test for each gene (each row is a gene), and record the p-values. +Then, do a moderated t-test, as shown in section "Moderated t-tests" in this chapter, and record +the p-values. Make a p-value histogram and compare two approaches in terms of the number of significant tests with the $0.05$ threshold. On the p-values use FDR (BH), Bonferroni and q-value adjustment methods. Calculate how many adjusted p-values are below 0.05 for each approach. [Difficulty: **Intermediate/Advanced**] -### Relationship between variables: linear models and correlation +### Relationship between variables: Linear models and correlation Below we are going to simulate X and Y values that are needed for the rest of the exercise. @@ -1756,7 +1727,7 @@ set.seed(32) # get 50 X values between 1 and 100 x = runif(50,1,100) -# set b0,b1 and varience (sigma) +# set b0,b1 and variance (sigma) b0 = 10 b1 = 2 sigma = 20 @@ -1768,29 +1739,33 @@ y = b0 + b1*x+ eps 1. Run the code then fit a line to predict Y based on X. [Difficulty:**Intermediate**] -2. Plot the scatter plot and the fitted line.[Difficulty:**Intermediate**] +2. Plot the scatter plot and the fitted line. [Difficulty:**Intermediate**] 3. Calculate correlation and R^2. [Difficulty:**Intermediate**] 4. Run the `summary()` function and try to extract P-values for the model from the object -returned by `summary`. see `?summary.lm`.[Difficulty:**Intermediate/Advanced**] -5. Plot the residuals vs fitted values plot, by calling `plot` +returned by `summary`. See `?summary.lm`. [Difficulty:**Intermediate/Advanced**] +5. Plot the residuals vs. the fitted values plot, by calling the `plot()` function with `which=1` as the second argument. First argument -is the model returned by `lm`.[Difficulty:**Advanced**] +is the model returned by `lm()`. [Difficulty:**Advanced**] 6. For the next exercises, read the data set histone modification data set. Use the following to get the path to the file: -`hmodFile=system.file("extdata","HistoneModeVSgeneExp.rds",package="compGenomRData")`. There -are 3 columns in the data set these are measured levels of H3K4me3, -H3K27me3 and gene expression per gene. Once you read in the data, plot the scatter plot for H3K4me3 vs expression.[Difficulty:**Beginner**] -7. plot the scatter plot for H3K27me3 vs expression. [Difficulty:**Beginner**] -8. fit the model model for prediction of expression data using: - - only H3K4me3 as explanatory variable - - only H3K27me3 as explanatory variable - - using both H3K4me3 and H3K27me3 as explanatory variables -Inspect summary() function output in each case, which terms are significant. [Difficulty:**Beginner/Intermediate**] -10. Is using H3K4me3 and H3K27me3 better than the model with only H3K4me3. [Difficulty:**Intermediate**] -11. Plot H3k4me3 vs H3k27me3. Inspect the points that does not +``` +hmodFile=system.file("extdata", + "HistoneModeVSgeneExp.rds", + package="compGenomRData")` +``` +There are 3 columns in the dataset. These are measured levels of H3K4me3, +H3K27me3 and gene expression per gene. Once you read in the data, plot the scatter plot for H3K4me3 vs. expression. [Difficulty:**Beginner**] + +7. Plot the scatter plot for H3K27me3 vs. expression. [Difficulty:**Beginner**] + +8. Fit the model for prediction of expression data using: 1) Only H3K4me3 as explanatory variable, 2) Only H3K27me3 as explanatory variable, and 3) Using both H3K4me3 and H3K27me3 as explanatory variables. Inspect the `summary()` function output in each case, which terms are significant. [Difficulty:**Beginner/Intermediate**] + +10. Is using H3K4me3 and H3K27me3 better than the model with only H3K4me3? [Difficulty:**Intermediate**] + +11. Plot H3k4me3 vs. H3k27me3. Inspect the points that do not follow a linear trend. Are they clustered at certain segments -of the plot. Bonus: Is there any biological or technical interpretation -for those points ?[Difficulty:**Intermediate/Advanced**] +of the plot? Bonus: Is there any biological or technical interpretation +for those points? [Difficulty:**Intermediate/Advanced**] diff --git a/04-unsupervisedLearning.Rmd b/04-unsupervisedLearning.Rmd index ba39260..0976b7e 100644 --- a/04-unsupervisedLearning.Rmd +++ b/04-unsupervisedLearning.Rmd @@ -1,5 +1,5 @@ # Exploratory Data Analysis with Unsupervised Machine Learning {#unsupervisedLearning} -In this chapter, we will focus on using some of the machine learning techniques to explore genomics data. The goals of data exploration is usually many. Generally, we want to understand how the variables in our data set relate to each other and how the samples defined by those variables relate to each other. These points of information can be used to generate hypothesis, find outliers \index{outliers}in the samples or identify sample groups that need more data points. In this chapter, we will focus on two main classes of techniques: "clustering" and "dimension reduction". We will show how to use these techniques and how to visualize them using R. As these techniques are fundamental for data analysis, we will see more of their use cases in chapters \@ref(rnaseqanalysis), \@ref(chipseq), \@ref(bsseq) and \@ref(multiomics). +In this chapter, we will focus on using some of the machine learning techniques to explore genomics data. The goals of data exploration are usually many. Generally, we want to understand how the variables in our data set relate to each other and how the samples defined by those variables relate to each other. These points of information can be used to generate a hypothesis, find outliers \index{outliers}in the samples or identify sample groups that need more data points. In this chapter, we will focus on two main classes of techniques: "clustering" and "dimension reduction". We will show how to use these techniques and how to visualize them using R. As these techniques are fundamental for data analysis, we will see more of their use cases in Chapters \@ref(rnaseqanalysis), \@ref(chipseq), \@ref(bsseq) and \@ref(multiomics). ```{r setupML, include=FALSE} knitr::opts_chunk$set(echo = TRUE, @@ -13,15 +13,15 @@ knitr::opts_chunk$set(echo = TRUE, ``` -## Clustering: grouping samples based on their similarity +## Clustering: Grouping samples based on their similarity -In genomics, we would very frequently want to assess how our samples relate to each other. Are our replicates similar to each other? Do the samples from the same treatment group have the similar genome-wide signals ? Do the patients with similar diseases have similar gene expression profiles ? -Take the last question for example. We need to define a distance or similarity metric between patients' expression profiles and use that metric to find groups of patients that are more similar to each other than the rest of the patients. This, in essence, is the general idea behind clustering. We need a distance metric and a method to utilize that distance metric to find self-similar groups. Clustering is a ubiquitous procedure in bioinformatics as well as any field that deals with high-dimensional data. It is very likely every genomics paper containing multiple samples have some sort of clustering. Due to this ubiquity and general usefulness, it is an essential technique to learn. +In genomics, we would very frequently want to assess how our samples relate to each other. Are our replicates similar to each other? Do the samples from the same treatment group have similar genome-wide signals? Do the patients with similar diseases have similar gene expression profiles? +Take the last question for example. We need to define a distance or similarity metric between patients' expression profiles and use that metric to find groups of patients that are more similar to each other than the rest of the patients. This, in essence, is the general idea behind clustering. We need a distance metric and a method to utilize that distance metric to find self-similar groups. Clustering is a ubiquitous procedure in bioinformatics as well as any field that deals with high-dimensional data. It is very likely that every genomics paper containing multiple samples has some sort of clustering. Due to this ubiquity and general usefulness, it is an essential technique to learn. ### Distance metrics -The first required step for clustering is the distance metric. This is simply a measurement of how similar gene expressions to each other are. There are many options for distance metrics and the choice of the metric is quite important for clustering. Consider a simple example where we have four patients and expression of three genes measured in Table \@ref(tab:expTable). Which patients look similar to each other based on their gene expression profiles \index{gene expression}? +The first required step for clustering is the distance metric. This is simply a measurement of how similar gene expressions are to each other. There are many options for distance metrics and the choice of the metric is quite important for clustering. Consider a simple example where we have four patients and expression of three genes measured in Table \@ref(tab:expTable). Which patients look similar to each other based on their gene expression profiles \index{gene expression}? ```{r expTable,echo=FALSE} df=data.frame( @@ -37,9 +37,9 @@ knitr::kable( ) ``` -It may not be obvious from the table at first sight but if we plot the gene expression profile for each patient (shown in Figure \@ref(fig:expPlot)), we will see that expression profiles of patient 1 and patient 2 is more similar to each other than patient 3 or patient 4. +It may not be obvious from the table at first sight, but if we plot the gene expression profile for each patient (shown in Figure \@ref(fig:expPlot)), we will see that expression profiles of patient 1 and patient 2 are more similar to each other than patient 3 or patient 4. -```{r expPlot,echo=FALSE,out.width='50%',fig.cap="Gene expression values for different patients. Certain patients have similar gene expression values to each other."} +```{r expPlot,echo=FALSE,out.width='50%',fig.cap="Gene expression values for different patients. Certain patients have gene expression values that are similar to each other."} library(ggplot2) df2=tidyr::gather(cbind(patient=rownames(df),df),key="gene",value="expression",IRX4,PAX6,OCT4) @@ -47,28 +47,28 @@ ggplot(df2, aes(gene,expression, fill = patient)) + geom_bar(stat = "identity", ``` -But how can we quantify what see by eye ? A simple metric for distance between gene expression vectors between a given patient pair is the sum of absolute difference between gene expression values This can be formulated as follows: $d_{AB}={\sum _{i=1}^{n}|e_{Ai}-e_{Bi}|}$, where $d_{AB}$ is the distance between patient A and B, and $e_{Ai}$ and $e_{Bi}$ expression value of the $i$th gene for patient A and B. This distance metric is called **"Manhattan distance"** or **"L1 norm"**. \index{Manhattan distance} +But how can we quantify what we see? A simple metric for distance between gene expression vectors between a given patient pair is the sum of the absolute difference between gene expression values. This can be formulated as follows: $d_{AB}={\sum _{i=1}^{n}|e_{Ai}-e_{Bi}|}$, where $d_{AB}$ is the distance between patients A and B, and the $e_{Ai}$ and $e_{Bi}$ are expression values of the $i$th gene for patients A and B. This distance metric is called the **"Manhattan distance"** or **"L1 norm"**. \index{Manhattan distance} \index{L1 norm} -Another distance metric using sum of squared distances and taking a square root of resulting value, that can be formulated as: $d_{AB}={{\sqrt {\sum _{i=1}^{n}(e_{Ai}-e_{Bi})^{2}}}}$. This distance is called **"Euclidean Distance"** or **"L2 norm"**. This is usually the default distance metric for many clustering algorithms. due to squaring operation values that are very different get higher contribution to the distance. Due to this, compared to Manhattan distance it can be more affected by outliers\index{outliers} but generally if the outliers are rare this distance metric works well. +Another distance metric uses the sum of squared distances and takes the square root of resulting value; this metric can be formulated as: $d_{AB}={{\sqrt {\sum _{i=1}^{n}(e_{Ai}-e_{Bi})^{2}}}}$. This distance is called **"Euclidean Distance"** or **"L2 norm"**. This is usually the default distance metric for many clustering algorithms. Due to the squaring operation, values that are very different get higher contribution to the distance. Due to this, compared to the Manhattan distance, it can be affected more by outliers\index{outliers}. But, generally if the outliers are rare, this distance metric works well. -The last metric we will introduce is the **"correlation distance"**. This is simply $d_{AB}=1-\rho$, where $\rho$ is the Pearson correlation coefficient between two vectors, in our case those vectors are gene expression profiles of patients. Using this distance the gene expression vectors that have a similar pattern will have a small distance whereas when the vectors have different patterns they will have a large distance. In this case, the linear correlation between vectors matters, the the scale of the vectors might be different.\index{correlation distance} +The last metric we will introduce is the **"correlation distance"**. This is simply $d_{AB}=1-\rho$, where $\rho$ is the Pearson correlation coefficient between two vectors; in our case those vectors are gene expression profiles of patients. Using this distance the gene expression vectors that have a similar pattern will have a small distance, whereas when the vectors have different patterns they will have a large distance. In this case, the linear correlation between vectors matters, although the scale of the vectors might be different.\index{correlation distance} -Now let's see how we can calculate these distance in R. First, we have our gene expression per patient table. +Now let's see how we can calculate these distances in R. First, we have our gene expression per patient table. ```{r dists1} df ``` -Next, we calculate the distance metrics using `dist` function and `1-cor()`. +Next, we calculate the distance metrics using the `dist()` function and `1-cor()` expression. ```{r distMethodChp4} dist(df,method="manhattan") dist(df,method="euclidean") -as.dist(1-cor(t(df))) +as.dist(1-cor(t(df))) # correlation distance ``` #### Scaling before calculating the distance -Before we proceed to the clustering, one more thing we need to take care. Should we normalize our data ? Scale of the vectors in our expression matrix can affect the distance calculation. Gene expression tables are usually have some sort of normalization, so the values are in comparable scales. But somehow if a gene's expression values were on much higher scale than the other genes, that gene will effect the distance more than other when using Euclidean or Manhattan distance. If that is the case we can scale the variables.The traditional way of scaling variables is to subtract their mean, and divide by their standard deviation, this operation is also called "standardization". If this is done on all genes, each gene will have the same affect on distance measures. The decision to apply scaling ultimately depends on our data and what you want to achieve. If the gene expression values are previously normalized between patients, having genes that dominate the distance metric could have a biological meaning and therefore it may not be desirable to further scale variables. In R, the standardization is done via `scale()` function. Here we scale the gene expression values.\index{scaling} +Before we proceed to the clustering, there is one more thing we need to take care of. Should we normalize our data? The scale of the vectors in our expression matrix can affect the distance calculation. Gene expression tables might have some sort of normalization, so the values are in comparable scales. But somehow, if a gene's expression values are on a much higher scale than the other genes, that gene will affect the distance more than others when using Euclidean or Manhattan distance. If that is the case we can scale the variables. The traditional way of scaling variables is to subtract their mean, and divide by their standard deviation, this operation is also called "standardization". If this is done on all genes, each gene will have the same effect on distance measures. The decision to apply scaling ultimately depends on our data and what you want to achieve. If the gene expression values are previously normalized between patients, having genes that dominate the distance metric could have a biological meaning and therefore it may not be desirable to further scale variables. In R, the standardization is done via the `scale()` function. Here we scale the gene expression values.\index{scaling} ```{r scaling} df scale(df) @@ -76,18 +76,18 @@ scale(df) ### Hiearchical clustering -This is one of the most ubiquitous clustering algorithms. Using this algorithm you can see the relationship of individual data points and relationships of clusters. This is achieved successively joining small clusters to each other based on the inter-cluster distance. Eventually, you get a tree structure or a dendrogram that shows the relationship between the individual data points and clusters. The height of the dendrogram is the distance between clusters. Here we can show how to use this on our toy data set from four patients. The base function in R to do hierarchical clustering in `hclust()`. Below, we apply that function on Euclidean distances between patients.The resulting clustering tree or dendrogram is shown in Figure \@ref(fig:expPlot).\index{clustering!hierarchical clustering} +This is one of the most ubiquitous clustering algorithms. Using this algorithm you can see the relationship of individual data points and relationships of clusters. This is achieved by successively joining small clusters to each other based on the inter-cluster distance. Eventually, you get a tree structure or a dendrogram that shows the relationship between the individual data points and clusters. The height of the dendrogram is the distance between clusters. Here we can show how to use this on our toy data set from four patients. The base function in R to do hierarchical clustering in `hclust()`. Below, we apply that function on Euclidean distances between patients. The resulting clustering tree or dendrogram is shown in Figure \@ref(fig:expPlot).\index{clustering!hierarchical clustering} ```{r toyClust,fig.cap="Dendrogram of distance matrix",out.width='50%'} d=dist(df) hc=hclust(d,method="complete") plot(hc) ``` -In the above code snippet, we have used `method="complete"` argument without explaining it. The `method` argument defines the criteria that directs how the sub-clusters are merged. During clustering starting with single-member clusters, the clusters are merged based on the distance between them. There are many different ways to define distance between clusters and based on which definition you use the hierarchical clustering results change. So the `method` argument controls that. There are a couple of values this argument can take, we list them and their description below: +In the above code snippet, we have used the `method="complete"` argument without explaining it. The `method` argument defines the criteria that directs how the sub-clusters are merged. During clustering, starting with single-member clusters, the clusters are merged based on the distance between them. There are many different ways to define distance between clusters, and based on which definition you use, the hierarchical clustering results change. So the `method` argument controls that. There are a couple of values this argument can take; we list them and their description below: -- **"complete"** stands for "Complete Linkage" and the distance between two clusters is defined as largest distance between any members of the two clusters. -- **"single"** stands for "Single Linkage" and the distance between two clusters is defined as smallest distance between any members of the two clusters. -- **"average"** stands for "Average Linkage" or more precisely UPGMA (Unweighted Pair Group Method with Arithmetic Mean) method. In this case, the distance between two clusters is defined as average distance between any members of the two clusters. -- **"ward.D2"** and **"ward.D"** stands for different implementations of Ward's minimum variance method. This method aims to find compact, spherical clusters by selecting clusters to merge based on the change in the cluster variances. The clusters are merged if the increase in the combined variance over the sum of the cluster specific variances is minimum compared to alternative merging operations. +- **"complete"** stands for "Complete Linkage" and the distance between two clusters is defined as the largest distance between any members of the two clusters. +- **"single"** stands for "Single Linkage" and the distance between two clusters is defined as the smallest distance between any members of the two clusters. +- **"average"** stands for "Average Linkage" or more precisely the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) method. In this case, the distance between two clusters is defined as the average distance between any members of the two clusters. +- **"ward.D2"** and **"ward.D"** stands for different implementations of Ward's minimum variance method. This method aims to find compact, spherical clusters by selecting clusters to merge based on the change in the cluster variances. The clusters are merged if the increase in the combined variance over the sum of the cluster-specific variances is the minimum compared to alternative merging operations. ```{r setupData,eval=FALSE,echo=FALSE} library(leukemiasEset) @@ -99,9 +99,9 @@ saveRDS(mat,"leukemiaExpression.rds") mat=readRDS("leukemiaExpressionSubset.rds") ``` -In real life, we would get expression profiles from thousands of genes and we will typically have many more patients than our toy example. One such data set is gene expression values from 60 bone marrow samples of patients with one of the four main types of leukemia (ALL, AML, CLL, CML) or no-leukemia controls. We trimmed that data set down to top 1000 most variable genes to be able to work with it easier and in addition genes that are not very variable do not contribute much to the distances between patients. We will now use this data set to cluster the patients and display the values as a heatmap and a dendrogram. The heatmap shows the expression values of genes across patients in a color coded manner. The heatmap function, `pheatmap()`, we will use performs the clustering as well. The matrix that contains gene expressions has the genes in the rows and the patients in the columns. Therefore, we will also use a column-side color code to mark the patients based on their leukemia type. For the hierarchical clustering, we will use Ward's method designated by `clustering_method` argument to `pheatmap()` function. The resulting heatmap is shown in Figure \@ref(fig:heatmap1). \index{heatmap} +In real life, we would get expression profiles from thousands of genes and we will typically have many more patients than our toy example. One such data set is gene expression values from 60 bone marrow samples of patients with one of the four main types of leukemia (ALL, AML, CLL, CML) or no-leukemia controls. We trimmed that data set down to the top 1000 most variable genes to be able to work with it more easily, since genes that are not very variable do not contribute much to the distances between patients. We will now use this data set to cluster the patients and display the values as a heatmap and a dendrogram. The heatmap shows the expression values of genes across patients in a color coded manner. The heatmap function, `pheatmap()`, that we will use performs the clustering as well. The matrix that contains gene expressions has the genes in the rows and the patients in the columns. Therefore, we will also use a column-side color code to mark the patients based on their leukemia type. For the hierarchical clustering, we will use Ward's method designated by the `clustering_method` argument to the `pheatmap()` function. The resulting heatmap is shown in Figure \@ref(fig:heatmap1). \index{heatmap} -```{r heatmap1,eval=TRUE,out.width='50%',fig.cap="Heatmap of gene expression values from Leukemia patients. Each column represents a patient. Columns are clustered using gene expression and color coded by disease type:ALL, AML, CLL, CML or no-leukemia "} +```{r heatmap1,eval=TRUE,out.width='50%',fig.cap="Heatmap of gene expression values from leukemia patients. Each column represents a patient. Columns are clustered using gene expression and color coded by disease type: ALL, AML, CLL, CML or no-leukemia "} library(pheatmap) expFile=system.file("extdata","leukemiaExpressionSubset.rds", package="compGenomRData") @@ -119,29 +119,29 @@ pheatmap(mat,show_rownames=FALSE,show_colnames=FALSE, clustering_distance_cols="euclidean") ``` -As we can observe in the heatmap each cluster has a distinct set of expression values. The main clusters almost perfectly distinguish the leukemia types. Only one CML patient is clustered as a non-leukemia sample. This could mean that gene expression profiles are enough to classify leukemia type. More detailed analysis and experiments are needed to verify that but by looking at this exploratory analysis we can decide where to focus our efforts next. +As we can observe in the heatmap, each cluster has a distinct set of expression values. The main clusters almost perfectly distinguish the leukemia types. Only one CML patient is clustered as a non-leukemia sample. This could mean that gene expression profiles are enough to classify leukemia type. More detailed analysis and experiments are needed to verify that, but by looking at this exploratory analysis we can decide where to focus our efforts next. -#### where to cut the tree ? -The example above seems like a clear cut example where we can pick by eye clusters from the dendrogram. This is mostly due to the Ward's method where compact clusters are preferred. However, as it is usually the case we do not have patient labels and it would be difficult to tell which leaves (patients) in the dendrogram we should consider as part of the same cluster. In other words, how deep we should cut the dendrogram so that every patient sample still connected via the remaining sub-dendrograms constitute clusters. The `cutree()` function provides the functionality to output either desired number of clusters or clusters obtained from cutting the dendrogram at a certain height. Below, we will cluster the patients with hierarchical clustering using the default method "complete linkage" and cut the dendrogram at a certain height. In this case, you will also observe that, changing from Ward's distance to complete linkage had an effect on clustering. Now the two clusters that are defined by Ward's distance are closer to each other and harder to separate from each other, shown in Figure \@ref(fig:hclustNcut). -```{r hclustNcut,out.width='50%',fig.cap="Dendrogram of Leukemia patients clustered by hieararchical clustering. Rectangles show the cluster we will get if we cut the tree at `height=80`"} +#### Where to cut the tree ? +The example above seems like a clear-cut example where we can pick clusters from the dendrogram by eye. This is mostly due to Ward's method, where compact clusters are preferred. However, as is usually the case, we do not have patient labels and it would be difficult to tell which leaves (patients) in the dendrogram we should consider as part of the same cluster. In other words, how deep we should cut the dendrogram so that every patient sample still connected via the remaining sub-dendrograms constitute clusters. The `cutree()` function provides the functionality to output either desired number of clusters or clusters obtained from cutting the dendrogram at a certain height. Below, we will cluster the patients with hierarchical clustering using the default method "complete linkage" and cut the dendrogram at a certain height. In this case, you will also observe that, changing from Ward's distance to complete linkage had an effect on clustering. Now the two clusters that are defined by Ward's distance are closer to each other and harder to separate from each other, shown in Figure \@ref(fig:hclustNcut). +```{r hclustNcut,out.width='50%',fig.cap="Dendrogram of Leukemia patients clustered by hierarchical clustering. Rectangles show the cluster we will get if we cut the tree at `height=80`."} hcl=hclust(dist(t(mat))) plot(hcl,labels = FALSE, hang= -1) rect.hclust(hcl, h = 80, border = "red") -clu.k5=cutree(hcl,k=5) # cut tree so that there are 4 clusters +clu.k5=cutree(hcl,k=5) # cut tree so that there are 5 clusters clu.h80=cutree(hcl,h=80) # cut tree/dendrogram from height 80 table(clu.k5) # number of samples for each cluster ``` -Apart from the arbitrary values for the height or the number of the clusters, how can we define clusters more systematically? As this is a general question, we will show later how to decide the optimal number of clusters later in this chapter. +Apart from the arbitrary values for the height or the number of clusters, how can we define clusters more systematically? As this is a general question, we will show how to decide the optimal number of clusters later in this chapter. ### K-means clustering -Another, very common clustering algorithm is k-means.This method divides or partitions the data points, our working example patients, into a pre-determined, "k" number of clusters \index{clustering!k-means} [@hartigan1979algorithm]. Hence, this type of methods are generally called "partitioning" methods. The algorithm is initialized with randomly chosen $k$ centers or centroids. In a sense, a centroid is a data point with multiple values. In our working example, it is a hypothetical patient with gene expression values. But in the initialization phase, those gene expression values are chosen randomly within the boundaries of the gene expression distributions from real patients. As the next step in the algorithm, each patient is assigned to the closest centroid and in the next iteration centroids are set to the mean of values of the genes in the cluster. This process of setting centroids and assigning patients to the clusters repeats itself until sum of squared distances to cluster centroids is minimized. +Another very common clustering algorithm is k-means. This method divides or partitions the data points, our working example patients, into a pre-determined, "k" number of clusters \index{clustering!k-means} [@hartigan1979algorithm]. Hence, these types of methods are generally called "partitioning" methods. The algorithm is initialized with randomly chosen $k$ centers or centroids. In a sense, a centroid is a data point with multiple values. In our working example, it is a hypothetical patient with gene expression values. But in the initialization phase, those gene expression values are chosen randomly within the boundaries of the gene expression distributions from real patients. As the next step in the algorithm, each patient is assigned to the closest centroid, and in the next iteration, centroids are set to the mean of values of the genes in the cluster. This process of setting centroids and assigning patients to the clusters repeats itself until the sum of squared distances to cluster centroids is minimized. -As you might see, the cluster algorithm starts with random initial centroids. This feature might yield different results for each run of the algorithm. We will know show how to use k-means method on the gene expression data set. We will use `set.seed()` for reproducibility. In the wild, you might want to run this algorithm multiple times to see if your clustering results are stable. +As you might see, the cluster algorithm starts with random initial centroids. This feature might yield different results for each run of the algorithm. We will now show how to use the k-means method on the gene expression data set. We will use `set.seed()` for reproducibility. In the wild, you might want to run this algorithm multiple times to see if your clustering results are stable. ```{r kmeans} set.seed(101) @@ -153,7 +153,7 @@ kclu=kmeans(t(mat),centers=5) # number of data points in each cluster table(kclu$cluster) ``` -Now let us check the percentage of each leukemia type in each cluster. We can visualize this as a table. Looking at the table below, we see that each of the 5 clusters are predominantly representing one of the 4 leukemia types or the control patients without leukemia. +Now let us check the percentage of each leukemia type in each cluster. We can visualize this as a table. Looking at the table below, we see that each of the 5 clusters predominantly represents one of the 4 leukemia types or the control patients without leukemia. ```{r type2kclu} type2kclu = data.frame( LeukemiaType =substr(colnames(mat),1,3), @@ -163,7 +163,7 @@ table(type2kclu) ``` -Another related and maybe more robust algorithm is called **"k-medoids"** clustering [@reynolds2006clustering]. The procedure is almost identical to k-means clustering with a couple of differences. \index{clustering!k-medoids} In this case, centroids chosen are real data points in our case patients, and the metric we are trying to optimize in each iteration is based on Manhattan distance to the centroid. In k-means this was based on sum of squared distances so euclidean distance. Below we are showing how to use k-medoids clustering function `pam()` \index{clustering!pam} from the `cluster` package.\index{R Packages!\texttt{cluster}} +Another related and maybe more robust algorithm is called **"k-medoids"** clustering [@reynolds2006clustering]. The procedure is almost identical to k-means clustering with a couple of differences. \index{clustering!k-medoids} In this case, centroids chosen are real data points in our case patients, and the metric we are trying to optimize in each iteration is based on the Manhattan distance to the centroid. In k-means this was based on the sum of squared distances, so Euclidean distance. Below we show how to use the k-medoids clustering function `pam()` \index{clustering!pam} from the `cluster` package.\index{R Packages!\texttt{cluster}} ```{r kmed} kmclu=cluster::pam(t(mat),k=5) # cluster using k-medoids @@ -175,9 +175,9 @@ type2kmclu = data.frame( table(type2kmclu) ``` -We can not visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see the how patients and clusters relate each other. It turns out, that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this and we introduce these dimension reduction methods including the one we will use now later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail at [Multi-dimensional scaling] section below. +We cannot visualize the clustering from partitioning methods with a tree like we did for hierarchical clustering. Even if we can get the distances between patients the algorithm does not return the distances between clusters out of the box. However, if we had a way to visualize the distances between patients in 2 dimensions we could see the how patients and clusters relate to each other. It turns out that there is a way to compress between patient distances to a 2-dimensional plot. There are many ways to do this, and we introduce these dimension-reduction methods including the one we will use later in this chapter. For now, we are going to use a method called "multi-dimensional scaling" and plot the patients in a 2D plot color coded by their cluster assignments shown in Figure \@ref(fig:kmeansmds). We will explain this method in more detail in the [Multi-dimensional scaling] section below. -```{r, kmeansmds,out.width='50%',fig.cap="K-means cluster memberships are shown in multi-dimensional scaling plot"} +```{r, kmeansmds,out.width='50%',fig.cap="K-means cluster memberships are shown in a multi-dimensional scaling plot"} # Calculate distances dists=dist(t(mat)) @@ -194,30 +194,28 @@ legend("bottomright", border=NA,box.col=NA) ``` -The plot we obtained shows the separation between clusters. However, it does not do a great job showing the separation between cluster 3 and 4, which represent CML and "no leukemia" patients. We might need another dimension to properly visualize that separation. In addition, those two clusters were closely related in the hierarchical clustering as well. +The plot we obtained shows the separation between clusters. However, it does not do a great job showing the separation between clusters 3 and 4, which represent CML and "no leukemia" patients. We might need another dimension to properly visualize that separation. In addition, those two clusters were closely related in the hierarchical clustering as well. -### how to choose "k", the number of clusters +### How to choose "k", the number of clusters Up to this point, we have avoided the question of selecting optimal number clusters. How do we know where to cut our dendrogram or which k to choose ? -First of all, this is a difficult question. Usually, clusters have different granularity. Some clusters are tight and compact and some are wide,and both these types of clusters can be in the same data set. When visualized, some large clusters may look like they may have sub-clusters. So should we consider the large cluster as one cluster or should we consider the sub-clusters as individual clusters? There are some metrics to help but there is no definite answer. We will show a couple of them below. +First of all, this is a difficult question. Usually, clusters have different granularity. Some clusters are tight and compact and some are wide, and both these types of clusters can be in the same data set. When visualized, some large clusters may look like they may have sub-clusters. So should we consider the large cluster as one cluster or should we consider the sub-clusters as individual clusters? There are some metrics to help but there is no definite answer. We will show a couple of them below. -#### Silhouhette -One way to determine how well the clustering is to measure the expected self-similar nature of the points in a set of clusters. The silhouette value does just that and it is a measure of how similar a data point is to its own cluster compared to other clusters [@rousseeuw1987silhouettes]. The silhouette value ranges from -1 to +1, where values that are positive indicates that the data point is well matched to its own cluster, if the value is zero it is a borderline case and if the value is minus it means that the data point might be mis-clustered because it is more similar to a neighboring cluster. If most data points have a high value, then the clustering is appropriate. Ideally, one can create many different clusterings with different parameters such as $k$,number of clusters and assess their appropriateness using the average -silhouette values. In R, silhouette values are referred to as silhouette widths in the documentation.\index{silhouhette} +#### Silhouette +One way to determine the quality of the clustering is to measure the expected self-similar nature of the points in a set of clusters. The silhouette value does just that and it is a measure of how similar a data point is to its own cluster compared to other clusters [@rousseeuw1987silhouettes]. The silhouette value ranges from -1 to +1, where values that are positive indicate that the data point is well matched to its own cluster, if the value is zero it is a borderline case, and if the value is minus it means that the data point might be mis-clustered because it is more similar to a neighboring cluster. If most data points have a high value, then the clustering is appropriate. Ideally, one can create many different clusterings with each with a different $k$ parameter indicating the number of clusters, and assess their appropriateness using the average +silhouette values. In R, silhouette values are referred to as silhouette widths in the documentation.\index{silhouette} -A silhouette value is calculated for each data point. In our working example, each patient will get silhouette values showing how well they are matched to their assigned clusters. Formally this calculated as follows. For each data point $i$, we calculate ${\displaystyle a(i)}$, which denotes the average distance between $i$ and all other data points within the same cluster. This shows how well the point fits into that cluster. For the same data point, we also calculate ${\displaystyle b(i)}$ b(i) denotes the lowest average distance of ${\displaystyle i}$ to all points in any other cluster, of which ${\displaystyle i}$ is not a member. The cluster with this lowest average $b(i)$ is the "neighboring cluster" of data point ${\displaystyle i}$ since it is the next best fit cluster for that data point. Then, the silhouette value for a given data point is: - -$s(i) = \frac{b(i) - a(i)}{\max\{a(i),b(i)\}}$ +A silhouette value is calculated for each data point. In our working example, each patient will get silhouette values showing how well they are matched to their assigned clusters. Formally this calculated as follows. For each data point $i$, we calculate ${\displaystyle a(i)}$, which denotes the average distance between $i$ and all other data points within the same cluster. This shows how well the point fits into that cluster. For the same data point, we also calculate ${\displaystyle b(i)}$, which denotes the lowest average distance of ${\displaystyle i}$ to all points in any other cluster, of which ${\displaystyle i}$ is not a member. The cluster with this lowest average $b(i)$ is the "neighboring cluster" of data point ${\displaystyle i}$ since it is the next best fit cluster for that data point. Then, the silhouette value for a given data point is $s(i) = \frac{b(i) - a(i)}{\max\{a(i),b(i)\}}$. As described, this quantity is positive when $b(i)$ is high and $a(i)$ is low, meaning that the data point $i$ is self-similar to its cluster. And the silhouette value, $s(i)$, is negative if it is more similar to its neighbors than its assigned cluster. -In R, we can calculate silhouette values using `cluster::silhouette()` function. Below, we calculate the silhouette values for k-medoids clustering with `pam()` function with `k=5`. The resulting silhouette values are shown in Figure \@ref(fig:sill). +In R, we can calculate silhouette values using the `cluster::silhouette()` function. Below, we calculate the silhouette values for k-medoids clustering with the `pam()` function with `k=5`. The resulting silhouette values are shown in Figure \@ref(fig:sill). ```{r sill,out.width='50%',fig.cap="Silhouette values for k-medoids with `k=5`"} library(cluster) set.seed(101) pamclu=cluster::pam(t(mat),k=5) plot(silhouette(pamclu),main=NULL) ``` -Now, let us calculate average silhouette value different $k$ values and compare. We will use `sapply()` function to get average silhouette values across $k$ values between 2 and 7. Within `sapply()` there is an anonymous function that that does the clustering and calculates average silhouette values for each $k$. The plot showing average silhouette values for different $k$ values is shown in Figure \@ref(fig:sillav). +Now, let us calculate the average silhouette value for different $k$ values and compare. We will use `sapply()` function to get average silhouette values across $k$ values between 2 and 7. Within `sapply()` there is an anonymous function that that does the clustering and calculates average silhouette values for each $k$. The plot showing average silhouette values for different $k$ values is shown in Figure \@ref(fig:sillav). ```{r sillav,out.width='40%',fig.cap="Average silhouette values for k-medoids clustering for `k` values between 2 and 7"} Ks=sapply(2:7, @@ -226,18 +224,18 @@ Ks=sapply(2:7, plot(2:7,Ks,xlab="k",ylab="av. silhouette",type="b", pch=19) ``` -In this case, it seems the best value for $k$ is 4. The k-medoids function `pam()` will usually cluster CML and "no Leukemia" cases together when `k=4`, which are also related clusters according to hierarchical clustering we did earlier. +In this case, it seems the best value for $k$ is 4. The k-medoids function `pam()` will usually cluster CML and "no Leukemia" cases together when `k=4`, which are also related clusters according to the hierarchical clustering we did earlier. #### Gap statistic -As clustering aims to find self-similar data points, it would be reasonable to expect with the correct number of clusters the total within-cluster variation is minimized. Within-cluster variation for a single cluster can simply be defined as sum of squares from the cluster mean, which in this case is the centroid we defined in k-means algorithm. The total within-cluster variation is then sum of within-cluster variations for each cluster. This can be formally defined as follows:\index{gap statistic} +As clustering aims to find self-similar data points, it would be reasonable to expect with the correct number of clusters the total within-cluster variation is minimized. Within-cluster variation for a single cluster can simply be defined as the sum of squares from the cluster mean, which in this case is the centroid we defined in the k-means algorithm. The total within-cluster variation is then the sum of within-cluster variations for each cluster. This can be formally defined as follows:\index{gap statistic} $\displaystyle W_k = \sum_{k=1}^K \sum_{\mathrm{x}_i \in C_k} (\mathrm{x}_i - \mu_k )^2$ -Where $\mathrm{x}_i$ is data point in cluster $k$, and $\mu_k$ is the cluster mean, and $W_k$ is the total within-cluster variation quantity we described. However, the problem is that the variation quantity decreases with number of clusters. The more centroids we have, the smaller the distances to the centroids get. A more reliable approach would be somehow calculating the expected variation from a reference null distribution and compare that to the observed variation for each $k$. In gap statistic approach, the expected distribution is calculated via sampling points from the boundaries of the original data and calculating within-cluster variation quantity for multiple rounds of sampling [@tibshirani2001estimating]. This way we have an expectation about the variability when there is no clustering, and then compare that expected variation to the observed within-cluster variation. The expected variation should also go down with increasing number of clusters, but for the optimal number of clusters the expected variation will be furthest away from observed variation. This distance is called the **"gap statistic"** and defined as follows: -$\displaystyle \mathrm{Gap}_n(k) = E_n^*\{\log W_k\} - \log W_k$, where $E_n^*\{\log W_k\}$ is the expected variation in log-scale under a sample size $n$ from the reference distribution and $\log W_k$ is the observed variation. Our aim is choose the $k$, number of clusters, that maximizes $\mathrm{Gap}_n(k)$. +where $\mathrm{x}_i$ is a data point in cluster $k$, and $\mu_k$ is the cluster mean, and $W_k$ is the total within-cluster variation quantity we described. However, the problem is that the variation quantity decreases with the number of clusters. The more centroids we have, the smaller the distances to the centroids become. A more reliable approach would be somehow calculating the expected variation from a reference null distribution and compare that to the observed variation for each $k$. In the gap statistic approach, the expected distribution is calculated via sampling points from the boundaries of the original data and calculating within-cluster variation quantity for multiple rounds of sampling [@tibshirani2001estimating]. This way we have an expectation about the variability when there is no clustering, and then compare that expected variation to the observed within-cluster variation. The expected variation should also go down with the increasing number of clusters, but for the optimal number of clusters, the expected variation will be furthest away from observed variation. This distance is called the **"gap statistic"** and defined as follows: +$\displaystyle \mathrm{Gap}_n(k) = E_n^*\{\log W_k\} - \log W_k$, where $E_n^*\{\log W_k\}$ is the expected variation in log-scale under a sample size $n$ from the reference distribution and $\log W_k$ is the observed variation. Our aim is to choose the $k$ number of clusters that maximizes $\mathrm{Gap}_n(k)$. -We can easily calculate the gap statistic with `cluster::clusGap()` function. We will now use that function to calculate the gap statistic for our patient gene expression data. The resulting gap statistics are shown in Figure \@ref(fig:clusGap). -```{r clusGap,out.width='50%',fig.cap="Gap Statistic for clustering leukemia dataset with k-medoids (pam) algorithm"} +We can easily calculate the gap statistic with the `cluster::clusGap()` function. We will now use that function to calculate the gap statistic for our patient gene expression data. The resulting gap statistics are shown in Figure \@ref(fig:clusGap). +```{r clusGap,out.width='50%',fig.cap="Gap statistic for clustering the leukemia dataset with k-medoids (pam) algorithm."} library(cluster) set.seed(101) # define the clustering function @@ -252,13 +250,13 @@ plot(pam.gap, main = "Gap statistic for the 'Leukemia' data") ``` -In this case, gap statistic shows that $k=7$ is the best if we take the maximum value as the best. However, after $k=6$ the statistic has more or less a stable curve. This observation is Incorporated into algorithms that can select the best $k$ value based on gap statistic. A reasonable way is to take the simulation error (error bars in \@ref(fig:clusGap)) into account, and take the smallest $k$ whose gap statistic is larger or equal to the one of $k+1$ minus the simulation error. Formally written we would pick the smallest $k$ satisfying the following condition: $\mathrm{Gap}(k) \geq \mathrm{Gap}(k+1) - s_{k+1}$, where $s_{k+1}$ is the simulation error for $\mathrm{Gap}(k+1)$. +In this case, the gap statistic shows that $k=7$ is the best if we take the maximum value as the best. However, after $k=6$, the statistic has more or less a stable curve. This observation is incorporated into algorithms that can select the best $k$ value based on the gap statistic. A reasonable way is to take the simulation error (error bars in \@ref(fig:clusGap)) into account, and take the smallest $k$ whose gap statistic is larger or equal to the one of $k+1$ minus the simulation error. Formally written, we would pick the smallest $k$ satisfying the following condition: $\mathrm{Gap}(k) \geq \mathrm{Gap}(k+1) - s_{k+1}$, where $s_{k+1}$ is the simulation error for $\mathrm{Gap}(k+1)$. -Using this procedure gives us $k=6$ as the optimum number of clusters. Biologically, we know that there are 5 main patient categories but this does not mean there is no sub-categories or sub-types for the cancers we are looking at. +Using this procedure gives us $k=6$ as the optimum number of clusters. Biologically, we know that there are 5 main patient categories but this does not mean there are no sub-categories or sub-types for the cancers we are looking at. #### Other methods -There are several other methods that provide insight into how many clusters. In fact, the package `NbClust` provides 30 different ways to determine the number of optimal clusters and can offer a voting mechanism to pick the best number. Below, we are showing how to use this function for some of the optimal number of cluster detection methods.\index{R Packages!\texttt{NbClust}} +There are several other methods that provide insight into how many clusters. In fact, the package `NbClust` provides 30 different ways to determine the number of optimal clusters and can offer a voting mechanism to pick the best number. Below, we show how to use this function for some of the optimal number of cluster detection methods.\index{R Packages!\texttt{NbClust}} ```{r nbclustall, eval=FALSE,echo=TRUE, cache=TRUE} library(NbClust) nb = NbClust(data=t(mat), @@ -272,14 +270,14 @@ nb = NbClust(data=t(mat), table(nb$Best.nc[1,]) # consensus seems to be 3 clusters ``` -However, the readers should keep in mind that clustering is an exploratory technique. If you have solid labels for your data points maybe clustering is just a sanity check, and you should just do predictive modeling instead. However, in biology there are rarely solid labels and things have different granularity. Take the leukemia patients case we have been using for example, it is know that leukemia types have subtypes and those sub-types that have different mutation profiles and consequently have different molecular signatures. Because of this, it is not surprising that some optimal cluster number techniques will find more clusters to be appropriate. On the other hand, CML (Chronic myeloid leukemia ) is a slow progressing disease and maybe as molecular signatures goes could be the closest to no leukemia patients, clustering algorithms may confuse the two depending on what granularity they are operating with. It is always good to look at the heatmaps after clustering, if you have meaningful self-similar data points even if the labels you have do not agree that there can be different clusters you can perform downstream analysis to understand the sub-clusters better. As we have seen, we can estimate optimal number of clusters but we can not take that estimation as the absolute truth, given more data points or different set of expression signatures you may have different optimal clusterings, or the supposed optimal clustering might overlook previously known sub-groups of your data. +However, readers should keep in mind that clustering is an exploratory technique. If you have solid labels for your data points, maybe clustering is just a sanity check, and you should just do predictive modeling instead. However, in biology there are rarely solid labels and things have different granularity. Take the leukemia patients case we have been using for example, it is known that leukemia types have subtypes and those sub-types that have different mutation profiles and consequently have different molecular signatures. Because of this, it is not surprising that some optimal cluster number techniques will find more clusters to be appropriate. On the other hand, CML (chronic myeloid leukemia) is a slow progressing disease and maybe their molecular signatures are closer to "no leukemia" patients, so clustering algorithms may confuse the two depending on what granularity they are operating with. It is always good to look at the heatmaps after clustering, if you have meaningful self-similar data points, even if the labels you have do not agree that there can be different clusters, you can perform downstream analysis to understand the sub-clusters better. As we have seen, we can estimate the optimal number of clusters but we cannot take that estimation as the absolute truth. Given more data points or a different set of expression signatures, you may have different optimal clusterings, or the supposed optimal clustering might overlook previously known sub-groups of your data. -## Dimensionality reduction techniques: visualizing complex data sets in 2D +## Dimensionality reduction techniques: Visualizing complex data sets in 2D In statistics, dimension reduction techniques are a set of processes for reducing the number of random variables by obtaining a set of principal variables. For example, in the context of a gene expression matrix across different patient samples, this might mean getting a set of new variables that cover the variation in sets of genes. This way samples can be represented by a couple of principal variables instead of thousands of genes. This is useful for visualization, clustering and predictive modeling.\index{dimensionality reduction} ### Principal component analysis -Principal component analysis (PCA)\index{principal component analysis (PCA)} is maybe the most popular technique to examine high-dimensional data. There are multiple interpretations of how PCA reduces dimensionality. We will first focus on geometrical interpretation, where this operation can be interpreted as rotating the original dimensions of the data. For this, we go back to our example gene expression data set. In this example, we will represent our patients with expression profiles of just two genes, CD33 (ENSG00000105383) and PYGL (ENSG00000100504) genes. This way we can visualize them in a scatterplot (See Figure \@ref(fig:scatterb4PCA)). +Principal component analysis (PCA)\index{principal component analysis (PCA)} is maybe the most popular technique to examine high-dimensional data. There are multiple interpretations of how PCA reduces dimensionality. We will first focus on geometrical interpretation, where this operation can be interpreted as rotating the original dimensions of the data. For this, we go back to our example gene expression data set. In this example, we will represent our patients with expression profiles of just two genes, CD33 (ENSG00000105383) and PYGL (ENSG00000100504). This way we can visualize them in a scatter plot (see Figure \@ref(fig:scatterb4PCA)). ```{r scatterb4PCA,out.width='60%',fig.width=5.5, fig.cap="Gene expression values of CD33 and PYGL genes across leukemia patients."} plot(mat[rownames(mat)=="ENSG00000100504",], mat[rownames(mat)=="ENSG00000105383",],pch=19, @@ -287,8 +285,8 @@ plot(mat[rownames(mat)=="ENSG00000100504",], xlab="PYGL (ENSG00000100504)") ``` -PCA rotates the original data space such that the axes of the new coordinate system point into the directions of highest variance of the data. The axes or new variables are termed principal components (PCs) and are ordered by variance: The first component, PC 1, represents the direction of the highest variance of the data. The direction of the second component, PC 2, represents the highest of the remaining variance orthogonal to the first component. This can be naturally extended to obtain the required number of components which together span a component space covering the desired amount of variance. In our toy example with only two genes, the principal components are drawn over the original scatter plot and in the next plot we show the new coordinate system based on the principal components. We will calculate the PCA with the `princomp()` function, this function returns the new coordinates as well. These new coordinates are simply a projection of data over the new coordinates. We will decorate the scatter plots with eigenvectors showing the direction of greatest variation. Then, we will plot the new coordinates (The resulting plot is shown in Figure \@ref(fig:pcaRot)). These are automatically calculated by `princomp()` function. Notice that we are using the `scale()` function when plotting coordinates and also before calculating PCA. This function centers the data, meaning subtracts the mean of the each column vector from the elements in the vector. This essentially gives the columns a zero mean. It also divides the data by the standard deviation of the centered columns. These two operations helps bring the data to a common scale which is important for PCA not to be affected by different scales in the data. -```{r pcaRot,out.width='60%',fig.width=8.5,fig.cap="Geometric interpretation of PCA finding eigenvectors that point to direction of highest variance. Eigenvectors can be used as a new coordinate system."} +PCA rotates the original data space such that the axes of the new coordinate system point to the directions of highest variance of the data. The axes or new variables are termed principal components (PCs) and are ordered by variance: The first component, PC 1, represents the direction of the highest variance of the data. The direction of the second component, PC 2, represents the highest of the remaining variance orthogonal to the first component. This can be naturally extended to obtain the required number of components, which together span a component space covering the desired amount of variance. In our toy example with only two genes, the principal components are drawn over the original scatter plot and in the next plot we show the new coordinate system based on the principal components. We will calculate the PCA with the `princomp()` function; this function returns the new coordinates as well. These new coordinates are simply a projection of data over the new coordinates. We will decorate the scatter plots with eigenvectors showing the direction of greatest variation. Then, we will plot the new coordinates (the resulting plot is shown in Figure \@ref(fig:pcaRot)). These are automatically calculated by the `princomp()` function. Notice that we are using the `scale()` function when plotting coordinates and also before calculating the PCA. This function centers the data, meaning it subtracts the mean of each column vector from the elements in the vector. This essentially gives the columns a zero mean. It also divides the data by the standard deviation of the centered columns. These two operations help bring the data to a common scale, which is important for PCA not to be affected by different scales in the data. +```{r pcaRot,out.width='60%',fig.width=8.5,fig.cap="Geometric interpretation of PCA finding eigenvectors that point to the direction of highest variance. Eigenvectors can be used as a new coordinate system."} par(mfrow=c(1,2)) # create the subset of the data with two genes only @@ -336,9 +334,9 @@ arrows(x0=0, y0=0, x1 = 0, ``` -As you can see, the new coordinate system is useful by itself.The X-axis which represents the first component separates the data along the lympoblastic and myeloid leukemias.\index{principal component analysis (PCA)} +As you can see, the new coordinate system is useful by itself. The X-axis, which represents the first component, separates the data along the lymphoblastic and myeloid leukemias.\index{principal component analysis (PCA)} -PCA in this case is obtained by calculating eigenvectors of the covariance matrix via an operation called eigen decomposition. Covariance matrix is obtained by covariance of pairwise variables of our expression matrix, which is simply ${ \operatorname{cov} (X,Y)={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-\mu_X)(y_{i}-\mu_Y)}$, where $X$ and $Y$ expression values of genes in a sample in our example. This is a measure of how things vary together, if high expressed genes in sample A are also highly expressed in sample B and lowly expressed in sample A are also lowly expressed in sample B, then sample A and B will have positive covariance. If the opposite is true then they will have negative covariance. This quantity is related to correlation and in fact correlation is standardized covariance. Covariance of variables can be obtained with `cov()` function, and eigen decomposition of such a matrix will produce a set of orthogonal vectors that span the directions of highest variation. In 2D, you can think of this operation as rotating two perpendicular lines together until they point to the directions where most of the variation in the data lies on, similar to the figure \@ref(fig:pcaRot). An important intuition is that, after the rotation prescribed by eigenvectors is complete the covariance between variables in this rotated dataset will be zero. There is a proper mathematical relationship between covariances of the rotated dataset and the original dataset. That's why operating on covariance matrix is related to the rotation of the original dataset. +PCA in this case, is obtained by calculating eigenvectors of the covariance matrix via an operation called eigen decomposition. The covariance matrix is obtained by covariance of pairwise variables of our expression matrix, which is simply ${ \operatorname{cov} (X,Y)={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-\mu_X)(y_{i}-\mu_Y)}$, where $X$ and $Y$ are expression values of genes in a sample in our example. This is a measure of how things vary together, if highly expressed genes in sample A are also highly expressed in sample B and lowly expressed in sample A are also lowly expressed in sample B, then sample A and B will have positive covariance. If the opposite is true, then they will have negative covariance. This quantity is related to correlation, and as we saw in the previous chapter, correlation is standardized covariance. Covariance of variables can be obtained with the `cov()` function, and eigen decomposition of such a matrix will produce a set of orthogonal vectors that span the directions of highest variation. In 2D, you can think of this operation as rotating two perpendicular lines together until they point to the directions where most of the variation in the data lies, similar to Figure \@ref(fig:pcaRot). An important intuition is that, after the rotation prescribed by eigenvectors is complete, the covariance between variables in this rotated dataset will be zero. There is a proper mathematical relationship between covariances of the rotated dataset and the original dataset. That's why operating on the covariance matrix is related to the rotation of the original dataset. ```{r eigenOnCovMat,eval=FALSE} cov.mat=cov(sub.mat) # calculate covariance matrix @@ -346,18 +344,18 @@ cov.mat eigen(cov.mat) # obtain eigen decomposition for eigen values and vectors ``` -Eigenvectors and eigenvalues of the covariance matrix indicates the direction and the magnitute of variation of the data. In our visual example the eigenvectors are so-called principal components. The eigenvector indicates the direction and the eigen values indicate the variation in that direction. Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue and the eigenvectors are linearly independent from each other, this means they are orthogonal or uncorrelated in the our working example above. The eigenvectors are ranked by their corresponding eigen value, the higher the eigen value the more important the eigenvector is, because it explains more of the variation compared to the other eigenvectors. This feature of PCA makes the dimension reduction possible. We can sometimes display data sets that have many variables only in 2D or 3D because the these top eigenvectors are sometimes enough to capture most of variation in the data. `screeplot()` function takes the output of `princomp()` or `prcomp()` functions as input and plots the variance explained by eigen vectors. +Eigenvectors and eigenvalues of the covariance matrix indicate the direction and the magnitude of variation of the data. In our visual example, the eigenvectors are so-called principal components. The eigenvector indicates the direction and the eigenvalues indicate the variation in that direction. Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue and the eigenvectors are linearly independent from each other, which means they are orthogonal or uncorrelated as in our working example above. The eigenvectors are ranked by their corresponding eigenvalue, the higher the eigenvalue the more important the eigenvector is, because it explains more of the variation compared to the other eigenvectors. This feature of PCA makes the dimension reduction possible. We can sometimes display data sets that have many variables only in 2D or 3D because these top eigenvectors are sometimes enough to capture most of variation in the data. The `screeplot()` function takes the output of the `princomp()` or `prcomp()` functions as input and plots the variance explained by eigenvectors. #### Singular value decomposition and principal component analysis -A more common way to calculate PCA is through something called singular value decomposition (SVD). \index{singular value decomposition (SVD)}This results in another interpretation of PCA, which is called "latent factor" or "latent component" interpretation. In a moment, it \index{principal component analysis (PCA)} will be more clear what we mean by "latent factors". SVD is a matrix factorization or decomposition algorithm that decomposes an input matrix,$X$, to three matrices as follows: $\displaystyle \mathrm{X} = USV^T$. In essence many matrices can be decomposed as a product of multiple matrices and we will come to other techniques later in this chapter. Singular Value Decomposition is shown in figure \@ref(fig:SVDcartoon). $U$ is the matrix with eigenarrays on the columns and this has the same dimensions as the input matrix, you might see elsewhere the columns are named as eigenassays. $S$ is the matrix that contain the singular values on the diagonal. The singular values are also known as eigenvalues and their square is proportional to explained variation by each eigenvector. Finally, the matrix $V^T$ contains the eigenvectors on its rows. It is interpretation is still the same. Geometrically, eigenvectors point to the direction of highest variance in the data. They are uncorrolated or geometrically orthogonal to each other. These interpretations are identical to the ones we made before. The slight difference is that the decomposition seem to output $V^T$ which is just the transpose of the matrix $V$. However, the SVD algorithms in R usually return the matrix $V$. If you want the eigenvectors, you either simply use the columns of matrix $V$ or rows of $V^T$. -```{r SVDcartoon,echo=FALSE,fig.align='center',out.width='60%',fig.cap="Singular Value Decomposition (SVD) explained in a diagram. "} +A more common way to calculate PCA is through something called singular value decomposition (SVD). \index{singular value decomposition (SVD)}This results in another interpretation of PCA, which is called "latent factor" or "latent component" interpretation. In a moment, it \index{principal component analysis (PCA)} will be clearer what we mean by "latent factors". SVD is a matrix factorization or decomposition algorithm that decomposes an input matrix,$X$, to three matrices as follows: $\displaystyle \mathrm{X} = USV^T$. In essence, many matrices can be decomposed as a product of multiple matrices and we will come to other techniques later in this chapter. Singular value decomposition is shown in Figure \@ref(fig:SVDcartoon). $U$ is the matrix with eigenarrays on the columns and this has the same dimensions as the input matrix; you might see elsewhere the columns are called eigenassays. $S$ is the matrix that contains the singular values on the diagonal. The singular values are also known as eigenvalues and their square is proportional to explained variation by each eigenvector. Finally, the matrix $V^T$ contains the eigenvectors on its rows. Its interpretation is still the same. Geometrically, eigenvectors point to the direction of highest variance in the data. They are uncorrelated or geometrically orthogonal to each other. These interpretations are identical to the ones we made before. The slight difference is that the decomposition seems to output $V^T$, which is just the transpose of the matrix $V$. However, the SVD algorithms in R usually return the matrix $V$. If you want the eigenvectors, you either simply use the columns of matrix $V$ or rows of $V^T$. +```{r SVDcartoon,echo=FALSE,fig.align='center',out.width='60%',fig.cap="Singular value decomposition (SVD) explained in a diagram. "} knitr::include_graphics("images/SVDcartoon.png") ``` -One thing that is new in the figure \@ref(fig:SVDcartoon) is the concept of eigenarrays. The eigenarrays or sometimes called eigenassays represent the sample space and can be used to plot the relationship between samples rather than genes. In this way, SVD offers additional information than the PCA using the covariance matrix. It offers us a way to summarize both genes and samples. As we can project the gene expression profiles over the top two eigengenes and get a 2D representation of genes, but with SVD we can also project the samples over the the top two eigenarrays and get a representation of samples in 2D scatterplot. Eigenvector could represent independent expression programs across samples, such as cell-cycle if we had time-based expression profiles. However, there is no guarantee that each eigenvector will be biologically meaningful. Similarly each eigenarray represent samples with specific expression characteristics. For example, the samples that have a particular pathway activated might be correlated to an eigenarray returned by SVD. +One thing that is new in Figure \@ref(fig:SVDcartoon) is the concept of eigenarrays. The eigenarrays, sometimes called eigenassays, represent the sample space and can be used to plot the relationship between samples rather than genes. In this way, SVD offers additional information than the PCA using the covariance matrix. It offers us a way to summarize both genes and samples. As we can project the gene expression profiles over the top two eigengenes and get a 2D representation of genes, but with the SVD, we can also project the samples over the top two eigenarrays and get a representation of samples in 2D scatter plot. The eigenvector could represent independent expression programs across samples, such as cell-cycle, if we had time-based expression profiles. However, there is no guarantee that each eigenvector will be biologically meaningful. Similarly each eigenarray represents samples with specific expression characteristics. For example, the samples that have a particular pathway activated might be correlated to an eigenarray returned by SVD. -Previously, in order to map samples to the reduced 2D space we had to transpose the genes-by-samples matrix when using `princomp()` function. We will now first use SVD on genes-by-samples matrix to get eigenarrays and use that to plot samples on the reduced dimensions. We will project the columns in our original expression data on eigenarrays and use the first two dimensions in the scatter plot. If you look at the code you will see that for the projection we use $U^T X$ operation, which is just $V^T$ if you follow the linear algebra. We will also perform the PCA this time with `prcomp()` function on the transposed genes-by-samples matrix to get a similar information, and plot the samples on the reduced coordinates. +Previously, in order to map samples to the reduced 2D space we had to transpose the genes-by-samples matrix before using the `princomp()` function. We will now first use SVD on the genes-by-samples matrix to get eigenarrays and use that to plot samples on the reduced dimensions. We will project the columns in our original expression data on eigenarrays and use the first two dimensions in the scatter plot. If you look at the code you will see that for the projection we use $U^T X$ operation, which is just $S V^T$ if you follow the linear algebra. We will also perform the PCA this time with the `prcomp()` function on the transposed genes-by-samples matrix to get similar information, and plot the samples on the reduced coordinates. -```{r svd,out.width='65%',fig.width=8.5,fig.cap="SVD on matrix and its transpose"} +```{r svd,out.width='65%',fig.width=8.5,fig.cap="SVD on the matrix and its transpose"} par(mfrow=c(1,2)) d=svd(scale(mat)) # apply SVD assays=t(d$u) %*% scale(mat) # projection on eigenassays @@ -372,32 +370,32 @@ pr=prcomp(t(mat),center=TRUE,scale=TRUE) # apply PCA on transposed matrix plot(pr$x[,1],pr$x[,2],col=as.factor(annotation_col$LeukemiaType)) ``` -As you can see in the figure \@ref(fig:svd), the two approaches yield separation of samples, although they are slightly different. The difference comes from the centering and scaling. In the first case, we scale and center columns and the second case we scale and center rows since the matrix is transposed. If we do not do any scaling or centering we would get identical plots. +As you can see in Figure \@ref(fig:svd), the two approaches yield separation of samples, although they are slightly different. The difference comes from the centering and scaling. In the first case, we scale and center columns and in the second case we scale and center rows since the matrix is transposed. If we do not do any scaling or centering we would get identical plots. ##### Eigenvectors as latent factors/variables -Finally, we can introduce the latent factor interpretation of PCA via SVD. As we have already mentioned eigenvectors can also be interpreted as expression programs that are shared by several genes such as cell cycle expression program when measuring gene expression across samples taken in different time points. In this interpretation, linear combination of expression programs makes up the expression profile of the genes. Linear combination simply means multiplying the expression program with a weight and adding them up. Our $USV^T$ matrix multiplication can be rearranged to yield such an understanding, we can multiply eigenarrays $U$ with the diagonal eigenvalues $S$, to produce a m-by-n weights matrix called $W$, so $W=US$ and we can re-write the equation as just weights by eigenvectors matrix, $X=WV^T$ as shown in figure \@ref(fig:SVDasWeigths). -```{r SVDasWeigths,echo=FALSE,out.width='70%',fig.cap="Singular Value Decomposition (SVD) reorgonized as multiplication of m-by-n weights matrix and eigenvectors "} +Finally, we can introduce the latent factor interpretation of PCA via SVD. As we have already mentioned, eigenvectors can also be interpreted as expression programs that are shared by several genes such as cell cycle expression program when measuring gene expression across samples taken in different time points. In this interpretation, linear combination of expression programs makes up the expression profile of the genes. Linear combination simply means multiplying the expression program with a weight and adding them up. Our $USV^T$ matrix multiplication can be rearranged to yield such an understanding, we can multiply eigenarrays $U$ with the diagonal eigenvalues $S$, to produce an m-by-n weights matrix called $W$, so $W=US$ and we can re-write the equation as just weights by eigenvectors matrix, $X=WV^T$ as shown in Figure \@ref(fig:SVDasWeigths). +```{r SVDasWeigths,echo=FALSE,out.width='70%',fig.cap="Singular value decomposition (SVD) reorganized as multiplication of m-by-n weights matrix and eigenvectors "} knitr::include_graphics("images/SVDasWeights.png") ``` -This simple transformation now makes it clear that indeed if eigenvectors are representing expression programs, their linear combination is making up individual gene expression profiles. As an example, we can show the liner combination of the first two eigenvectors can approximate the expression profile of an hypothetical gene in the gene expression matrix. The figure \@ref(fig:SVDlatentExample) shows eigenvector 1 and eigenvector 2 combined with certain weights in $W$ matrix can approximate gene expression pattern our example gene. -```{r SVDlatentExample,echo=FALSE,fig.cap="Gene expression of a gene can be thought as linear combination of eigenvectors. "} +This simple transformation now makes it clear that indeed, if eigenvectors represent expression programs, their linear combination makes up individual gene expression profiles. As an example, we can show the linear combination of the first two eigenvectors can approximate the expression profile of a hypothetical gene in the gene expression matrix. Figure \@ref(fig:SVDlatentExample) shows eigenvector 1 and eigenvector 2 combined with certain weights in $W$ matrix can approximate gene expression pattern our example gene. +```{r SVDlatentExample,echo=FALSE,fig.cap="Gene expression of a gene can be regarded as a linear combination of eigenvectors. "} knitr::include_graphics("images/SVDlatentExample.png") ``` However, SVD does not care about biology. The eigenvectors are just obtained from the data with constraints of orthogonality and the direction of variation. There are examples of eigenvectors representing -real expression programs but that does not mean eigenvectors will always be biologically meaningful. Sometimes combination of them might make more sense in biology than single eigenvectors. This is also the same for the other matrix factorization techniques we describe below. +real expression programs but that does not mean eigenvectors will always be biologically meaningful. Sometimes a combination of them might make more sense in biology than single eigenvectors. This is also the same for the other matrix factorization techniques we describe below. ### Other matrix factorization methods for dimensionality reduction -We must mention a few other techniques that are similar to SVD in spirit. Remember we mentioned that every matrix can be decomposed to other matrices where matrix multiplication operations reconstruct the original matrix, which is in general called "matrix factorization"\index{matrix factorization}. In the case of SVD/PCA, the constraint is that eigenvectors/arrays are orthogonal, however there are other decomposition algorithms with other constraints. +We must mention a few other techniques that are similar to SVD in spirit. Remember, we mentioned that every matrix can be decomposed to other matrices where matrix multiplication operations reconstruct the original matrix, which is in general called "matrix factorization"\index{matrix factorization}. In the case of SVD/PCA, the constraint is that eigenvectors/arrays are orthogonal, however, there are other decomposition algorithms with other constraints. #### Independent component analysis (ICA) -We will first start with independent component analysis (ICA)\index{Independent component analysis} which is an extension of PCA. ICA algorithm decomposes a given matrix $X$ as follows: $X=SA$ [@hyvarinen2013independent]. The rows of $A$ could be interpreted similar to the eigengenes and columns of $S$ could be interpreted as eigenarrays, these components are sometimes called metagenes and metasamples in the literature. Traditionally, $S$ is called source matrix and $A$ is called mixing matrix. ICA is developed for a problem called "blind-source separation". In this problem, multiple microphones record sound from multiple instruments, and the task is to disentangle sounds from original instruments since each microphone is recording a combination of sounds. In this respect, the matrix $S$ contains the original signals (sounds from different instruments) and their linear combinations identified by the weights in $A$, and the product of $A$ and $S$ makes up the matrix $X$, which is the observed signal from different microphones. With this interpretation in mind, if the interest is strictly expression patterns similar that represent the hidden expression programs we see that genes-by-samples matrix is transposed to a samples-by-genes matrix, so that the columns of $S$ represent these expression patterns , here referred to as "metagenes", hopefully representing distinct expression programs (Figure \@ref(fig:ICAcartoon) ). \index{independent component analyis (ICA)} +We will first start with independent component analysis (ICA)\index{Independent component analysis} which is an extension of PCA. ICA algorithm decomposes a given matrix $X$ as follows: $X=SA$ [@hyvarinen2013independent]. The rows of $A$ could be interpreted similar to the eigengenes and columns of $S$ could be interpreted as eigenarrays. These components are sometimes called metagenes and metasamples in the literature. Traditionally, $S$ is called the source matrix and $A$ is called mixing matrix. ICA is developed for a problem called "blind-source separation". In this problem, multiple microphones record sound from multiple instruments, and the task is to disentangle sounds from original instruments since each microphone is recording a combination of sounds. In this respect, the matrix $S$ contains the original signals (sounds from different instruments) and their linear combinations identified by the weights in $A$, and the product of $A$ and $S$ makes up the matrix $X$, which is the observed signal from different microphones. With this interpretation in mind, if the interest is strictly expression patterns that represent the hidden expression programs, we see that the genes-by-samples matrix is transposed to a samples-by-genes matrix, so that the columns of $S$ represent these expression patterns, here referred to as "metagenes", hopefully representing distinct expression programs (Figure \@ref(fig:ICAcartoon) ). \index{independent component analyis (ICA)} ```{r ICAcartoon,echo=FALSE,fig.cap="Independent Component Analysis (ICA)"} knitr::include_graphics("images/ICAcartoon.png") ``` -ICA requires that the columns of $S$ matrix, the "metagenes" in our example above to be statistical independent. This is a stronger constraint than uncorrelatedness. In this case, there should be no relationship between non-linear transformation of the data either. There are different ways of ensuring this statistical indepedence and this is the main constraint when finding the optimal $A$ and $S$ matrices. The various ICA algorithms use different proxies for statistical independence, and the definition of that proxy is the main difference between many ICA algorithms. The algorithm we are going to use requires that metagenes or sources in the $S$ matrix are non-gaussian (non-normal) as possible. Non-gaussianity is shown to be related to statistical independence [@hyvarinen2013independent]. Below, we are using `fastICA::fastICA()` function to extract 2 components and plot the rows of matrix $A$ which represents metagenes shown in Figure \@ref(fig:fastICAex). This way, we can visualize samples in a 2D plot. If we wanted to plot the relationship between genes we would use the the columns of matrix $S$. +ICA requires that the columns of the $S$ matrix, the "metagenes" in our example above, are statistically independent. This is a stronger constraint than uncorrelatedness. In this case, there should be no relationship between non-linear transformation of the data either. There are different ways of ensuring this statistical indepedence and this is the main constraint when finding the optimal $A$ and $S$ matrices. The various ICA algorithms use different proxies for statistical independence, and the definition of that proxy is the main difference between many ICA algorithms. The algorithm we are going to use requires that metagenes or sources in the $S$ matrix are non-Gaussian (non-normal) as possible. Non-Gaussianity is shown to be related to statistical independence [@hyvarinen2013independent]. Below, we are using the `fastICA::fastICA()` function to extract 2 components and plot the rows of matrix $A$ which represents metagenes shown in Figure \@ref(fig:fastICAex). This way, we can visualize samples in a 2D plot. If we wanted to plot the relationship between genes we would use the columns of matrix $S$. ```{r fastICAex, out.width='50%',fig.width=5,fig.cap="Leukemia gene expression values per patient on reduced dimensions by ICA."} library(fastICA) ica.res=fastICA(t(mat),n.comp=2) # apply ICA @@ -409,20 +407,20 @@ plot(ica.res$S[,1],ica.res$S[,2],col=as.factor(annotation_col$LeukemiaType)) #### Non-negative matrix factorization (NMF) Non-negative matrix factorization -\index{non-negative matrix factorization (NMF)}algorithms are series of algorithms that aim to decompose the matrix $X$ into the product or matrices $W$ and $H$, $X=WH$ (Figure \@ref(fig:NMFcartoon)) [@lee2001algorithms]. The constraint is that $W$ and $H$ must contain non-negative values, so must $X$. This is well suited for data sets that can not contain negative values such as gene expression. This also implies additivity of components, in our example expression of a gene across samples are addition of multiple metagenes. Unlike ICA and SVD/PCA, the metagenes can never be combined in subtractive way. In this sense, expression programs potentially captured by metagenes are combined additively. +\index{non-negative matrix factorization (NMF)}algorithms are series of algorithms that aim to decompose the matrix $X$ into the product of matrices $W$ and $H$, $X=WH$ (Figure \@ref(fig:NMFcartoon)) [@lee2001algorithms]. The constraint is that $W$ and $H$ must contain non-negative values, so must $X$. This is well suited for data sets that cannot contain negative values such as gene expression. This also implies additivity of components or latent factors. This is in line with the idea that expression pattern of a gene across samples is the weighted sum of multiple metagenes. Unlike ICA and SVD/PCA, the metagenes can never be combined in a subtractive way. In this sense, expression programs potentially captured by metagenes are combined additively. ```{r NMFcartoon,echo=FALSE,fig.cap="Non-negative matrix factorization summary",out.width='70%'} knitr::include_graphics("images/NMFcartoon.png") ``` -The algorithms that compute NMF tries to minimize the cost function $D(X,WH)$, which is the distance between $X$ and $WH$. The early algorithms just use the euclidean distance which translates to $\sum(X-WH)^2$, this is also known as Frobenious norm and you will see in the literature it is written as :$\||V-WH||_{F}$ -However this is not the only distance metric, other distance metrics are also used in NMF algorithms. In addition, there could be other parameters to optimize that relates to sparseness of the $W$ and $H$ matrices. With sparse $W$ and $H$, each entry in the $X$ matrix is expressed as the sum of a small number of components. This makes the interpretation easier, if the weights are 0 than there is not contribution from the corresponding factors. +The algorithms that compute NMF try to minimize the cost function $D(X,WH)$, which is the distance between $X$ and $WH$. The early algorithms just use the Euclidean distance, which translates to $\sum(X-WH)^2$; this is also known as the Frobenius norm and you will see in the literature it is written as :$\||X-WH||_{F}$. +However, this is not the only distance metric; other distance metrics are also used in NMF algorithms. In addition, there could be other parameters to optimize that relates to sparseness of the $W$ and $H$ matrices. With sparse $W$ and $H$, each entry in the $X$ matrix is expressed as the sum of a small number of components. This makes the interpretation easier, if the weights are $0$ then there is no contribution from the corresponding factors. -Below, we are plotting the values of metagenes (rows of $H$) for component 1 and 3, shown in Figure \@ref(fig:nmfCode). In this context, these values can also be interpreted as relationship between samples. If we wanted to plot the relationship between genes we would plot the columns of $W$ matrix. +Below, we are plotting the values of metagenes (rows of $H$) for components 1 and 3, shown in Figure \@ref(fig:nmfCode). In this context, these values can also be interpreted as the relationship between samples. If we wanted to plot the relationship between genes we would plot the columns of the $W$ matrix. ```{r nmfCode,out.width='60%',fig.width=5,fig.cap="Leukemia gene expression values per patient on reduced dimensions by NMF. Components 1 and 3 is used for the plot."} library(NMF) -res=nmf(mat,rank=3,seed="nndsvd") # nmf with 3 components/factors +res=NMF::nmf(mat,rank=3,seed="nndsvd") # nmf with 3 components/factors w <- basis(res) # get W h <- coef(res) # get H @@ -431,42 +429,42 @@ plot(h[1,],h[3,],col=as.factor(annotation_col$LeukemiaType),pch=19) ``` -We should add the note that due to random starting points of the optimization algorithm, NMF is usually run multiple times and a consensus clustering approach is used when clustering samples. This simply means that samples are clustered together if they cluster together in multiple runs of the NMF. The NMF package we used above has built-in ways to achieve this. In addition, NMF is a family of algorithms the choice of cost function to optimize the difference between $X$ and $WH$ and the methods used for optimization creates multiple variants of NMF. The "method" parameter in the above `nmf()` function controls the which algorithm for NMF. \index{R Packages!\texttt{NMF}} +We should add the note that, due to random starting points of the optimization algorithm, NMF is usually run multiple times and a consensus clustering approach is used when clustering samples. This simply means that samples are clustered together if they cluster together in multiple runs of the NMF. The NMF package we used above has built-in ways to achieve this. In addition, NMF is a family of algorithms. The choice of cost function to optimize the difference between $X$ and $WH$, and the methods used for optimization create multiple variants of NMF. The "method" parameter in the above `nmf()` function controls the algorithm choice for NMF. \index{R Packages!\texttt{NMF}} -#### chosing the number of components and ranking components in importance -In both ICA and NMF, there is no well-defined way to rank components or to select the number of components. There are couple of approaches that might suit to both ICA and NMF for ranking components. One can use the norms of columns/rows in mixing matrices. This could simply mean take the sum of absolute values in mixing matrices. In our examples above, For our ICA example above, ICA we would take the sum of the absolute values of the rows of $A$ since we transposed the input matrix $X$ before ICA. And for the NMF, we would use the columns of $W$. These ideas assume that the larger coefficients in the weight or mixing matrices indicate more important components. +#### Choosing the number of components and ranking components in importance +In both ICA and NMF, there is no well-defined way to rank components or to select the number of components. There are a couple of approaches that might suit both ICA and NMF for ranking components. One can use the norms of columns/rows in mixing matrices. This could simply mean take the sum of absolute values in mixing matrices. For our ICA example above, we would take the sum of the absolute values of the rows of $A$ since we transposed the input matrix $X$ before ICA. And for the NMF, we would use the columns of $W$. These ideas assume that the larger coefficients in the weight or mixing matrices indicate more important components. -For selecting the optimal number of components, NMF package provides different strategies. One way is to calculate RSS for each $k$, number of components, and take the $k$ where the RSS curve starts to stabilize.However, these strategies require that you run the algorithm with multiple possible component numbers. `nmf` function will run these automatically when the `rank` argument is a vector of numbers. For ICA there is no straightforward way to choose the right number of components, a common strategy is to start with as many components as variables and try to rank them by their usefulness. +For selecting the optimal number of components, the NMF package provides different strategies. One way is to calculate the RSS for each $k$, the number of components, and take the $k$ where the RSS curve starts to stabilize. However, these strategies require that you run the algorithm with multiple possible component numbers. The `nmf` function will run these automatically when the `rank` argument is a vector of numbers. For ICA there is no straightforward way to choose the right number of components. A common strategy is to start with as many components as variables and try to rank them by their usefulness. ```{block2, nmfica, type='rmdtip'} __Want to know more ?__ -NMF package vignette has extensive information on how to run NMF to get stable resuts and getting an estimate of components https://cran.r-project.org/web/packages/NMF/vignettes/NMF-vignette.pdf +The NMF package vignette has extensive information on how to run NMF to get stable results and an estimate of components: https://cran.r-project.org/web/packages/NMF/vignettes/NMF-vignette.pdf ``` ### Multi-dimensional scaling -MDS is a set of data analysis techniques that display the structure of distance data in a high dimensional space into a lower dimensional space without much loss of information [@cox2000multidimensional]. The overall goal of MDS is to faithfully represent these distances with the lowest possible dimensions. So called "classical multi-dimensional scaling" algorithm, tries to minimize the following function:\index{Multi-dimensional scaling (MDS)} +MDS is a set of data analysis techniques that displays the structure of distance data in a high-dimensional space into a lower dimensional space without much loss of information [@cox2000multidimensional]. The overall goal of MDS is to faithfully represent these distances with the lowest possible dimensions. The so-called "classical multi-dimensional scaling" algorithm, tries to minimize the following function:\index{Multi-dimensional scaling (MDS)} ${\displaystyle Stress_{D}(z_{1},z_{2},...,z_{N})={\Biggl (}{\frac {\sum _{i,j}{\bigl (}d_{ij}-\|z_{i}-z_{j}\|{\bigr )}^{2}}{\sum _{i,j}d_{ij}^{2}}}{\Biggr )}^{1/2}}$ -Here the function compares the new data points on lower dimension $(z_{1},z_{2},...,z_{N})$ to the input distances between data points or distance between samples in our gene expression example. It turns out, this problem can be efficiently solved with SVD/PCA on the scaled distance matrix, the projection on eigenvectors will be the most optimal solution for the equation above. Therefore, classical MDS is sometimes called Principal Coordinates Analysis in the literature. However, later variants improve on classical MDS this by using this as a starting point and optimize a slightly different cost function that again measures how well the low-dimensional distances correspond to high-dimensional distances. This variant is called non-metric MDS and due to the nature of the cost function, it assumes a less stringent relationship between the low-dimensional distances $\|z_{i}-z_{j}\| and input distances $d_{ij}$. Formally, this procedure tries to optimize the following function. +Here the function compares the new data points on the lower dimension $(z_{1},z_{2},...,z_{N})$ to the input distances between data points or distance between samples in our gene expression example. It turns out, this problem can be efficiently solved with SVD/PCA on the scaled distance matrix, the projection on eigenvectors will be the most optimal solution for the equation above. Therefore, classical MDS is sometimes called Principal Coordinates Analysis in the literature. However, later variants improve on classical MDS by using this as a starting point and optimize a slightly different cost function that again measures how well the low-dimensional distances correspond to high-dimensional distances. This variant is called non-metric MDS and due to the nature of the cost function, it assumes a less stringent relationship between the low-dimensional distances $\|z_{i}-z_{j}\| and input distances $d_{ij}$. Formally, this procedure tries to optimize the following function. ${\displaystyle Stress_{D}(z_{1},z_{2},...,z_{N})={\Biggl (}{\frac {\sum _{i,j}{\bigl (}\|z_{i}-z_{j}\|-\theta(d_{ij}){\bigr )}^{2}}{\sum _{i,j}\|z_{i}-z_{j}\|^{2}}}{\Biggr )}^{1/2}}$ -The core of a non-metric MDS algorithm is a twofold optimization process. First the optimal monotonic transformation of the distances has to be found, this is shown in the above formula as $\theta(d_{ij})$. Secondly, the points on a low dimension configuration have to be optimally arranged, so that their distances match the scaled distances as closely as possible. This two steps are repeated until some convergence criteria is reached. This usually means that the cost function does not improve much after certain number of iterations. The basic steps in a non-metric MDS algorithm are: +The core of a non-metric MDS algorithm is a two-fold optimization process. First the optimal monotonic transformation of the distances has to be found, which is shown in the above formula as $\theta(d_{ij})$. Secondly, the points on a low dimension configuration have to be optimally arranged, so that their distances match the scaled distances as closely as possible. These two steps are repeated until some convergence criteria is reached. This usually means that the cost function does not improve much after certain number of iterations. The basic steps in a non-metric MDS algorithm are: -1) Find a random low dimensional configuration of points, or in the variant we will be using below we start with the configuration returned by classical MDS -2) Calculate the distances between the points in the low dimension $\|z_{i}-z_{j}\|$, $z_{i}$ and $z_{j}$ are vector of positions for sample $i$ and $j$. +1) Find a random low-dimensional configuration of points, or in the variant we will be using below we start with the configuration returned by classical MDS. +2) Calculate the distances between the points in the low dimension $\|z_{i}-z_{j}\|$, $z_{i}$ and $z_{j}$ are vector of positions for samples $i$ and $j$. 3) Find the optimal monotonic transformation of the input distance, ${\textstyle \theta(d_{ij})}$, to approximate input distances to low-dimensional distances. This is achieved by isotonic regression, where a monotonically increasing free-form function is fit. This step practically ensures that ranking of low-dimensional distances are similar to rankings of input distances. 4) Minimize the stress function by re-configuring low-dimensional space and keeping $\theta$ function constant. -5) repeat from step 2 until convergence. +5) Repeat from Step 2 until convergence. We will now demonstrate both classical MDS and Kruskal's isometric MDS. ```{r mds2,out.width='60%',fig.width=8.5,fig.cap="Leukemia gene expression values per patient on reduced dimensions by classical MDS and isometric MDS."} @@ -482,37 +480,37 @@ plot(isomds$points,pch=19,col=as.factor(annotation_col$LeukemiaType), main="isotonic MDS") ``` -The resulting plot is shown in Figure \@ref(fig:mds2). In this example, there is not much difference between isotonic MDS and classical MDS. However, there might be cases where different MDS methods provides visible changes in the scatter plots. +The resulting plot is shown in Figure \@ref(fig:mds2). In this example, there is not much difference between isotonic MDS and classical MDS. However, there might be cases where different MDS methods provide visible changes in the scatter plots. ### t-Distributed Stochastic Neighbor Embedding (t-SNE) -t-SNE maps the distances in high-dimensional space to lower dimensions and it is similar to MDS method in this respect. But the benefit of this particular method is that it tries to preserve the local structure of the data so the distances and grouping of the points we observe in a lower dimensions such as a 2D scatter plot is as close as possible to the distances we observe in the high-dimensional space [@maaten2008visualizing]. As with other dimension reduction methods, you can choose how many lower dimensions you need. The main difference of t-SNE is that it tries to preserve the local structure of the data. This kind of local structure embedding is missing in the MDS algorithm which also has a similar goal. MDS tries to optimize the distances as a whole, whereas t-SNE optimizes the distances with the local structure in mind. This is defined by the "perplexity" parameter in the arguments. This parameter controls how much the local structure influences the distance calculation. The lower the value the more the local structure is take into account. Similar to MDS, the process is an optimization algorithm. Here, we also try to minimize the divergence between observed distances and lower dimension distances. However, in the case of t-SNE, the observed distances and lower dimensional distances are transformed using a probabilistic framework with their local variance in mind.\index{t-Distributed Stochastic Neighbor Embedding (t-SNE)} +t-SNE maps the distances in high-dimensional space to lower dimensions and it is similar to the MDS method in this respect. But the benefit of this particular method is that it tries to preserve the local structure of the data so the distances and grouping of the points we observe in lower dimensions such as a 2D scatter plot is as close as possible to the distances we observe in the high-dimensional space [@maaten2008visualizing]. As with other dimension reduction methods, you can choose how many lower dimensions you need. The main difference of t-SNE, as mentiones above, is that it tries to preserve the local structure of the data. This kind of local structure embedding is missing in the MDS algorithm, which also has a similar goal. MDS tries to optimize the distances as a whole, whereas t-SNE optimizes the distances with the local structure in mind. This is defined by the "perplexity" parameter in the arguments. This parameter controls how much the local structure influences the distance calculation. The lower the value, the more the local structure is taken into account. Similar to MDS, the process is an optimization algorithm. Here, we also try to minimize the divergence between observed distances and lower dimension distances. However, in the case of t-SNE, the observed distances and lower dimensional distances are transformed using a probabilistic framework with their local variance in mind.\index{t-Distributed Stochastic Neighbor Embedding (t-SNE)} -From here on, we will provide a bit more detail on how the algorithm works in case conceptual description above is too shallow. In t-SNE the euclidean distances between data points are transformed into a conditional similarity between points. This is done by assuming a normal distribution on each data point with a variance calculated ultimately by the use of "perplexity" parameter. The perplexity parameter is, in a sense, a guess about the number of the closest neighbors each point has. Setting it to higher values gives more weight to global structure. Given $d_{ij}$ is the euclidean distance between point $i$ and $j$, the similarity score $p_{ij}$ is calculated as shown below. +From here on, we will provide a bit more detail on how the algorithm works in case the conceptual description above is too shallow. In t-SNE the Euclidean distances between data points are transformed into a conditional similarity between points. This is done by assuming a normal distribution on each data point with a variance calculated ultimately by the use of the "perplexity" parameter. The perplexity parameter is, in a sense, a guess about the number of the closest neighbors each point has. Setting it to higher values gives more weight to global structure. Given $d_{ij}$ is the Euclidean distance between point $i$ and $j$, the similarity score $p_{ij}$ is calculated as shown below. -$p_{j | i} = \frac{\exp(-\|d_{ij}\|^2 / 2 σ_i^2)}{∑_{k \neq i} \exp(-\|d_{ik}\|^2 / 2 σ_i^2)}$ +$$p_{j | i} = \frac{\exp(-\|d_{ij}\|^2 / 2 σ_i^2)}{∑_{k \neq i} \exp(-\|d_{ik}\|^2 / 2 σ_i^2)}$$ This distance is symmetrized by incorporating $p_{i | j}$ as shown below. -$p_{i j}=\frac{p_{j|i} + p_{i|j}}{2n}$ +$$p_{i j}=\frac{p_{j|i} + p_{i|j}}{2n}$$ -For the distances in the reduced dimension, we use t-distribution with one degree of freedom. In the formula below, $| y_i-y_j\|^2$ is euclidean distance between points $i$ and $j$ in the reduced dimensions. +For the distances in the reduced dimension, we use t-distribution with one degree of freedom. In the formula below, $| y_i-y_j\|^2$ is Euclidean distance between points $i$ and $j$ in the reduced dimensions. $$ q_{i j} = \frac{(1+ \| y_i-y_j\|^2)^{-1}}{(∑_{k \neq l} 1+ \| y_k-y_l\|^2)^{-1} } $$ -As most of the algorithms we have seen in this section, t-SNE is an optimization process in essence. In every iteration the points along lower dimensions are re-arranged to minimize the formulated difference between the the observed joint probabilities ($p_{i j}$) and low-dimensional joint probabilities ($q_{i j}$). Here we are trying to compare probability distributions. In this case, this is done using a method called Kullback-Leibler divergence, or KL-divergence. In the formula below, since the $p_{i j}$ is pre-defined using original distances, only way to optimize is to play with $q_{i j}$) because it depends on the configuration of points in the lower dimensional space. This configuration is optimized to minimize the KL-divergence between $p_{i j}$ and $q_{i j}$. +As most of the algorithms we have seen in this section, t-SNE is an optimization process in essence. In every iteration the points along lower dimensions are re-arranged to minimize the formulated difference between the observed joint probabilities ($p_{i j}$) and low-dimensional joint probabilities ($q_{i j}$). Here we are trying to compare probability distributions. In this case, this is done using a method called Kullback-Leibler divergence, or KL-divergence. In the formula below, since the $p_{i j}$ is pre-defined using original distances, the only way to optimize is to play with $q_{i j}$ because it depends on the configuration of points in the lower dimensional space. This configuration is optimized to minimize the KL-divergence between $p_{i j}$ and $q_{i j}$. $$ KL(P||Q) = \sum_{i, j} p_{ij} \, \log \frac{p_{ij}}{q_{ij}}. $$ -Strictly speaking, KL-divergence measures how well the distribution $P$ which is observed using the original data points can be approximated by distribution $Q$, which is modeled using points on the lower dimension. If the distributions are identical KL-divergence would be 0. Naturally, the more divergent the distributions are the higher the KL-divergence will be. +Strictly speaking, KL-divergence measures how well the distribution $P$ which is observed using the original data points can be approximated by distribution $Q$, which is modeled using points on the lower dimension. If the distributions are identical, KL-divergence would be $0$. Naturally, the more divergent the distributions are, the higher the KL-divergence will be. -We will now show how to use t-SNE on our gene expression data set using `Rtsne` package \index{R Packages!\texttt{Rtsne}}. We are setting the random seed because again t-SNE optimization algorithm have random starting points and this might create non-identical results in every run. After calculating the t-SNE lower dimension embeddings we plot the points in a 2D scatter plot, shown in Figure \@ref(fig:tsne). +We will now show how to use t-SNE on our gene expression data set using the `Rtsne` package \index{R Packages!\texttt{Rtsne}}. We are setting the random seed because again, the t-SNE optimization algorithm has random starting points and this might create non-identical results in every run. After calculating the t-SNE lower dimension embeddings we plot the points in a 2D scatter plot, shown in Figure \@ref(fig:tsne). ```{r tsne,eval=TRUE, out.width='60%',fig.width=5, fig.cap="t-SNE of leukemia expression dataset"} library("Rtsne") set.seed(42) # Set a seed if you want reproducible results @@ -529,23 +527,23 @@ legend("bottomleft", border=NA,box.col=NA) ``` -As you might have noticed, we set again a random seed with `set.seed()` function. The optimization algorithm starts with random configuration of points in the lower dimension space, and each iteration it tries to improve on the previous lower dimension conflagration, that is why starting points can result in different final outcomes. +As you might have noticed, we set again a random seed with the `set.seed()` function. The optimization algorithm starts with random configuration of points in the lower dimension space, and in each iteration it tries to improve on the previous lower dimension conflagration, which is why starting points can result in different final outcomes. ```{block2, t-sne, type='rmdtip'} __Want to know more ?__ -- How perplexity effects t-sne, interactive examples https://distill.pub/2016/misread-tsne/ -- more on perplexity: https://blog.paperspace.com/dimension-reduction-with-t-sne/ -- Intro to t-SNE https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm +- How perplexity affects t-sne, interactive examples: https://distill.pub/2016/misread-tsne/ +- More on perplexity: https://blog.paperspace.com/dimension-reduction-with-t-sne/ +- Intro to t-SNE: https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm ``` ## Exercises -For this set of exercises we will be using the expression data as shown below: +For this set of exercises we will be using the expression data shown below: ```{r dataLoadClu,eval=FALSE} expFile=system.file("extdata", "leukemiaExpressionSubset.rds", @@ -556,21 +554,21 @@ mat=readRDS(expFile) ### Clustering -1. We want to observe the effect of data transformation in this exercise. Scale the expression matrix with `scale()` function. In addition, try taking the logarithm of the data with `log2()` function prior to scaling. Make box plots of the unscaled and scaled data sets using `boxplot()` function. [Difficulty: **Beginner/Intermediate**] +1. We want to observe the effect of data transformation in this exercise. Scale the expression matrix with the `scale()` function. In addition, try taking the logarithm of the data with the `log2()` function prior to scaling. Make box plots of the unscaled and scaled data sets using the `boxplot()` function. [Difficulty: **Beginner/Intermediate**] -2. For the same problem above using the unscaled data and different data transformation strategies, use `ward.d` distance in hierarchical clustering and plot multiple heatmap. You can try to use `pheatmap` library or any other library that can plot a heatmap with a dendrogram.Which data scaling strategy provides more homogeneous clusters with respect to disease types? [Difficulty: **Beginner/Intermediate**] +2. For the same problem above using the unscaled data and different data transformation strategies, use the `ward.d` distance in hierarchical clustering and plot multiple heatmaps. You can try to use the `pheatmap` library or any other library that can plot a heatmap with a dendrogram. Which data-scaling strategy provides more homogeneous clusters with respect to disease types? [Difficulty: **Beginner/Intermediate**] -3. For the transformed and untransformed data sets used in exercise above, use the silhouette for deciding number of clusters using hierarchical clustering [Difficulty: **Intermediate/Advanced**] +3. For the transformed and untransformed data sets used in the exercise above, use the silhouette for deciding number of clusters using hierarchical clustering. [Difficulty: **Intermediate/Advanced**] -4. Now, use Gap Statistic for deciding number of clusters in hierarchical clustering. Is it the same number of clusters identified by two methods? Is it similar to the number of clusters obtained using the k-means algorithm in the chapter [Difficulty: **Intermediate/Advanced**] +4. Now, use the Gap Statistic for deciding the number of clusters in hierarchical clustering. Is it the same number of clusters identified by two methods? Is it similar to the number of clusters obtained using the k-means algorithm in the chapter. [Difficulty: **Intermediate/Advanced**] -### Dimension Reduction -We will be using the leukemia expression data set again. You can use it as shown in clustering exercises. +### Dimension reduction +We will be using the leukemia expression data set again. You can use it as shown in the clustering exercises. -1. Do PCA on the expression matrix using `princomp()` function and then use `screeplot()` function to visualize the explained variation by eigenvectors. How many top components explain the 95% of the variation ? [Difficulty: **Beginner**] +1. Do PCA on the expression matrix using the `princomp()` function and then use the `screeplot()` function to visualize the explained variation by eigenvectors. How many top components explain 95% of the variation? [Difficulty: **Beginner**] -2. Our next tasks is to remove eigenvectors and reconstruct the matrix using SVD, then calculating reconstruction error as the difference between original and reconstructed matrix. HINT: You have to use `svd()` function and equalize eigen value to 0 for the component you want to remove for the eigen vector you want to remove [Difficulty: **Intermediate/Advanced**] +2. Our next tasks are to remove eigenvectors and reconstruct the matrix using SVD, then calculate the reconstruction error as the difference between original and reconstructed matrix. HINT: You have to use the `svd()` function and equalize eigenvalue to $0$ for the component you want to remove. [Difficulty: **Intermediate/Advanced**] -3. Produce a 10 component ICA from the expression data set. Remove each component one-by-one and measure the reconstruction error. Rank the components by decreasing reconstruction error.[Difficulty: **Advanced**] +3. Produce a 10-component ICA from the expression data set. Remove each component and measure the reconstruction error without that component. Rank the components by decreasing reconstruction-error. [Difficulty: **Advanced**] -4. In this exercise we use the `Rtsne()` function on the leukemia expression data set. Try to increase and decrease perplexity t-sne, and describe observed the changes in 2D tsne plots. [Difficulty: **Beginner**] +4. In this exercise we use the `Rtsne()` function on the leukemia expression data set. Try to increase and decrease perplexity t-sne, and describe the observed changes in 2D plots. [Difficulty: **Beginner**] diff --git a/05-supervisedLearning.Rmd b/05-supervisedLearning.Rmd index b075e22..34eaeed 100644 --- a/05-supervisedLearning.Rmd +++ b/05-supervisedLearning.Rmd @@ -12,33 +12,33 @@ knitr::opts_chunk$set(echo = TRUE, In this chapter we will introduce supervised machine learning applications for predictive modeling. In genomics, we are often faced with biological questions to answer using lots of data. Some of those questions can easily fit in the domain of machine learning, where algorithms will learn a mathematical model of the input data in order to make decisions about similar data, previously unseen by the model. Often we are trying to predict a medical or biological variable of interest using molecular signatures obtained via genomics methods. To give you a better idea, we listed some of the machine learning applications in genomics: -- predicting gene expression from epigenetic modifications (@pmid22950368) -- predicting gene locations (@pmid12364589) -- predicting enhancer or other regulatory regions (@pmid22328731) -- predicting drug response based on genomics (@pmid21428770) -- predicting healthy/disease status or disease subtypes based on genomics (@pmid25750696) -- predicting the effect of SNPs on gene regulation (@pmid26301843) -- calling SNPs (@pmid30247488) +- Predicting gene expression from epigenetic modifications [@pmid22950368]. +- Predicting gene locations [@pmid12364589]. +- Predicting enhancer or other regulatory regions [@pmid22328731]. +- Predicting drug response based on genomics [@pmid21428770]. +- Predicting healthy/disease status or disease subtypes based on genomics [@pmid25750696]. +- Predicting the effect of SNPs on gene regulation [@pmid26301843]. +- Calling SNPs [@pmid30247488]. -Apart from prediction of an outcome, machine learning can be used to understand which predictor variables are the most important for the prediction performance. This often gives insights into the biology as well. Many machine learning algorithms have either built-in variable importance assessment or can be wrapped around a model-agnostic variable importance method. For example, we may want to find which epigenetic modifications are most important for gene expression prediction. Although decades of molecular biology gives a pretty good idea for this, we could arrive at similar conclusions by building a machine learning model to predict gene expression using histone modifications H3K27ac, H3K27me, H3K4me1, H3K4me3 and DNA methylation. \index{histone modification} \index{DNA methylation} We can then check which of these are most important for gene expression prediction using variable importance metrics. +Apart from prediction of an outcome, machine learning can be used to understand which predictor variables are the most important for prediction performance. This often gives insights into the biology as well. Many machine learning algorithms have either built-in variable importance assessment or can be wrapped around a model-agnostic variable importance method. For example, we may want to find which epigenetic modifications are most important for gene expression prediction. Although decades of molecular biology gives a pretty good idea for this, we could arrive at similar conclusions by building a machine learning model to predict gene expression using histone modifications H3K27ac, H3K27me, H3K4me1, H3K4me3, and DNA methylation. \index{histone modification} \index{DNA methylation} We can then check which of these are most important for gene expression prediction using variable importance metrics. In this chapter, we will show how to use supervised machine learning models to solve problems in genomics. We will go over general steps in machine learning applications. In addition, we will introduce how to use some of the most popular supervised machine learning models in practice. -## How machine learning models are fit? -We have already have quite an insight on how machine learning models are fit. We have previously seen clustering methods, which are unsupervised machine learning models, and we have seen linear regression which is a simple machine learning model if we disregard its objectives for statistical inference. +## How are machine learning models fit? +We already have quite an insight on how machine learning models are fit. We have previously seen clustering methods, which are unsupervised machine learning models, and we have seen linear regression which is a simple machine learning model if we disregard its objectives for statistical inference. -Machine learning models are optimization methods in their core. They all depend on defining a "cost" or "loss" function to minimize\index{loss function} \index{cost function}. For example, in linear regression the difference between predicted and the original values are being minimized. When we have a data set with the correct answer such as original values or class labels, this is called supervised learning. We use the structure in the data the predict a value and optimization methods help us use the right structure or patterns in the data. The supervised machine learning methods use predictor variables such as gene expression values or other genomic scores to build a mathematical function, or a mapping method if you will. This function maps a predictor variable vector or matrix from a given sample to the response variable: labels/classes or numeric values. The response variable is also called "dependent variable". Then, the predictions are simply output of mathematical functions, $f(X)$. These functions take predictor variables, $X$, as input. The variables in $X$ are also \index{independent variables} called "independent variables","explanatory variables" or "features". \index{explanatory variables}The functions also have internal parameters that help map $X$ to the predicted values. The optimization works on the parameters of $f(X)$ and tries to minimize the difference between the function output and original response variables ($Y$): $\sum(Y-f(X))^2$. Now, this is just a simplification of the actual "cost" or "loss" function. Especially, in classification problems cost functions can take different forms but the idea behind is the same. You have a mathematical expression you can minimize by searching for the optimal parameter values. The core ingredients of a machine learning algorithm are the same and they are listed as follows: +Machine learning models are optimization methods at their core. They all depend on defining a "cost" or "loss" function to minimize\index{loss function}\index{cost function}. For example, in linear regression the difference between the predicted and the original values are being minimized. When we have a data set with the correct answer such as original values or class labels, this is called supervised learning. We use the structure in the data to predict a value, and optimization methods help us use the right structure or patterns in the data. The supervised machine learning methods use predictor variables such as gene expression values or other genomic scores to build a mathematical function, or a mapping method if you will. This function maps a predictor variable vector or matrix from a given sample to the response variable: labels/classes or numeric values. The response variable is also called the "dependent variable". Then, the predictions are simply output of mathematical functions, $f(X)$. These functions take predictor variables, $X$, as input. The variables in $X$ are also \index{independent variables} called "independent variables", "explanatory variables" or "features". \index{explanatory variables}The functions also have internal parameters that help map $X$ to the predicted values. The optimization works on the parameters of $f(X)$ and tries to minimize the difference between the function output and original response variables ($Y$): $\sum(Y-f(X))^2$. Now, this is just a simplification of the actual "cost" or "loss" function. Especially in classification problems, cost functions can take different forms, but the idea is the same. You have a mathematical expression you can minimize by searching for the optimal parameter values. The core ingredients of a machine learning algorithm are the same and they are listed as follows: 1) Define a prediction function or method $f(X)$. -2) Devise a function (called loss or cost function) to optimize the difference between your predictions and observed values, such as $\sum (Y-f(X))^2$. -3) Apply mathematical optimization methods to find best parameter values for $f(X)$ in relation to the cost/loss function.\index{optimization} +2) Devise a function (called the loss or cost function) to optimize the difference between your predictions and observed values, such as $\sum (Y-f(X))^2$. +3) Apply mathematical optimization methods to find the best parameter values for $f(X)$ in relation to the cost/loss function.\index{optimization} -Similarly, clustering and dimension reduction techniques can use optimization methods but they do so without having a correct answer to predict or train with. In this case, they find patterns or structure in the data without trying to estimate a correct answer. These patterns are groupings of samples or variables, such as common gene expression patterns that can be obtained from dimension reduction techniques such as PCA. In general, dimension reduction algorithms can be thought as optimization procedures that are trying to minimize $X-WH$. Here, $X$ is our original data set and $WH$ is the product of potentially two lower dimension matrices, $W$ and $H$. In this case, the optimization procedure hopefully gives us the lower dimensional space we can represent our data without loosing too much information. +Similarly, clustering and dimension reduction techniques can use optimization methods, but they do so without having a correct answer to predict or train with. In this case, they find patterns or structure in the data without trying to estimate a correct answer. These patterns are groupings of samples or variables, such as common gene expression patterns, that can be obtained from dimension reduction techniques such as PCA. In general, dimension reduction algorithms can be thought of as optimization procedures that are trying to minimize $X-WH$. Here, $X$ is our original data set and $WH$ is the product of potentially two lower dimension matrices, $W$ and $H$. In this case, the optimization procedure hopefully gives us the lower-dimensional space so that we can represent our data without losing too much information. -### Machine learning vs Statistics -Machine learning and statistics are related and sometimes overlapping fields. Statistical inference is the main purpose of statistics. The aim of inference is to find statistical properties of the underlying data and also estimate the uncertainty about those properties. However, while doing so, the field of statistics developed dimension reduction and regression techniques that are the corner stone of machine learning applications. +### Machine learning vs. statistics +Machine learning and statistics are related and sometimes overlapping fields. Statistical inference is the main purpose of statistics. The aim of inference is to find statistical properties of the underlying data and to estimate the uncertainty about those properties. However, while doing so, the field of statistics developed dimension reduction and regression techniques that are the cornerstone of machine learning applications. Both machine learning and statistics share the same overarching goal, which is learning from the data. The difference between the two is that machine learning emphasizes optimization and performance over statistical inference. Statistics is also concerned about performance but would like to calculate the uncertainty associated with parameters of the model. It will try to model the population statistics from the sample data points to assess that uncertainty. Having said that, many machine learning algorithms, including a couple we will introduce below, are developed by scientists who will define themselves as statisticians, and work at statistics departments of universities. @@ -47,16 +47,16 @@ The difference between the two is that machine learning emphasizes optimization ## Steps in supervised machine learning There are many methods to use for supervised learning problems. However, there are similar steps that you will need to follow whatever machine learning method you choose to train. These steps are briefly described below and we will get back to these in detail later in the chapter: -- pre-processing data: We might have to use normalization and data transformation procedures. -- training and test data split: Decide which strategy you want to use for evaluation purposes. You need to use a test set to evaluate your model later on. -- training the model: This is where your choice of supervised learning algorithm becomes relevant."Training" generally means your data set is used in optimization of the loss function to find parameters for $f(x)$. +- Pre-processing data: We might have to use normalization and data transformation procedures. +- Training and test data split: Decide which strategy you want to use for evaluation purposes. You need to use a test set to evaluate your model later on. +- Training the model: This is where your choice of supervised learning algorithm becomes relevant. "Training" generally means your data set is used in optimization of the loss function to find parameters for $f(x)$. - Estimating performance of the model: This is about which metrics to use to evaluate performance and how to calculate those metrics. - Model tuning and selection: We try different parameters and select the best model. -Many of these steps are identical for different supervised learning methods. Therefore, we will use [`caret`](http://topepo.github.io/caret/index.html) package to\index{R Packages!\texttt{caret}} perform these steps, which streamlines the steps and provides a similar interface for different supervised learning methods. There are other similar packages such as [`mlr`](https://mlr.mlr-org.com/) \index{R Packages!\texttt{mlr}}that can provide similar functionality. For now, we will focus on classification models which is a subset of supervised learning models. In these types of models, we try to predict a categorical response variable, such as if a patient has the disease or not, or what type of disease the patient has based on predictor variables. +Many of these steps are identical for different supervised learning methods. Therefore, we will use the [`caret`](http://topepo.github.io/caret/index.html) package to\index{R Packages!\texttt{caret}} perform these steps, which streamlines the steps and provides a similar interface for different supervised learning methods. There are other similar packages, such as [`mlr`](https://mlr.mlr-org.com/), \index{R Packages!\texttt{mlr}}that can provide similar functionality. For now, we will focus on classification models, which is a subset of supervised learning models. In these types of models, we try to predict a categorical response variable, such as if a patient has the disease or not, or what type of disease the patient has based on predictor variables. ## Use case: Disease subtype from genomics data -We will start our illustration of machine learning using a real dataset from tumor biopsies. We will use the gene expression data of glioblastoma tumor samples from “The Cancer Genome Atlas”\index{The Cancer Genome Atlas (TCGA)} project. We will try to predict the subtype of this disease using molecular markers. \index{CpG island}This subtype is characterized by large scale epigenetic alterations called "CpG island methylator phenotype" or "CIMP" ( @pmid20399149), half of the patients in our data set have this subtype and the rest do not, and we will try to predict which ones have CIMP subtype. There two data objects we need for this exercise, one for gene expression values per tumor sample and the other one is subtype annotation per patient. In the expression data set, every row is a patient and every column is a gene expression value\index{gene expression}. There are 184 tumor samples. This data set might a bit small for real world applications, however it is very relevant for the genomics focus of this book and the small datasets takes less time to train, which is useful for reproduciblity purposes. We will read these data sets from **compGenomRData** package now with `readRDS()` function. +We will start our illustration of machine learning using a real dataset from tumor biopsies. We will use the gene expression data of glioblastoma tumor samples from The Cancer Genome Atlas\index{The Cancer Genome Atlas (TCGA)} project. We will try to predict the subtype of this disease using molecular markers. \index{CpG island}This subtype is characterized by large-scale epigenetic alterations called the "CpG island methylator phenotype" or "CIMP" [@pmid20399149]; half of the patients in our data set have this subtype and the rest do not, and we will try to predict which ones have the CIMP subtype. There two data objects we need for this exercise, one for gene expression values per tumor sample and the other one is subtype annotation per patient. In the expression data set, every row is a patient and every column is a gene expression value\index{gene expression}. There are 184 tumor samples. This data set might be a bit small for real-world applications, however it is very relevant for the genomics focus of this book and the small datasets take less time to train, which is useful for reproducibility purposes. We will read these data sets from the **compGenomRData** package now with the `readRDS()` function. ```{r,readMLdata} # get file paths fileLGGexp=system.file("extdata", @@ -77,33 +77,33 @@ dim(patient) ``` ## Data preprocessing -We will have to preprocess the data before we start training. This might include exploratory data analysis to see how variables and samples relate to each other. For example, we might want to check correlation between predictor variables and keep only one variable from that group. In addition, some training algorithms might be sensitive to data scales or outliers\index{outliers}. We should deal with those issues in this step. In some cases, the data might have missing values. We can choose the remove the samples that have missing values or try to impute them. Many machine learning algorithms will not be able to deal with missing values. +We will have to preprocess the data before we start training. This might include exploratory data analysis to see how variables and samples relate to each other. For example, we might want to check the correlation between predictor variables and keep only one variable from that group. In addition, some training algorithms might be sensitive to data scales or outliers\index{outliers}. We should deal with those issues in this step. In some cases, the data might have missing values. We can choose to remove the samples that have missing values or try to impute them. Many machine learning algorithms will not be able to deal with missing values. -We will show how to do this in practice using the `caret::preProcess()` function and base R functions. Please note that there are more preprocessing options available than what we will show both via `caret::preProcess()` and base R functions, we are just going to cover a few basics. +We will show how to do this in practice using the `caret::preProcess()` function and base R functions. Please note that there are more preprocessing options available than we will show here. There are more possibilities in `caret::preProcess()`function and base R functions, we are just going to cover a few basics in this section. -### data transformation -First thing we will do is data normalization and transformation. We have to take care of data scale issues that might come from how the experiments are performed and the potential problems that might occur during data collection. Ideally, each tumor sample has a similar distribution of gene expression values. Systematic differences between tumor samples must be corrected. We check if there are such differences using box plots. +### Data transformation +The first thing we will do is data normalization and transformation. We have to take care of data scale issues that might come from how the experiments are performed and the potential problems that might occur during data collection. Ideally, each tumor sample has a similar distribution of gene expression values. Systematic differences between tumor samples must be corrected. We check if there are such differences using box plots. We will only plot the first 50 tumor samples so that the figure is not too squished. The resulting boxplot is shown in Figure \@ref(fig:boxML). -```{r boxML,out.width='60%',fig.width=5,fig.cap="boxplots for gene expression values"} +```{r boxML,out.width='60%',fig.width=5,fig.cap="Boxplots for gene expression values."} boxplot(gexp[,1:50],outline=FALSE,col="cornflowerblue") ``` -It seems there was some normalization done on this data. Gene expression values per sample looks to have the same scale. However, it looks like they have long tailed distributions, a log transformation may fix that. These long tailed distributions have outliers \index{outliers}and this might adversely affect the models. We show how to distributions looks like for one of the patients samples without and with log transformation. We add a pseudo count of 1 to avoid `log(0)`. +It seems there was some normalization done on this data. Gene expression values per sample seem to have the same scale. However, it looks like they have long-tailed distributions, so a log transformation may fix that. These long-tailed distributions have outliers \index{outliers}and this might adversely affect the models. Below, we show the effect of log transformation on the gene expression profile of a patient. We add a pseudo count of 1 to avoid `log(0)`. The resulting histograms are shown in Figure \@ref(fig:logTransform). -```{r logTransform,out.width='60%',fig.width=8,fig.cap="Gene expression distribution for the 5th patient (left). log transformed Gene expression distribution for the same patient (right)"} +```{r logTransform,out.width='60%',fig.width=8,fig.cap="Gene expression distribution for the 5th patient (left). Log transformed gene expression distribution for the same patient (right)."} par(mfrow=c(1,2)) hist(gexp[,5],xlab="gene expression",main="",border="blue4", col="cornflowerblue") hist(log10(gexp+1)[,5], xlab="gene expression log scale",main="", border="blue4",col="cornflowerblue") ``` -Since taking a log seems to work to tame the extreme values we do that below and also add 1 pseudo-count to be able to deal with 0 values.: +Since taking a log seems to work to tame the extreme values, we do that below and also add $1$ pseudo-count to be able to deal with $0$ values: ```{r takeLog} gexp=log10(gexp+1) ``` -Other things we can do in combination with this is to winsorize the data which caps extreme values to 1st and 99th percentiles. But before we go forward, we should transpose our data. In this case, the predictor variables are gene expression values and they should be on the column side. It was OK to leave them on the row side, to check systematic errors with box plots, but machine learning algorithms require predictor variables are on the column side. +Another thing we can do in combination with this is to winsorize the data, which caps extreme values to the 1st and 99th percentiles or to other user-defined percentiles. But before we go forward, we should transpose our data. In this case, the predictor variables are gene expression values and they should be on the column side. It was OK to leave them on the row side, to check systematic errors with box plots, but machine learning algorithms require that predictor variables are on the column side. ```{r transposeML} # transpose the data set tgexp <- t(gexp) @@ -111,7 +111,7 @@ tgexp <- t(gexp) ``` ### Filtering data and scaling -We can filter predictor variables which has low variation. They are not likely to have any predictive importance since there is not much variation and they will just slow our algorithms. The more variables the slower the algorithms will be generally. `caret::preProcess()` function can help filter the predictor variables with near zero variance. +We can filter predictor variables which have low variation. They are not likely to have any predictive importance since there is not much variation and they will just slow our algorithms. The more variables, the slower the algorithms will be generally. The `caret::preProcess()` function can help filter the predictor variables with near zero variance. ```{r nzv,eval=FALSE} library(caret) # remove near zero variation for the columns at least @@ -131,15 +131,15 @@ topPreds=order(SDs,decreasing = TRUE)[1:1000] tgexp=tgexp[,topPreds] ``` -We can also center the data which is as we have seen in chapter 4 is subtracting the mean. Following this, the predictor variables will have zero means. In addition, we can scale the data. When we scale, each value of the predictor -variable is divided by its standard deviation. Therefore predictor variables will have the same standard deviation. These manipulations are generally used to improve the numerical stability of some calculations. In distance based metrics, it could be beneficial to at least center the data. We will now center the data. We are centering using `preProcess`. This is more practical than `scale()` function because when we get a new data point we can use `predict()` function and `processCenter` object to process it just like we did for the training samples. +We can also center the data, which as we have seen in Chapter 4, is subtracting the mean. Following this, the predictor variables will have zero means. In addition, we can scale the data. When we scale, each value of the predictor +variable is divided by its standard deviation. Therefore predictor variables will have the same standard deviation. These manipulations are generally used to improve the numerical stability of some calculations. In distance-based metrics, it could be beneficial to at least center the data. We will now center the data using the `preProcess()` function. This is more practical than the `scale()` function because when we get a new data point, we can use the `predict()` function and `processCenter` object to process it just like we did for the training samples. ```{r, preCenter} library(caret) processCenter=preProcess(tgexp, method = c("center")) tgexp=predict(processCenter,tgexp) ``` -We will next filter the predictor variables that are highly correlated. You may choose not to do this as some methods can handle correlation between predictor variables.\index{collinearity} However, the less predictor variables we have the faster the model fitting can be done. +We will next filter the predictor variables that are highly correlated. You may choose not to do this as some methods can handle correlation between predictor variables.\index{collinearity} However, the fewer predictor variables we have, the faster the model fitting can be done. ```{r filterCorr} # create a filter for removing higly correlated variables @@ -150,14 +150,14 @@ tgexp=predict(corrFilt,tgexp) ``` ### Dealing with missing values -In real life situations, there will be missing values in our data. In genomics, we might not have values for certain genes or genomic locations due to technical problems during experiments. We have to be able to deal with these missing values\index{missing values}. For demonstration purposes, we will now introduce NA values in our data, the "NA" values is normally used to encode missing values in R. We then show how to check and deal with those. One way is to impute them, here we use again a machine learning algorithm to guess the missing values. Another option is to discard the samples with missing values or discard the predictor variables with missing values. First, we replace one of the values as NA and check if it is there. +In real-life situations, there will be missing values in our data. In genomics, we might not have values for certain genes or genomic locations due to technical problems during experiments. We have to be able to deal with these missing values\index{missing values}. For demonstration purposes, we will now introduce NA values in our data, the "NA" value is normally used to encode missing values in R. We then show how to check and deal with those. One way is to impute them; here, we again use a machine learning algorithm to guess the missing values. Another option is to discard the samples with missing values or discard the predictor variables with missing values. First, we replace one of the values as NA and check if it is there. ```{r,checkNA} missing_tgexp=tgexp missing_tgexp[1,1]=NA anyNA(missing_tgexp) # check if there are NA values ``` -Next, we will try to remove that gene from the set. Removing genes or samples have both downsides. You might be removing a predictor variable that could be important for the prediction. Removing the samples with missing values will decrease the number of samples in the training set. The code below checks which values are NA in the matrix, then runs a column sum and keeps everything where column sum is equal to 0. The column sums where there are NA values will be higher than 0 depending on how many NA values there are in a column. +Next, we will try to remove that gene from the set. Removing genes or samples both have downsides. You might be removing a predictor variable that could be important for the prediction. Removing samples with missing values will decrease the number of samples in the training set. The code below checks which values are NA in the matrix, then runs a column sum and keeps everything where the column sum is equal to 0. The column sums where there are NA values will be higher than 0 depending on how many NA values there are in a column. ```{r,removeNA, eval=FALSE} gexpnoNA=missing_tgexp[ , colSums(is.na(missing_tgexp)) == 0] @@ -172,7 +172,7 @@ imputedGexp=predict(mImpute,missing_tgexp) ``` -Another imputation\index{imputation} methods that is more precise than the median imputation is to impute the missing values based on the nearest neighbors of the samples. In this case, the algorithm finds most similar samples to the sample vector with NA values. Next, the algorithm averages the non-missing values from those neighbors and replaces the missing value with that value. +Another imputation\index{imputation} method that is more precise than the median imputation is to impute the missing values based on the nearest neighbors of the samples. In this case, the algorithm finds samples that are most similar to the sample vector with NA values. Next, the algorithm averages the non-missing values from those neighbors and replaces the missing value with that value. ```{r knnimpute,eval=FALSE} library(RANN) knnImpute=preProcess(missing_tgexp,method="knnImpute") @@ -181,10 +181,10 @@ knnimputedGexp=predict(knnImpute,missing_tgexp) ## Splitting the data -At this point we might choose to split the data to the test and the training partitions. The reason for this is that we need an independent test we did not train on. This will become more clear in the following sections but without having a separate test set we can not assess the performance of our model or tune it properly. +At this point we might choose to split the data into the test and the training partitions. The reason for this is that we need an independent test we did not train on. This will become clearer in the following sections, but without having a separate test set, we cannot assess the performance of our model or tune it properly. ### Holdout test dataset -There are multiple data split strategies. For starters, we will split 30% of the data as test. This method is the gold standard for testing performance of our model. By doing this, we have a separate data set that the model has never seen. First, we create a single data frame with predictors and response variables +There are multiple data split strategies. For starters, we will split 30% of the data as the test. This method is the gold standard for testing performance of our model. By doing this, we have a separate data set that the model has never seen. First, we create a single data frame with predictors and response variables. ```{r mergeLabelTrain} tgexp=merge(patient,tgexp,by="row.names") @@ -192,7 +192,7 @@ tgexp=merge(patient,tgexp,by="row.names") rownames(tgexp)=tgexp[,1] tgexp=tgexp[,-1] ``` -Now the response variable or the class label is merged with our dataset we can split it to test and training set with `caret::createPartition()`. +Now that the response variable or the class label is merged with our dataset, we can split it into test and training sets with the `caret::createPartition()` function. ```{r datapart} set.seed(3031) # set the random number seed for reproducibility @@ -207,25 +207,25 @@ testing <- tgexp[-intrain,] ``` ### Cross-validation -In some cases, we might have too few data points and it might be to costly to set aside a significant portion of the data set as a holdout test set. In these cases a resampling based techniques such as cross-validation may be useful.\index{cross-validation} +In some cases, we might have too few data points and it might be too costly to set aside a significant portion of the data set as a holdout test set. In these cases a resampling-based technique such as cross-validation may be useful.\index{cross-validation} -Cross-validation works by splitting the data into randomly sampled $k$ subsets, called k-folds. So, for example, in the case of 5-fold cross-validation with 100 data points, we would create 5 folds each containing 20 data points. We would then build models and estimate errors 5 times. Each time four of the groups are combined (resulting in 80 data points) and used to train your model. Then the 5th group of 20 points that was not used to construct the model is used to estimate the test error. In the case of 5-fold cross-validation, we would have 5 error estimates that could be averaged to obtain a more robust estimate of the test error. +Cross-validation works by splitting the data into randomly sampled $k$ subsets, called k-folds. So, for example, in the case of 5-fold cross-validation with 100 data points, we would create 5 folds, each containing 20 data points. We would then build models and estimate errors 5 times. Each time, four of the groups are combined (resulting in 80 data points) and used to train your model. Then the 5th group of 20 points that was not used to construct the model is used to estimate the test error. In the case of 5-fold cross-validation, we would have 5 error estimates that could be averaged to obtain a more robust estimate of the test error. -An extreme case of k-fold cross-validation, is to equalize the $k$ to the number of data points or in our case number of tumor samples. This is called leave-one-out cross-validation (LOOCV). This could be better than k-fold cross-validation but it takes too much time to train that many models if number of data points are large. +An extreme case of k-fold cross-validation, is to equalize the $k$ to the number of data points or in our case, the number of tumor samples. This is called leave-one-out cross-validation (LOOCV). This could be better than k-fold cross-validation but it takes too much time to train that many models if the number of data points is large. The `caret` package\index{R Packages!\texttt{caret}} has built-in cross-validation functionality for all the machine learning methods and we will be using that in the later sections. ### Bootstrap resampling -Another method to be used to estimate the prediction error is to use bootstrap resampling. This is a general method we have already introduced in Chapter \@ref(stats). It can be used the estimate variability of any statistical parameter. In this case, that parameter is the test error or test accuracy.\index{bootstrap resampling} +Another method to estimate the prediction error is to use bootstrap resampling. This is a general method we have already introduced in Chapter \@ref(stats). It can be used to estimate variability of any statistical parameter. In this case, that parameter is the test error or test accuracy.\index{bootstrap resampling} -The training set is drawn from the original set with replacement (same size as the original set), then we build a model with this bootstrap resampled set. Next, we take the data points that are not selected for the random sample and predict labels for them. These data points are called "out-of-the-bag (OOB) sample". We repeat this process many times and record the error for the OOB samples. We can take the average of OOB error to estimate the real test error. This is a powerful method that is not only used to estimate test error but incorporated into the training part of some machine learning methods such as random forests. Normally, we should repeat the process hundreds or up to a thousand times to get good estimates. However, limiting factor would be the time it takes construct and test that many models. 20-30 repetitions might be enough if the time cost of training is too high. Again, `caret` \index{R Packages!\texttt{caret}} package provides the bootstrap interface for many machine learning models for sampling before training and estimating the error on OOB samples. +The training set is drawn from the original set with replacement (same size as the original set), then we build a model with this bootstrap resampled set. Next, we take the data points that are not selected for the random sample and predict labels for them. These data points are called the "out-of-the-bag (OOB) sample". We repeat this process many times and record the error for the OOB samples. We can take the average of the OOB errors to estimate the real test error. This is a powerful method that is not only used to estimate test error but incorporated into the training part of some machine learning methods such as random forests. Normally, we should repeat the process hundreds or up to a thousand times to get good estimates. However, the limiting factor would be the time it takes to construct and test that many models. Twenty to 30 repetitions might be enough if the time cost of training is too high. Again, the `caret` \index{R Packages!\texttt{caret}} package provides the bootstrap interface for many machine learning models for sampling before training and estimating the error on OOB samples. ## Predicting the subtype with k-nearest neighbors -One of the easiest things to wrap our heads around when we are trying to predict a label such as disease subtype is to look for similar samples and assign the labels of the those similar samples to our sample. +One of the easiest things to wrap our heads around when we are trying to predict a label such as disease subtype is to look for similar samples and assign the labels of those similar samples to our sample. Conceptually, k-nearest neighbors (k-NN) is very similar to\index{k-nearest neighbors (k-NN)} clustering algorithms we have seen earlier. If we have a measure of distance between the samples, we can find the nearest $k$ samples to our new sample and use a voting method to decide on the label of our new sample. -Let us run the k-NN algorithm with our cancer data. For illustrative purposes, we provide the same data set for training and test data. Providing the training data as the test data shows us the training error or accuracy, which is how the model is doing on the data it is trained with. Below we are running k-NN with `caret:knn3()` function. The most important argument is `k` which is the number of nearest neighbors to consider. In this case, we set it to 5. We will later discuss how to find the best `k`. +Let us run the k-NN algorithm with our cancer data. For illustrative purposes, we provide the same data set for training and test data. Providing the training data as the test data shows us the training error or accuracy, which is how the model is doing on the data it is trained with. Below we are running k-NN with the `caret:knn3()` function. The most important argument is `k`, which is the number of nearest neighbors to consider. In this case, we set it to 5. We will later discuss how to find the best `k`. ```{r knn} library(caret) @@ -238,7 +238,7 @@ trainPred=predict(knnFit,training[,-1]) ``` ## Assessing the performance of our model -We have to define some metrics to see if our model worked. The algorithm is trying to reduce the classification error, or in other words it is trying to increase the training accuracy. For the assessment of performance, there are other different metrics to consider. All the metrics for 2-class classification depend on the table below. Which shows the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN), similar to a table we used in hypothesis testing section in the statistics chapter previously. +We have to define some metrics to see if our model worked. The algorithm is trying to reduce the classification error, or in other words it is trying to increase the training accuracy. For the assessment of performance, there are other different metrics to consider. All the metrics for 2-class classification depend on the table below, which shows the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN), similar to a table we used in the hypothesis testing section in the statistics chapter previously. ------------------------------------------------------------- Actual CIMP Actual noCIMP @@ -251,18 +251,18 @@ noCIMP False Positives (FN) True negatives (TN) ------------------------------------------------------------- -Accuracy is the first metric to look at. This metric is is simply +Accuracy is the first metric to look at. This metric is simply $(TP+TN)/(TP+TN+FP+FN)$ and shows the proportion of times we were right. There are other accuracy metrics that are important and output by `caret` functions. We will go over some of them here.\index{accuracy} -Precision, $TP/(TP+FP)$, is about the confidence we have on our CIMP calls. If our method is very precise, we will have low false positives. That means every time we call a CIMP event, we would relatively certain it is not a false positive.\index{precision} +Precision, $TP/(TP+FP)$, is about the confidence we have on our CIMP calls. If our method is very precise, we will have low false positives. That means every time we call a CIMP event, we would be relatively certain it is not a false positive.\index{precision} -Sensitivity, $TP/(TP+FN)$, is how often we miss CIMP cases and call them as noCIMP. Making less mistakes in noCIMP cases will increase our sensitivity. You can think of sensitivity also as in sick/healthy context. A highly sensitive method will be good at classifying sick people when they are indeed sick.\index{sensitivity} +Sensitivity, $TP/(TP+FN)$, is how often we miss CIMP cases and call them as noCIMP. Making fewer mistakes in noCIMP cases will increase our sensitivity. You can think of sensitivity also in a sick/healthy context. A highly sensitive method will be good at classifying sick people when they are indeed sick.\index{sensitivity} -Specificity,$TN/(TN+FP)$, is about how sure we are when we call something as noCIMP. If our method is not very specific, we would call many patients as CIMP while in fact they did not have the subtype. In the sick/healthy context, a highly specific method, will be good at not calling healthy people sick.\index{specificity} +Specificity, $TN/(TN+FP)$, is about how sure we are when we call something noCIMP. If our method is not very specific, we would call many patients CIMP, while in fact, they did not have the subtype. In the sick/healthy context, a highly specific method will be good at not calling healthy people sick.\index{specificity} -An alternative to accuracy we showed earlier is "balanced accuracy". Accuracy does not perform well when classes have very different number of samples (class imbalance). For example, if you have 90 CIMP cases and 10 noCIMP cases, classifying all the samples as CIMP gives 0.9 accuracy score by default. Using the "balanced accuracy" metric can help in such situations. This is simply $(Precision+Sensitivity)/2$. In this case above with the class imbalance scenario, the "balanced accuracy" would be 0.5. Another metric that takes into account accuracy that could be generated by chance is "Kappa statistic" or "Cohen's Kappa". This metric includes expected accuracy which is affected by class imbalance in the training set and provides a metric corrected by that. +An alternative to accuracy we showed earlier is "balanced accuracy". Accuracy does not perform well when classes have very different numbers of samples (class imbalance). For example, if you have 90 CIMP cases and 10 noCIMP cases, classifying all the samples as CIMP gives 0.9 accuracy score by default. Using the "balanced accuracy" metric can help in such situations. This is simply $(Precision+Sensitivity)/2$. In this case above with the class imbalance scenario, the "balanced accuracy" would be 0.5. Another metric that takes into account accuracy that could be generated by chance is the "Kappa statistic" or "Cohen's Kappa". This metric includes expected accuracy, which is affected by class imbalance in the training set and provides a metric corrected by that. -In the k-NN example above, we trained and tested on the same data. The model returned the predicted labels for our training. We can calculate the accuracy metrics using `caret::confusionMatrix()` function. This is sometimes what it is called training accuracy. If you take $1-accuracy$, it will be the "training error". +In the k-NN example above, we trained and tested on the same data. The model returned the predicted labels for our training. We can calculate the accuracy metrics using the `caret::confusionMatrix()` function. This is sometimes called training accuracy. If you take $1-accuracy$, it will be the "training error". ```{r knnConfusionMatrix} # get k-NN prediction on the training data itself, with k=5 @@ -290,16 +290,16 @@ testPred=predict(knnFit,testing[,-1],type="class") confusionMatrix(data=testing[,1],reference=testPred) ``` -Test set accuracy is not as good as the training accuracy, which is usually the case. That is why the best way to evaluate performance is to use test data that is not used by the model for training. That gives you an idea about real world performance where the model will be used to predict data that is not previously seen. +Test set accuracy is not as good as the training accuracy, which is usually the case. That is why the best way to evaluate performance is to use test data that is not used by the model for training. That gives you an idea about real-world performance where the model will be used to predict data that is not previously seen. -### Receiver Operating Characteristic (ROC) Curves -One important and popular metric when evaluating performance is looking at receiver operating characteristic (ROC) curves.\index{receiver operating characteristic (ROC) curve} The ROC curve is created by evaluating the class probabilities for the model across a continuum of thresholds. Typically, in the case of two class classification the methods return a probability for one of the classes. If that probability is higher than 0.5, you call the label as, for example, class A. If less than 0.5, we call the label class B. However, we can move that threshold and change what we call class A or B. For each candidate threshold, the resulting the sensitivity and 1-specificity are plotted against each other. The best possible prediction would result a point in the upper left corner, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). For the best model, the curve will be almost like a square. Since this is an important information, area under the curve (AUC) is \index{area under the curve (AUC)}calculated. This is a quantity between 0 and 1, and closer to 1 better the performance of your classifier in terms of sensitivity and specificity. For an uninformative classification model AUC will be 0.5. Although, ROC curves are initially designed fro two class problems, later extensions made it possible to use ROC curves for multi-class problems. +### Receiver Operating Characteristic (ROC) curves +One important and popular metric when evaluating performance is looking at receiver operating characteristic (ROC) curves.\index{receiver operating characteristic (ROC) curve} The ROC curve is created by evaluating the class probabilities for the model across a continuum of thresholds. Typically, in the case of two-class classification, the methods return a probability for one of the classes. If that probability is higher than $0.5$, you call the label, for example, class A. If less than $0.5$, we call the label class B. However, we can move that threshold and change what we call class A or B. For each candidate threshold, the resulting sensitivity and 1-specificity are plotted against each other. The best possible prediction would result in a point in the upper left corner, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). For the best model, the curve will be almost like a square. Since this is important information, area under the curve (AUC) is \index{area under the curve (AUC)}calculated. This is a quantity between 0 and 1, and the closer to 1, the better the performance of your classifier in terms of sensitivity and specificity. For an uninformative classification model, AUC will be $0.5$. Although, ROC curves are initially designed for two-class problems, later extensions made it possible to use ROC curves for multi-class problems. -ROC curves can also be used to determine alternate cutoffs for class probabilities for two class problems. However, this will always result in a trade-off between sensitivity and specificity. Sometimes it might be desirable to limit number of false positives as making such mistakes would be too costly for the individual cases. For example, if predicted with a certain disease, you might be recommended to go under surgery. However, if your classifier has a relatively high false positive rate, low specificity, you might go under the surgery for no reason. Typically, you want your classification model to have high specificity and sensitivity which may not be always possible in the real world. You might have to choose what is more important for a specific problem and try to increase that. +ROC curves can also be used to determine alternate cutoffs for class probabilities for two-class problems. However, this will always result in a trade-off between sensitivity and specificity. Sometimes it might be desirable to limit the number of false positives because making such mistakes would be too costly for the individual cases. For example, if predicted with a certain disease, you might be recommended to have surgery. However, if your classifier has a relatively high false positive rate, low specificity, you might have surgery for no reason. Typically, you want your classification model to have high specificity and sensitivity, which may not always be possible in the real world. You might have to choose what is more important for a specific problem and try to increase that. -Next, we will show how to use ROC curves for our k-NN application\index{k-nearest neighbors (k-NN)}. The method requires classification probabilities in the format where 0 probability denotes class "noCIMP" and probability 1 denotes class "CIMP". This way ROC curve can be drawn by varying the probability cutoff for calling class "noCIMP" or "CIMP". Below we are getting a similar probability from k-NN but we have to transform it to the format we described above. Then, we feed those class probabilities to `pROC::roc()` function to calculate ROC curve and area-under-the-curve. The resulting ROC curve is shown in Figure \@ref(fig:ROC). -```{r ROC,message=FALSE,warning=FALSE,out.width='60%',fig.width=5,fig.cap="ROC curve for k-NN"} +Next, we will show how to use ROC curves for our k-NN application\index{k-nearest neighbors (k-NN)}. The method requires classification probabilities in the format where 0 probability denotes class "noCIMP" and probability 1 denotes class "CIMP". This way the ROC curve can be drawn by varying the probability cutoff for calling class a "noCIMP" or "CIMP". Below we are getting a similar probability from k-NN, but we have to transform it to the format we described above. Then, we feed those class probabilities to the `pROC::roc()` function to calculate the ROC curve and the area-under-the-curve. The resulting ROC curve is shown in Figure \@ref(fig:ROC). +```{r ROC,message=FALSE,warning=FALSE,out.width='60%',fig.width=5,fig.cap="ROC curve for k-NN."} library(pROC) # get k-NN class probabilities @@ -322,8 +322,8 @@ pROC::auc(rocCurve) ``` ## Model tuning and avoiding overfitting -How can we know we pick the best $k$? One straightforward way is that we can try many different $k$ values and check accuracy of our model. We will first check the effect of different $k$ on training accuracy. Below, we will go through many $k$ values and calculate the training accuracy for each. -```{r trainingErrork,out.width='60%',fig.width=5, fig.cap="Training error for k-NN classification of glioma tumor samples"} +How can we know that we picked the best $k$? One straightforward way is that we can try many different $k$ values and check the accuracy of our model. We will first check the effect of different $k$ values on training accuracy. Below, we will go through many $k$ values and calculate the training accuracy for each. +```{r trainingErrork,out.width='60%',fig.width=5, fig.cap="Training error for k-NN classification of glioma tumor samples."} set.seed(101) k=1:12 # set k values trainErr=c() # set vector for training errors @@ -348,10 +348,10 @@ plot(k,trainErr,type="p",col="#CC0000",pch=20) lines(loess.smooth(x=k, trainErr,degree=2),col="#CC0000") ``` -The resulting training error plot is shown in Figure \@ref(fig:trainingErrork). We can see the effect of $k$ in training error, as $k$ increases the model tends to a bit worse on training. This makes sense because with large $k$ we take into account more and more neighbors, at some point we start considering data points from the other classes as well and that decreases our accuracy. +The resulting training error plot is shown in Figure \@ref(fig:trainingErrork). We can see the effect of $k$ in the training error; as $k$ increases the model tends to be a bit worse on training. This makes sense because with large $k$ we take into account more and more neighbors, and at some point we start considering data points from the other classes as well and that decreases our accuracy. -However, looking at the training accuracy is not the right way to test the model as we have mentioned. The models are generally tested on the datasets that are not used when building model. There are different strategies to do this. We have already split part of our data set as test, so let us see how we do it on the test data using the code below. The resulting plot is shown in Figure \@ref(fig:testTrainErr). -```{r testTrainErr,out.width='60%',fig.width=5, fig.cap="Training and test error for k-NN classification of glioma tumor samples"} +However, looking at the training accuracy is not the right way to test the model as we have mentioned. The models are generally tested on the datasets that are not used when building model. There are different strategies to do this. We have already split part of our dataset as test set, so let us see how we do it on the test data using the code below. The resulting plot is shown in Figure \@ref(fig:testTrainErr). +```{r testTrainErr,out.width='60%',fig.width=5, fig.cap="Training and test error for k-NN classification of glioma tumor samples."} set.seed(31) k=1:12 @@ -387,44 +387,45 @@ legend("bottomright",fill=c("#CC0000","#00CC66"), ``` -The test data shows a different thing of course. It is not the best strategy to increase the $k$ indefinitely. The test error rate increases after a while. Increasing $k$ results in too many data points influencing the decision about the class of the new sample, this may not be desirable since it this strategy might include points from other classes eventually. On the other hand, if we set $k$ too low, we are restricting the model to only look for few neighbors. +The test data show a different thing, of course. It is not the best strategy to increase the $k$ indefinitely. The test error rate increases after a while. Increasing $k$ results in too many data points influencing the decision about the class of the new sample, this may not be desirable since this strategy might include points from other classes eventually. On the other hand, if we set $k$ too low, we are restricting the model to only look for a few neighbors. -In addition, $k$ values that gives the best performance for the training set is not the best $k$ for the test set.In fact, if we stick with $k=1$ as the best $k$ obtained from the training set, we would obtain a worse performance on the test set. In this case, we can talk about the concept of overfitting. This happens when our models are fitting the data in the training set extremely well but can not perform well in the test data, in other words they can not generalize. Similarly, underfitting could occur when our models do not learn well from the training data and they are overly simplistic. Ideally, we should use methods that helps us estimate the real test error when tuning the models such as cross-validation, bootstrap or holdout test set. +In addition, $k$ values that give the best performance for the training set are not the best $k$ for the test set. In fact, if we stick with $k=1$ as the best $k$ obtained from the training set, we would obtain a worse performance on the test set. In this case, we can talk about the concept of overfitting. This happens when our models fit the data in the training set extremely well but cannot perform well in the test data; in other words, they cannot generalize. Similarly, underfitting could occur when our models do not learn well from the training data and they are overly simplistic. Ideally, we should use methods that help us estimate the real test error when tuning the models such as cross-validation, bootstrap or holdout test set. ### Model complexity and bias variance trade-off -The case of over- and underfitting is closely related to the model complexity and the related bias-variance trade-off.\index{overfitting} We will introduce these concepts now. First, let us point out that prediction error depends on the real value of the class label of the test case and predicted value. The test case label or value is not dependent on the prediction, the only thing that is variable here is the model. Therefore, if we could train multiple models with different data sets for the same problem, our predictions for the test set would vary. That means, our prediction error would also vary. Now, with this setting we can talk about expected prediction error for a given machine learning model. This is the average error you would get for a test set if you were able to train multiple models. This expected prediction error can largely be decomposed into the variability of the predictions due to the model variability (Variance) and the difference between the expected prediction values and the correct value of the response (Bias). Formally, the expected prediction error,$E[Error]$ is decomposed as follows: +The case of over- and underfitting is closely related to the model complexity and the related bias-variance trade-off.\index{overfitting} We will introduce these concepts now. First, let us point out that prediction error depends on the real value of the class label of the test case and predicted value. The test case label or value is not dependent on the prediction; the only thing that is variable here is the model. Therefore, if we could train multiple models with different data sets for the same problem, our predictions for the test set would vary. That means our prediction error would also vary. Now, with this setting we can talk about expected prediction error for a given machine learning model. This is the average error you would get for a test set if you were able to train multiple models. This expected prediction error can largely be decomposed into the variability of the predictions due to the model variability (variance) and the difference between the expected prediction values and the correct value of the response (bias). Formally, the expected prediction error, $E[Error]$ is decomposed as follows: + $$ E[Error]=Bias^2 + Variance + \sigma_e^2 $$ -Note that in the above equation $\sigma_e^2$ is the irreducible error. This is the noise term that cannot fundamentally be accounted by any model. The bias is formally the difference between the expected prediction value and the correct response value, $Y$: $Bias=(Y-E[PredictedValue])$. The variance is simply the variability of the prediction values when we construct models multiple times with different training sets for the same problem: $Variance=E[(PredictedValue-E[PredictedValue])^2]$. Note that this the value of the variance does not depend of the correct value of the test cases. +Note that in the above equation $\sigma_e^2$ is the irreducible error. This is the noise term that cannot fundamentally be accounted for by any model. The bias is formally the difference between the expected prediction value and the correct response value, $Y$: $Bias=(Y-E[PredictedValue])$. The variance is simply the variability of the prediction values when we construct models multiple times with different training sets for the same problem: $Variance=E[(PredictedValue-E[PredictedValue])^2]$. Note that this value of the variance does not depend of the correct value of the test cases. -The models that have high variance are generally more complex models that have many knobs or parameters than can fit the training data well. These models, due to their flexibility, can fit training data too much that it creates poor prediction performance in a new data set. On the other hand, simple, less complex models do not have the flexibility to fit every data set that well, so they can avoid overfitting. However, they can underfit if they are not flexible enough to model or at least approximate the true relationship between predictors and the response variable. The bias term is mostly about the general model performance (expected or average value of predictions ) that can be attributed to approximating a real life problem with simpler models. These simple models can have less variability in their predictions, then the prediction error will be mostly composed of the bias term. +The models that have high variance are generally more complex models that have many knobs or parameters than can fit the training data well. These models, due to their flexibility, can fit training data too much that it creates poor prediction performance in a new data set. On the other hand, simple, less complex models do not have the flexibility to fit every data set that well, so they can avoid overfitting. However, they can underfit if they are not flexible enough to model or at least approximate the true relationship between predictors and the response variable. The bias term is mostly about the general model performance (expected or average value of predictions) that can be attributed to approximating a real-life problem with simpler models. These simple models can have less variability in their predictions, so the prediction error will be mostly composed of the bias term. -In reality, there is always a trade-off between bias and variance (See Figure \@ref(fig:varBias)). Increasing the variance with complex models will decrease the bias but that might overfit. Conversely, simple models will increase the bias in the expense of the model variance and that might underfit. There is an optimal point for model complexity, a balance between overfitting and underfitting.\index{overfitting} \index{underfitting} In practice, there is no analytical way to find this optimal complexity. Instead we must use an accurate measure of prediction error and explore different levels of model complexity and choose the complexity level that minimizes the overall error. Another approach to this is to use "The one standard error rule". Instead of choosing the parameter that minimizes the error estimate, we can choose the simplest model whose error estimate is within one standard error of the best model (see chapter 7 of [@friedman2001elements]). The rationale behind that is to choose a simple model with the hope that it would perform better in the unseen data since its performance is not different from the best model in a statistically significant way. You might see the option to choose "one-standard-error" model in some machine learning packages. +In reality, there is always a trade-off between bias and variance (See Figure \@ref(fig:varBias)). Increasing the variance with complex models will decrease the bias, but that might overfit. Conversely, simple models will increase the bias at the expense of the model variance, and that might underfit. There is an optimal point for model complexity, a balance between overfitting and underfitting.\index{overfitting} \index{underfitting} In practice, there is no analytical way to find this optimal complexity. Instead we must use an accurate measure of prediction error and explore different levels of model complexity and choose the complexity level that minimizes the overall error. Another approach to this is to use "the one standard error rule". Instead of choosing the parameter that minimizes the error estimate, we can choose the simplest model whose error estimate is within one standard error of the best model (see Chapter 7 of [@friedman2001elements]). The rationale behind that is to choose a simple model with the hope that it would perform better in the unseen data since its performance is not different from the best model in a statistically significant way. You might see the option to choose the "one-standard-error" model in some machine learning packages. -```{r,varBias,fig.cap="Variance-Bias trade-off visualized as components of total prediction error in relation to model complexity",fig.align = 'center',out.width='80%',echo=FALSE} +```{r,varBias,fig.cap="Variance-bias trade-off visualized as components of total prediction error in relation to model complexity.",fig.align = 'center',out.width='80%',echo=FALSE} knitr::include_graphics("images/Variance-bias.png" ) ``` -In our k-NN example\index{k-nearest neighbors (k-NN)}, lower $k$ values creates a more flexible model. This might be counter intuitive but as we have explained before having small $k$ values will fit the data in very data specific manner. It will probably not generalize well. Therefore in this respect, lower $k$ values will result in more complex models with high variance\index{model complexity}. On the other hand, higher $k$ values will result in less variance but higher bias. Figure \@ref(fig:kNNboundary) shows the decision boundary for two different k-NN models with $k=2$ and $k=12$, to be able to plot this in 2D we ran the model on principal component 1 and 2 of the training data set, and predicted the class label of many points in this 2D space. As you can see, $k=2$ creates a more variable model which tries aggressively to include all training samples in the correct class. This creates a high variance model because the model could change drastically from data set to the data set. On the other hand, setting $k=12$ creates a model with a smoother decision boundary. This model will have less variance since it considers many points for a decision, therefore the decision boundary is smoother. +In our k-NN example\index{k-nearest neighbors (k-NN)}, lower $k$ values create a more flexible model. This might be counterintuitive, but as we have explained before having small $k$ values will fit the data in a very data-specific manner. It will probably not generalize well. Therefore in this respect, lower $k$ values will result in more complex models with high variance\index{model complexity}. On the other hand, higher $k$ values will result in less variance but higher bias. Figure \@ref(fig:kNNboundary) shows the decision boundary for two different k-NN models with $k=2$ and $k=12$. To be able to plot this in 2D we ran the model on principal component 1 and 2 of the training data set, and predicted the class label of many points in this 2D space. As you can see, $k=2$ creates a more variable model which tries aggressively to include all training samples in the correct class. This creates a high-variance model because the model could change drastically from dataset to dataset. On the other hand, setting $k=12$ creates a model with a smoother decision boundary. This model will have less variance since it considers many points for a decision, and therefore the decision boundary is smoother. -```{r,kNNboundary,fig.cap="Decision boundary for different k values in k-NN models. k=12 creates a smooth decision boundary and ignores certain data points on either side of the boundary. k=2 is less smooth and more variable",fig.align = 'center',out.width='70%',echo=FALSE} +```{r,kNNboundary,fig.cap="Decision boundary for different k values in k-NN models. k=12 creates a smooth decision boundary and ignores certain data points on either side of the boundary. k=2 is less smooth and more variable.",fig.align = 'center',out.width='70%',echo=FALSE} knitr::include_graphics("images/knnDecisionBoundPCA.png" ) ``` ### Data split strategies for model tuning and testing The data split strategy is essential for accurate prediction of the test error. As we have seen in the model complexity/bias-variance discussion\index{model complexity}, estimating the prediction error is central for model tuning in order to find the model with the right complexity. Therefore, we will revisit this and show how to build and test models, and measure their prediction error in practice. -#### training-validation-test -This data split strategy is creates three partitions of the data set, training, validation and test sets. In this strategy, training set is used to train the data and the validation set is used to tune the model to the best possible model. The final partition, "test", is only used for the final test and should not be used to tune the model, this is regarded as the real world prediction error for your model. This strategy works when you a lot of data to do a three way split. The test set we used above is most likely too small to measure the prediction error with just using a test set. In such cases, bootstrap or cross-validation should yield more stable results. +#### Training-validation-test +This data split strategy creates three partitions of the dataset, training, validation, and test sets. In this strategy, the training set is used to train the data and the validation set is used to tune the model to the best possible model. The final partition, "test", is only used for the final test and should not be used to tune the model. This is regarded as the real-world prediction error for your model. This strategy works when you have a lot of data to do a three-way split. The test set we used above is most likely too small to measure the prediction error with just using a test set. In such cases, bootstrap or cross-validation should yield more stable results. -#### cross-validation -A more realistic approach when you do not have a lot of data to do the three way split is cross-validation. You can use\index{cross-validation} cross-validation in the model tuning phase as well, instead of going a single train-validation split. As with the three-way split, the final prediction error could be estimated with the test set. In other words, we can separate 80% of the data for model building with cross-validation, and the final model performance will be measured on the test set. +#### Cross-validation +A more realistic approach when you do not have a lot of data to do the three-way split is cross-validation. You can use\index{cross-validation} cross-validation in the model-tuning phase as well, instead of going with a single train-validation split. As with the three-way split, the final prediction error could be estimated with the test set. In other words, we can separate 80% of the data for model building with cross-validation, and the final model performance will be measured on the test set. -We have already split our glioma data set into training and test set. Now, we will show how to use run a k-NN \index{k-nearest neighbors (k-NN)}model with cross-validation using `caret::train()` function. This function will use cross-validation to train models for different $k$ values. Every $k$ value will be trained and tested with cross-validation to estimate prediction performance for each $k$. We will then plot the cross-validation error and the resulting plot is shown in Figure \@ref(fig:kknCv). -```{r, kknCv,eval=TRUE,out.width='60%',fig.width=5,fig.cap="Cross-validated estimate of Prediction error of k in k-NN models"} +We have already split our glioma dataset into training and test sets. Now, we will show how to run a k-NN \index{k-nearest neighbors (k-NN)}model with cross-validation using the `caret::train()` function. This function will use cross-validation to train models for different $k$ values. Every $k$ value will be trained and tested with cross-validation to estimate prediction performance for each $k$. We will then plot the cross-validation error and the resulting plot is shown in Figure \@ref(fig:kknCv). +```{r, kknCv,eval=TRUE,out.width='60%',fig.width=5,fig.cap="Cross-validated estimate of prediction error of k in k-NN models."} set.seed(17) # this method controls everything about training # we will just set up 10-fold cross validation @@ -446,7 +447,7 @@ lines(loess.smooth(x=1:12,1-knn_fit$results[,2],degree=2), col="#CC0000") ``` -Based on figure \@ref(fig:kknCv) the cross-validation accuracy reveals that $k=5$ is the best $k$ value. On the other hand, we can also try bootstrap resampling and check the prediction error that way. We will again use `caret::trainControl()` function to do the bootstrap sampling and estimate OOB based error. However, for small number of samples like we have in our example the difference between the estimated and the true value of the prediction error can be large. Below we show how to use bootstrapping for k-NN model. +Based on Figure \@ref(fig:kknCv) the cross-validation accuracy reveals that $k=5$ is the best $k$ value. On the other hand, we can also try bootstrap resampling and check the prediction error that way. We will again use the `caret::trainControl()` function to do the bootstrap sampling and estimate OOB-based error. However, for a small number of samples like we have in our example, the difference between the estimated and the true value of the prediction error can be large. Below we show how to use bootstrapping for the k-NN model. ```{r knnboot,eval=FALSE,out.width='60%',fig.width=5,fig.cap="bootstrap estimate of Prediction error of k in k-NN models"} set.seed(17) # this method controls everything about training @@ -467,14 +468,14 @@ knn_fit <- train(subtype~., data = training, ## Variable importance Another important purpose of machine learning models could be to learn which variables are more important for the prediction. This information could lead to potential biological insights or could help design better data collection methods or experiments.\index{variable importance} -Variable importance metrics can be separated into two groups: those that are model dependent and those are not. Many machine-learning methods comes with built-in variable importance measures. These may be able to incorporate the correlation structure between the predictors into the importance calculation. Model independent methods are not able to use any internal model data. We will go over some model independent strategies below. The model dependent importance measures will be mentioned when we introduce machine learning methods that have built-in variable importance measures. +Variable importance metrics can be separated into two groups: those that are model dependent and those that are not. Many machine-learning methods come with built-in variable importance measures. These may be able to incorporate the correlation structure between the predictors into the importance calculation. Model-independent methods are not able to use any internal model data. We will go over some model-independent strategies below. The model-dependent importance measures will be mentioned when we introduce machine learning methods that have built-in variable importance measures. -One simple method for variable importance is to correlate or apply statistical tests to test the association with the predictor variable with the response variable. Variables can be ranked based on the strength of those associations. For classification problems, ROC curves can be computed by thresholding the predictor variable, and for each variable an AUC can be computed. The variables can be ranked based on these values. However, these methods completely ignores how variables would behave in the presence of other variables. `caret::filterVarImp()` function implements some of these strategies. +One simple method for variable importance is to correlate or apply statistical tests to test the association of the predictor variable with the response variable. Variables can be ranked based on the strength of those associations. For classification problems, ROC curves can be computed by thresholding the predictor variable, and for each variable an AUC can be computed. The variables can be ranked based on these values. However, these methods completely ignore how variables would behave in the presence of other variables. The `caret::filterVarImp()` function implements some of these strategies. -If a variable important for prediction removing that variable before model training will cause a drop in performance. With this understanding, we can remove the variables one by one and train models without them and rank them by the loss of performance. The most important variables must cause the largest loss of performance. This strategy requires training and testing models as many times as the number of predictor variables. This will consume a lot of time. A related but more practical approach have been put forward to measure variable importance in a model independent manner but without re-training [@dalex; @mcr]. In this case, instead of removing the variables at training, variables are permuted at the test phase. The loss in prediction performance is calculated by comparing the labels/values from the original response variable to the labels/values obtained by running the permuted test data through the model. This is called "variable dropout loss". In this case, we are not really dropping out variables but by permuting them we destroy their relationship to the response variable. The dropout loss is compared to the "worst case" scenario where response variable is permuted and compared against the original response variables, this is called "baseline loss". The algorithm ranks the variables by their variable dropout loss or by their ratio of variable dropout to baseline loss. Both quantities are proportional but the second one contains information about the baseline loss. +If a variable is important for prediction, removing that variable before model training will cause a drop in performance. With this understanding, we can remove the variables one by one and train models without them and rank them by the loss of performance. The most important variables must cause the largest loss of performance. This strategy requires training and testing models as many times as the number of predictor variables. This will consume a lot of time. A related but more practical approach has been put forward to measure variable importance in a model-independent manner but without re-training [@dalex; @mcr]. In this case, instead of removing the variables at training, variables are permuted at the test phase. The loss in prediction performance is calculated by comparing the labels/values from the original response variable to the labels/values obtained by running the permuted test data through the model. This is called "variable dropout loss". In this case, we are not really dropping out variables, but by permuting them, we destroy their relationship to the response variable. The dropout loss is compared to the "worst case" scenario where the response variable is permuted and compared against the original response variables, which is called "baseline loss". The algorithm ranks the variables by their variable dropout loss or by their ratio of variable dropout to baseline loss. Both quantities are proportional but the second one contains information about the baseline loss. \index{variable importance} -Below, we are running `DALEX::explain()` function to do the permutation drop-out strategy for the variables. The function needs the machine learning model, and new data and its labels to do the permutation-based dropout strategy. In this case, we are feeding the function with the data we used for training. -For visualization we can use `DALEX::feature_importance()` function which plots the loss. Although, in this case we are not plotting the results. In following sections, we will discuss method specific variable importance measures. +Below, we run the `DALEX::explain()` function to do the permutation drop-out strategy for the variables. The function needs the machine learning model, and new data and its labels to do the permutation-based dropout strategy. In this case, we are feeding the function with the data we used for training. +For visualization we can use the `DALEX::feature_importance()` function which plots the loss. Although, in this case we are not plotting the results. In the following sections, we will discuss method-specific variable importance measures. ```{r dalex,eval=FALSE,out.width='50%',fig.cap="Variable importance as loss from variable drop out."} library(DALEX) set.seed(102) @@ -492,41 +493,41 @@ plot(viknn) Although the variable drop-out strategy will still be slow if you have a lot of variables, the upside is that you can use any black-box model as long as you have access to the model to run new predictions. Later sections in this chapter will show methods with built-in variable importance metrics, since these are calculated during training it comes with less of an additional compute cost. ## How to deal with class imbalance -A common hurdle in many applications of machine learning on genomic data is the large class imbalance. The imbalance refers to relative difference in the sizes of the groups being classified. For example, if we had class imbalance in our example data set we could have much more CIMP samples in the training than noCIMP samples, or the other way around. Another example with severe class imbalance would be enhancer prediction [@enhancerImbalance]. Depending on which training data set you use you can have a couple of hundred to thousands of positive examples for enhancer locations in the human genome. In either case, the negative set, "not enhancer", set will overwhelm the training, because the human genome is 3 billion base-pairs long and most of that do not overlap with an enhancer annotation. In whatever strategy you pick to build a negative set it will contain much more data points than the positive set. As we have mentioned in the model performance section above, if we have a severe imbalance in the class sizes, the training algorithm may get better accuracy just by calling everything one class. This will be evident in specificity and sensitivity metrics, and the related balanced accuracy metric. Below, we will discuss a couple of techniques that might help when the training set has class imbalance. +A common hurdle in many applications of machine learning on genomic data is the large class imbalance. The imbalance refers to relative difference in the sizes of the groups being classified. For example, if we had class imbalance in our example data set we could have much more CIMP samples in the training than noCIMP samples, or the other way around. Another example with severe class imbalance would be enhancer prediction [@enhancerImbalance]. Depending on which training data set you use, you can have a couple of hundred to thousands of positive examples for enhancer locations in the human genome. In either case, the negative set, "not enhancer", will overwhelm the training, because the human genome is 3 billion base-pairs long and most of that does not overlap with an enhancer annotation. In whatever strategy you pick to build a negative set, it will contain many more data points than the positive set. As we have mentioned in the model performance section above, if we have a severe imbalance in the class sizes, the training algorithm may get better accuracy just by calling everything one class. This will be evident in specificity and sensitivity metrics, and the related balanced accuracy metric. Below, we will discuss a couple of techniques that might help when the training set has class imbalance. ### Sampling for class balance -If we think class imbalance is a problem based on looking at the relative sizes of the classes, and relevant accuracy metrics of a model there are a couple of things that might help. First, we can try sampling or "stratified" sampling when we are constructing our training set. This simply means that before training we can we build the classification model with samples of the data so we have the same size classes. This could be down-sampling the classes with too many data points. For this purpose, you can simply use `sample()` or `caret::downSample()` function and create your training set prior to modeling. In addition, minority class could be up-sampled for the missing number of data points using sampling with replacement similar to bootstrap sampling with `caret::upSample()` function. There are more advanced up-sampling methods such as synthetic up-sampling method SMOTE [@smote]. In this method, each data point from the minority class is up-sampled synthetically by adding variability to the predictor variable vector from the one of the k-nearest neighbors of the data point. Specifically, one neighbor is randomly chosen and the difference between predictor variables of the neighbor and the original data point is added to the original predictor variables after multiplying the difference values with a random number between 0 and 1. This creates synthetic data points data are similar to original data points but not identical. This method and other similar methods of synthetic sampling is available at [`smotefamily`](https://cran.r-project.org/web/packages/smotefamily/index.html) package \index{R Packages!\texttt{smotefamily}} in CRAN. +If we think class imbalance is a problem based on looking at the relative sizes of the classes and relevant accuracy metrics of a model, there are a couple of things that might help. First, we can try sampling or "stratified" sampling when we are constructing our training set. This simply means that before training we can we build the classification model with samples of the data so we have the same size classes. This could be down-sampling the classes with too many data points. For this purpose, you can simply use the `sample()` or `caret::downSample()` function and create your training set prior to modeling. In addition, the minority class could be up-sampled for the missing number of data points using sampling with replacement similar to bootstrap sampling with the `caret::upSample()` function. There are more advanced up-sampling methods such as the synthetic up-sampling method SMOTE [@smote]. In this method, each data point from the minority class is up-sampled synthetically by adding variability to the predictor variable vector from one of the k-nearest neighbors of the data point. Specifically, one neighbor is randomly chosen and the difference between predictor variables of the neighbor and the original data point is added to the original predictor variables after multiplying the difference values with a random number between $0$ and $1$. This creates synthetic data points that are similar to original data points but not identical. This method and other similar methods of synthetic sampling are available at [`smotefamily`](https://cran.r-project.org/web/packages/smotefamily/index.html) package \index{R Packages!\texttt{smotefamily}} in CRAN. In addition to the strategies above, some methods can do sampling during training to cope with the effects of class imbalance. For example, random forests has a sampling step during training, and this step can be altered to do stratified sampling. We will be introducing random forests later in the chapter. -However, even if we are doing the sampling on the training set to avoid problems. The test set proportions should have original class label proportions to evaluate the performance in a real world situation. +However, even if we are doing the sampling on the training set to avoid problems, the test set proportions should have original class label proportions to evaluate the performance in a real-world situation. ### Altering case weights -For some methods, we can use different case weights proportional to the imbalance suffered by the minority class. This means cases from minority class will have higher case weights, this causes an effect as if we are up-sampling the minority class. Logistic regression based methods and boosting methods are examples of algorithms that can utilize case weights. Both of which will be introduced later. +For some methods, we can use different case weights proportional to the imbalance suffered by the minority class. This means cases from the minority class will have higher case weights, which causes an effect as if we are up-sampling the minority class. Logistic regression-based methods and boosting methods are examples of algorithms that can utilize case weights, both of which will be introduced later. -### selecting different classification score cutoffs -Another simple approach for dealing with class imbalance is to select a prediction score cutoff that minimizes the excess true positives or false positives depending on the direction of the class imbalance. This can simply be done using ROC curves. For example, classical prediction cutoff for 2-class classification problems is 0.5. We can alter this cutoff to optimize sensitivity and specificity. +### Selecting different classification score cutoffs +Another simple approach for dealing with class imbalance is to select a prediction score cutoff that minimizes the excess true positives or false positives depending on the direction of the class imbalance. This can simply be done using ROC curves. For example, the classical prediction cutoff for a 2-class classification problems is 0.5. We can alter this cutoff to optimize sensitivity and specificity. ## Dealing with correlated predictors -Highly correlated predictors can lead to collinearity issues and this can greatly increase the model variance especially in the context of regression. In some cases, there could be relationships between multiple predictor variables and this is called multicollinearity. Having correlated variables will result in unnecessarily complex models with more than necessary predictor variables. From a data collection point of view, spending time and money of collecting correlated variables could be a waste of effort. In terms of linear regression or the models that are based on regression, the collinearity problem is more severe because it creates unstable models where statistical inference becomes difficult or unreliable. On the other hand, correlation between variables may not be a problem for the predictive performance if the correlation structure in the training and the future tests data sets are the same. However, more often correlated structures within the training set might lead to overfitting. +Highly correlated predictors can lead to collinearity issues and this can greatly increase the model variance, especially in the context of regression. In some cases, there could be relationships between multiple predictor variables and this is called multicollinearity. Having correlated variables will result in unnecessarily complex models with more than necessary predictor variables. From a data collection point of view, spending time and money for collecting correlated variables could be a waste of effort. In terms of linear regression or the models that are based on regression, the collinearity problem is more severe because it creates unstable models where statistical inference becomes difficult or unreliable. On the other hand, correlation between variables may not be a problem for the predictive performance if the correlation structure in the training and the future tests data sets are the same. However, more often, correlated structures within the training set might lead to overfitting. Here are couple of things to do if collinearity \index{collinearity} is a problem: -- We can do PCA on the training data which creates new variables removing the collinearity between them. We can then training models on these new dimensions. Downside is it is harder to interpret these variables. They are now linear combinations of original variables. The variable importance would be harder to interpret. +- We can do PCA on the training data, which creates new variables removing the collinearity between them. We can then train models on these new dimensions. The downside is that it is harder to interpret these variables. They are now linear combinations of original variables. The variable importance would be harder to interpret. -- As we have already shown in the data preprocessing section, we can try variable filtering and reduce the number of correlated variables. However, this will may not address the multicollinearity issue where linear combinations of variables might be correlated while themselves are not directly correlated. +- As we have already shown in the data preprocessing section, we can try variable filtering and reduce the number of correlated variables. However, this may not address the multicollinearity issue where linear combinations of variables might be correlated while they are not directly correlated themselves. -- Method specific techniques such as regularization \index{regularization}can decrease the effects of collinearity. Regularization as we will see in the later chapter is a technique that is used to prevent overfitting and it can also dampen the effects of collinearity. In addition, decision tree based methods could suffer less from the effects of collinearity. +- Method-specific techniques such as regularization \index{regularization}can decrease the effects of collinearity. Regularization, as we will see in the later chapter, is a technique that is used to prevent overfitting and it can also dampen the effects of collinearity. In addition, decision-tree-based methods could suffer less from the effects of collinearity. ## Trees and forests: Random forests in action -### decision trees -Decision trees are a popular method for various machine learning tasks mostly because their interpretability is very high. A decision tree is a series of filters on the predictor variables. The series of filters end up in a class prediction. Each filter is binary a yes/no question, this creates bifurcations in the series of filters thus leading to a tree like structure. The filters are dependent on the type of the predictor variables. If the variables are categorical such as gender then the filters could be "is gender female" type of questions. If the variables are continuous such as gene expression, the filter could be "is PIGX expression larger than 210?". Every point where we filter samples based on these questions are called "decision nodes". The tree fitting algorithm finds the best variables at decision nodes depending on how well they split the samples into classes after the application of the decision node. Decision trees handle both categorical and numeric predictor variables, they are easy to interpret and they can deal with missing variables. Despite their advantages, decision trees tend to overfit if they are grown very deep and can learn irregular patterns. +### Decision trees +Decision trees are a popular method for various machine learning tasks mostly because their interpretability is very high. A decision tree is a series of filters on the predictor variables. The series of filters end up in a class prediction. Each filter is a binary yes/no question, which creates bifurcations in the series of filters thus leading to a treelike structure. The filters are dependent on the type of predictor variables. If the variables are categorical, such as gender, then the filters could be "is gender female" type of questions. If the variables are continuous, such as gene expression, the filter could be "is PIGX expression larger than 210?". Every point where we filter samples based on these questions are called "decision nodes". The tree-fitting algorithm finds the best variables at decision nodes depending on how well they split the samples into classes after the application of the decision node. Decision trees handle both categorical and numeric predictor variables, they are easy to interpret, and they can deal with missing variables. Despite their advantages, decision trees tend to overfit if they are grown very deep and can learn irregular patterns. \index{decision tree} -There are many variants of tree based machine learning algorithms. However, most algorithms construct decision nodes in a top down manner. They select the best variables to use in decision nodes based on how homogeneous the sample sets are after the split. One measure of homogeneity is "Gini impurity". This measure is calculated for each subset after the split and later summed up as weighted average. For a decision node that splits the data perfectly in a two class problem, the gini impurity will be 0, and for a node that splits the data to a subset that that has 50% class 2 and 50% class two the impurity will be 0.5. Formally, gini impurity,${I}_{G}(p)$, of a set of samples with known class labels for $K$ classes is the following, where $p_{i}$ is the probability of observing class $i$ in the subset: +There are many variants of tree-based machine learning algorithms. However, most algorithms construct decision nodes in a top down manner. They select the best variables to use in decision nodes based on how homogeneous the sample sets are after the split. One measure of homogeneity is "Gini impurity". This measure is calculated for each subset after the split and later summed up as a weighted average. For a decision node that splits the data perfectly in a two-class problem, the gini impurity will be $0$, and for a node that splits the data into a subset that has 50% class A and 50% class B the impurity will be $0.5$. Formally, the gini impurity, ${I}_{G}(p)$, of a set of samples with known class labels for $K$ classes is the following, where $p_{i}$ is the probability of observing class $i$ in the subset: $$ {\displaystyle {I}_{G}(p)=\sum _{i=1}^{K}p_{i}(1-p_{i})=\sum _{i=1}^{K}p_{i}-\sum _{i=1}^{K}{p_{i}}^{2}=1-\sum _{i=1}^{K}{p_{i}}^{2}} @@ -536,23 +537,24 @@ $$ -For example, if a subset of data after split has 75% class A and 25% class B for that subset the impurity would be $1-(0.75^2+0.25^2)=0.375$. If the other subset had 5% class A and 95% class B its impurity would be $1-(0.95^2+0.05^2)=0.095$. If the subset sizes after the split were equal total weighted impurity would be $0.5*0.375+0.5*0.095= 0.235$. These calculations will be done for each potential variable and the split, and every node will be constructed based on gini impurity decrease. If the variable is continuous the cutoff value will be decided based on the best impurity. For example, gene expression values will have splits such as "PIGX expression < 2.1". Here 2.1 is the cutoff value is the one that produces the best impurity. There are other homogeneity measures, however gini impurity is the one that is used for the random forests which we will introduce next. +For example, if a subset of data after split has 75% class A and 25% class B for that subset, the impurity would be $1-(0.75^2+0.25^2)=0.375$. If the other subset had 5% class A and 95% class B, its impurity would be $1-(0.95^2+0.05^2)=0.095$. If the subset sizes after the split were equal, total weighted impurity would be $0.5*0.375+0.5*0.095= 0.235$. These calculations will be done for each potential variable and the split, and every node will be constructed based on gini impurity decrease. If the variable is continuous, the cutoff value will be decided based on the best impurity. For example, gene expression values will have splits such as "PIGX expression < 2.1". Here $2.1$ is the cutoff value that produces the best impurity. There are other homogeneity measures, however gini impurity is the one that is used for random forests, which we will introduce next. ### Trees to forests -Random forests are devised to counter the short comings of decision trees. They are simply ensembles of decision trees. Each tree is trained with a different randomly selected part of the data with randomly selected predictor variables. The goal of introducing randomness is to reduce the variance of the model so it does not overfit, with the expense of a small increase in the bias and some loss of interpretability. This strategy generally boosts the performance of the final model.\index{random forest} +Random forests are devised to counter the shortcomings of decision trees. They are simply ensembles of decision trees. Each tree is trained with a different randomly selected part of the data with randomly selected predictor variables. The goal of introducing randomness is to reduce the variance of the model so it does not overfit, at the expense of a small increase in the bias and some loss of interpretability. This strategy generally boosts the performance of the final model.\index{random forest} -The random forests algorithm tries to decorrelate the trees so that they learn different things about the data. It does this by selecting a random subset of variables. If one or a few predictor variables are very strong predictors for the response variable, these features will be selected in many of the trees, causing them to become correlated. By random subsampling of predictor variables it ensures that not always the best predictors overall will be selected for every tree and the model has a chance to learn other features of the data. +The random forests algorithm tries to decorrelate the trees so that they learn different things about the data. It does this by selecting a random subset of variables. If one or a few predictor variables are very strong predictors for the response variable, these features will be selected in many of the trees, causing them to become correlated. Random subsampling of predictor variables ensures that not always the best predictors overall are selected for every tree and, the model does +have a chance to learn other features of the data. -Another sampling method introduce when building random forest models is the bootstrap resampling before constructing each tree. This brings the advantage of out-of-the-bag (OOB) error prediction. In this case, the prediction error can be estimated for training samples that were OOB, meaning they were not used in the training, for some percentage of the trees. The prediction error for each sample can be estimated from the trees where that sample was OOB. OOB estimates claimed to be a good alternative to cross-validation estimated errors [@breiman2001random]. +Another sampling method introduced when building random forest models is bootstrap resampling before constructing each tree. This brings the advantage of out-of-the-bag (OOB) error prediction. In this case, the prediction error can be estimated for training samples that were OOB, meaning they were not used in the training, for some percentage of the trees. The prediction error for each sample can be estimated from the trees where that sample was OOB. OOB estimates claimed to be a good alternative to cross-validation estimated errors [@breiman2001random]. -```{r,RFcartoon,fig.cap="Random forest concept. Individual decision trees are built with sampling strategies. Votes from each tree defines the final class",fig.align = 'center',out.width='70%',echo=FALSE} +```{r,RFcartoon,fig.cap="Random forest concept. Individual decision trees are built with sampling strategies. Votes from each tree define the final class.",fig.align = 'center',out.width='70%',echo=FALSE} knitr::include_graphics("images/ml-random-forest-features.png" ) ``` -For the demonstration purposes, we will use `caret` \index{R Packages!\texttt{caret}}package interface to `ranger` random forest package\index{R Packages!\texttt{ranger}}. This is a fast implementation of the original random forest algorithm. For random forests, we have two critical arguments. One is the most critical argument for random forest is the predictor variables to sample in each split of the tree. As this parameter controls the independence between the trees, and as explain before this limits the overfitting. Below, we are going to fit a random forest model to our tumor subtype problem. We will set `mtry=100` and do not let select the training procedure to find the best `mtry` value for simplicity. However, it is good practice -to run the model with cross-validation let it pick best parameters based on the cross-validation performance. It defaults to the square root of number of predictor variables. Another variable we can tune is the minimum node size of terminal nodes in the trees (`min.node.size`). This controls the depth of the trees grown. Setting this to larger numbers might cost a small loss in accuracy but the algorithm will run faster. +For demonstration purposes, we will use the `caret` \index{R Packages!\texttt{caret}}package interface to the `ranger` random forest package\index{R Packages!\texttt{ranger}}. This is a fast implementation of the original random forest algorithm. For random forests, we have two critical arguments. One of the most critical arguments for random forest is the number of predictor variables to sample in each split of the tree. This parameter controls the independence between the trees, and as explained before, this limits overfitting. Below, we are going to fit a random forest model to our tumor subtype problem. We will set `mtry=100` and not perform the training procedure to find the best `mtry` value for simplicity. However, it is good practice +to run the model with cross-validation and let it pick the best parameters based on the cross-validation performance. It defaults to the square root of number of predictor variables. Another variable we can tune is the minimum node size of terminal nodes in the trees (`min.node.size`). This controls the depth of the trees grown. Setting this to larger numbers might cost a small loss in accuracy but the algorithm will run faster. ```{r,RFex,fig.cap="description",fig.align = 'center',out.width='70%'} set.seed(17) @@ -575,21 +577,21 @@ rfFit$finalModel$prediction.error ``` ### Variable importance -Random forests comes with built-in variable importance metrics. One of the metrics is similar to the "variable dropout metric" where the predictor variables are permuted. In this case, OOB samples are used and the variables are permuted one at a time. Every time, the samples with the permuted variables are fed to the network and decrease in accuracy with the permuted variable is measured. Using this quantity, the variables can be ranked. \index{variable importance} +Random forests come with built-in variable importance metrics. One of the metrics is similar to the "variable dropout metric" where the predictor variables are permuted. In this case, OOB samples are used and the variables are permuted one at a time. Every time, the samples with the permuted variables are fed to the network and the decrease in accuracy is measured. Using this quantity, the variables can be ranked. \index{variable importance} -A less costly method with similar performance is to use gini impurity. Every time a variable is used in a tree to make a split, the gini impurity is less than the parent node. This method adds up these gini impurity decreases for each individual variable across the trees and divided by number of the trees in the forest. This metric is often consistent with the permutation importance measure [@breiman2001random]. Below, we are going to plot the "permutation" based importance metric. This metric has been calculated during the run of the model above. We will use the `caret::varImp()` function to access the importance values and plot them using the `plot()` function, the result is shown in Figure \@ref(fig:RFvarImp). +A less costly method with similar performance is to use gini impurity. Every time a variable is used in a tree to make a split, the gini impurity is less than the parent node. This method adds up these gini impurity decreases for each individual variable across the trees and divides it by the number of the trees in the forest. This metric is often consistent with the permutation importance measure [@breiman2001random]. Below, we are going to plot the permutation-based importance metric. This metric has been calculated during the run of the model above. We will use the `caret::varImp()` function to access the importance values and plot them using the `plot()` function; the result is shown in Figure \@ref(fig:RFvarImp). -```{r,RFvarImp,fig.cap="Top 10 important variables based on permutation-based method for the random forest classification",fig.align = 'center',out.width='50%'} +```{r,RFvarImp,fig.cap="Top 10 important variables based on permutation-based method for the random forest classification.",fig.align = 'center',out.width='50%'} plot(varImp(rfFit),top=10) ``` ## Logistic regression and regularization -Logistic regression \index{logistic regression} is a statistical method that is used to model a binary response variable based on predictor variables. Although initially devised for two-class or binary response problems, this method can be generalized to multiclass problems. However, our example tumor sample data is a binary response or two-class problem, therefore we will not go into multiclass case in this chapter. +Logistic regression \index{logistic regression} is a statistical method that is used to model a binary response variable based on predictor variables. Although initially devised for two-class or binary response problems, this method can be generalized to multiclass problems. However, our example tumor sample data is a binary response or two-class problem, therefore we will not go into the multiclass case in this chapter. -Logistic regression is very similar to the linear regression as a concept and it can be thought of as a "maximum likelihood estimation" problem where we are trying to find statistical parameters that maximizes the likelihood of the observed data being sampled from the statistical distribution of interest. Which is also very related to the general cost/loss function approach we see in supervised machine learning algorithms. In the case of binary response variables, simple linear regression model, such as $y_i \sim \beta _{0}+\beta _{1}x_i$, would be a poor choice because it can easily generate values outside of 0 to 1 boundary. What we need is a +Logistic regression is very similar to linear regression as a concept and it can be thought of as a "maximum likelihood estimation" problem where we are trying to find statistical parameters that maximize the likelihood of the observed data being sampled from the statistical distribution of interest. This is also very related to the general cost/loss function approach we see in supervised machine learning algorithms. In the case of binary response variables, the simple linear regression model, such as $y_i \sim \beta _{0}+\beta _{1}x_i$, would be a poor choice because it can easily generate values outside of the $0$ to $1$ boundary. What we need is a model that restricts the lower bound of the prediction to zero and an upper -bound to 1. First thing towards this requirement is to formulate the problem differently. If $y_i$ can only be 0 or 1, we can formulate $y_i$ as a realization of a random variable that can take the values one and zero with probabilities $p_i$ and $1-{p_i}$, respectively. This random variable follows the Bernoulli distribution, and instead of predicting the binary variable we can formulate the problem as $p_i \sim \beta _{0}+\beta _{1}x_i$. However, our initial problem still stands, simple linear regression will still result in values that are beyond 0 and 1 boundary. A model that satisfies the boundary requirement is the logistic equation shown below. +bound to $1$. The first thing towards this requirement is to formulate the problem differently. If $y_i$ can only be $0$ or $1$, we can formulate $y_i$ as a realization of a random variable that can take the values one and zero with probabilities $p_i$ and $1-{p_i}$, respectively. This random variable follows the Bernoulli distribution, and instead of predicting the binary variable we can formulate the problem as $p_i \sim \beta _{0}+\beta _{1}x_i$. However, our initial problem still stands, simple linear regression will still result in values that are beyond $0$ and $1$ boundaries. A model that satisfies the boundary requirement is the logistic equation shown below. $$ {\displaystyle p_i={\frac {e^{(\beta _{0}+\beta _{1}x_i)}}{1+e^{(\beta _{0}+\beta_{1}x_i)}}}} @@ -600,16 +602,16 @@ This equation can be linearized by the following transformation $$ {\displaystyle \operatorname{logit} (p_i)=\ln \left({\frac {p_i}{1-p_i}}\right)=\beta _{0}+\beta _{1}x_i} $$ -The left-hand side is termed the logit, which stands for “logistic unit.” It is also known as the log odds. In this case, our model will produce values on the log scale and with the logistic equation above, we can transform the values to 0-1 range. Now, the question remains: "What are the best parameter estimates for our training set". Within the maximum likelihood framework\index{maximum likelihood estimation} we have touched upon in Chapter \@ref(stats), the best parameter estimates are the ones that maximizes the likelihood of the statistical the model actually producing the observed data. You can think of this fitting a probability distribution to an observed data set. The parameters of the probability distribution should maximize the likelihood that the observed data came from the distribution of in question. If we were using a Gaussian distribution we would change the mean and variance parameters until the observed data is more plausible to be drawn from that specific Gaussian distribution. +The left-hand side is termed the logit, which stands for “logistic unit". It is also known as the log odds. In this case, our model will produce values on the log scale and with the logistic equation above, we can transform the values to the $0-1$ range. Now, the question remains: "What are the best parameter estimates for our training set". Within the maximum likelihood framework\index{maximum likelihood estimation} we have touched upon in Chapter \@ref(stats), the best parameter estimates are the ones that maximize the likelihood of the statistical model actually producing the observed data. You can think of this fitting as a probability distribution to an observed data set. The parameters of the probability distribution should maximize the likelihood that the observed data came from the distribution in question. If we were using a Gaussian distribution we would change the mean and variance parameters until the observed data was more plausible to be drawn from that specific Gaussian distribution. -In logistic regression, \index{logistic regression}the response variable is modeled with a binomial distribution or its special case Bernoulli distribution. The value of each response variable, $y_i$, is 0 or 1, and we need to figure out parameter $p_i$ values that could generate such a distribution of 0s and 1s. If we can find the best $p_i$ values for each tumor sample $i$, we would be maximizing the log-likelihood function of the model over the observed data. The maximum log-likelihood function for our binary response variable case is shown as equation \@ref(eq:logLik). +In logistic regression, \index{logistic regression}the response variable is modeled with a binomial distribution or its special case Bernoulli distribution. The value of each response variable, $y_i$, is 0 or 1, and we need to figure out parameter $p_i$ values that could generate such a distribution of 0s and 1s. If we can find the best $p_i$ values for each tumor sample $i$, we would be maximizing the log-likelihood function of the model over the observed data. The maximum log-likelihood function for our binary response variable case is shown as Equation \@ref(eq:logLik). \begin{equation} \operatorname{\ln} (L)=\sum_{i=1}^N\bigg[{\ln(1-p_i)+y_i\ln \left({\frac {p_i}{1-p_i}}\right)\bigg]} (\#eq:logLik) \end{equation} -In order to maximize this equation we have to find optimum $p_i$ values which are dependent on parameters $\beta_0$ and $\beta_1$, and also dependent on the values of predictor variables $x_i$. We can rearrange the equation replacing $p_i$ with the logistic equation. In addition, many optimization functions minimize rather than maximize. Therefore, we will be using negative log likelihood, this also called "log loss" or "logistic loss" function. The function below is the "log loss" function. We substituted $p_i$ with the logistic equation and simplified the expression. +In order to maximize this equation we have to find optimum $p_i$ values which are dependent on parameters $\beta_0$ and $\beta_1$, and also dependent on the values of predictor variables $x_i$. We can rearrange the equation replacing $p_i$ with the logistic equation. In addition, many optimization functions minimize rather than maximize. Therefore, we will be using negative log likelihood, which is also called the "log loss" or "logistic loss" function. The function below is the "log loss" function. We substituted $p_i$ with the logistic equation and simplified the expression. \begin{equation} \operatorname L_{log}=-{\ln}(L)=-\sum_{i=1}^N\bigg[-{\ln(1+e^{(\beta _{0}+\beta _{1}x_i)})+y_i \left(\beta _{0}+\beta _{1}x_i\right)\bigg]} @@ -617,9 +619,9 @@ In order to maximize this equation we have to find optimum $p_i$ values which ar \end{equation} -Now, let us see how this works in practice. First, as in the example above we will use one predictor variable, the expression of one gene to classify tumor samples to "CIMP" and "noCIMP" subtypes. We will be using PDPN gene expression, which was one of the most important variables in our random forest model. We will use the formula interface in `caret`, where we will supply the names of response and and predictor variables in a formula. In this case, the we will be using a core R function `glm()` from `stats` package.\index{R Packages!\texttt{stats}} "glm" stands for generalized linear models, and it is the main interface for different types of regression +Now, let us see how this works in practice. First, as in the example above we will use one predictor variable, the expression of one gene to classify tumor samples to "CIMP" and "noCIMP" subtypes. We will be using PDPN gene expression, which was one of the most important variables in our random forest model. We will use the formula interface in `caret`, where we will supply the names of the response and predictor variables in a formula. In this case, we will be using a core R function, `glm()`, from the `stats` package.\index{R Packages!\texttt{stats}} "glm" stands for generalized linear models, and it is the main interface for different types of regression in R. -```{r, logReg1,out.width='60%',fig.width=5,fig.cap="Sigmoid curve for prediction of subtype based on one predictor variable"} +```{r, logReg1,out.width='60%',fig.width=5,fig.cap="Sigmoid curve for prediction of subtype based on one predictor variable."} # fit logistic regression model # method and family defines the type of regression @@ -643,7 +645,7 @@ plot(ifelse(subtype=="CIMP",1,0) ~ PDPN, lines(subtype ~ PDPN, newdat, col="green4", lwd=2) ``` -The figure \@ref(fig:logReg1) shows the sigmoidal curve that is fitted by the logistic regression. "noCIMP" subtype has higher expression of PDPN gene than the "CIMP" subtype. In other words, the higher the values of PDPN, the more likely that the tumor sample will be classified as "noCIMP". We can also assess the performance of our model with the test set and the training set. Let us try to do that with again `caret::predict()` and `caret::confusionMatrix()` functions. +Figure \@ref(fig:logReg1) shows the sigmoidal curve that is fitted by the logistic regression. "noCIMP" subtype has higher expression of the PDPN gene than the "CIMP" subtype. In other words, the higher the values of PDPN, the more likely that the tumor sample will be classified as "noCIMP". We can also assess the performance of our model with the test set and the training set. Let us try to do that again with the `caret::predict()` and `caret::confusionMatrix()` functions. ```{r confusionLR2} # training accuracy @@ -655,7 +657,7 @@ class.res=predict(lrFit,testing[,-1]) confusionMatrix(testing[,1],class.res)$overall[1] ``` -The test accuracy\index{accuracy} is slightly worse than the training accuracy. Overall this is not as good as k-NN\index{k-nearest neighbors (k-NN)}, but remember we used only one predictor variable. We have thousands of genes as predictor variables. Now we will try to use all of them in the classification problem. After fitting model we will check training and test accuracy. We fit the model again with `caret::train()` function. +The test accuracy\index{accuracy} is slightly worse than the training accuracy. Overall this is not as good as k-NN\index{k-nearest neighbors (k-NN)}, but remember we used only one predictor variable. We have thousands of genes as predictor variables. Now we will try to use all of them in the classification problem. After fitting the model, we will check training and test accuracy. We fit the model again with the `caret::train()` function. ```{r logRegMulti, warning=FALSE,message=FALSE} lrFit2 = train(subtype ~ ., data=training, @@ -673,29 +675,29 @@ confusionMatrix(testing[,1],class.res)$overall[1] ``` -Training accuracy is 1 so training error is 0, nothing is misclassified in the training set. However, test accuracy/error is close to terrible. It does only little better than a random guess. If we randomly assigned class labels we would get 0.5 accuracy. The test set accuracy is 0.55 despite the 100% training accuracy. This is because the model overfits to the training data. There are too many variables in the model. The number of predictor variables is ~6.5 times more than the number of samples. The excess of predictor variables makes the model very flexible (high variance), and this leads to overfitting. +Training accuracy is $1$, so training error is $0$, and nothing is misclassified in the training set. However, test accuracy/error is close to terrible. It does only little better than a random guess. If we randomly assigned class labels we would get 0.5 accuracy. The test set accuracy is 0.55 despite the 100% training accuracy. This is because the model overfits to the training data. There are too many variables in the model. The number of predictor variables is ~6.5 times more than the number of samples. The excess of predictor variables makes the model very flexible (high variance), and this leads to overfitting. -### regularization in order to avoid overfitting +### Regularization in order to avoid overfitting If \index{regularization}we can limit the flexibility of the model, this might help with performance on the unseen, new data sets. Generally, any modification of the learning method to improve performance on the unseen datasets is called regularization. We need regularization to introduce bias to the model and to decrease the variance. This can be achieved by modifying the loss function with a penalty term which effectively shrinks the estimates of the coefficients. Therefore these types of methods within the framework of regression are also called "shrinkage" methods or "penalized regression" methods.\index{overfitting} -One way to ensure shrinkage is to add the penalty term, $\lambda\sum{\beta_j}^2$, to the loss function. This penalty term is also known as The L2 norm or L2 penalty. It is calculated as the square root of the sum of the squared vector values. This term will help shrink the coefficients in the regression towards zero. The new loss function is as follows, where $j$ is the number of parameters/coefficients in the model and $L_{log}$ is the log loss function in Eq. \@ref(eq:llog). +One way to ensure shrinkage is to add the penalty term, $\lambda\sum{\beta_j}^2$, to the loss function. This penalty term is also known as the L2 norm or L2 penalty. It is calculated as the square root of the sum of the squared vector values. This term will help shrink the coefficients in the regression towards zero. The new loss function is as follows, where $j$ is the number of parameters/coefficients in the model and $L_{log}$ is the log loss function in Eq. \@ref(eq:llog). \begin{equation} L_{log}+\lambda\sum_{j=1}^p{\beta_j}^2 (\#eq:L2norm) \end{equation} -This penalized loss function is called "ridge regression" [@hoerl1970ridge].\index{ridge regression} When we add the penalty, the only way the optimization procedure keeps the overall loss function minimum is to assign smaller values to the coefficients. The $\lambda$ parameter controls how much emphasis is given to the penalty term. Higher the $\lambda$ value, the more coefficients in the regression will be pushed towards zero. However, they will never be exactly zero. This is not desirable if we want the model to select important variables. A small modification to the penalty is to use the absolute values of $B_j$ instead of squared values. This penalty is called "L1 norm" or "L1 penalty". The regression method that uses the L1 penalty is known as "Lasso regression"\index{lasso regression} [@tibshirani1996regression]. +This penalized loss function is called "ridge regression" [@hoerl1970ridge].\index{ridge regression} When we add the penalty, the only way the optimization procedure keeps the overall loss function minimum is to assign smaller values to the coefficients. The $\lambda$ parameter controls how much emphasis is given to the penalty term. The higher the $\lambda$ value, the more coefficients in the regression will be pushed towards zero. However, they will never be exactly zero. This is not desirable if we want the model to select important variables. A small modification to the penalty is to use the absolute values of $B_j$ instead of squared values. This penalty is called the "L1 norm" or "L1 penalty". The regression method that uses the L1 penalty is known as "Lasso regression"\index{lasso regression} [@tibshirani1996regression]. $$ L_{log}+\lambda\sum_{j=1}^p{|\beta_j}| $$ -However, the L1 penalty tends to pick one variable at random when predictor variables are correlated. In this case, it looks like one of the variables are not important although it might still have predictive power. The Ridge regression on the other hand shrinks coefficients of correlated variables towards each other, keeping all of them. It has been shown that both Lasso and Ridge regression has their drawbacks and advantages [friedman2010regularization]. More recently, a method called "elastic net" \index{elastic net}is proposed to include best of the both worlds [@zou2005regularization]. This method uses both L1 and L2 penalties. The equation below shows the modified loss function by this penalty. As you can see the $\lambda$ parameter still controls the weight that is given to the penalty. This time the additional parameter $\alpha$ controls the weight given to L1 or L2 penalty and it is a value between 0 and 1. +However, the L1 penalty tends to pick one variable at random when predictor variables are correlated. In this case, it looks like one of the variables is not important although it might still have predictive power. The Ridge regression on the other hand shrinks coefficients of correlated variables towards each other, keeping all of them. It has been shown that both Lasso and Ridge regression have their drawbacks and advantages [@friedman2010regularization]. More recently, a method called "elastic net" \index{elastic net}was proposed to include the best of both worlds [@zou2005regularization]. This method uses both L1 and L2 penalties. The equation below shows the modified loss function by this penalty. As you can see the $\lambda$ parameter still controls the weight that is given to the penalty. This time the additional parameter $\alpha$ controls the weight given to L1 or L2 penalty and it is a value between 0 and 1. $$ L_{log}+\lambda\sum_{j=1}^p{(\alpha\beta_j^2+(1-\alpha)|\beta_j}|) $$ -We have now got the concept behind regularization and we can see how it works in practice. We are going to use elastic net on our tumor subtype prediction problem. We will let cross-validation select the best $\lambda$ and we will fix the $\alpha$ parameter at 0.5. +We have now got the concept behind regularization and we can see how it works in practice. We are going to use elastic net on our tumor subtype prediction problem. We will let cross-validation select the best $\lambda$ and we will fix the $\alpha$ parameter at $0.5$. ```{r} set.seed(17) library(glmnet) @@ -721,12 +723,12 @@ class.res=predict(enetFit,testing[,-1]) confusionMatrix(testing[,1],class.res)$overall[1] ``` -As you can see regularization worked, the tuning step selected $\lambda=1$ and we were able to get a satisfactory test set accuracy with the best model. +As you can see regularization worked, the tuning step selected $\lambda=1$, and we were able to get a satisfactory test set accuracy with the best model. -### variable importance -The variable importance\index{variable importance} for the penalized regression especially for lasso and elastic net is more or less out of the box. As discussed, these methods will set regression coefficients for irrelevant variables to zero. This provides a system for selecting important variables but it does not necessarily provide a way to rank them. Using the size of the regression coefficients is a way to rank predictor variables, however if the data is not normalized you will get different scales for different variables. In our case, we normalized the data and we know that the variables have the same scale before they went into the training. We can use this fact and rank them based on the regression coefficients. The `caret::varImp()` function uses the coefficients to rank the variables from the elastic net model. Below, were going to plot top 10 important variables which are normalized to the importance of the most important variable. -```{r varImpEnet,out.width='60%',fig.width=5,fig.cap="Variable importance metric for elastic net. This metric is using regression coefficients as importance"} +### Variable importance +The variable importance\index{variable importance} of the penalized regression, especially for lasso and elastic net, is more or less out of the box. As discussed, these methods will set regression coefficients for irrelevant variables to zero. This provides a system for selecting important variables but it does not necessarily provide a way to rank them. Using the size of the regression coefficients is a way to rank predictor variables, however if the data is not normalized, you will get different scales for different variables. In our case, we normalized the data and we know that the variables have the same scale before they went into the training. We can use this fact and rank them based on the regression coefficients. The `caret::varImp()` function uses the coefficients to rank the variables from the elastic net model. Below, were going to plot the top 10 important variables which are normalized to the importance of the most important variable. +```{r varImpEnet,out.width='60%',fig.width=5,fig.cap="Variable importance metric for elastic net. This metric uses regression coefficients as importance."} plot(varImp(enetFit),top=10) ``` @@ -735,25 +737,25 @@ plot(varImp(enetFit),top=10) __Want to know more ?__ -- Lecture by Trevor Hastie on regularized regression. You probably need to understand basics of regression and its terminology to follow this. However, the lecture is not very heavy on math. https://youtu.be/BU2gjoLPfDc +- Lecture by Trevor Hastie on regularized regression. You probably need to understand the basics of regression and its terminology to follow this. However, the lecture is not very heavy on math. https://youtu.be/BU2gjoLPfDc ``` ## Other supervised algorithms -We will next introduce a couple of other supervised algorithms for completeness but in less detail. These algorithms are also as popular as the others we introduced above and people who are interested in computational genomics see them used in the field for different problems. These algorithms also fit to the general framework of optimization of a cost/loss function. However, the approaches to the construction of the cost function and the cost function itself is different in each case. +We will next introduce a couple of other supervised algorithms for completeness but in less detail. These algorithms are also as popular as the others we introduced above and people who are interested in computational genomics see them used in the field for different problems. These algorithms also fit to the general framework of optimization of a cost/loss function. However, the approaches to the construction of the cost function and the cost function itself are different in each case. ### Gradient boosting -Gradient boosting is a prediction model that uses an ensemble of decision trees similar to random forest. However, the decision trees are added sequentially. These models is why this models are also called "Multiple Additive Regression Trees (MART)" [@friedman2003mart]. Apart from this, you will see similar methods named as "Gradient boosting machines (GBM)"[@friedman2001gbm] or "Boosted regression trees (BRT)" [@elith2008brt] in the literature. +Gradient boosting is a prediction model that uses an ensemble of decision trees similar to random forest. However, the decision trees are added sequentially, which is why these models are also called "Multiple Additive Regression Trees (MART)" [@friedman2003mart]. Apart from this, you will see similar methods called "Gradient boosting machines (GBM)"[@friedman2001gbm] or "Boosted regression trees (BRT)" [@elith2008brt] in the literature. -Generally, "boosting" \index{gradient boosting} refers to an iterative learning approach where each new model tries to focus on data points where previous ensemble of simple models did not predict well. Gradient boosting is an improvement over that, where each new model tries to focus on the residual errors (prediction error for the current ensemble of models) of the previous model. Specifically in gradient boosting, the simple models are trees. As in random forests, many trees are grown but in this case, trees are sequentially grown and each tree is focusing on fixing the shortcomings of the previous trees. Figure \@ref(fig:GBMcartoon) shows this concept. One of the most widely used algorithms for gradient boosting is `XGboost` which stands for "extreme gradient boosting"[@chen2016xgboost]. Below we will demonstrate how to use this on our problem. `XGboost`\index{R Packages!\texttt{XGboost}} as well as other gradient boosting methods has many parameters to regularize and optimize the complexity of the model. Finding the best parameters for your problem might take some time. However, this flexibility comes with benefits, methods depending on `XGboost` have won many machine learning competitions boosting [@chen2016xgboost]. -```{r,GBMcartoon,fig.cap="Gradient boosting machines concept. Individual decision trees are built sequentially in order to fix the errors from the previous trees",fig.align = 'center',out.width='70%',echo=FALSE} +Generally, "boosting" \index{gradient boosting} refers to an iterative learning approach where each new model tries to focus on data points where the previous ensemble of simple models did not predict well. Gradient boosting is an improvement over that, where each new model tries to focus on the residual errors (prediction error for the current ensemble of models) of the previous model. Specifically in gradient boosting, the simple models are trees. As in random forests, many trees are grown but in this case, trees are sequentially grown and each tree focuses on fixing the shortcomings of the previous trees. Figure \@ref(fig:GBMcartoon) shows this concept. One of the most widely used algorithms for gradient boosting is `XGboost` which stands for "extreme gradient boosting" [@chen2016xgboost]. Below we will demonstrate how to use this on our problem. `XGboost`\index{R Packages!\texttt{XGboost}} as well as other gradient boosting methods has many parameters to regularize and optimize the complexity of the model. Finding the best parameters for your problem might take some time. However, this flexibility comes with benefits; methods depending on `XGboost` have won many machine learning competitions [@chen2016xgboost]. +```{r,GBMcartoon,fig.cap="Gradient boosting machines concept. Individual decision trees are built sequentially in order to fix the errors from the previous trees.",fig.align = 'center',out.width='70%',echo=FALSE} knitr::include_graphics("images/ml-GBM-features.png" ) ``` -The most important parameters are number of trees (`nrounds`), tree depth (`max_depth`) and learning rate or shrinkage (`eta`). Generally, the more trees we have the better the algorithm will learn because each tree tries to fix misclassification errors that previous tree ensemble could not perform. Having too many trees might cause overfitting. However, learning rate parameter, eta, combats that by shrinking the contribution of each new tree. This can be set to lower values if you have many trees. You can either set a large number of trees and then tune the model with the learning rate parameter or set the learning rate low, say to 0.01 or 0.1 and tune the number of trees. Similarly, tree depth is also controls for overfitting. The deeper the tree, the more usually it will overfit. This has to be tuned as well, the default is at 6. You can try to explore a range around the default. Apart from these, as in random forests, you can subsample the training data and/or the predictive variables. These strategies also can help you counter overfitting. +The most important parameters are number of trees (`nrounds`), tree depth (`max_depth`), and learning rate or shrinkage (`eta`). Generally, the more trees we have, the better the algorithm will learn because each tree tries to fix classification errors that the previous tree ensemble could not perform. Having too many trees might cause overfitting. However, the learning rate parameter, eta, combats that by shrinking the contribution of each new tree. This can be set to lower values if you have many trees. You can either set a large number of trees and then tune the model with the learning rate parameter or set the learning rate low, say to $0.01$ or $0.1$ and tune the number of trees. Similarly, tree depth also controls for overfitting. The deeper the tree, the more usually it will overfit. This has to be tuned as well; the default is at 6. You can try to explore a range around the default. Apart from these, as in random forests, you can subsample the training data and/or the predictive variables. These strategies can also help you counter overfitting. -We are now going to use `XGboost` with caret package on our cancer subtype classification problem. We are going to try different learning rate parameters. In this instance, we also subsample the dataset before we train each tree. The "subsample" parameter controls this and we set this to be 0.5, which means that before we train a tree we will sample 50% of the data and use only that portion to train the tree. +We are now going to use `XGboost` with the caret package on our cancer subtype classification problem. We are going to try different learning rate parameters. In this instance, we also subsample the dataset before we train each tree. The "subsample" parameter controls this and we set this to be 0.5, which means that before we train a tree we will sample 50% of the data and use only that portion to train the tree. ```{r,xgboost} library(xgboost) set.seed(17) @@ -780,30 +782,30 @@ gbFit <- train(subtype~., data = training, gbFit$bestTune ``` -Similar to random forests, we can estimate the variable importance for gradient boosting using the improvement in gini impurity or other performance relates metrics every time a variable is selected in a tree. Again, `caret::varImp()` function can be used to plot the importance metrics. +Similar to random forests, we can estimate the variable importance for gradient boosting using the improvement in gini impurity or other performance-related metrics every time a variable is selected in a tree. Again, the `caret::varImp()` function can be used to plot the importance metrics. ```{block2, xgboostMore, type='rmdtip'} __Want to know more ?__ -- [More background on gradient boosting and XGboost](https://xgboost.readthedocs.io/en/latest/tutorials/model.html). This explains the cost/loss function and regularization in more detail. -- [Lecture on Gradient boosting and random forests by Trevor Hastie](https://youtu.be/wPqtzj5VZus) +- More background on gradient boosting and XGboost: (https://xgboost.readthedocs.io/en/latest/tutorials/model.html). This explains the cost/loss function and regularization in more detail. +- Lecture on Gradient boosting and random forests by Trevor Hastie: (https://youtu.be/wPqtzj5VZus) ``` ### Support Vector Machines (SVM) -Support vector machines (SVM) \index{Support vector machines (SVM)} are popularized in the 90s due the efficiency and the performance of the algorithm [@boser1992svm]. The algorithm works by identifying the optimal decision boundary that separates the data points to different groups (or classes), and then predicts the class of new observations based on this separation boundary. Depending on the situation, the different groups might be separable by a linear straight line or by a non-linear boundary line or plane. If you review k-NN decision boundaries in figure \@ref(fig:kNNboundary), you can see that the decision boundary is not linear. SVM can deal with the linear or non-linear decision boundaries. +Support vector machines (SVM) \index{Support vector machines (SVM)} were popularized in the 90s due the efficiency and the performance of the algorithm [@boser1992svm]. The algorithm works by identifying the optimal decision boundary that separates the data points into different groups (or classes), and then predicts the class of new observations based on this separation boundary. Depending on the situation, the different groups might be separable by a linear straight line or by a non-linear boundary line or plane. If you review k-NN decision boundaries in Figure \@ref(fig:kNNboundary), you can see that the decision boundary is not linear. SVM can deal with linear or non-linear decision boundaries. -First, SVM can map the data to higher dimensions where the decision boundary can be linear. This is achieved by applying certain mathematical functions, called "kernel functions", to the predictor variable space. For example, a second degree polynomial can be applied to predictor variables which creates new variables and in this new space the problem is linearly separable. Figure \@ref(fig:SVMcartoon) demonstrates this concept where points in feature space is mapped to quadratic space where linear separation is possible. -```{r,SVMcartoon,fig.cap="Support vector machine concept. With the help of a kernel function points in feature space are mapped to higher dimensions where linear separation is possible.",fig.align = 'center',out.width='70%',echo=FALSE} +First, SVM can map the data to higher dimensions where the decision boundary can be linear. This is achieved by applying certain mathematical functions, called "kernel functions", to the predictor variable space. For example, a second-degree polynomial can be applied to predictor variables which creates new variables and in this new space the problem is linearly separable. Figure \@ref(fig:SVMcartoon) demonstrates this concept where points in feature space are mapped to quadratic space where linear separation is possible. +```{r,SVMcartoon,fig.cap="Support vector machine concept. With the help of a kernel function,points in feature space are mapped to higher dimensions where linear separation is possible.",fig.align = 'center',out.width='80%',echo=FALSE} knitr::include_graphics("images/kernelSVM.png" ) ``` -Second, SVM not only tries to find a decision boundary but tries to find the boundary with largest buffer zone on the sides of the boundary. Having a boundary with a large buffer or "margin" ,as it is formally called, will perform better for the new data points not used in the model training (Margin is marked in Figure \@ref(fig:SVMcartoon) ). In addition, SVM calculates the decision boundary with some error toleration. As we have seen it may not be always possible to find a linear boundary that perfectly separates the classes. SVM tolerates some degree of error, as in data points on the wrong side of the decision boundary. +Second, SVM not only tries to find a decision boundary, but tries to find the boundary with the largest buffer zone on the sides of the boundary. Having a boundary with a large buffer or "margin", as it is formally called, will perform better for the new data points not used in the model training (margin is marked in Figure \@ref(fig:SVMcartoon) ). In addition, SVM calculates the decision boundary with some error toleration. As we have seen it may not always be possible to find a linear boundary that perfectly separates the classes. SVM tolerates some degree of error, as in data points on the wrong side of the decision boundary. -Another important feature of the algorithm is that SVM decides on the decision boundary by only relying on the "landmark" data points, formally known as "support vectors". These are points that are closest to the decision boundary and harder to classify. By keeping track of such points only for decision boundary creation, the computational complexity of the algorithm is reduced. However, this depends on the margin or the buffer zone. If we have large margin than there are many landmark points. The extent of the margin is also related to the variance-bias trade off, the small margin, the classification will try to find a boundary that makes less errors in the training set therefore might overfit. If the margin is larger, it will tolerate more errors in the training set and might generalize better. Practically, this is controlled by the "C" or "Cost" parameter in SVM example we will show below. Another important choice we will make is the kernel function. Below we are using the radial basis kernel function. This function provides extra predictor dimension where the problem is linearly separable. The model we will use has only one parameter, which is "C". It is recommended that $C$ is in the form of $2^k$ where $k$ is in the range of -5 and 15 [@hsu2003practical]. Another parameter that can be tuned is related to the radial basis function called "sigma". Smaller sigma means less bias and more variance while larger sigma means less variance and more bias. Again, exponential sequences are recommended for tuning that [@hsu2003practical]. We will set it to 1 for demonstration purposes below. +Another important feature of the algorithm is that SVM decides on the decision boundary by only relying on the "landmark" data points, formally known as "support vectors". These are points that are closest to the decision boundary and harder to classify. By keeping track of such points only for decision boundary creation, the computational complexity of the algorithm is reduced. However, this depends on the margin or the buffer zone. If we have a large margin then there are many landmark points. The extent of the margin is also related to the variance-bias trade-off. If the allowed margin is small the classification will try to find a boundary that makes fewer errors in the training set therefore might overfit. If the margin is larger, it will tolerate more errors in the training set and might generalize better. Practically, this is controlled by the "C" or "Cost" parameter in the SVM example we will show below. Another important choice we will make is the kernel function. Below we use the radial basis kernel function. This function provides an extra predictor dimension where the problem is linearly separable. The model we will use has only one parameter, which is "C". It is recommended that $C$ is in the form of $2^k$ where $k$ is in the range of -5 and 15 [@hsu2003practical]. Another parameter that can be tuned is related to the radial basis function called "sigma". A smaller sigma means less bias and more variance, while a larger sigma means less variance and more bias. Again, exponential sequences are recommended for tuning that [@hsu2003practical]. We will set it to 1 for demonstration purposes below. ```{r, SVMcode} #svm code here library(kernlab) @@ -827,21 +829,21 @@ svmFit <- train(subtype~., data = training, __Want to know more ?__ -- MIT lecture by Patrick Winston on SVM(https://youtu.be/_PwhiWxHK8o). This lecture explains the concept with some mathematical background. It is not hard to follow. You should be able to follow this if you know what vectors are and if you have some knowledge on derivatives and basic algebra. -- Online demo for SVM (https://cs.stanford.edu/people/karpathy/svmjs/demo/). You can play with sigma and C parameters for radial basis SVM and see how they affect the decision boundary. +- MIT lecture by Patrick Winston on SVM: https://youtu.be/_PwhiWxHK8o. This lecture explains the concept with some mathematical background. It is not hard to follow. You should be able to follow this if you know what vectors are and if you have some knowledge on derivatives and basic algebra. +- Online demo for SVM: (https://cs.stanford.edu/people/karpathy/svmjs/demo/). You can play with sigma and C parameters for radial basis SVM and see how they affect the decision boundary. ``` ### Neural networks and deep versions of it -Neural networks \index{neural network} are another popular machine learning method which is recently regaining popularity. The earlier versions of the algorithm were popularized in the 80s and 90s. The advantage of neural networks is like SVM, they can model non-linear decision boundaries. The basic idea of neural networks is to combine the predictor variables in order to model the response variable as a non-linear function. In a neural network, input variables pass through several layers that combine the variables and transform those combinations and recombine outputs depending on how many layers the network has. In the conceptual example in Figure \@ref(fig:neuralNetDiagram) the input nodes receive predictor variables and make linear combinations of them in the form of $\sum ( w_ixi +b)$. Simply put, the variables are multiplied with weights and summed up. This is what we by "linear combination". These quantities are further fed into another layer called hidden layer where an activation function is applied on the sums. And these results are further fed into an output node which outputs class probabilities assuming we are working on a classification algorithm. There could be many more hidden layers that will even further combine the output from hidden layers before them. The algorithm in the end also has a cost function similar to the logistic regression cost function but it now has to estimate all the weight parameters: $w_i$. This is a more complicated problem than logistic regression because of number of parameters to be estimated but neural networks are able to fit complex functions due their parameter space flexibility as well. +Neural networks \index{neural network} are another popular machine learning method which is recently regaining popularity. The earlier versions of the algorithm were popularized in the 80s and 90s. The advantage of neural networks is like SVM, they can model non-linear decision boundaries. The basic idea of neural networks is to combine the predictor variables in order to model the response variable as a non-linear function. In a neural network, input variables pass through several layers that combine the variables and transform those combinations and recombine outputs depending on how many layers the network has. In the conceptual example in Figure \@ref(fig:neuralNetDiagram) the input nodes receive predictor variables and make linear combinations of them in the form of $\sum ( w_ixi +b)$. Simply put, the variables are multiplied with weights and summed up. This is what we call "linear combination". These quantities are further fed into another layer called the hidden layer where an activation function is applied on the sums. And these results are further fed into an output node which outputs class probabilities assuming we are working on a classification algorithm. There could be many more hidden layers that will even further combine the output from hidden layers before them. The algorithm in the end also has a cost function similar to the logistic regression cost function, but it now has to estimate all the weight parameters: $w_i$. This is a more complicated problem than logistic regression because of the number of parameters to be estimated but neural networks are able to fit complex functions due their parameter space flexibility as well. -```{r,neuralNetDiagram,fig.cap="Diagram for a simple neural network, their combinations pass through hidden layers and combined again for the output. Predictor variables are fed to the network and weights are adjusted to optimize the cost function",fig.align = 'center',out.width='80%',echo=FALSE} +```{r,neuralNetDiagram,fig.cap="Diagram for a simple neural network, their combinations pass through hidden layers and are combined again for the output. Predictor variables are fed to the network and weights are adjusted to optimize the cost function.",fig.align = 'center',out.width='80%',echo=FALSE} knitr::include_graphics("images/neuralNetDiagram.png" ) ``` -In practical sense, the number of nodes in the hidden layer (size) and some regularization on the weights can be applied to control for overfitting. This is called the calculated (decay) parameter controls for overfitting. +In a practical sense, the number of nodes in the hidden layer (size) and some regularization on the weights can be applied to control for overfitting. This is called the calculated (decay) parameter controls for overfitting. -We will train a simple neural network on our cancer data set. In this simple example, the network architecture is somewhat fixed. We can only choose number of nodes (denoted by "size") in the hidden layer and a regularization parameter ( denoted by "decay"). Increasing the number of nodes in hidden layer or in other implementations increasing the number of hidden layers will help model non-linear relationships but can overfit. One way to combat that is to limit the number of nodes in the hidden layer, another way is to regularize the weights. The decay parameter does just that, it penalizes the loss function by $decay(weigths^2)$. In the example below, we are trying 1 or 2 nodes in the hidden layer in the interest of simplicity and run-time. In addition, we are setting `decay=0`, which will correspond to not doing any regularization. +We will train a simple neural network on our cancer data set. In this simple example, the network architecture is somewhat fixed. We can only the choose number of nodes (denoted by "size") in the hidden layer and a regularization parameter (denoted by "decay"). Increasing the number of nodes in the hidden layer or in other implementations increasing the number of the hidden layers, will help model non-linear relationships but can overfit. One way to combat that is to limit the number of nodes in the hidden layer; another way is to regularize the weights. The decay parameter does just that, it penalizes the loss function by $decay(weigths^2)$. In the example below, we try 1 or 2 nodes in the hidden layer in the interest of simplicity and run-time. In addition, we set `decay=0`, which will correspond to not doing any regularization. ```{r, nnet, eval=FALSE} #svm code here library(nnet) @@ -862,20 +864,20 @@ nnetFit <- train(subtype~., data = training, MaxNWts=2000) ``` -The example we used above is a bit outdated. The modern "deep" neural networks provides much more flexibility in the number of nodes, number of layers and regularization options. In many areas especially computer vision deep neural networks are the state-of-the-art [@lecun2015deep]. These modern implementations of neural networks are available in R via `keras`\index{R Packages!\texttt{keras}} package and also can be trained via the `caret`\index{R Packages!\texttt{caret}} package with the similar interface we have shown until now. +The example we used above is a bit outdated. The modern "deep" neural networks provide much more flexibility in the number of nodes, number of layers and regularization options. In many areas, especially computer vision deep neural networks are the state-of-the-art [@lecun2015deep]. These modern implementations of neural networks are available in R via the `keras`\index{R Packages!\texttt{keras}} package and can also be trained via the `caret`\index{R Packages!\texttt{caret}} package with the similar interface we have shown until now. ```{block2, DL, type='rmdtip'} __Want to know more ?__ -- Deep neural networks in R (https://keras.rstudio.com/). There are examples and background information on deep neural networks. -- Online demo for neural networks(https://cs.stanford.edu/~karpathy/svmjs/demo/demonn.html). You can see the affect of number of hidden layers and number of nodes on the decision boundary. +- Deep neural networks in R: (https://keras.rstudio.com/). There are examples and background information on deep neural networks. +- Online demo for neural networks: (https://cs.stanford.edu/~karpathy/svmjs/demo/demonn.html). You can see the effect of the number of hidden layers and number of nodes on the decision boundary. ``` ### Ensemble learning -Ensemble learning \index{ensemble learning}models are simply combinations of different machine learning models. By now, we already introduced the concept of ensemble learning in random forests and gradient boosting. However, this concept can be generalized to combining all kinds of different models. "random forests" is an ensemble of the same type of models, decision trees. We can also have ensembles of different types of models. For example, we can combine random Forest, k-NN and elastic net models, and make class predictions based on the votes from those different models. Below, we are showing how to do this. We are going to get predictions for three different models on the test set, use majority voting to decide on the class label and then check performance using `caret::confusionMatrix()`. +Ensemble learning \index{ensemble learning}models are simply combinations of different machine learning models. By now, we already introduced the concept of ensemble learning in random forests and gradient boosting. However, this concept can be generalized to combining all kinds of different models. "Random forests" is an ensemble of the same type of models, decision trees. We can also have ensembles of different types of models. For example, we can combine random forest, k-NN and elastic net models, and make class predictions based on the votes from those different models. Below, we are showing how to do this. We are going to get predictions for three different models on the test set, use majority voting to decide on the class label, and then check performance using `caret::confusionMatrix()`. ```{r, simpleEnsembl} # predict with k-NN model knnPred=as.character(predict(knnFit,testing[,-1],type="class")) @@ -895,21 +897,21 @@ confusionMatrix(data=testing[,1], ``` -In the test set, we were able to obtain perfect accuracy after voting. More complicated and accurate ways to build ensembles exist. We could also use mean of class probabilities instead of voting for final class predictions. We can even combine models in a regression based scheme to assign weights to the votes or to the predicted class probabilities of each model. In these cases, the prediction performance of the ensembles can also be tested with sampling techniques such as cross-validation. You can think of this as another layer of optimization or modeling for combining results from different models. We will not pursue this further in this chapter but packages such as [`caretEnsemble`](https://cran.r-project.org/web/packages/caretEnsemble/), [`SuperLearner`](https://cran.r-project.org/web/packages/SuperLearner/index.html) or [`mlr`](https://mlr.mlr-org.com/) can combine models in various ways described above. \index{R Packages!\texttt{caretEnsemble}} +In the test set, we were able to obtain perfect accuracy after voting. More complicated and accurate ways to build ensembles exist. We could also use the mean of class probabilities instead of voting for final class predictions. We can even combine models in a regression-based scheme to assign weights to the votes or to the predicted class probabilities of each model. In these cases, the prediction performance of the ensembles can also be tested with sampling techniques such as cross-validation. You can think of this as another layer of optimization or modeling for combining results from different models. We will not pursue this further in this chapter but packages such as [`caretEnsemble`](https://cran.r-project.org/web/packages/caretEnsemble/), [`SuperLearner`](https://cran.r-project.org/web/packages/SuperLearner/index.html) or [`mlr`](https://mlr.mlr-org.com/) can combine models in various ways described above. \index{R Packages!\texttt{caretEnsemble}} \index{R Packages!\texttt{SuperLearner}} \index{R Packages!\texttt{mlr}} -## Predicting continuous variables: regression with machine learning -Until now, we only considered methods that can help us predict class labels. However, all the methods we have shown can also be used to predict continuous variables. In this case, the methods will try to optimize the prediction in error which is usually in the form of sum of squared errors (SSE): $SSE=\sum (Y-f(X))^2$, where $Y$ is the continuous response variable and $f(X)$ is the outcome of the machine learning model. \index{sum of squared errors (SSE)} +## Predicting continuous variables: Regression with machine learning +Until now, we only considered methods that can help us predict class labels. However, all the methods we have shown can also be used to predict continuous variables. In this case, the methods will try to optimize the prediction in error which is usually in the form of the sum of squared errors (SSE): $SSE=\sum (Y-f(X))^2$, where $Y$ is the continuous response variable and $f(X)$ is the outcome of the machine learning model. \index{sum of squared errors (SSE)} -In this section, we are going to show how to use a supervised learning method for regression. All the methods we have introduced previously in the context of classification can also do regression. Technically, this is just a simple change in the cost function format and optimization step still tries to optimize the parameters of the cost function. In many cases, if your response variable is numeric methods in the `caret` package will automatically apply regression. +In this section, we are going to show how to use a supervised learning method for regression. All the methods we have introduced previously in the context of classification can also do regression. Technically, this is just a simple change in the cost function format and the optimization step still tries to optimize the parameters of the cost function. In many cases, if your response variable is numeric, methods in the `caret` package will automatically apply regression. ### Use case: Predicting age from DNA methylation -We will demonstrate random forest regression using a different data set which has a continuous response variable. This time we are going to try to predict age of individuals from their DNA methylation \index{DNA methylation} levels. Methylation is a DNA modification which has implications in gene regulation and cell state. We have introduced DNA methylation in depth in Chapters \@ref(intro) and \@ref(bsseq), however for now what we need to know is that there are 24 million CpG dinucleotides in the human genome. Their methylation status can be measured with quantitative assays and the value is between 0 and 1. If it is 0, the CpG is not methylated in any of the cells in the sample, and if it is 1, the CpG is methylated in all the cells of the sample. It has been shown that methylation is predictive of the age of the individual that the sample is taken from [@numata2012dna; @horvath2013dna]. Now, we will try to test that with a data set containing hundreds of individuals, their age and methylation values for ~27000 CpGs. We first read in the files and construct a training set. +We will demonstrate random forest regression using a different data set which has a continuous response variable. This time we are going to try to predict the age of individuals from their DNA methylation \index{DNA methylation} levels. Methylation is a DNA modification which has implications in gene regulation and cell state. We have introduced DNA methylation in depth in Chapters \@ref(intro) and \@ref(bsseq), however for now, what we need to know is that there are about 24 million CpG dinucleotides in the human genome. Their methylation status can be measured with quantitative assays and the value is between 0 and 1. If it is 0, the CpG is not methylated in any of the cells in the sample, and if it is 1, the CpG is methylated in all the cells of the sample. It has been shown that methylation is predictive of the age of the individual that the sample is taken from [@numata2012dna; @horvath2013dna]. Now, we will try to test that with a data set containing hundreds of individuals, their age, and methylation values for ~27000 CpGs. We first read in the files and construct a training set. -### reading and processing the data -Let us first read in the data. When we run the summary and histogram we see that the methylation values are between 0 and 1 and there are 108 samples (See Figure \@ref(fig:readMethAge) ). Typically, methylation values have bimodal distribution. In this case many of them have values around 0 and second most frequent value bracket is around 0.9. +### Reading and processing the data +Let us first read in the data. When we run the summary and histogram we see that the methylation values are between $0$ and $1$ and there are $108$ samples (see Figure \@ref(fig:readMethAge) ). Typically, methylation values have bimodal distribution. In this case many of them have values around $0$ and the second-most frequent value bracket is around $0.9$. ```{r, readMethAge,out.width='60%',fig.width=4.5,fig.cap="Histogram of methylation values in the training set for age prediction."} # file path for CpG methylation and age fileMethAge=system.file("extdata", @@ -925,16 +927,16 @@ hist(unlist(ameth[,-1]),border="white", col="cornflowerblue",main="",xlab="methylation values") ``` -There are ~27000 predictor variables. We should remove the ones that have low variation across samples. In this case, the methylation values are between 0 and 1, the CpGs that have low variation are not likely to have any association with age, they could simply be technical variation of the experiment. We will remove CpGs that have less than 0.1 standard deviation. +There are $~27000$ predictor variables. We can remove the ones that have low variation across samples. In this case, the methylation values are between $0$ and $1$. The CpGs that have low variation are not likely to have any association with age; they could simply be technical variation of the experiment. We will remove CpGs that have less than 0.1 standard deviation. ```{r, readMethAgeremove} ameth=ameth[,c(TRUE,matrixStats::colSds(as.matrix(ameth[,-1]))>0.1)] dim(ameth) ``` ### Running random forest regression -Now we can use random forest regression to predict the age from methylation values. We are then going to plot predicted vs observed ages and see how well our predictions are. The resulting plots are shown in Figure \@ref(fig:predictAge). \index{random forest regression} +Now we can use random forest regression to predict the age from methylation values. We are then going to plot the predicted vs. observed ages and see how good our predictions are. The resulting plots are shown in Figure \@ref(fig:predictAge). \index{random forest regression} -```{r, predictAge, fig.width=11,out.width='70%', fig.cap="Observed vs predicted age (Left). Residual plot showing for older people the error increases (Right)"} +```{r, predictAge, fig.width=11,out.width='80%', fig.cap="Observed vs. predicted age (Left). Residual plot showing that for older people the error increases (Right)."} set.seed(18) @@ -970,7 +972,7 @@ abline(h=0,col="red4",lty=2) ``` -In this instance, we are using OOB errors and $R^2$ value which shows how the model performs on OOB samples. The model can capture the general trend and it has acceptable OOB performance. It is not perfect as it makes errors on average close to 10 years when predicting the age, and the errors are more severe for older people (Figure \@ref(fig:predictAge)). This could be due to having fewer older people to model or missing/inadequate predictor variables. However, everything we discussed in classification applies here. We had even fewer data points than the classification problem we did not do a split for a test data set. This should also be done for regression problems especially when we are going to compare the performance of different models or want to have a better idea of the real world performance of our model. We might be also interested in which variables are most important as in the classification problem, we can use `caret:varImp()` function to get access to random forest specific variable importance metrics. +In this instance, we are using OOB errors and $R^2$ value which shows how the model performs on OOB samples. The model can capture the general trend and it has acceptable OOB performance. It is not perfect as it makes errors on average close to 10 years when predicting the age, and the errors are more severe for older people (Figure \@ref(fig:predictAge)). This could be due to having fewer older people to model or missing/inadequate predictor variables. However, everything we discussed in classification applies here. We had even fewer data points than the classification problem, so we did not do a split for a test data set. However, this should also be done for regression problems, especially when we are going to compare the performance of different models or want to have a better idea of the real-world performance of our model. We might also be interested in which variables are most important as in the classification problem; we can use the `caret:varImp()` function to get access to random-forest-specific variable importance metrics. ## Exercises @@ -994,13 +996,13 @@ patient=readRDS(fileLGGann) ``` -1. Our first task is to not use any data transformation and do classification. Run the k-NN classifier on the data without any transformation or scaling, what is the effect on classification accuracy for k-NN predicting the CIMP and noCIMP status of the patient?[Difficulty: **Beginner**] +1. Our first task is to not use any data transformation and do classification. Run the k-NN classifier on the data without any transformation or scaling. What is the effect on classification accuracy for k-NN predicting the CIMP and noCIMP status of the patient? [Difficulty: **Beginner**] -2. Bootstrap resampling can be used to measure the variability of the prediction error. Use bootstrap resampling with k-NN for the prediction accuracy. How different is it from cross-validation for different $k$s?[Difficulty: **Intermediate**] +2. Bootstrap resampling can be used to measure the variability of the prediction error. Use bootstrap resampling with k-NN for the prediction accuracy. How different is it from cross-validation for different $k$s? [Difficulty: **Intermediate**] -3. There are a number of ways to get get variable importance for a classification problem. Run Random Forests on the classification problem above. Compare the variable importance metrics from random forest and the one obtained from DALEX. How many variables are the same in the top 10?[Difficulty: **Advanced**] +3. There are a number of ways to get variable importance for a classification problem. Run random forests on the classification problem above. Compare the variable importance metrics from random forest and the one obtained from DALEX. How many variables are the same in the top 10? [Difficulty: **Advanced**] -4. Come up with a unified importance score by normalizing importance scores from Random Forests and DALEX, followed by taking the average of those scores.[Difficulty: **Advanced**] +4. Come up with a unified importance score by normalizing importance scores from random forests and DALEX, followed by taking the average of those scores. [Difficulty: **Advanced**] ### Regression For this set of problems we will use the regression data set where we tried to predict the age of the sample from the methylation values. The data can be loaded as shown below: @@ -1014,9 +1016,9 @@ fileMethAge=system.file("extdata", ameth=readRDS(fileMethAge) ``` -1. Run random forest regression and plot the importance metrics.[Difficulty: **Beginner**] +1. Run random forest regression and plot the importance metrics. [Difficulty: **Beginner**] -2. Split 20% of the methylation-age data as test data and run elastic net regression on the training portion to tune parameters and test it on the test portion.[Difficulty: **Intermediate**] +2. Split 20% of the methylation-age data as test data and run elastic net regression on the training portion to tune parameters and test it on the test portion. [Difficulty: **Intermediate**] -3. Run an ensemble model for regression using **caretEnsemble** or **mlr** package and compare the results with the elastic net and random forest model. Did the test accuracy increase? -**HINT:** You need to install these extra packages and learn how to use them in the context of ensemble models.[Difficulty: **Advanced**] \ No newline at end of file +3. Run an ensemble model for regression using the **caretEnsemble** or **mlr** package and compare the results with the elastic net and random forest model. Did the test accuracy increase? +**HINT:** You need to install these extra packages and learn how to use them in the context of ensemble models. [Difficulty: **Advanced**] \ No newline at end of file diff --git a/06-genomicIntervals.Rmd b/06-genomicIntervals.Rmd index f37d452..899a63e 100644 --- a/06-genomicIntervals.Rmd +++ b/06-genomicIntervals.Rmd @@ -12,32 +12,32 @@ knitr::opts_chunk$set(echo = TRUE, fig.align = 'center') ``` -A considerable time in computational genomics is spent on overlapping different +Considerable time in computational genomics is spent on overlapping different features of the genome. Each feature can be represented with a genomic interval within the chromosomal coordinate system. In addition, each interval can carry different sorts of information. An interval may for instance represent exon coordinates or a transcription factor binding site. On the other hand, -you can have base-pair resolution, continuous scores over the genome such as read coverage or +you can have base-pair resolution, continuous scores over the genome such as read coverage, or scores that could be associated with only certain bases such as in the case of CpG -methylation (See Figure \@ref(fig:gintsum) ). -Typically, you will need to overlap intervals on interest with other features of +methylation (see Figure \@ref(fig:gintsum) ). +Typically, you will need to overlap intervals of interest with other features of the genome, again represented as intervals. For example, you may want to overlap -transcription factor binding sites with CpG islands or promoters to quantify what percentage of binding sites overlap with your regions of interest. Overlapping mapped reads from high-throughput sequencing experiments with genomic features such as exons, promoters, enhancers can also be classified as operations on genomic intervals. You can think of a million other ways that involves overlapping two sets of different features on the genome. This chapter aims to show how to do analysis involving operations on genomic intervals. +transcription factor binding sites with CpG islands or promoters to quantify what percentage of binding sites overlap with your regions of interest. Overlapping mapped reads from high-throughput sequencing experiments with genomic features such as exons, promoters, and enhancers can also be classified as operations on genomic intervals. You can think of a million other ways that involve overlapping two sets of different features on the genome. This chapter aims to show how to do analysis involving operations on genomic intervals. -```{r,gintsum,fig.cap="Summary of genomic intervals with different kinds of information",fig.align = 'center',out.width='75%',echo=FALSE} +```{r,gintsum,fig.cap="Summary of genomic intervals with different kinds of information.",fig.align = 'center',out.width='75%',echo=FALSE} knitr::include_graphics("images/genomeIntervalSummary.png" ) ``` -## Operations on Genomic Intervals with GenomicRanges package -The [Bioconductor](http://bioconductor.org) project has a dedicated package called [`GenomicRanges`](http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html) to deal with genomic intervals. In this section, we will provide use cases involving operations on genomic intervals. The main reason we will stick to this package is that it provides tools to do overlap operations. However package requires that users operate on specific data types that are conceptually similar to a tabular data structure implemented in a way that makes overlapping and related operations easier. The main object we will be using is called `GRanges` object and we will also see some other related objects from the `GenomicRanges` package.\index{R Packages!\texttt{GenomicRanges}} +## Operations on genomic intervals with `GenomicRanges` package +The [Bioconductor](http://bioconductor.org) project has a dedicated package called [`GenomicRanges`](http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html) to deal with genomic intervals. In this section, we will provide use cases involving operations on genomic intervals. The main reason we will stick to this package is that it provides tools to do overlap operations. However, the package requires that users operate on specific data types that are conceptually similar to a tabular data structure implemented in a way that makes overlapping and related operations easier. The main object we will be using is called the `GRanges` object and we will also see some other related objects from the `GenomicRanges` package.\index{R Packages!\texttt{GenomicRanges}} ### How to create and manipulate a GRanges object -`GRanges` (from `GenomicRanges` package) is the main object that holds the genomic intervals and extra information about those intervals. Here we will show how to create one. Conceptually, it is similar to a data frame and some operations such as using `[ ]` notation to subset the table will work also on `GRanges`, but keep in mind that not everything that works for data frames will work on `GRanges` objects. +`GRanges` (from `GenomicRanges` package) is the main object that holds the genomic intervals and extra information about those intervals. Here we will show how to create one. Conceptually, it is similar to a data frame and some operations such as using `[ ]` notation to subset the table will also work on `GRanges`, but keep in mind that not everything that works for data frames will work on `GRanges` objects. ```{r,createGR} library(GenomicRanges) @@ -50,9 +50,9 @@ gr # subset like a data frame gr[1:2,] ``` -As you can see it looks a bit like a data frame. Also, note that the peculiar second argument “ranges” which basically contains start and end positions of the genomic intervals. However, you can not just give start and end positions you actually have to provide another object of `IRanges`. Do not let this confuse you, `GRanges` actually depends on another object that is very similar to itself called `IRanges` and you have to provide the “ranges” argument as an `IRanges` object. In its simplest for, an `IRanges` object can be constructed by providing start and end positions to `IRanges()` function. Think of it as something you just have to provide in order to construct the `GRanges` object. +As you can see, it looks a bit like a data frame. Also, note that the peculiar second argument “ranges” basically contains the start and end positions of the genomic intervals. However, you cannot just give start and end positions, you actually have to provide another object of `IRanges`. Do not let this confuse you; `GRanges` actually depends on another object that is very similar to itself called `IRanges` and you have to provide the “ranges” argument as an `IRanges` object. In its simplest form, an `IRanges` object can be constructed by providing start and end positions to the `IRanges()` function. Think of it as something you just have to provide in order to construct the `GRanges` object. -`GRanges` can also contain other information about the genomic interval such as scores, names, etc. You can provide extra information at the time of the construction or you can add it later. Here is how you can do those: +`GRanges` can also contain other information about the genomic interval such as scores, names, etc. You can provide extra information at the time of the construction or you can add it later. Here is how you can do that: ```{r,createGRwMetadata} gr=GRanges(seqnames=c("chr1","chr2","chr2"), @@ -87,7 +87,7 @@ gr ### Getting genomic regions into R as GRanges objects -There are multiple ways you can read in your genomic features into R and create a `GRanges` object. Most genomic interval data comes as a tabular format that has the basic information about the location of the interval and some other information. We already showed how to read BED files as a data frame in Chapter \@ref(Rintro). Now we will show how to convert it to `GRanges` object. This is one way of doing it, but there are more convenient ways described further in the text. +There are multiple ways you can read your genomic features into R and create a `GRanges` object. Most genomic interval data comes in a tabular format that has the basic information about the location of the interval and some other information. We already showed how to read BED files as a data frame in Chapter \@ref(Rintro). Now we will show how to convert it to the `GRanges` object. This is one way of doing it, but there are more convenient ways described further in the text. ```{r,convertDataframe2gr} # read CpGi data set @@ -130,11 +130,11 @@ start(tss.gr[strand(tss.gr)=="-",])=end(tss.gr[strand(tss.gr)=="-",]) tss.gr=tss.gr[!duplicated(tss.gr),] ``` -Another way of doing this is from a BED file is to use `readTranscriptfeatures()` +Another way of doing this from a BED file is to use the `readTranscriptfeatures()` function from the `genomation` package. This function takes care of the steps described in the code chunk above. -Reading the genomic features as text files and converting to `GRanges` is not the only way to create `GRanges` object. With the help of the [`rtracklayer`](http://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) package we can directly import BED files.\index{R Packages!\texttt{rtracklayer}} +Reading the genomic features as text files and converting to `GRanges` is not the only way to create a `GRanges` object. With the help of the [`rtracklayer`](http://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) package we can directly import BED files.\index{R Packages!\texttt{rtracklayer}} ```{r,importbed_rtracklayer,eval=FALSE} require(rtracklayer) @@ -145,7 +145,7 @@ import.bed(filePathRefseq) ``` -Next, we will show how to use other methods to automatically obtain the data in `GRanges` format from online databases. But you will not be able to use these methods for every data set so it is good to now how to read data from flat files as well. We will use `rtracklayer` package to download data from the UCSC Genome Browser \index{UCSC Genome Browser}. We will download CpG islands as `GRanges` objects. The `rtracklayer` workflow we show below works like using the UCSC table browser. You need to select which species you are working with, then you need to select which dataset you need to download and lastly you download the UCSC dataset or track as `GRanges` object. +Next, we will show how to use other methods to automatically obtain the data in the `GRanges` format from online databases. But you will not be able to use these methods for every data set, so it is good to know how to read data from flat files as well. We will use the `rtracklayer` package to download data from the UCSC Genome Browser\index{UCSC Genome Browser}. We will download CpG islands as `GRanges` objects. The `rtracklayer` workflow we show below works like using the UCSC table browser. You need to select which species you are working with, then you need to select which dataset you need to download and lastly you download the UCSC dataset or track as a `GRanges` object. ```{r,importFromUCSC,eval=FALSE} require(rtracklayer) @@ -157,23 +157,23 @@ query <- ucscTableQuery(session, track="CpG Islands",table="cpgIslandExt", ## get the GRanges object for the track track(query) ``` -There is also an interface to Ensembl database called [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html). \index{R Packages!\texttt{biomaRt}} +There is also an interface to the Ensembl database called [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html). \index{R Packages!\texttt{biomaRt}} This package will enable you to access and import all of the datasets included in Ensembl. Another similar package is [AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html).\index{R Packages!\texttt{AnnotationHub}} This package is an aggregator for different datasets from various sources. -Using `AnnotationHub` one can access data sets from UCSC browser, Ensembl browser -and data sets from genomics consortia such as ENCODE and Roadmap Epigenomics\index{ENCODE}.\index{Roadmap Epigenomics} -We provide examples of using `Biomart` package further into the chapter. In addition, `AnnotationHub` package is used in Chapter \@ref(chipseq). +Using `AnnotationHub` one can access data sets from the UCSC browser, Ensembl browser +and datasets from genomics consortia such as ENCODE and Roadmap Epigenomics\index{ENCODE}.\index{Roadmap Epigenomics} +We provide examples of using `Biomart` package further into the chapter. In addition, the `AnnotationHub` package is used in Chapter \@ref(chipseq). #### Frequently used file formats and how to read them into R as a table There are multiple file formats in genomics but some of them you will see more frequently than others. We already mentioned some of them. Here is a list of files -and functions to read them into R as `GRanges` objects or something coercible to +and functions that can read them into R as `GRanges` objects or something coercible to `GRanges` objects. -1) **BED**: These are used and popularized by UCSC browser, and can hold a variety of -information including exon/intron structure of transcripts in a single line. We will be using BED files in this chapter. In its simplest form, the BED file contains chromosome name, start positions and end position for a genomic feature of interest.\index{BED file} +1) **BED**: This format is used and popularized by the UCSC browser, and can hold a variety of +information including exon/intron structure of transcripts in a single line. We will be using BED files in this chapter. In its simplest form, the BED file contains the chromosome name, the start position and end position for a genomic feature of interest.\index{BED file} - `genomation::readBed()` - `genomation::readTranscriptFeatures()` good for getting intron/exon/promoters from BED12 files - `rtracklayer::import.bed()` @@ -183,10 +183,10 @@ it is a more flexible format than BED, which makes it harder to parse at times. - `genomation::gffToGranges()` - `rtracklayer::impot.gff()` -3) **BAM/SAM**: BAM format is compressed and indexed tabular file format designed for aligned sequencing reads. SAM is the uncompressed version of BAM file. We will touch upon BAM files in this chapter. The uncompressed SAM file is similar in spirit to BED file where you have the basic location on chromosome information plus additional columns that are related to the quality of alignment or other relevant information. We will introduce this format in detail later in this chapter.\index{BAM file} +3) **BAM/SAM**: BAM format is a compressed and indexed tabular file format designed for aligned sequencing reads. SAM is the uncompressed version of the BAM file. We will touch upon BAM files in this chapter. The uncompressed SAM file is similar in spirit to a BED file where you have the basic location of chromosomal location information plus additional columns that are related to the quality of alignment or other relevant information. We will introduce this format in detail later in this chapter.\index{BAM file} \index{SAM file} - `GenomicAlignments::readGAlignments` - - `Rsamtools::scanBam` returns a data frame with columns from SAM/BAM file. + - `Rsamtools::scanBam` returns a data frame with columns from a SAM/BAM file. 4) **BigWig**: This is used to for storing scores associated with genomic intervals. It is an indexed format. Similar to BAM, this makes it easier to query and only necessary portions of the file could be loaded into memory. @@ -205,9 +205,9 @@ formats are mostly used to store genomic variation data such as SNPs and indels. ### Finding regions that do/do not overlap with another set of regions -This is one of the most common tasks in genomics. Usually, you have a set of regions that you are interested in and you want to see if they overlap with another set of regions or see how many of them overlap. A good example is transcription factor binding sites determined by [ChIP-seq](http://en.wikipedia.org/wiki/ChIP-sequencing) experiments. We will introduce ChIP-seq in more detail in Chapter \@ref(chipseq). However, in these types of experiments and followed analysis, one usually ends up with genomic regions that are bound by transcription factors. One of the standard next questions would be to annotate binding sites with genomic annotations such as promoter,exon,intron and/or CpG islands, which are important for gene regulation. Below is a demonstration of how transcription factor binding sites can be annotated using CpG islands\index{CpG island}. First, we will get the subset of binding sites that overlap with the CpG islands. In this case, binding sites are ChIP-seq peaks.\index{ChIP-seq} +This is one of the most common tasks in genomics. Usually, you have a set of regions that you are interested in and you want to see if they overlap with another set of regions or see how many of them overlap. A good example is transcription factor binding sites determined by [ChIP-seq](http://en.wikipedia.org/wiki/ChIP-sequencing) experiments. We will introduce ChIP-seq in more detail in Chapter \@ref(chipseq). However, in these types of experiments and the following analysis, one usually ends up with genomic regions that are bound by transcription factors. One of the standard next questions would be to annotate binding sites with genomic annotations such as promoter, exon, intron and/or CpG islands, which are important for gene regulation. Below is a demonstration of how transcription factor binding sites can be annotated using CpG islands\index{CpG island}. First, we will get the subset of binding sites that overlap with the CpG islands. In this case, binding sites are ChIP-seq peaks.\index{ChIP-seq} -In the code snippet below, we read the ChIP-seq analysis output files using `genomation::readBroadPeak()` function. This function directly outputs a `GRanges` object. These output files are similar to BED files, where the location of the predicted binding sites are written out in a tabular format with some analysis related scores and/or P-values. After reading the files, we can find the subset of peaks that overlap with the CpG islands using the `subsetByoverlaps()` function. +In the code snippet below, we read the ChIP-seq analysis output files using the `genomation::readBroadPeak()` function. This function directly outputs a `GRanges` object. These output files are similar to BED files, where the location of the predicted binding sites are written out in a tabular format with some analysis-related scores and/or P-values. After reading the files, we can find the subset of peaks that overlap with the CpG islands using the `subsetByoverlaps()` function. ```{r,findPeakwithCpGi} library(genomation) @@ -229,16 +229,16 @@ counts=countOverlaps(pk1.gr,cpgi.gr) head(counts) ``` -The `GenomicRanges::findOverlaps()` function can be used to see one-to-one overlaps between peaks and CpG islands. It returns a matrix showing which peak overlaps with which CpGi island. +The `GenomicRanges::findOverlaps()` function can be used to see one-to-one overlaps between peaks and CpG islands. It returns a matrix showing which peak overlaps which CpG island. ```{r,findOverlaps} findOverlaps(pk1.gr,cpgi.gr) ``` -Another interesting thing would be to look at the distances to nearest CpG islands for each peak. In addition, just finding the nearest CpG island could also be interesting. Often times, you will need to find nearest TSS\index{transcription start site (TSS)} or gene to your regions of interest, and the code below is handy for doing that using `nearest()` and `distanceToNearest()` functions, the resulting plot is shown in Figure \@ref(fig:findNearest). +Another interesting thing would be to look at the distances to the nearest CpG islands for each peak. In addition, just finding the nearest CpG island could also be interesting. Oftentimes, you will need to find the nearest TSS\index{transcription start site (TSS)} or gene to your regions of interest, and the code below is handy for doing that using the `nearest()` and `distanceToNearest()` functions, the resulting plot is shown in Figure \@ref(fig:findNearest). -```{r,findNearest,fig.cap="histogram of distances"} +```{r,findNearest,fig.cap="Histogram of distances of CpG islands to the nearest TSSes."} # find nearest CpGi to each TSS n.ind=nearest(pk1.gr,cpgi.gr) # get distance to nearest @@ -253,12 +253,12 @@ hist(log10(dist2plot),xlab="log10(dist to nearest TSS)", ``` ## Dealing with mapped high-throughput sequencing reads -The reads from sequencing machines are usually pre-proccessed and aligned to the genome with the help of specific bioinformatics tools. We have introduced the details of general read processing , quality check and alignment methods in chapter \@ref(processingReads). In this section we will deal with mapped reads. Since each mapped read has a start and end position the genome, mapped reads can be thought as genomic intervals stored in a file. After mapping, the next task is to quantify the enrichment of those aligned reads in the regions of interest. You may want to count how many reads overlapping with your promoter set of interest or you may want to quantify RNA-seq reads\index{RNA-seq} overlapping with exons. This is similar to operations on genomic intervals which are described previously. If you can read all your alignments into the memory and create a `GRanges` object, you can apply the previously described operations. However, most of the time we can not read all mapped reads into the memory, so we have to use specialized tools to query and quantify alignments on a given set of regions. One of the most common alignment formats is SAM/BAM format, most aligners will produce SAM/BAM output or you will be able to convert your specific alignment format to SAM/BAM format. The BAM format is a binary version of the human readable SAM format. The SAM format has specific columns that contain different kind of information about the alignment such as mismatches, qualities etc. (see [http://samtools.sourceforge.net/SAM1.pdf] for SAM format specification). +The reads from sequencing machines are usually pre-processed and aligned to the genome with the help of specific bioinformatics tools. We have introduced the details of general read processing, quality check and alignment methods in Chapter \@ref(processingReads). In this section we will deal with mapped reads. Since each mapped read has a start and end position the genome, mapped reads can be thought of as genomic intervals stored in a file. After mapping, the next task is to quantify the enrichment of those aligned reads in the regions of interest. You may want to count how many reads overlap with your promoter set of interest or you may want to quantify RNA-seq reads\index{RNA-seq} overlap with exons. This is similar to operations on genomic intervals which are described previously. If you can read all your alignments into memory and create a `GRanges` object, you can apply the previously described operations. However, most of the time we can not read all mapped reads into memory, so we have to use specialized tools to query and quantify alignments on a given set of regions. One of the most common alignment formats is SAM/BAM format, most aligners will produce SAM/BAM output or you will be able to convert your specific alignment format to SAM/BAM format. The BAM format is a binary version of the human-readable SAM format. The SAM format has specific columns that contain different kinds of information about the alignment such as mismatches, qualities etc. (see [http://samtools.sourceforge.net/SAM1.pdf] for SAM format specification). ### Counting mapped reads for a set of regions -`Rsamtools` package has functions to query BAM files\index{R Packages!\texttt{Rsamtools}}. The function we will use in the first example is countBam which takes input of the BAM file and param argument. “param” argument takes a ScanBamParam object. The object is instantiated using `ScanBamParam()` and contains parameters for scanning the BAM file. The example below is a simple example where `ScanBamParam()` only includes regions of interest, promoters on chr21. +The `Rsamtools` package has functions to query BAM files\index{R Packages!\texttt{Rsamtools}}. The function we will use in the first example is the `countBam()` function, which takes input of the BAM file and param argument. The `param` argument takes a `ScanBamParam` object. The object is instantiated using `ScanBamParam()` and contains parameters for scanning the BAM file. The example below is a simple example where `ScanBamParam()` only includes regions of interest, promoters on chr21. ```{r,countBam} promoter.gr=tss.gr @@ -277,7 +277,7 @@ counts=countBam(bamfilePath, param=param) ``` -Alternatively, aligned reads can be read in using `GenomicAlignments` package (which on this occasion relies on `Rsamtools` package).\index{R Packages!\texttt{GenomicAlignments}} +Alternatively, aligned reads can be read in using the `GenomicAlignments` package (which on this occasion relies on the `Rsamtools` package).\index{R Packages!\texttt{GenomicAlignments}} ```{r,readGAlignments} library(GenomicAlignments) @@ -285,17 +285,17 @@ alns <- readGAlignments(bamfilePath, param=param) ``` ## Dealing with continuous scores over the genome -Most high-throughput data can be viewed as a continuous score over the bases of the genome. In case of RNA-seq or ChIP-seq experiments the data can be represented as read coverage values per genomic base position\index{RNA-seq}\index{ChIP-seq}. In addition, other information (not necessarily from high-throughput experiments) can be represented this way. The GC content and conservation scores per base are prime examples of other data sets that can be represented as scores. This sort of data can be stored as a generic text file or can have special formats such as Wig (stands for wiggle) from UCSC, or the bigWig format is which is indexed binary format of the wig files\index{wig file}\index{bigWig file}. The bigWig format is great for data that covers large fraction of the genome with varying scores, because the file is much smaller than regular text files that have the same information and it can be queried easier since it is indexed. +Most high-throughput data can be viewed as a continuous score over the bases of the genome. In case of RNA-seq or ChIP-seq experiments, the data can be represented as read coverage values per genomic base position\index{RNA-seq}\index{ChIP-seq}. In addition, other information (not necessarily from high-throughput experiments) can be represented this way. The GC content and conservation scores per base are prime examples of other data sets that can be represented as scores over the genome. This sort of data can be stored as a generic text file or can have special formats such as Wig (stands for wiggle) from UCSC, or the bigWig format, which is an indexed binary format of the wig files\index{wig file}\index{bigWig file}. The bigWig format is great for data that covers a large fraction of the genome with varying scores, because the file is much smaller than regular text files that have the same information and it can be queried more easily since it is indexed. -In R/Bioconductor, the continuous data can also be represented in a compressed format, in a format called Rle vector, which stands for run-length encoded vector. This gives superior memory performance over regular vectors because repeating consecutive values are represented as one value in the Rle vector (See Figure \@ref(fig:Rle) ). +In R/Bioconductor, continuous data can also be represented in a compressed format, called Rle vector, which stands for run-length encoded vector. This gives superior memory performance over regular vectors because repeating consecutive values are represented as one value in the Rle vector (see Figure \@ref(fig:Rle)). -```{r,Rle,fig.cap="Rle encoding explained",fig.align = 'center',out.width='100%',echo=FALSE} +```{r,Rle,fig.cap="Rle encoding explained.",fig.align = 'center',out.width='100%',echo=FALSE} knitr::include_graphics("images/Rle_demo.png" ) ``` -Typically, for genome-wide data you will have a `RleList` object which is a list of Rle vectors per chromosome. You can obtain such vectors by reading the reads in and calling `coverage()` function from `GenomicRanges` package. Let's try that on the above data set.\index{R Packages!\texttt{GenomicRanges}} +Typically, for genome-wide data you will have an `RleList` object, which is a list of Rle vectors per chromosome. You can obtain such vectors by reading the reads in and calling the `coverage()` function from the `GenomicRanges` package. Let's try that on the above data set.\index{R Packages!\texttt{GenomicRanges}} ```{r,getCoverageFromAln} covs=coverage(alns) # get coverage vectors @@ -308,7 +308,7 @@ covs=coverage(bamfilePath, param=param) # get coverage vectors ``` -One of the most common ways of storing score data is, as mentioned, wig or bigWig format. Most of the ENCODE project\index{ENCODE} data can be downloaded in bigWig format. In addition, conservation scores can also be downloaded as wig/bigWig format. You can import bigWig files into R using `import()` function from `rtracklayer` package. However, it is generally not advisable to read the whole bigWig file in memory as it was the case with BAM files. Usually, you will be interested in only a fraction of the genome, such as promoters, exons etc. So it is best you extract the data for those regions and read those into memory rather than the whole file. Below we read the a bigWig file only for promoters. The operation returns an `GRanges` object with score column which indicates the scores in the BigWig file per genomic region. +One of the most common ways of storing score data is, as mentioned, the wig or bigWig format. Most of the ENCODE project\index{ENCODE} data can be downloaded in bigWig format. In addition, conservation scores can also be downloaded in the wig/bigWig format. You can import bigWig files into R using the `import()` function from the `rtracklayer` package. However, it is generally not advisable to read the whole bigWig file in memory as was the case with BAM files. Usually, you will be interested in only a fraction of the genome, such as promoters, exons etc. So it is best that you extract the data for those regions and read those into memory rather than the whole file. Below we read a bigWig file only for the bases on promoters. The operation returns a `GRanges` object with the score column which indicates the scores in the bigWig file per genomic region. ```{r,getRleFromBigWig} library(rtracklayer) @@ -319,7 +319,7 @@ bwFile=system.file("extdata", bw.gr=import(bwFile, which=promoter.gr) # get coverage vectors bw.gr ``` -Following this we can create an `RleList` object from the `GRanges` with `coverage()` function. +Following this we can create an `RleList` object from the `GRanges` with the `coverage()` function. ```{r,BigWigCov} cov.bw=coverage(bw.gr,weight = "score") @@ -332,10 +332,10 @@ Frequently, we will need to extract subsections of the Rle vectors or `RleList` We will need to do this to visualize that subsection or get some statistics out of those sections. For example, we could be interested in average coverage per base for the regions we are interested in. We have to extract those regions -from `RleList` object and apply summary statistics. Below, we show how to extract -subsections of `RleList` object. We are extracting promoter regions from ChIP-seq\index{ChIP-seq} -read coverage `RleList`. Following that, we will plot the one of the promoters associated coverage values. -```{r,getViews,fig.cap="Coverage vector extracted from RleList via Views() function is plotted as a line plot."} +from the `RleList` object and apply summary statistics. Below, we show how to extract +subsections of the `RleList` object. We are extracting promoter regions from the ChIP-seq\index{ChIP-seq} +read coverage `RleList`. Following that, we will plot one of the promoter's coverage values. +```{r,getViews,fig.cap="Coverage vector extracted from the RleList via the Views() function is plotted as a line plot."} myViews=Views(cov.bw,as(promoter.gr,"IRangesList")) # get subsets of coverage # there is a views object for each chromosome myViews @@ -346,7 +346,7 @@ plot(myViews[[1]][[5]],type="l") ``` Next, we are interested in average coverage per base for the promoters using summary -functions that works on Views object. +functions that work on the `Views` object. ```{r, viewMeans} # get the mean of the views head( @@ -361,32 +361,32 @@ head( ## Genomic intervals with more information: SummarizedExperiment class As we have seen, genomic intervals can be mainly contained in a `GRanges` object. -It can also contain additional columns associated with each interval, here +It can also contain additional columns associated with each interval. Here you can save information such as read counts or other scores associated with the interval. However, -genomics data is often have many layers. With `GRanges` you can have a table +genomic data often have many layers. With `GRanges` you can have a table associated with the intervals, but what happens if you have many tables and each table has some metadata associated with it. In addition, rows and columns might -have additional annotation that can not be contained by row or column names. -For these cases, `SummarizedExperiment` class is ideal. It can hold multi-layered +have additional annotation that cannot be contained by row or column names. +For these cases, the `SummarizedExperiment` class is ideal. It can hold multi-layered tabular data associated with each genomic interval and the meta-data associated with rows and columns, or associated with each table. For example, genomic intervals associated with the `SummarizedExperiment` object can be gene locations, and each tabular data structure can be RNA-seq read counts in a time course experiment. -Each table could represent different conditions in which experiments performed. +Each table could represent different conditions in which experiments are performed. The `SummarizedExperiment` class is outlined in the figure below (Figure \@ref(fig:SumExpOv) ). -```{r,SumExpOv,fig.cap="Overview of SummarizedExperiment class and functions. Adapted from SummerizedExperiment package vignette",fig.align = 'center',out.width='100%',echo=FALSE} +```{r,SumExpOv,fig.cap="Overview of SummarizedExperiment class and functions. Adapted from the SummarizedExperiment package vignette.",fig.align = 'center',out.width='100%',echo=FALSE} knitr::include_graphics("images/Summarized.Experiment.png" ) ``` ### Create a SummarizedExperiment object Here we show how to create a basic `SummarizedExperiment` object. We will first create a matrix of read counts. This matrix will represent read counts from -a series RNA-seq experiments from different time points. Following that, -we create `GRanges` object to represent the locations of the genes, and a table +a series of RNA-seq experiments from different time points. Following that, +we create a `GRanges` object to represent the locations of the genes, and a table for column annotations. This will include the names for the columns and any other value we want to represent. Finally, we will create a `SummarizedExperiment` object by combining all those pieces. @@ -421,7 +421,7 @@ Now that we have a `SummarizedExperiment` object, we can subset it and extract/c parts of it. #### Extracting parts of the object -`colData()` and `rowData()` extract the column associated and row associated +`colData()` and `rowData()` extract the column-associated and row-associated tables. `metaData()` extracts the meta-data table if there is any table associated. ```{r,extractSe} colData(se) # extract column associated data @@ -430,7 +430,7 @@ rowData(se) # extrac row associated data ``` To extract the main table or tables that contain the values of interest such -as read counts. We must use the `assays()` function. This returns a list of +as read counts, we must use the `assays()` function. This returns a list of `DataFrame` objects associated with the object. ```{r,assaysSe} assays(se) # extract list of assays @@ -446,7 +446,7 @@ assays(se)[[1]] # get the first table ``` -#### subsetting +#### Subsetting Subsetting is easy using `[ ]` notation. This is similar to the way we subset data frames or matrices. ```{r,subsetSe1} @@ -454,8 +454,8 @@ subset data frames or matrices. se[1:5, 1:3] ``` -One can also use `$` operator to subset based on `colData()` columns. You can -extract certain samples or in our case time points. +One can also use the `$` operator to subset based on `colData()` columns. You can +extract certain samples or in our case, time points. ```{r,subsetSe2,eval=FALSE} se[, se$timepoint == 1] ``` @@ -470,22 +470,22 @@ subsetByOverlaps(se, roi) ``` ## Visualizing and summarizing genomic intervals -Data integration and visualization is corner stone of genomic data analysis. Below, we will +Data integration and visualization is cornerstone of genomic data analysis. Below, we will show different ways of integrating and visualizing genomic intervals. These methods -can be use to visualize large amounts of data in a locus-specific or multi-loci +can be used to visualize large amounts of data in a locus-specific or multi-loci manner. ### Visualizing intervals on a locus of interest -Often times, we will be interested in particular genomic locus and try to visualize +Oftentimes, we will be interested in a particular genomic locus and try to visualize different genomic datasets over that locus. This is similar to looking at the data over one of the genome browsers. Below we will display genes, GpG islands and read \index{R Packages!\texttt{Gviz}} -coverage from a ChIP-seq experiment using `Gviz` package\index{ChIP-seq}.For `Gviz` package, we first need to +coverage from a ChIP-seq experiment using the `Gviz` package\index{ChIP-seq}. For the `Gviz` package, we first need to set the tracks to display. The tracks can be in various formats. They can be R objects such as `IRanges`,`GRanges` and `data.frame`, or they can be in flat file formats -such as BigWig,BED and BAM. After the tracks are set, we can display them with +such as bigWig, BED, and BAM. After the tracks are set, we can display them with the `plotTracks` function, the resulting plot is shown in Figure \@ref(fig:GvizExchp6). -```{r GvizExchp6,fig.cap="tracks visualized using Gviz"} +```{r GvizExchp6,fig.cap="Genomic data tracks visualized using the Gviz functions."} library(Gviz) # set tracks to display @@ -518,11 +518,11 @@ plotTracks(track.list,from=27698681,to=28083310,chromsome="chr21") ### Summaries of genomic intervals on multiple loci Looking at data one region at a time could be inefficient. One can summarize different data sets over thousands of regions of interest and identify patterns. -This summaries can include different data types such as motifs, read coverage +These summaries can include different data types such as motifs, read coverage and other scores associated with genomic intervals. The `genomation` package can summarize and help identify patterns in the datasets. The datasets can have -different kinds of information and multiple file types can be used such as BED, GFF, BAM and bigWig. We will look at H3K4me3 ChIP-seq \index{ChIP-seq} \index{histone modification}and DNAse-seq signals from H1 embryonic stem cell line. H3K4me3 is usually associated with promoters and regions with high DNAse-seq signal are associated with accessible regions, that means mostly regulatory regions. We will summarize those datasets around the transcription start sites (TSS)\index{transcription start site (TSS)} of genes on chromosome 20 of human hg19 assembly. We will first read the genes and extract the region around TSS, 500bp upstream and downstream. We will then create a matrix of ChIP-seq scores for those regions, each row will represent a region around a specific TSS and columns will be the scores per base. We will then plot, average enrichment values around the TSSes of genes on chromosome 20. -```{r metaRegionchp6,fig.cap="meta region plot using genomation"} +different kinds of information and multiple file types can be used such as BED, GFF, BAM and bigWig. We will look at H3K4me3 ChIP-seq \index{ChIP-seq} \index{histone modification}and DNAse-seq signals from the H1 embryonic stem cell line. H3K4me3 is usually associated with promoters and regions with high DNAse-seq signal are associated with accessible regions, which means mostly regulatory regions. We will summarize those datasets around the transcription start sites (TSS)\index{transcription start site (TSS)} of genes on chromosome 20 of the human hg19 assembly. We will first read the genes and extract the region around the TSS, 500bp upstream and downstream. We will then create a matrix of ChIP-seq scores for those regions. Each row will represent a region around a specific TSS and columns will be the scores per base. We will then plot average enrichment values around the TSS of genes on chromosome 20. +```{r metaRegionchp6,fig.cap="Meta-region plot using genomation."} # get transcription start sites on chr20 library(genomation) @@ -551,23 +551,23 @@ plotMeta(sm, profile.names = "H3K4me3", xcoords = c(-500,500), xlab="bases around TSS") ``` -The resulting plot is shown in Figure \@ref(fig:metaRegionchp6). The pattern we see is expected, there is a dip just around TSS \index{transcription start site (TSS)}and signal is more -intense on the downstream of the TSS. +The resulting plot is shown in Figure \@ref(fig:metaRegionchp6). The pattern we see is expected, there is a dip just around TSS \index{transcription start site (TSS)}and the signal is more +intense downstream of the TSS. We can also plot a heatmap where each row is a -region around TSS and color coded by enrichment. This can show us not only the -general pattern as in the meta-region -plot but also how many of the regions produce such a pattern. The `heatMatrix()` function shown below achieves that. The resulting heatmap plot is shown in Figure \@ref(fig:heatmatrix1Chp6). -```{r heatmatrix1Chp6,fig.cap="Heatmap of enrichment of H3K4me2 around TSS"} +region around the TSS and color coded by enrichment. This can show us not only the +general pattern, as in the meta-region +plot, but also how many of the regions produce such a pattern. The `heatMatrix()` function shown below achieves that. The resulting heatmap plot is shown in Figure \@ref(fig:heatmatrix1Chp6). +```{r heatmatrix1Chp6,fig.cap="Heatmap of enrichment of H3K4me2 around the TSS."} heatMatrix(sm,order=TRUE,xcoords = c(-500,500), xlab="bases around TSS") ``` -Here we saw that about half of the regions do not have any signal. In addition it seems the multi-modal profile we have observed earlier is more complicated. Certain regions seems to have signal on both sides of the TSS, \index{transcription start site (TSS)}whereas others have signal mostly on the downstream side. +Here we saw that about half of the regions do not have any signal. In addition it seems the multi-modal profile we have observed earlier is more complicated. Certain regions seem to have signal on both sides of the TSS, \index{transcription start site (TSS)}whereas others have signal mostly on the downstream side. Normally, there would be more than one experiment or we can integrate datasets from -public repositories. In this case, we can see how different signals look like on the regions we are interested in. Now, we will also use DNAse-seq data and create a list of matrices with our datasets and plot the average profile of the signals from both datasets. The resulting meta-region plot is shown in Figure \@ref(fig:heatmatrixlistchp6) -```{r heatmatrixlistchp6,fig.cap= "Average profiles of DNAse and H3K4me3 ChIP-seq",out.width='50%'} +public repositories. In this case, we can see how different signals look in the regions we are interested in. Now, we will also use DNAse-seq data and create a list of matrices with our datasets and plot the average profile of the signals from both datasets. The resulting meta-region plot is shown in Figure \@ref(fig:heatmatrixlistchp6). +```{r heatmatrixlistchp6,fig.cap= "Average profiles of DNAse and H3K4me3 ChIP-seq.",out.width='50%'} DNAseFile=system.file("extdata", "H1.ESC.dnase.chr20.bw", @@ -581,10 +581,10 @@ plotMeta(sml) ``` We should now look at the heatmaps side by side and we should also cluster the rows -based on their similarity. We will be using `multiHeatMatrix` since we have multiple `ScoreMatrix` objects in the list. In this case, we will also use `winsorize` argument to limit extreme values, -every score above 95th percentile will be equalized the the value of the 95th percentile. In addition, `heatMatrix` and `multiHeatMatrix` can cluster the rows. +based on their similarity. We will be using `multiHeatMatrix` since we have multiple `ScoreMatrix` objects in the list. In this case, we will also use the `winsorize` argument to limit extreme values, +every score above 95th percentile will be equalized the value of the 95th percentile. In addition, `heatMatrix` and `multiHeatMatrix` can cluster the rows. Below, we will be using k-means clustering with 3 clusters. -```{r,multiHeatMatrix,fig.cap= "Heatmaps of H3K4me3 and DNAse data",out.width='40%'} +```{r,multiHeatMatrix,fig.cap= "Heatmaps of H3K4me3 and DNAse data.",out.width='40%'} set.seed(1029) multiHeatMatrix(sml,order=TRUE,xcoords = c(-500,500), xlab="bases around TSS",winsorize = c(0,95), @@ -594,33 +594,33 @@ multiHeatMatrix(sml,order=TRUE,xcoords = c(-500,500), ``` -The resulting heatmaps are shown in \@ref(fig:multiHeatMatrix). These plots revealed a different picture than we have observed before. Almost half of the promoters have no signal for DNAse or H3K4me3; these\index{histone modification} regions are probably not active and associated genes are not expressed. For regions with H3K4me3 signal, there are two major patterns. One pattern where both downstream and upstream of the TSS are enriched. On the other pattern, mostly downstream of the TSS is enriched.\index{transcription start site (TSS)} +The resulting heatmaps are shown in Figure \@ref(fig:multiHeatMatrix). These plots revealed a different picture than we have observed before. Almost half of the promoters have no signal for DNAse or H3K4me3; these\index{histone modification} regions are probably not active and associated genes are not expressed. For regions with the H3K4me3 signal, there are two major patterns: one pattern where both downstream and upstream of the TSS are enriched, and on the other pattern, mostly downstream of the TSS is enriched.\index{transcription start site (TSS)} ### Making karyograms and circos plots Chromosomal karyograms and circos plots are beneficial for displaying data over the -whole genome of chromosomes of interest. Although,the information that can be +whole genome of chromosomes of interest, although the information that can be displayed over these large regions are usually not very clear and only large trends can be discerned by eye, such as loss of methylation in large regions or genome-wide. -Below, we are showing how to use `ggbio` package for plotting. +Below, we show how to use the `ggbio` package for plotting. This package has a slightly different syntax than base graphics. The syntax follows "grammar of graphics" logic, and depends on the `ggplot2` package we introduced in Chapter \@ref(Rintro). It is a deconstructed way of thinking about the plot. You add your data and apply mappings and transformations in order to achieve the final output. In `ggbio`, things are -relatively easy since a high-level function `autoplot` function will recognize \index{R Packages!\texttt{ggbio}} +relatively easy since a high-level function, the `autoplot` function, will recognize \index{R Packages!\texttt{ggbio}} most of the datatypes and guess the most appropriate plot type. You can change -it is behavior by applying low-level functions. We first get the sizes of chromosomes -and make a karyogram template. The empty karyogram is shown in \@ref(fig:karyo1). +its behavior by applying low-level functions. We first get the sizes of chromosomes +and make a karyogram template. The empty karyogram is shown in Figure \@ref(fig:karyo1). -```{r,karyo1,fig.cap= "Karyogram example"} +```{r,karyo1,fig.cap= "Karyogram example."} library(ggbio) data(ideoCyto, package = "biovizBase") p <- autoplot(seqinfo(ideoCyto$hg19), layout = "karyogram") p ``` -Next, we would like to plot CpG islands on this Karyogram. We simply do this -by adding a layer with `layout_karyogram` function. The resulting karyogram is shown in \@ref(fig:karyo2). -```{r,karyo2,fig.cap= "Karyogram of CpG islands over the human genome"} +Next, we would like to plot CpG islands on this karyogram. We simply do this +by adding a layer with the `layout_karyogram()` function. The resulting karyogram is shown in Figure \@ref(fig:karyo2). +```{r,karyo2,fig.cap= "Karyogram of CpG islands over the human genome."} # read CpG islands from a generic text file CpGiFile=filePath=system.file("extdata", @@ -635,12 +635,12 @@ p + layout_karyogram(cpgi.gr) ``` -Next, we would like to plot some data over the chromosomes. This could be ChIP-seq \index{ChIP-seq} +Next, we would like to plot some data over the chromosomes. This could be the ChIP-seq \index{ChIP-seq} signal -or any other signal over the genome, we will use CpG island scores from the data set -we read earlier. We will plot a point proportional to "obsExp" column in the data set. We use `ylim` argument to squish the chromosomal rectangles and plot on top of those. `aes` argument defines how the data is mapped to geometry. In this case, -the argument indicates that the points will have x coordinate from CpG island start positions and y coordinate from obsExp score of CpG islands. The resulting karyogram is shown in \@ref(fig:karyoCpG). -```{r,karyoCpG,fig.cap="Karyogram of CpG islands and their observed/expected scores over the human genome"} +or any other signal over the genome; we will use CpG island scores from the data set +we read earlier. We will plot a point proportional to "obsExp" column in the data set. We use the `ylim` argument to squish the chromosomal rectangles and plot on top of those. The `aes` argument defines how the data is mapped to geometry. In this case, +the argument indicates that the points will have an x coordinate from CpG island start positions and a y coordinate from the obsExp score of CpG islands. The resulting karyogram is shown in Figure \@ref(fig:karyoCpG). +```{r,karyoCpG,fig.cap="Karyogram of CpG islands and their observed/expected scores over the human genome."} p + layout_karyogram(cpgi.gr, aes(x= start, y = obsExp), geom="point", @@ -651,9 +651,9 @@ p + layout_karyogram(cpgi.gr, aes(x= start, y = obsExp), ``` -Another way to depict regions or quantitative signals on the chromosomes is circos plots. These are circular plots usually used for showing chromosomal rearrangements, but can also be used for depicting signals.`ggbio` package can produce all kinds of circos plots. Below, we will show how to use that for our CpG island score example, and the resulting plot is shown in \@ref(fig:circosCpG). +Another way to depict regions or quantitative signals on the chromosomes is circos plots. These are circular plots usually used for showing chromosomal rearrangements, but can also be used for depicting signals. The `ggbio` package can produce all kinds of circos plots. Below, we will show how to use that for our CpG island score example, and the resulting plot is shown in Figure \@ref(fig:circosCpG). -```{r,"circosCpG",fig.cap="circos plot for CpG islands scores"} +```{r,"circosCpG",fig.cap="Circos plot for CpG island scores."} # set the chromsome in a circle # color set to white to look transparent @@ -677,7 +677,7 @@ p ## Exercises -The data for the exercises is within `compGenomRData` package. +The data for the exercises is within the `compGenomRData` package. Run the following to see the data files. ``` @@ -686,7 +686,7 @@ dir(system.file("extdata", ``` You will need some of those files to complete the exercises. -### Operations on Genomic Intervals with GenomicRanges package +### Operations on genomic intervals with the `GenomicRanges` package 1. Create a `GRanges` object using the information in the table below:[Difficulty: **Beginner**] @@ -697,59 +697,57 @@ You will need some of those files to complete the exercises. | chr2 | 20000 | 20030 | + | 15 | -2. use `start()`, `end()`, `strand()`,`seqnames()` and `width()` functions on the `GRanges` -object you created. Figure out what they are doing. Can you get a subset of `GRanges` object for intervals that are only on + strand? If you can do that, try getting intervals that are on chr1. *HINT:* `GRanges` objects can be subset using `[ ]` operator similar to data frames but you may need -to use `start()`, `end()` and `strand()`,`seqnames()` within the `[]`.[Difficulty: **Beginner/Intermediate**] +2. Use the `start()`, `end()`, `strand()`,`seqnames()` and `width()` functions on the `GRanges` +object you created. Figure out what they are doing. Can you get a subset of the `GRanges` object for intervals that are only on the + strand? If you can do that, try getting intervals that are on chr1. *HINT:* `GRanges` objects can be subset using the `[ ]` operator, similar to data frames, but you may need +to use `start()`, `end()` and `strand()`,`seqnames()` within the `[]`. [Difficulty: **Beginner/Intermediate**] -3. Import mouse (mm9 assembly) CpG islands and RefSeq transcripts for chr12 from UCSC browser as `GRanges` objects using `rtracklayer` functions. HINT: Check chapter content and modify the code there as necessary. If that somehow does not work, go to UCSC browser and download it as a BED file. The track name for Refseq genes is "RefSeq Genes" and table name is "refGene". [Difficulty: **Beginner/Intermediate**] +3. Import mouse (mm9 assembly) CpG islands and RefSeq transcripts for chr12 from the UCSC browser as `GRanges` objects using `rtracklayer` functions. HINT: Check chapter content and modify the code there as necessary. If that somehow does not work, go to the UCSC browser and download it as a BED file. The track name for Refseq genes is "RefSeq Genes" and the table name is "refGene". [Difficulty: **Beginner/Intermediate**] -4. Following from the exercise above, get the promoters of Refseq transcripts (-1000bp and +1000 bp of the TSS) and calculate what percentage of them overlap with CpG islands. HINT: You have to get the promoter coordinates and use `findOverlaps()` or `subsetByOverlaps()` from `GenomicRanges` package. To get promoters, type `?promoters` on the R console and see how to use that function to get promoters or calculate their coordinates as shown in the lecture material.[Difficulty: **Beginner/Intermediate**] +4. Following from the exercise above, get the promoters of Refseq transcripts (-1000bp and +1000 bp of the TSS) and calculate what percentage of them overlap with CpG islands. HINT: You have to get the promoter coordinates and use the `findOverlaps()` or `subsetByOverlaps()` from the `GenomicRanges` package. To get promoters, type `?promoters` on the R console and see how to use that function to get promoters or calculate their coordinates as shown in the chapter. [Difficulty: **Beginner/Intermediate**] 5. Plot the distribution of CpG island lengths for CpG islands that overlap with the -promoters.[Difficulty: **Beginner/Intermediate**] +promoters. [Difficulty: **Beginner/Intermediate**] -6. Get canonical peaks for SP1 (peaks that are in both replicates) on chr21. Peaks for each replicate are located in `wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz` and `wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2.broadPeak.gz` files. HINT: You need to use `findOverlaps()` or `subsetByOverlaps()` to get the subset of peaks that occur in both replicates (canonical peaks). You can try to read *broadPeak.gz files using genomation function `readBroadPeak`, broadPeak is just an extended BED format. In addition, you can try to use `coverage()` and `slice()` functions to get more precise canonical peak locations.[Difficulty: **Intermediate/Advanced**] +6. Get canonical peaks for SP1 (peaks that are in both replicates) on chr21. Peaks for each replicate are located in the `wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz` and `wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2.broadPeak.gz` files. **HINT**: You need to use `findOverlaps()` or `subsetByOverlaps()` to get the subset of peaks that occur in both replicates (canonical peaks). You can try to read "...broadPeak.gz" files using the `genomation::readBroadPeak()` function; broadPeak is just an extended BED format. In addition, you can try to use `the coverage()` and `slice()` functions to get more precise canonical peak locations. [Difficulty: **Intermediate/Advanced**] ### Dealing with mapped high-throughput sequencing reads -1. Count the reads overlapping with canonical SP1 peaks using the BAM file for one of the replicates. The following file in `compGenomRData` package contains the alignments for SP1 ChIP-seq reads: `wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam`. **HINT**: Use functions from `GenomicAlignments` package.[Difficulty: **Beginner/Intermediate**] +1. Count the reads overlapping with canonical SP1 peaks using the BAM file for one of the replicates. The following file in the `compGenomRData` package contains the alignments for SP1 ChIP-seq reads: `wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam`. **HINT**: Use functions from the `GenomicAlignments` package. [Difficulty: **Beginner/Intermediate**] ### Dealing with contiguous scores over the genome -1. Extract Views object for the promoters on chr20 from `H1.ESC.H3K4me1.chr20.bw` file available at `CompGenomRData` package. Plot the first "View" as a line plot. **HINT**: see the code in the relevant section in the chapter and adapt the code from there.[Difficulty: **Beginner/Intermediate**] +1. Extract the `Views` object for the promoters on chr20 from the `H1.ESC.H3K4me1.chr20.bw` file available at `CompGenomRData` package. Plot the first "View" as a line plot. **HINT**: See the code in the relevant section in the chapter and adapt the code from there. [Difficulty: **Beginner/Intermediate**] -2. Make a histogram of the maximum signal for the Views in the object you extracted above. You can use any of the view summary functions or use `lapply()` and write your own summary function.[Difficulty: **Beginner/Intermediate**] +2. Make a histogram of the maximum signal for the Views in the object you extracted above. You can use any of the view summary functions or use `lapply()` and write your own summary function. [Difficulty: **Beginner/Intermediate**] -3. Get the genomic positions of maximum signal in each view and make a `GRanges` object. **HINT**: See `?viewRangeMaxs` help page. Try to make a `GRanges` object out of the returned object.[Difficulty: **Intermediate**] +3. Get the genomic positions of maximum signal in each view and make a `GRanges` object. **HINT**: See the `?viewRangeMaxs` help page. Try to make a `GRanges` object out of the returned object. [Difficulty: **Intermediate**] ### Visualizing and summarizing genomic intervals -1. Extract -500,+500 bp regions around TSSes on chr21, there are refseq files for hg19 human genome assembly. in the `compGenomRData` package. As an example here is how you can get the file path to refseq annotation on chr21. +1. Extract -500,+500 bp regions around the TSSes on chr21; there are refseq files for the hg19 human genome assembly in the `compGenomRData` package. Use SP1 ChIP-seq data in the `compGenomRData` package, access the file path via the `system.file()` function, the file name is: +`wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam`. Create an average profile of read coverage around the TSSes. Following that, visualize the read coverage with a heatmap. **HINT**: All of these are possible using the `genomation` package functions. Check `help(ScoreMatrix)` to see how you can use bam files. As an example here is how you can get the file path to refseq annotation on chr21. [Difficulty: **Intermediate/Advanced**] ```{r example,eval=FALSE} transcriptFilechr21=system.file("extdata", "refseq.hg19.chr21.bed", package="compGenomRData") ``` -Use SP1 ChIP-seq data in the `compGenomRData` package, access the file path via `system.file()` command, the file name is: -`wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam`. Create an average profile of read coverage around TSSes. Following that, visualize the read coverage with a heatmap. **HINT**: All of these possible using `genomation` package functions. Check `help(ScoreMatrix)` to see how you can use bam files.[Difficulty: **Intermediate/Advanced**] +2. Extract -500,+500 bp regions around the TSSes on chr20. Use H3K4me3 (`H1.ESC.H3K4me3.chr20.bw`) and H3K27ac (`H1.ESC.H3K27ac.chr20.bw`) ChIP-seq enrichment data in the `compGenomRData` package and create heatmaps and average signal profiles for regions around the TSSes.[Difficulty: **Intermediate/Advanced**] -2. Extract -500,+500 bp regions around TSSes on chr20. Use H3K4me3 (`H1.ESC.H3K4me3.chr20.bw`) and H3K27ac (`H1.ESC.H3K27ac.chr20.bw`) ChIP-seq enrichment data in the `compGenomRData` package and create heatmaps and average signal profiles for regions around the TSSes.[Difficulty: **Intermediate/Advanced**] +3. Download P300 ChIP-seq peaks data from the UCSC browser. The peaks are locations where P300 binds. The P300 binding marks enhancer regions in the genome. (**HINT**: group: "regulation", track: "Txn Factor ChIP", table:"wgEncodeRegTfbsClusteredV3", you need to filter the rows for "EP300" name.) Check enrichment of H3K4me3, H3K27ac and DNase-seq (`H1.ESC.dnase.chr20.bw`) experiments on chr20 on and arounf the P300 binding-sites, use data from `compGenomRData` package. Make multi-heatmaps and metaplots. What is different from the TSS profiles? [Difficulty: **Advanced**] -3. Download P300 ChIP-seq peaks data from UCSC browser. The peaks are locations where P300 binds. The P300 binding marks enhancer regions in the genome. (HINT: group: "regulation", track: "Txn Factor ChIP", table:"wgEncodeRegTfbsClusteredV3", you need to filter the rows for "EP300" name )Check enrichment of H3K4me3, H3K27ac and DNase-seq (`H1.ESC.dnase.chr20.bw`) experiments on chr20, use data from `compGenomRData` package. Make multi-heatmaps and metaplots, what is different from TSS profiles ? [Difficulty: **Advanced**] +4. Cluster the rows of multi-heatmaps for the task above. Are there obvious clusters? **HINT**: Check arguments of the `multiHeatMatrix()` function. [Difficulty: **Advanced**] -4. Cluster the rows of multi-heatmaps? Are there obvious clusters ? HINT: check arguments of `multiHeatMatrix()` function.[Difficulty: **Advanced**] - -5. Visualize one of the -500,+500 bp regions around TSS using `Gviz` functions. You should visualize both H3K4me3 and H3K27ac and the gene models.[Difficulty: **Advanced**] +5. Visualize one of the -500,+500 bp regions around the TSS using `Gviz` functions. You should visualize both H3K4me3 and H3K27ac and the gene models. [Difficulty: **Advanced**] diff --git a/07-Read_Processing.Rmd b/07-Read_Processing.Rmd index 2f635c4..c502202 100644 --- a/07-Read_Processing.Rmd +++ b/07-Read_Processing.Rmd @@ -11,29 +11,29 @@ knitr::opts_chunk$set(echo = TRUE, ``` -Advances in sequencing technology are helping researchers sequence the genome deeper than ever. These sequencing experiments typically yield millions of reads. These reads have to be further processed, quality checked and aligned before we can quantify the genomic signal of interest and apply statistics and/or machine learning methods. For example, you may want to count how many reads overlapping with your promoter set of interest or you may want to quantify RNA-seq reads overlapping with exons. Post-alignment operations are usually but not always similar to operations on genomic intervals. Dealing with mapped reads are described previously in chapter \@ref(genomicIntervals). In addition, we have introduced high-throughput sequencing and its applications in general in chapter \@ref(intro). In this chapter we will introduce the fundamentals of read processing and quality check, and we will show how to do those tasks in R. The read quality check and processing is a fundemental step in all high-throughput sequencing analyses. For example, RNA-seq, ChIP-seq and BS-seq analyses shown in Chapters \@ref(rnaseqanalysis), \@ref(chipseq) and \@ref(bsseq) require these quality check and processing steps prior to further analysis. For a long time, quality check and mapping tasks were outside the R domain. However, nowadays certain packages in R/Bioconductor can accomplish those tasks. +Advances in sequencing technology are helping researchers sequence the genome deeper than ever. These sequencing experiments typically yield millions of reads. These reads have to be further processed, quality checked and aligned before we can quantify the genomic signal of interest and apply statistics and/or machine learning methods. For example, you may want to count how many reads overlap with your promoter set of interest or you may want to quantify RNA-seq reads that overlap with exons. Post-alignment operations are usually, but not always, similar to operations on genomic intervals. Dealing with mapped reads is described previously in Chapter \@ref(genomicIntervals). In addition, we have introduced high-throughput sequencing and its applications in general in Chapter \@ref(intro). In this chapter we will introduce the fundamentals of read processing and quality check, and we will show how to do those tasks in R. The read quality check and processing is a fundamental step in all high-throughput sequencing analyses. For example, RNA-seq, ChIP-seq and BS-seq analyses shown in Chapters \@ref(rnaseqanalysis), \@ref(chipseq) and \@ref(bsseq) require these quality check and processing steps prior to further analysis. For a long time, quality check and mapping tasks were outside the R domain. However, nowadays certain packages in R/Bioconductor can accomplish those tasks. ## FASTA and FASTQ formats -High-throughput sequencing reads are usually output from sequencing facilities as text files in a format called "FASTQ" or "fastq". This format depends on an earlier format called FASTA. The FASTA format is developed as a text-based format to represent nucleotide or protein sequences (See Figure \@ref(fig:fasta) for an example). +High-throughput sequencing reads are usually output from sequencing facilities as text files in a format called "FASTQ" or "fastq". This format depends on an earlier format called FASTA. The FASTA format was developed as a text-based format to represent nucleotide or protein sequences (see Figure \@ref(fig:fasta) for an example). -```{r,fasta,fig.cap="An example fasta file showing first part of PAX6 gene",fig.align = 'center',out.width='80%',echo=FALSE} +```{r,fasta,fig.cap="An example fasta file showing the first part of the PAX6 gene.",fig.align = 'center',out.width='80%',echo=FALSE} knitr::include_graphics("images/fastaPic.png" ) ``` -The first line in a FASTA file usually starts with a ">" (greater-than) symbol. This first line is called the "description line", and can contain descriptive information about the sequence in the subsequent lines. The description can be id or name of the sequence such as gene names. However, very infrequently you may see lines starting with a ";" (semicolon). These lines will be taken as a comment, and can hold additional descriptive information about the sequence in subsequent lines. +The first line in a FASTA file usually starts with a ">" (greater-than) symbol. This first line is called the "description line", and can contain descriptive information about the sequence in the subsequent lines. The description can be the ID or name of the sequence such as gene names. However, very infrequently you may see lines starting with a ";" (semicolon). These lines will be taken as a comment, and can hold additional descriptive information about the sequence in subsequent lines. -An extension of the FASTA format is FASTQ format. This format is designed to handle base quality metrics output from sequencing machines. In this format, both the sequence and quality scores are represented as single ASCII characters. The format uses for lines for each sequence, and these four lines are stacked on top of each other in text files output by sequencing workflows. Each of the 4 lines will represent a read. Figure \@ref(fig:fastq) shows those four lines with brief explanations for each line. +An extension of the FASTA format is FASTQ format. This format is designed to handle base quality metrics output from sequencing machines. In this format, both the sequence and quality scores are represented as single ASCII characters. The format uses four lines for each sequence, and these four lines are stacked on top of each other in text files output by sequencing workflows. Each of the 4 lines will represent a read. Figure \@ref(fig:fastq) shows those four lines with brief explanations for each line. -```{r,fastq,fig.cap="FASTQ format and brief explanation of each line in the format",fig.align = 'center',out.width='80%',echo=FALSE} +```{r,fastq,fig.cap="FASTQ format and a brief explanation of each line in the format.",fig.align = 'center',out.width='80%',echo=FALSE} knitr::include_graphics("images/fastqPic.png" ) ``` -__Line 1__ begins with a '@' character and is followed by a sequence identifier and an optional description. This line is utilized by the sequencing technology, and usually contains specific information for the technology. It can contain flow cell ids, lane numbers, information on read pairs. __Line 2__ is the sequence letters. __Line 3__ begins with a '+' character, it marks the end of sequence and is optionally followed by the same sequence identifier again in line 1. __Line 4__ encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each letter corresponds to a quality score. Although there might be different definitions of the quality scores, a *de facto* standard in the field is to use "Phred quality scores". These scores represent the likelihood of base being called wrong. Formally, ${\displaystyle Q_{\text{phred}}=-10\log _{\text{10}}e}$, where $e$ is probability that the base is called wrong.Since the score is in minus log scale, the higher the score, the more unlikely that the base is called wrong. +__Line 1__ begins with the '@' character and is followed by a sequence identifier and an optional description. This line is utilized by the sequencing technology, and usually contains specific information for the technology. It can contain flow cell IDs, lane numbers, and information on read pairs. __Line 2__ is the sequence letters. __Line 3__ begins with a '+' character; it marks the end of the sequence and is optionally followed by the same sequence identifier again in line 1. __Line 4__ encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each letter corresponds to a quality score. Although there might be different definitions of the quality scores, a *de facto* standard in the field is to use "Phred quality scores". These scores represent the likelihood of the base being called wrong. Formally, ${\displaystyle Q_{\text{phred}}=-10\log _{\text{10}}e}$, where $e$ is the probability that the base is called wrong. Since the score is in minus log scale, the higher the score, the more unlikely that the base is called wrong. ## Quality check on sequencing reads -The sequencing technologies usually produce basecalls with varying quality. In addition, there could be sample specific issues in your sequencing run, such as adapter contamination. It is standard procedure to check the quality of the reads and identify problems before doing further analysis. Checking the quality and making some decisions for the downstream analysis can influence the outcome of your project. +The sequencing technologies usually produce basecalls with varying quality. In addition, there could be sample-specific issues in your sequencing run, such as adapter contamination. It is standard procedure to check the quality of the reads and identify problems before doing further analysis. Checking the quality and making some decisions for the downstream analysis can influence the outcome of your project. -Below, we will walk you through the quality check steps using [`Rqc`](https://bioconductor.org/packages/release/bioc/html/Rqc.html) package\index{R Packages!\texttt{Rqc}}. First, we need to feed fastq files to `rqc()` function and obtain an object with sequence quality related results. We are using example fastq files from `ShortRead` package{R Packages!\texttt{ShortRead}}. +Below, we will walk you through the quality check steps using the [`Rqc`](https://bioconductor.org/packages/release/bioc/html/Rqc.html) package\index{R Packages!\texttt{Rqc}}. First, we need to feed fastq files to the `rqc()` function and obtain an object with sequence quality-related results. We are using example fastq files from the `ShortRead` package\index{R Packages!\texttt{ShortRead}}. ```{r,rqcStart,echo=TRUE, warning=FALSE} library(Rqc) folder = system.file(package="ShortRead", "extdata/E-MTAB-1147") @@ -46,35 +46,35 @@ qcRes=rqc(path = folder, pattern = ".fastq.gz", openBrowser=FALSE) ### Sequence quality per base/cycle -Now that we have `qcRes` object, we can plot various sequence quality metrics for our fastq files. We will first plot " sequence quality per base/cycle". This plot, shown in Figure \@ref(fig:CycleQualityBoxPlot), depicts the quality scores across all bases at each position in the reads. +Now that we have the `qcRes` object, we can plot various sequence quality metrics for our fastq files. We will first plot "sequence quality per base/cycle". This plot, shown in Figure \@ref(fig:CycleQualityBoxPlot), depicts the quality scores across all bases at each position in the reads. -```{r,CycleQualityBoxPlot,fig.cap="Per base sequence quality boxplot",fig.align = 'center',out.width='80%',echo=TRUE, warning=FALSE,fig.height=3,fig.width=5} +```{r,CycleQualityBoxPlot,fig.cap="Per base sequence quality boxplot.",fig.align = 'center',out.width='80%',echo=TRUE, warning=FALSE,fig.height=3,fig.width=5} rqcCycleQualityBoxPlot(qcRes) ``` -In our case, the x-axis in the plot is labeled as "cycle". This is because in each sequencing "cycle" a fluorescently labeled nucleotide is added to complement the template sequence, and the sequencing machine identifies which nucleotide is added. Therefore, cycles corresponds to bases/nucleotides along the read, and the number of cycles is equivalent to the read length. +In our case, the x-axis in the plot is labeled as "cycle". This is because in each sequencing "cycle" a fluorescently labeled nucleotide is added to complement the template sequence, and the sequencing machine identifies which nucleotide is added. Therefore, cycles correspond to bases/nucleotides along the read, and the number of cycles is equivalent to the read length. Long sequences can have degraded quality towards the ends of the reads. Looking at quality distribution over base positions can help us decide to do trimming towards the end of the reads or not. A good sample will have median quality scores per base above 28. If scores are below 20 towards the ends, you can think about trimming the reads. ### Sequence content per base/cycle -Per base sequence content shows nucleotide proportions for each position. In a random sequencing library there should be no nucleotide bias and the lines should be almost parallel with each other. The code below shows how to get this plot. The resulting plot is shown in Figure \@ref(fig:baseCallFreq). +Per-base sequence content shows nucleotide proportions for each position. In a random sequencing library there should be no nucleotide bias and the lines should be almost parallel with each other. The code below shows how to get this plot. The resulting plot is shown in Figure \@ref(fig:baseCallFreq). -```{r,baseCallFreq,fig.cap="Percentage of nucleotide bases per position accross different FASTQ files",fig.align = 'center',out.width='80%',echo=TRUE, warning=FALSE,fig.height=3,fig.width=5} +```{r,baseCallFreq,fig.cap="Percentage of nucleotide bases per position across different FASTQ files.",fig.align = 'center',out.width='80%',echo=TRUE, warning=FALSE,fig.height=3,fig.width=5} rqcCycleBaseCallsLinePlot(qcRes) ``` -However some types of sequencing libraries can produce a biased sequence composition. For example, in RNA-Seq , it is common to have bias at the beginning of the reads. This happens because of random primers annealing to the start of reads during RNA-Seq library preparation. These primers are not truly random, and it leads to a variation at the beginning of the reads. Although RNA-seq experiments will usually have these biases, this will not affect the ability of measuring gene expression. +However, some types of sequencing libraries can produce a biased sequence composition. For example, in RNA-Seq, it is common to have bias at the beginning of the reads. This happens because of random primers annealing to the start of reads during RNA-Seq library preparation. These primers are not truly random, which leads to a variation at the beginning of the reads. Although RNA-seq experiments will usually have these biases, this will not affect the ability of measuring gene expression. -In addition, some libraries are inherently biased in their sequence composition. For example, in bisulfite sequencing experiments most of the cytosines will be converted to thymines. This will create a difference in C and T base compositions over the read, however this type of difference is normal for bisulfite sequencing experiments. +In addition, some libraries are inherently biased in their sequence composition. For example, in bisulfite sequencing experiments, most of the cytosines will be converted to thymines. This will create a difference in C and T base compositions over the read, however this type of difference is normal for bisulfite sequencing experiments. ### Read frequency plot -This plot shows the degree of duplication for every read in the library. We show how to get this plot in the code snippet below and the resulting plot is in Figure \@ref(fig:ReadFrequencyPlot). A high level of duplication, non-unique reads, is likely to indicate an enrichment bias. Technical duplicates arising from PCR artefacts could cause this. PCR is a common step in library preparation which creates many copies of the sequence fragment. In RNA-seq \index{RNA-seq}data, non-unique read proportion can reach more than 20%. However, these duplications may stem from simply genes being expressed at high levels. This means that there will many copies of transcripts and many copies of the same fragment. Since we can not be sure these duplicated reads are due to PCR bias or an effect of high transcription, we should not remove duplicated reads in RNA-seq analysis. However, in ChIP-seq experiments duplicated reads are more likely to be due to PCR bias. +This plot shows the degree of duplication for every read in the library. We show how to get this plot in the code snippet below and the resulting plot is in Figure \@ref(fig:ReadFrequencyPlot). A high level of duplication, non-unique reads, is likely to indicate an enrichment bias. Technical duplicates arising from PCR artifacts could cause this. PCR is a common step in library preparation which creates many copies of the sequence fragment. In RNA-seq \index{RNA-seq}data, the non-unique read proportion can reach more than 20%. However, these duplications may stem from genes simply being expressed at high levels. This means that there will be many copies of transcripts and many copies of the same fragment. Since we cannot be sure these duplicated reads are due to PCR bias or an effect of high transcription, we should not remove duplicated reads in RNA-seq analysis. However, in ChIP-seq experiments duplicated reads are more likely to be due to PCR bias. ```{r,ReadFrequencyPlot,fig.height=3,fig.width=5,fig.cap="The percent of different duplication levels in FASTQ files. Most of the reads in all libraries have only one copy in this case. ",fig.align = 'center',out.width='60%',echo=TRUE, warning=FALSE} rqcReadFrequencyPlot(qcRes) @@ -82,7 +82,7 @@ rqcReadFrequencyPlot(qcRes) ``` ### Other quality metrics and QC tools -Over-represented k-mers along the reads can be an additional check. If there are such sequences it may point to adapter contamination and should be trimmed. Adapters are known sequences that are added to the ends of the reads. This kind of contamination could also be visible at "sequence content per base" plots. In addition, if you know the adapter sequences you can match it to the end of the reads and trim them. The most popular tool for sequencing quality control is the fastQC tool [@noauthor_babraham_nodate], which is written in Java. It produces the plots that we described above in addition to k-mer overrepresentation and adapter overrepresentation plots. The R package [fastqcr](https://cran.r-project.org/web/packages/fastqcr/index.html) can run this Java tool\index{R Packages!\texttt{fastqcr}} and produce R based plots and reports. This package simply calls the Java tool and parses its results. Below, we are showing how to do that. +Over-represented k-mers along the reads can be an additional check. If there are such sequences it may point to adapter contamination and should be trimmed. Adapters are known sequences that are added to the ends of the reads. This kind of contamination could also be visible at "sequence content per base" plots. In addition, if you know the adapter sequences, you can match it to the end of the reads and trim them. The most popular tool for sequencing quality control is the fastQC tool [@noauthor_babraham_nodate], which is written in Java. It produces the plots that we described above in addition to k-mer overrepresentation and adapter overrepresentation plots. The R package [fastqcr](https://cran.r-project.org/web/packages/fastqcr/index.html) can run this Java tool\index{R Packages!\texttt{fastqcr}} and produce R-based plots and reports. This package simply calls the Java tool and parses its results. Below, we show how to do that. ```{r,fastqcr,eval=FALSE} library(fastqcr) @@ -95,7 +95,7 @@ fastqc_install() fastqc(fq.dir = folder,qc.dir = "fastqc_results") ``` -Now that we have run FastQC on our fastq files, we can read the results to R and construct plots or reports. `gc_report` function can create an Rmarkdown based report from FastQC output. +Now that we have run FastQC on our fastq files, we can read the results to R and construct plots or reports. The `gc_report()` function can create an Rmarkdown-based report from FastQC output. ```{r, fastqcr2,eval=FALSE} # view the report rendered by R functions qc_report(qc.path="fastqc_results", @@ -112,12 +112,12 @@ qc_plot(qc, "Per base sequence quality") ``` -Apart from this, the bioconductor packages Rqc [@Rqc] (see `Rqc::rqcReport` function), QuasR [@gaidatzis_quasr:_2015] (see `QuasR::qQCReport` function), systemPipeR [@backman_systempiper:_2016] (see `systemPipeR::seeFastq` function), and ShortRead [@morgan_shortread:_2009] (see `ShortRead::report` function) packages can all generate quality reports in a similar fashion to FastQC with some differences in plot content and number. +Apart from this, the bioconductor packages Rqc [@Rqc] (see `Rqc::rqcReport` function), QuasR [@gaidatzis_quasr:_2015] (see `QuasR::qQCReport` function), systemPipeR [@backman_systempiper:_2016] (see `systemPipeR::seeFastq` function), and ShortRead [@morgan_shortread:_2009] (see `ShortRead::report` function) can all generate quality reports in a similar fashion to FastQC with some differences in plot content and number. ## Filtering and trimming reads -Based on the results of the quality check, you may want to trim or filter the reads. Quality check might have shown number of reads that have low quality scores. These reads will probably not align very well because of the potential mistakes in base calling, or they may align to wrong places in the genome. Therefore, you may want to remove these reads from your fastq file. Another potential scenario is that part of your reads needs to be trimmed in order align the read. In some cases, adapters will be present in either side of the read, in other cases technical errors will lead to decreasing base quality towards the ends of the reads. Both in these cases, portion of the read should be trimmed so that read can align or better align the genome. We will show how to use `QuasR` package to trim the reads. Other packages such as `ShortRead` also have capabilities to trim and filter reads. However, `QuasR::preprocessReads()` function provides a single interface to multiple preprocessing possibilities. With this function, we match adapter sequences and remove them. We can remove low-complexity reads (reads containing repetitive sequences). We can trim start or ends of the reads by a pre-defined length. +Based on the results of the quality check, you may want to trim or filter the reads. The quality check might have shown the number of reads that have low quality scores. These reads will probably not align very well because of the potential mistakes in base calling, or they may align to wrong places in the genome. Therefore, you may want to remove these reads from your fastq file. Another potential scenario is that parts of your reads need to be trimmed in order to align the reads. In some cases, adapters will be present in either side of the read; in other cases technical errors will lead to decreasing base quality towards the ends of the reads. Both in these cases, the portion of the read should be trimmed so the read can align or better align the genome. We will show how to use the `QuasR` package to trim the reads. Other packages such as `ShortRead` also have capabilities to trim and filter reads. However, the `QuasR::preprocessReads()` function provides a single interface to multiple preprocessing possibilities. With this function, we match adapter sequences and remove them. We can remove low-complexity reads (reads containing repetitive sequences). We can trim the start or ends of the reads by a pre-defined length. -Below we will first set up the file paths to fastq files and filter them based on their length and whether or not they contain "N" character, which stands for unidentified base. With the same function we will also trim 3 bases from the end of the reads and also trim segments from the start of the reads if they match the "ACCCGGGA" sequence. +Below we will first set up the file paths to fastq files and filter them based on their length and whether or not they contain the "N" character, which stands for unidentified base. With the same function we will also trim 3 bases from the end of the reads and also trim segments from the start of the reads if they match the "ACCCGGGA" sequence. ```{r,preprocRead,eval=FALSE} library(QuasR) @@ -146,7 +146,7 @@ preprocessReads(fastqFiles, outfiles, ``` -As we have mentioned, `ShortRead` package has low-level functions, which `QuasR::preprocessReads()` also depends on. We can use these low level functions to filter reads in ways that are not possible using `QuasR::preprocessReads()` function. Below we are going to read in a fastq file and filter the reads where every quality score is below 20. +As we have mentioned, the `ShortRead` package has low-level functions, which `QuasR::preprocessReads()` also depends on. We can use these low-level functions to filter reads in ways that are not possible using the `QuasR::preprocessReads()` function. Below we are going to read in a fastq file and filter the reads where every quality score is below 20. ```{r,shortreadQual} library(ShortRead) @@ -170,7 +170,7 @@ qcount = rowSums( qPerBase <= 20) fq[qcount == 0] ``` -We can finally write out the filtered fastq file with `ShortRead::writeFastq()` function. +We can finally write out the filtered fastq file with the `ShortRead::writeFastq()` function. ```{r,eval=FALSE} # write out fastq file with only reads where all # quality scores per base are above 20 @@ -178,7 +178,7 @@ writeFastq(fq[qcount == 0], paste(fastqFile, "Qfiltered", sep="_")) ``` -As fastq files can be quite large, it may not be feasible to read a 30 Gigabyte file into memory. A more memory efficient way would be to read the file piece by piece. We can do our filtering operations for each piece, write the filtered part out and read a new piece. Fortunately, this is possible by `ShortRead::FastqStreamer()` function. This function enables "streaming" the fastq file in pieces, which are blocks of the fastq file with a pre-defined number of reads . We can access the successive blocks with `yield()` function. Each time we call `yield()` function after opening the fastq file with `FastqStreamer()`, a new part of the file will be read to the memory. +As fastq files can be quite large, it may not be feasible to read a 30-Gigabyte file into memory. A more memory-efficient way would be to read the file piece by piece. We can do our filtering operations for each piece, write the filtered part out, and read a new piece. Fortunately, this is possible using the `ShortRead::FastqStreamer()` function. This function enables "streaming" the fastq file in pieces, which are blocks of the fastq file with a pre-defined number of reads. We can access the successive blocks with the `yield()` function. Each time we call the `yield()` function after opening the fastq file with `FastqStreamer()`, a new part of the file will be read to the memory. ```{r,fastqStreamer, eval=FALSE} @@ -209,14 +209,14 @@ while(length(fq <- yield(f))) { ## Mapping/aligning reads to the genome -After quality check and potential pre-processing, the reads are ready to be mapped or aligned to the reference genome. This process simply finds most probable the origin of each read in the genome. Since there might be errors in sequencing and mutations in the genomes, we may not find exact matches of reads in the genomes. An important feature of the alignment algorithms is to tolerate potential mismatches between reads and the reference genome. In addition, effienct algorithms and data structures are needed for the alignment to be completed in a reasonable amount of time. Alignment methods usually create data structures to store and efficiently search the genome for matching reads. These data structures are called genome indices and creating these indices is the first step for the read alignment. Based on how indices are created, there are two major types of methods. One class of methods rely on "hash tables", to store and search the genomes. Hash tables are simple lookup tables, in which all possible k-mers point to locations in the genome. The general idea is that overlapping k-mers constructed from a read goes through this look up table. Each k-mer points to potential locations in the genome. Then, final location for the read is obtained by optimizing k-mer chain by their distances in the genome and in the read. This optimization process removes k-mer locations that are distant from other k-mers that map nearby each other. +After the quality check and potential pre-processing, the reads are ready to be mapped or aligned to the reference genome. This process simply finds the most probable origin of each read in the genome. Since there might be errors in sequencing and mutations in the genomes, we may not find exact matches of reads in the genomes. An important feature of the alignment algorithms is to tolerate potential mismatches between reads and the reference genome. In addition, efficient algorithms and data structures are needed for the alignment to be completed in a reasonable amount of time. Alignment methods usually create data structures to store and efficiently search the genome for matching reads. These data structures are called genome indices and creating these indices is the first step for the read alignment. Based on how indices are created, there are two major types of methods. One class of methods relies on "hash tables", to store and search the genomes. Hash tables are simple lookup tables in which all possible k-mers point to locations in the genome. The general idea is that overlapping k-mers constructed from a read go through this lookup table. Each k-mer points to potential locations in the genome. Then, the final location for the read is obtained by optimizing the k-mer chain by their distances in the genome and in the read. This optimization process removes k-mer locations that are distant from other k-mers that map nearby each other. -Another class of algorithms build genome indices by creating Burrows-Wheeler transformation of the genome. This in essence creates a compact and searchable data structure for all reads. Although, details are out of scope for this section, these alignment tools provide faster alignment and use less memory. BWA[@li2009fast], Bowtie1/2[@langmead2012fast] and SOAP[@li2009soap2] are examples of such algorithms. +Another class of algorithms builds genome indices by creating a Burrows-Wheeler transformation of the genome. This in essence creates a compact and searchable data structure for all reads. Although details are out of the scope of this section, these alignment tools provide faster alignment and use less memory. BWA[@li2009fast], Bowtie1/2[@langmead2012fast] and SOAP[@li2009soap2] are examples of such algorithms. -The read mapping in R can be done with `gmapR` [@gmapR], `QuasR` [@gaidatzis_quasr:_2015], `Rsubread` [@liao_subread_2013], and `systemPipeR` [@backman_systempiper:_2016] packages. We will demonstrate read mapping with QuasR which uses `Rbowtie` package, which wraps the Bowtie aligner. Below, we show how to map reads from a ChIP-seq experiment using QuasR/bowtie. +The read mapping in R can be done with the `gmapR` [@gmapR], `QuasR` [@gaidatzis_quasr:_2015], `Rsubread` [@liao_subread_2013], and `systemPipeR` [@backman_systempiper:_2016] packages. We will demonstrate read mapping with QuasR which uses the `Rbowtie` package, which wraps the Bowtie aligner. Below, we show how to map reads from a ChIP-seq experiment using QuasR/bowtie. -We will use `qAlign()` function which requires two mandatory arugments: 1) a genome file either in fasta format or as a `BSgenome` package 2) a sample file which is a text file and contains file paths to fastq files and sample names. In the case, below sample file looks like this: +We will use the `qAlign()` function which requires two mandatory arguments: 1) a genome file in either fasta format or as a `BSgenome` package and 2) a sample file which is a text file and contains file paths to fastq files and sample names. In the case below, sample file looks like this: ``` FileName SampleName @@ -240,31 +240,31 @@ sampleFile <- "extdata/samples_chip_single.txt" proj <- qAlign(sampleFile, genomeFile) ``` -It is good to explain what is going on here as the `qAlign()` function makes things look simple. This function is designed to be easy. For example, it creates a genome index automatically if it does not exist, and will look for existing indices before it creates one. We provided only two arguments, a text file containing sample names and fastq file paths and a reference genome file. In fact, this function also has many knobs and you can change its behavior by supplying different arguments in order to affect the behavior of Bowtie. For example, you can supply parameters to Bowtie using `alignmentParameter` argument. However the `qAlign()` function is optimized for different types of alignment problems and selects alignment parameters automatically. It is designed to work with alignment and quantification tasks for RNA-seq\index{RNA-seq}, ChIP-seq\index{ChIP-seq}, small-RNA sequencing, Bisulfite sequencing (DNA methylation)\index{Bisulfite sequencing} and allele specific analysis. If you want to change default bowtie parameters only do it for simple alignment problems such as ChIP-seq and RNA-seq. +It is good to explain what is going on here as the `qAlign()` function makes things look simple. This function is designed to be easy. For example, it creates a genome index automatically if it does not exist, and will look for existing indices before it creates one. We provided only two arguments, a text file containing sample names and fastq file paths and a reference genome file. In fact, this function also has many knobs and you can change its behavior by supplying different arguments in order to affect the behavior of Bowtie. For example, you can supply parameters to Bowtie using the `alignmentParameter` argument. However the `qAlign()` function is optimized for different types of alignment problems and selects alignment parameters automatically. It is designed to work with alignment and quantification tasks for RNA-seq\index{RNA-seq}, ChIP-seq\index{ChIP-seq}, small-RNA sequencing, Bisulfite sequencing (DNA methylation)\index{Bisulfite sequencing} and allele-specific analysis. If you want to change the default bowtie parameters, only do it for simple alignment problems such as ChIP-seq and RNA-seq. ```{block2, mappingKnowledge, type='rmdtip'} __Want to know more ?__ -- More on hash tables and Burrows-Wheeler based aligners - - A survey of sequence alignment algorithms for next-generation sequencin](https://academic.oup.com/bib/article/11/5/473/264166) H Li, N Homer - Briefings in bioinformatics, 2010 -- More on QuasR and all the alignment and post-processing capabilities.(https://bioconductor.org/packages/release/bioc/vignettes/QuasR/inst/doc/QuasR.html) +- More on hash tables and Burrows-Wheeler-based aligners + - A survey of sequence alignment algorithms for next-generation sequencin: (https://academic.oup.com/bib/article/11/5/473/264166) H Li, N Homer - Briefings in bioinformatics, 2010 +- More on QuasR and all the alignment and post-processing capabilities: (https://bioconductor.org/packages/release/bioc/vignettes/QuasR/inst/doc/QuasR.html) ``` ## Further processing of aligned reads -After alignment some further processing might be necessary. However, these steps are usually sequencing protocol specific. For example, for methylation obtained via bisulfite sequencing C->T mismatches should be counted \index{Bisulfite sequencing}. For gene expression measurements, reads that overlap with transcripts should be counted. These further processing tasks are either done by specialized alignment related software or can be done in R in some cases. We will explain these further processing steps when they become relevant in the context of following chapters. +After alignment, some further processing might be necessary. However, these steps are usually sequencing protocol specific. For example, for methylation obtained via bisulfite sequencing, C->T mismatches should be counted\index{Bisulfite sequencing}. For gene expression measurements, reads that overlap with transcripts should be counted. These further processing tasks are either done by a specialized alignment-related software or can be done in R in some cases. We will explain these further processing steps when they become relevant in the context of the following chapters. ## Exercises -For this set of we will use the `chip_1_1.fq.bz2` and `chip_2_1.fq.bz2` files from the `QuasR` package. You can reach the folder that contains the files as follows, you need to: +For this set of exercises, we will use the `chip_1_1.fq.bz2` and `chip_2_1.fq.bz2` files from the `QuasR` package. You can reach the folder that contains the files as follows: ```{r seqProcessEx,eval=FALSE} folder=(system.file(package="QuasR", "extdata")) dir(folder) # will show the contents of the folder ``` -1. Plot base quality distributions of chip-seq samples `Rqc` package. -**HINT**: You need to provide a regular expression pattern for extracting the right files from the folder. `"^chip"` matches the files begining with "chip".[Difficulty: **Beginner/Intermediate**] +1. Plot the base quality distributions of the ChIP-seq samples `Rqc` package. +**HINT**: You need to provide a regular expression pattern for extracting the right files from the folder. `"^chip"` matches the files beginning with "chip". [Difficulty: **Beginner/Intermediate**] ```{r seqProcessEx2,eval=FALSE,echo=FALSE} folder=(system.file(package="QuasR", "extdata")) @@ -277,6 +277,6 @@ rqcCycleQualityBoxPlot(qcRes) ``` -2. Now we will trim the reads based on the quality scores. Let's trim 2-4 bases on the 3' end depending on the quality scores. You can use Trim the ends of the samples `QuasR::preprocessReads()` function for this purpose.[Difficulty: **Beginner/Intermediate**] +2. Now we will trim the reads based on the quality scores. Let's trim 2-4 bases on the 3' end depending on the quality scores. You can use the `QuasR::preprocessReads()` function for this purpose. [Difficulty: **Beginner/Intermediate**] -3. Align the trimmed and untrimmed reads using `QuasR` and plot alignment statistics, did the trimming improve alignments? [Difficulty: **Intermediate/Advanced**] \ No newline at end of file +3. Align the trimmed and untrimmed reads using `QuasR` and plot alignment statistics, did the trimming improve alignments? [Difficulty: **Intermediate/Advanced**] diff --git a/08-rna-seq-analysis.Rmd b/08-rna-seq-analysis.Rmd index 631e47e..f863ecc 100644 --- a/08-rna-seq-analysis.Rmd +++ b/08-rna-seq-analysis.Rmd @@ -19,19 +19,20 @@ suppressMessages(suppressWarnings(library(gage))) *Chapter Author*: **Bora Uyar** -RNA sequencing (RNA-seq) \index{RNA-seq}has proven as a revolutionary tool since the time it has been introduced. The throughput, accuracy, and resolution of data produced with RNA-seq has been instrumental in the study of transcriptomics in the last decade [@wang_rna-seq:_2009]. There is a variety of applications of transcriptome sequencing and each application may consist of different chains of tools each with many alternatives [@conesa_survey_2016]. In this chapter, we are going to demonstrate a common workflow of how to do differential expression analysis with downstream applications such as GO term and gene set enrichment analysis. We assume that the sequencing data was generated using one of the NGS sequencing platforms. Where applicable, we will try to provide alternatives to the reader in terms of both the tools to carry out a demonstrated analysis and also the other applications of the same sequencing data depending on the different biological questions. + +RNA sequencing (RNA-seq) \index{RNA-seq}has proven to be a revolutionary tool since the time it was introduced. The throughput, accuracy, and resolution of data produced with RNA-seq has been instrumental in the study of transcriptomics in the last decade [@wang_rna-seq:_2009]. There is a variety of applications of transcriptome sequencing and each application may consist of different chains of tools each with many alternatives [@conesa_survey_2016]. In this chapter, we are going to demonstrate a common workflow for doing differential expression analysis with downstream applications such as GO term and gene set enrichment analysis. We assume that the sequencing data was generated using one of the NGS sequencing platforms. Where applicable, we will try to provide alternatives to the reader in terms of both the tools to carry out a demonstrated analysis and also the other applications of the same sequencing data depending on the different biological questions. ## What is gene expression? -`Gene expression` is a term used to describe the contribution of a gene to the overall functions and phenotype of a cell through the activity of the molecular products, which are encoded in the specific nucleotide sequence of the gene. RNA is the primary product encoded in a gene, which is transcribed in the nucleus of a cell. A class of RNA molecules, messenger RNAs, are transported from the nucleus to the cytoplasm, where the translation machinery of the cell translates the nucleotide sequence of the mRNA into proteins. The functional protein repertoire in a given cell is the primary factor that dictates the shape, function, and phenotype of a cell. Due to the prime roles of proteins for a cell's fate, most molecular biology literature is focused on protein-coding genes. However, a bigger proportion of a eukaryotic gene repertoire is reserved for non-coding genes, which code for RNA molecules that are not translated into proteins, yet carry out many important cellular functions. All in all, the term `gene expression` \index{gene expression}refers to the combined activity of protein-coding or non-coding products of a gene. +Gene expression is a term used to describe the contribution of a gene to the overall functions and phenotype of a cell through the activity of the molecular products, which are encoded in the specific nucleotide sequence of the gene. RNA is the primary product encoded in a gene, which is transcribed in the nucleus of a cell. A class of RNA molecules, messenger RNAs, are transported from the nucleus to the cytoplasm, where the translation machinery of the cell translates the nucleotide sequence of the mRNA into proteins. The functional protein repertoire in a given cell is the primary factor that dictates the shape, function, and phenotype of a cell. Due to the prime roles of proteins for a cell's fate, most molecular biology literature is focused on protein-coding genes. However, a bigger proportion of a eukaryotic gene repertoire is reserved for non-coding genes, which code for RNA molecules that are not translated into proteins, yet carry out many important cellular functions. All in all, the term gene expression \index{gene expression}refers to the combined activity of protein-coding or non-coding products of a gene. -In a cell, there are many layers of quality controls and modifications that act upon a gene's product until the end-product attains a particular function. These layers of regulation include epigenetic, transcriptional, post-transcriptional, translational, and post-translational control mechanisms, the latter two applying only to protein-coding genes. A protein or RNA molecule, is only functional if it is produced at the right time, at the right cellular compartment, with the neccessary base or amino-acid modifications, with the correct secondary/tertiary structure (or unstructure wherever applicable), among the availability of other metabolites or molecules, which are needed to form complexes to altogether accomplish a certain cellular function. However, traditionally, the number of copies of a gene's products is considered a quantitative measure of a gene's activity. Although this approach does not reflect all of the complexity of what defines a functional molecule, quantification of the abundance of transcripts from a gene has proven to be a cost-effective method in understanding genes' functions. +In a cell, there are many layers of quality controls and modifications that act upon a gene's product until the end-product attains a particular function. These layers of regulation include epigenetic, transcriptional, post-transcriptional, translational, and post-translational control mechanisms, the latter two applying only to protein-coding genes. A protein or RNA molecule, is only functional if it is produced at the right time, at the right cellular compartment, with the necessary base or amino-acid modifications, with the correct secondary/tertiary structure (or unstructure wherever applicable), among the availability of other metabolites or molecules, which are needed to form complexes to altogether accomplish a certain cellular function. However, traditionally, the number of copies of a gene's products is considered a quantitative measure of a gene's activity. Although this approach does not reflect all of the complexity that defines a functional molecule, quantification of the abundance of transcripts from a gene has proven to be a cost-effective method of understanding genes' functions. ## Methods to detect gene expression -Quantification of how much expression levels of genes deviate from a baseline gives clues about which genes are actually important for, for instance, disease outcome or cell/tissue identity. The methods of detecting and quantifying gene expression has evolved from low-throughput methods such as the usage of a reporter gene with a fluorescent protein product to find out if a single gene is expressed at all, to high-throughput methods such as massively parallel RNA-sequencing that can profile -at a single-nucleotide resolution- the abundance of tens of thousands of distinct transcripts encoded in the largest eukaryotic genomes. +Quantification of how much gene expression levels deviate from a baseline gives clues about which genes are actually important for, for instance, disease outcome or cell/tissue identity. The methods of detecting and quantifying gene expression have evolved from low-throughput methods such as the usage of a reporter gene with a fluorescent protein product to find out if a single gene is expressed at all, to high-throughput methods such as massively parallel RNA-sequencing that can profile -at single-nucleotide resolution- the abundance of tens of thousands of distinct transcripts encoded in the largest eukaryotic genomes. -## Gene Expression Analysis Using High-throughput Sequencing Technologies +## Gene expression analysis using high-throughput sequencing technologies With the advent of the second-generation (a.k.a next-generation or high-throughput) sequencing technologies, the number of genes that can be profiled for expression levels with a single experiment has increased to the order of tens of thousands of genes. Therefore, the bottleneck in this process has become the data analysis rather than the data generation. Many statistical methods and computational tools are required for getting meaningful results from the data, which comes with a lot of valuable information along with a lot of sources of noise. Fortunately, most of the steps of RNA-seq analysis have become quite mature over the years. Below we will first describe how to reach a read count table from raw fastq reads obtained from an Illumina sequencing run. We will then demonstrate in R how to process the count table, make a case-control differential expression analysis, and do some downstream functional enrichment analysis. @@ -40,44 +41,43 @@ the number of genes that can be profiled for expression levels with a single exp #### Quality check and read processing -The first step in any experiment that involves high-throughput short-read sequencing should be to check the sequencing quality of the reads before starting to do any downstream analysis. The quality of the input sequences holds fundamental importance in the confidence for the biological conclusions drawn from the experiment. We have introduced quality check and processing in Chapter \@ref(processingReads), those tools and workflows also apply in RNA-seq analysis. +The first step in any experiment that involves high-throughput short-read sequencing should be to check the sequencing quality of the reads before starting to do any downstream analysis. The quality of the input sequences holds fundamental importance in the confidence for the biological conclusions drawn from the experiment. We have introduced quality check and processing in Chapter \@ref(processingReads), and those tools and workflows also apply in RNA-seq analysis. #### Improving the quality -The second step in the RNA-seq analysis workflow is to improve the quality of the input reads. This step could be regarded as an optional step when the sequencing quality is very good. However, even with the highest quality sequencing datasets, this step may still improve the quality of the input sequences. The most common technical artifacts that can be filtered out are the adapter sequences that contaminate the sequenced reads, and the low quality bases that are usually found at the ends of the sequences. Commonly used tools in the field (trimmomatic [@bolger_trimmomatic:_2014], trimGalore [@noauthor_babraham_nodate]) are again not written in R, however there are alternative R libraries for carrying out the same functionality, for instance, QuasR [@gaidatzis_quasr:_2015] (see `QuasR::preprocessReads` function) and ShortRead [@morgan_shortread:_2009] (see `ShortRead::filterFastq` function). Some of these approaches are introduced in Chapter \@ref(processingReads). +The second step in the RNA-seq analysis workflow is to improve the quality of the input reads. This step could be regarded as an optional step when the sequencing quality is very good. However, even with the highest-quality sequencing datasets, this step may still improve the quality of the input sequences. The most common technical artifacts that can be filtered out are the adapter sequences that contaminate the sequenced reads, and the low-quality bases that are usually found at the ends of the sequences. Commonly used tools in the field (trimmomatic [@bolger_trimmomatic:_2014], trimGalore [@noauthor_babraham_nodate]) are again not written in R, however there are alternative R libraries for carrying out the same functionality, for instance, QuasR [@gaidatzis_quasr:_2015] (see `QuasR::preprocessReads` function) and ShortRead [@morgan_shortread:_2009] (see `ShortRead::filterFastq` function). Some of these approaches are introduced in Chapter \@ref(processingReads). -The sequencing quality control and read pre-processing steps can be visited multiple times until achieving a satisfactory level of quality in the sequence data before moving onto the dowstream analysis steps. +The sequencing quality control and read pre-processing steps can be visited multiple times until achieving a satisfactory level of quality in the sequence data before moving on to the downstream analysis steps. ### Alignment -Once a decent level of quality in the sequences is reached, the expression level of the genes can be quantified by first mapping the sequences to a reference genome, and secondly matching the aligned reads to the gene annotations, in order to count the number of reads mapping to each gene. If the species under study has a well annotated transcriptome, the reads can be aligned to the transcript sequences instead of the reference genome. In cases where there is no good quality reference genome or transcriptome, it is possible to de novo assemble the transcriptome from the sequences and then quantify the expression levels of genes/transcripts. +Once a decent level of quality in the sequences is reached, the expression level of the genes can be quantified by first mapping the sequences to a reference genome, and secondly matching the aligned reads to the gene annotations, in order to count the number of reads mapping to each gene. If the species under study has a well-annotated transcriptome, the reads can be aligned to the transcript sequences instead of the reference genome. In cases where there is no good quality reference genome or transcriptome, it is possible to de novo assemble the transcriptome from the sequences and then quantify the expression levels of genes/transcripts. -For RNA-seq read alignments, apart from the availability of reference genomes and annotations, probably the most important factor to consider when choosing an alignment tool is whether the alignment method considers the absence of intronic regions in the sequenced reads, while the target genome may contain introns. Therefore, it is important to choose alignment tools that take into account alternative splicing. In the basic setting where a read, which originates from a cDNA sequence corresponding to an exon-exon junction, needs to be split into two parts when aligned against the genome. There are various tools that consider this factor such as STAR [@dobin_star:_2013], Tophat2 [@kim_tophat2:_2013], Hisat2 [@kim_hisat:_2015], GSNAP [@wu_gmap_2016]. Most alignment tools are written in C/C++ languages because of performance concerns. There are also R libraries that can do short read alignments, these are discussed in Chapter \@ref(processingReads). +For RNA-seq read alignments, apart from the availability of reference genomes and annotations, probably the most important factor to consider when choosing an alignment tool is whether the alignment method considers the absence of intronic regions in the sequenced reads, while the target genome may contain introns. Therefore, it is important to choose alignment tools that take into account alternative splicing. In the basic setting where a read, which originates from a cDNA sequence corresponding to an exon-exon junction, needs to be split into two parts when aligned against the genome. There are various tools that consider this factor such as STAR [@dobin_star:_2013], Tophat2 [@kim_tophat2:_2013], Hisat2 [@kim_hisat:_2015], and GSNAP [@wu_gmap_2016]. Most alignment tools are written in C/C++ languages because of performance concerns. There are also R libraries that can do short read alignments; these are discussed in Chapter \@ref(processingReads). ### Quantification -After the reads are aligned to the target, a SAM/BAM file sorted by coordinates should have been obtained. The BAM file \index{BAM file}contains all alignment related information of all the reads that have been attempted to be aligned to the target sequence. This information consists of - most basically - the genomic coordinates (chromosome, start, end, strand) of where a sequence was matched (if at all) in the target, specific insertions/deletions/mismatches that describes the differences between the input and target sequences. These pieces of information are used along with the genomic coordinates of genome annotations such as gene/transcript models in order to count how many reads have been sequenced from a gene/transcript. As simple as it may sound, it is not a trivial task to assign reads to a gene/transcript just by comparing the genomic coordinates of the annotations and the sequences, because of the confounding factors such as overlapping gene annotations, overlapping exon annotations from different transcript isoforms of a gene, overlapping annotations from opposite DNA strands in the absence of a strand-specific sequencing protocol. Therefore, for read counting, it is important to consider: +After the reads are aligned to the target, a SAM/BAM file sorted by coordinates should have been obtained. The BAM file \index{BAM file}contains all alignment-related information of all the reads that have been attempted to be aligned to the target sequence. This information consists of - most basically - the genomic coordinates (chromosome, start, end, strand) of where a sequence was matched (if at all) in the target, specific insertions/deletions/mismatches that describe the differences between the input and target sequences. These pieces of information are used along with the genomic coordinates of genome annotations such as gene/transcript models in order to count how many reads have been sequenced from a gene/transcript. As simple as it may sound, it is not a trivial task to assign reads to a gene/transcript just by comparing the genomic coordinates of the annotations and the sequences, because of confounding factors such as overlapping gene annotations, overlapping exon annotations from different transcript isoforms of a gene, and overlapping annotations from opposite DNA strands in the absence of a strand-specific sequencing protocol. Therefore, for read counting, it is important to consider: - 1. Strand specificity of the sequencing protocol: are the reads expected to originate from the forward strand, reverse strand, or unspecific? + 1. Strand specificity of the sequencing protocol: Are the reads expected to originate from the forward strand, reverse strand, or unspecific? 2. Counting mode: - - when counting at the gene-level: when there are overlapping annotations, which features should the read be assigned to? Tools usually have a parameter that lets the user to select a counting mode. - - when counting at the transcript-level: when there are multiple isoforms of a gene, which isoform should the read be assigned to? This consideration is usually an algorithmic consideration that is not modifiable by the end-user. + - when counting at the gene-level: When there are overlapping annotations, which features should the read be assigned to? Tools usually have a parameter that lets the user select a counting mode. + - when counting at the transcript-level: When there are multiple isoforms of a gene, which isoform should the read be assigned to? This consideration is usually an algorithmic consideration that is not modifiable by the end-user. Some tools can couple alignment to quantification (e.g. STAR), while some assume the alignments are already calculated and require BAM files as input. On the other hand, in the presence of good transcriptome annotations, alignment-free methods (Salmon [@patro_salmon:_2017], Kallisto [@bray_near-optimal_2016], Sailfish [@patro_sailfish_2014]) can also be used to estimate the expression levels of transcripts/genes. There are also reference-free quantification methods that can first de novo assemble the transcriptome and estimate the expression levels based on this assembly. Such a strategy can be useful in discovering novel transcripts or may be required in cases when a good reference does not exist. If a reference transcriptome exists but of low quality, a reference-based transcriptome assembler such as Cufflinks [@trapnell_transcript_2010] can be used to improve the transcriptome. In case there is no available transcriptome annotation, a de novo assembler such as Trinity [@haas_novo_2013] or Trans-ABySS [@robertson_novo_2010] can be used to assemble the transcriptome from scratch. -Within R, quantification can be done using - +Within R, quantification can be done using: - `Rsubread::featureCounts` - `QuasR::qCount` - `GenomicAlignments::summarizeOverlaps` ### Within sample normalization of the read counts -The most common application after a gene's expression is quantified (as the number of reads aligned to the gene), is to compare the gene's expression in different conditions, for instance, in a case-control setting (e.g. disease versus normal) or in a time-series (e.g. along different developmental stages). Making such comparisons help identify the genes that might be responsible for a disease or an impaired developmental trajectory. However, there are multiple caveats that needs to be addressed before making a comparison between the read counts of a gene in different conditions [@maza_comparison_2013] +The most common application after a gene's expression is quantified (as the number of reads aligned to the gene), is to compare the gene's expression in different conditions, for instance, in a case-control setting (e.g. disease versus normal) or in a time-series (e.g. along different developmental stages). Making such comparisons helps identify the genes that might be responsible for a disease or an impaired developmental trajectory. However, there are multiple caveats that needs to be addressed before making a comparison between the read counts of a gene in different conditions [@maza_comparison_2013]. - - Library size (i.e. sequencing depth) varies between samples coming from different lanes of the flow cell of the sequencing machine. - - Longer genes will have higher number of reads. - - Library composition (i.e. relative size of the studied transcriptome) can be different in two different biological conditions. + - Library size (i.e. sequencing depth) varies between samples coming from different lanes of the flow cell of the sequencing machine. + - Longer genes will have a higher number of reads. + - Library composition (i.e. relative size of the studied transcriptome) can be different in two different biological conditions. - GC content biases across different samples may lead to a biased sampling of genes [@risso_gc-content_2011]. - Read coverage of a transcript can be biased and non-uniformly distributed along the transcript [@mortazavi_mapping_2008]. @@ -89,13 +89,13 @@ The most basic normalization\index{normalization} approaches address the sequenc - Upper Quartile Normalization (divide counts by the **upper quartile** value of the counts) - Median Normalization (divide counts by the **median** of all counts) -Popular metrics that improve upon CPM are RPKM/FPKM (reads/fragments per kilobase of million reads) and TPM (transcripts per million). RPKM is obtained by dividing the CPM value by another factor, which is the length of the gene per kilobases. FPKM is the same as RPKM, but is used for paired-end reads. Thus, RPKM/FPKM methods account for, firstly the **library size**, and secondly the **gene lengths**. +Popular metrics that improve upon CPM are RPKM/FPKM (reads/fragments per kilobase of million reads) and TPM (transcripts per million). RPKM is obtained by dividing the CPM value by another factor, which is the length of the gene per kilobase. FPKM is the same as RPKM, but is used for paired-end reads. Thus, RPKM/FPKM methods account for, firstly, the **library size**, and secondly, the **gene lengths**. TPM also controls for both the library size and the gene lengths, however, with the TPM method, the read counts are first normalized by the gene length (per kilobase), and then gene-length normalized values are divided by the sum of the gene-length normalized values and multiplied by 10^6. Thus, the sum of normalized values for TPM will always be equal to 10^6 for each library, while the sum of RPKM/FPKM values do not sum to 10^6. Therefore, it is easier to interpret TPM values than RPKM/FPKM values. ### Computing different normalization schemes in R -Here we will assume that there is an RNA-seq count table comprising of raw-counts, meaning the number of reads counted for each gene has not been exposed to any kind of normalization and consists of integers. The rows of the count table correspond to the genes and the columns represent different samples. Here we will use a subset of the RNA-seq count table from a colorectal cancer study. We have filtered the original count table for only protein-coding genes (to improve the speed of calculation) and also selected only five metastasized colorectal cancer samples along with five normal colon samples. There is an additional column `width` that contains the length of the corresponding gene in the unit of base pairs. The length of the genes are important to compute RPKM and TPM values. The original count tables can be found from the recount2 database [REF] using the SRA project code `SRP029880` and the experimental setup along with other accessory information can be found from the NCBI Trace archive using the SRA project code `SRP029880` [here](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP029880). +Here we will assume that there is an RNA-seq count table comprising raw counts, meaning the number of reads counted for each gene has not been exposed to any kind of normalization and consists of integers. The rows of the count table correspond to the genes and the columns represent different samples. Here we will use a subset of the RNA-seq count table from a colorectal cancer study. We have filtered the original count table for only protein-coding genes (to improve the speed of calculation) and also selected only five metastasized colorectal cancer samples along with five normal colon samples. There is an additional column `width` that contains the length of the corresponding gene in the unit of base pairs. The length of the genes are important to compute RPKM and TPM values. The original count tables can be found from the recount2 database (https://jhubiostatistics.shinyapps.io/recount/) using the SRA project code _SRP029880_, and the experimental setup along with other accessory information can be found from the NCBI Trace archive using the SRA project code [SRP029880`](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP029880). ```{r crc_counts} #colorectal cancer @@ -109,7 +109,7 @@ counts <- as.matrix(read.table(counts_file, header = T, sep = '\t')) #### Computing CPM -Let's do a summary of the counts table. Due to space limitations, the summary for only the first three columns are displayed. +Let's do a summary of the counts table. Due to space limitations, the summary for only the first three columns is displayed. ```{r crc_counts_summary} summary(counts[,1:3]) @@ -142,7 +142,7 @@ rpkm <- apply(X = subset(counts, select = c(-width)), }) ``` -Check the sample sizes of RPKM. Notice that the sums of samples are all different +Check the sample sizes of RPKM. Notice that the sums of samples are all different. ```{r rpkm_2} colSums(rpkm) @@ -164,20 +164,20 @@ Check the sample sizes of `tpm`. Notice that the sums of samples are all equal t colSums(tpm) ``` -None of these metrics (CPM, RPKM/FPKM, TPM) account for the other important confounding factor when comparing expression levels of genes across samples: the **library composition**, which may also be referred to as the **relative size of the compared transcriptomes**. This factor is not dependent on the sequencing technology, it is rather biological. For instance, when comparing transcriptomes of different tissues, there can be sets of genes in one tissue that consume a big chunk of the reads, while in the other tissue not expressed at all. This kind of imbalances in the composition of compared transcriptomes can lead to wrong conclusions about which genes are actually differentially expressed. This consideration is addressed in two popular R packages: `DESeq2`\index{R Packages!\texttt{DESeq2}} [@love_moderated_2014] and edgeR [@robinson_edger:_2010] each with a different algorithm. `edgeR`\index{R Packages!\texttt{edgeR}} uses a normalization procedure called Trimmed Mean of M-values (TMM). `DESeq2` implements a normalization procedure using Median of Ratios, which is obtained by finding the ratio of log-transformed count of a gene divided by the average of log-transformed values of the gene in all samples (geometric mean), and then taking the median of these values for all genes. The raw read count of the gene is finally divided by this value (median of ratios) to obtain the normalized counts of genes. +None of these metrics (CPM, RPKM/FPKM, TPM) account for the other important confounding factor when comparing expression levels of genes across samples: the **library composition**, which may also be referred to as the **relative size of the compared transcriptomes**. This factor is not dependent on the sequencing technology, it is rather biological. For instance, when comparing transcriptomes of different tissues, there can be sets of genes in one tissue that consume a big chunk of the reads, while in the other tissues they are not expressed at all. This kind of imbalance in the composition of compared transcriptomes can lead to wrong conclusions about which genes are actually differentially expressed. This consideration is addressed in two popular R packages: `DESeq2`\index{R Packages!\texttt{DESeq2}} [@love_moderated_2014] and edgeR [@robinson_edger:_2010] each with a different algorithm. `edgeR`\index{R Packages!\texttt{edgeR}} uses a normalization procedure called Trimmed Mean of M-values (TMM). `DESeq2` implements a normalization procedure using median of Ratios, which is obtained by finding the ratio of the log-transformed count of a gene divided by the average of log-transformed values of the gene in all samples (geometric mean), and then taking the median of these values for all genes. The raw read count of the gene is finally divided by this value (median of ratios) to obtain the normalized counts. ### Exploratory analysis of the read count table -A typical quality control, in this case interrogating the RNA-seq experiment design, is to measure the similarity of the samples with each other in terms of the quantified expression level profiles across a set of genes. One important observation to make is to see, whether the most similar samples to any given sample are the biological replicates of that sample. This can be computed using unsupervised clustering techniques such as hierarchical clustering and visualized as a heatmap with dendrograms. Another most commonly applied technique is a dimensionality reduction technique called Principal Component Analysis (PCA) and visualized as a two-dimensional (or in some cases three-dimensional) scatter plot. In order to find out more about the clustering methods and PCA, please refer to the Chapter \@ref(unsupervisedLearning). +A typical quality control, in this case interrogating the RNA-seq experiment design, is to measure the similarity of the samples with each other in terms of the quantified expression level profiles across a set of genes. One important observation to make is to see whether the most similar samples to any given sample are the biological replicates of that sample. This can be computed using unsupervised clustering techniques such as hierarchical clustering and visualized as a heatmap with dendrograms. Another most commonly applied technique is a dimensionality reduction technique called Principal Component Analysis (PCA) and visualized as a two-dimensional (or in some cases three-dimensional) scatter plot. In order to find out more about the clustering methods and PCA, please refer to Chapter \@ref(unsupervisedLearning). #### Clustering -We can combine clustering and visualization of the clustering results by using heatmap functions that are available in a variety of R libraries. The basic R installation comes with the `stats::heatmap` function. However, there are other libraries available in CRAN (e.g. `pheatmap` [@pheatmap])\index{R Packages!\texttt{pheatmap}} or Bioconductor (e.g. `ComplexHeatmap` [@gu_complex_2016]) \index{R Packages!\texttt{ComplexHeatmap}}that come with more flexibility and more appealing visualisations. +We can combine clustering and visualization of the clustering results by using heatmap functions that are available in a variety of R libraries. The basic R installation comes with the `stats::heatmap` function. However, there are other libraries available in CRAN (e.g. `pheatmap` [@pheatmap])\index{R Packages!\texttt{pheatmap}} or Bioconductor (e.g. `ComplexHeatmap` [@gu_complex_2016]) \index{R Packages!\texttt{ComplexHeatmap}}that come with more flexibility and more appealing visualizations. -Here we demonstrate a heatmap using `pheatmap` package and the previously calculated `tpm` matrix. +Here we demonstrate a heatmap using the `pheatmap` package and the previously calculated `tpm` matrix. As these matrices can be quite large, both computing the clustering and rendering the heatmaps can take a lot of resources and time. Therefore, a quick and informative way to compare samples is to select a subset of genes that are, for instance, most variable across samples, and use that subset to do the clustering and visualization. -Let's select top 100 most variable genes among the samples. +Let's select the top 100 most variable genes among the samples. ```{r select_genes_for_clustering} #compute the variance of each gene across samples @@ -187,18 +187,18 @@ V <- apply(tpm, 1, var) selectedGenes <- names(V[order(V, decreasing = T)][1:100]) ``` -Now we can quickly produce a heatmap where samples and genes are clustered (See Figure \@ref(fig:tpmhierClust1) ). +Now we can quickly produce a heatmap where samples and genes are clustered (see Figure \@ref(fig:tpmhierClust1) ). -```{r tpmhierClust1, fig.cap="Clustering and visualisation of top most variable genes as a heatmap",out.width = "50%"} +```{r tpmhierClust1, fig.cap="Clustering and visualization of the topmost variable genes as a heatmap.",out.width = "50%"} library(pheatmap) pheatmap(tpm[selectedGenes,], scale = 'row', show_rownames = FALSE) ``` We can also overlay some annotation tracks to observe the clusters. -Here it is important to observe whether the replicates of the same sample cluster most closely with each other, or not. Overlaying the heatmap with such annotation and displaying sample groups with distinct colors helps quickly see if there are samples that don't cluster as expected (See Figure \@ref(fig:tpmhierclust2) ). +Here it is important to observe whether the replicates of the same sample cluster most closely with each other, or not. Overlaying the heatmap with such annotation and displaying sample groups with distinct colors helps quickly see if there are samples that don't cluster as expected (see Figure \@ref(fig:tpmhierclust2) ). -```{r tpmhierclust2, fig.cap='Clustering samples as a heatmap with sample annotations',out.width = "60%"} +```{r tpmhierclust2, fig.cap='Clustering samples as a heatmap with sample annotations.',out.width = "60%"} colData <- read.table(coldata_file, header = T, sep = '\t', stringsAsFactors = TRUE) pheatmap(tpm[selectedGenes,], scale = 'row', @@ -210,7 +210,7 @@ pheatmap(tpm[selectedGenes,], scale = 'row', Let's make a PCA plot \index{principal component analysis (PCA)} to see the clustering of replicates as a scatter plot in two dimensions (Figure \@ref(fig:pca1)). -```{r pca1, fig.cap='PCA plot of samples using TPM counts'} +```{r pca1, fig.cap='PCA plot of samples using TPM counts.'} library(stats) library(ggplot2) #transpose the matrix @@ -255,10 +255,10 @@ kable(correlationMatrix[samples,samples], ``` We can also draw more visually appealing correlation plots using the `corrplot` package (Figure \@ref(fig:corrplot3)). -Using `addrect` argument, we can split clusters into groups and surround them with rectangles. +Using the `addrect` argument, we can split clusters into groups and surround them with rectangles. By setting the `addCoef.col` argument to 'white', we can display the correlation coefficients as numbers in white color. -```{r corrplot3, fig.cap='Correlation plot of samples ordered by hierarchical clustering'} +```{r corrplot3, fig.cap='Correlation plot of samples ordered by hierarchical clustering.'} library(corrplot) corrplot(correlationMatrix, order = 'hclust', addrect = 2, addCoef.col = 'white', @@ -269,10 +269,10 @@ corrplot(correlationMatrix, order = 'hclust', Here pairwise correlation levels are visualized as colored circles. `Blue` indicates positive correlation, while `Red` indicates negative correlation. We could also plot this correlation matrix as a heatmap (Figure \@ref(fig:corrplot4)). As all the samples have a high pairwise -correlation score, using a heatmap instead of a corrplot helps to see the differences between samples more easily. -`annotation_col` argument helps to display sample annotations and `cutree_cols` argument is set to 2 to split the clusters into two groups based on the hierarchical clustering results. +correlation score, using a heatmap instead of a corrplot helps to see the differences between samples more easily. The +`annotation_col` argument helps to display sample annotations and the `cutree_cols` argument is set to 2 to split the clusters into two groups based on the hierarchical clustering results. -```{r corrplot4,fig.width=8, fig.cap='Pairwise correlation of samples displayed as a heatmap'} +```{r corrplot4,fig.width=8, fig.cap='Pairwise correlation of samples displayed as a heatmap.'} library(pheatmap) # split the clusters into two based on the clustering similarity pheatmap(correlationMatrix, @@ -283,24 +283,24 @@ pheatmap(correlationMatrix, ### Differential expression analysis -Differential expression analysis allows to test tens of thousands of hypotheses (one test for each gene) against the null hypothesis that the activity of the gene stays the same in two different conditions. There are multiple limiting factors that influence the power of detecting genes that have real changes between two biological conditions. Among these are the limited number of biological replicates, non-normality of the distribution of the read counts, and higher uncertainty of measurements for lowly expressed genes than highly expressed genes [@love_moderated_2014]. Tools such as `edgeR` and `DESeq2` address these limitations using sophisticated statistical models in order to maximize the amount of knowledge that can be extracted from such noisy datasets. In essence, these models assume that for each gene the read counts are generated by a negative binomial distribution\index{negative binomial distribution}. This is a popular distribution that is used for modeling count data. This distribution can be specified with a mean parameter, $m$, and a dispersion parameter, $\alpha$.The dispersion parameter $\alpha$ is directly related to the variance as the variance of this distribution is formulated as: $m+\alpha m^{2}$. Therefore, estimating these parameters are crucial for differential expression tests. The methods used in `edgeR` and `DESeq2` uses dispersion estimates from other genes with similar counts to precisely estimate the per-gene dispersion values. With accurate dispersion parameter estimate, one can estimate the variance more precisely which in turn -improve the result of the differential expression test. Although statistical models are different, the process here is similar to moderated t-test \index{moderated t-test} and qualifies as empirical Bayes method \index{empirical Bayes methods}we introduced in Chapter \@ref(stats). There, we calculated gene-wise variability and shrunk each gene-wise variability towards the median variability of all genes. In the case of RNA-seq the dispersion coefficient $\alpha$ is shrunk towards the value of dispersion from other genes with similar read counts. +Differential expression analysis allows us to test tens of thousands of hypotheses (one test for each gene) against the null hypothesis that the activity of the gene stays the same in two different conditions. There are multiple limiting factors that influence the power of detecting genes that have real changes between two biological conditions. Among these are the limited number of biological replicates, non-normality of the distribution of the read counts, and higher uncertainty of measurements for lowly expressed genes than highly expressed genes [@love_moderated_2014]. Tools such as `edgeR` and `DESeq2` address these limitations using sophisticated statistical models in order to maximize the amount of knowledge that can be extracted from such noisy datasets. In essence, these models assume that for each gene, the read counts are generated by a negative binomial distribution\index{negative binomial distribution}. This is a popular distribution that is used for modeling count data. This distribution can be specified with a mean parameter, $m$, and a dispersion parameter, $\alpha$. The dispersion parameter $\alpha$ is directly related to the variance as the variance of this distribution is formulated as: $m+\alpha m^{2}$. Therefore, estimating these parameters is crucial for differential expression tests. The methods used in `edgeR` and `DESeq2` use dispersion estimates from other genes with similar counts to precisely estimate the per-gene dispersion values. With accurate dispersion parameter estimates, one can estimate the variance more precisely, which in turn +improves the result of the differential expression test. Although statistical models are different, the process here is similar to the moderated t-test \index{moderated t-test}and qualifies as an empirical Bayes method \index{empirical Bayes methods} which we introduced in Chapter \@ref(stats). There, we calculated gene-wise variability and shrunk each gene-wise variability towards the median variability of all genes. In the case of RNA-seq the dispersion coefficient $\alpha$ is shrunk towards the value of dispersion from other genes with similar read counts. -Now let us take a closer look at `DESeq2` \index{R Packages!\texttt{DESeq2}}workflow and how it calculates differential expression: +Now let us take a closer look at the `DESeq2` \index{R Packages!\texttt{DESeq2}}workflow and how it calculates differential expression: 1. The read counts are normalized by computing size factors, which addresses the differences not only in the library sizes, but also the library compositions. -2. For each gene, a dispersion estimate is calculated. The dispersion value computed by `DESeq2` is equal to the squared coefficient of variation (variation divided by the mean). -3. A line is fit across the dispersion estimates of all genes computed in 2) versus the mean normalized counts of the genes. -4. Dispersion values of each gene is shrunken towards the fitted line in 3). -5. A Generalized Linear Model\index{generalized linear model} is fitted which considers additional confounding variables related to the experimental design such as sequencing batches, treatment, temperature, patient's age, sequencing technology etc. and uses negative binomial distribution for fitting count data. +2. For each gene, a dispersion estimate is calculated. The dispersion value computed by `DESeq2` is equal to the squared coefficient of variation (variation divided by the mean). +3. A line is fit across the dispersion estimates of all genes computed in step 2 versus the mean normalized counts of the genes. +4. Dispersion values of each gene are shrunk towards the fitted line in step 3. +5. A Generalized Linear Model\index{generalized linear model} is fitted which considers additional confounding variables related to the experimental design such as sequencing batches, treatment, temperature, patient's age, sequencing technology, etc., and uses negative binomial distribution for fitting count data. 6. For a given contrast (e.g. treatment type: drug-A versus untreated), a test for differential expression is carried out against the null hypothesis that the log fold change of the normalized counts of the gene in the given pair of groups is exactly zero. -7. Adjusts p-values for multiple-testing. +7. It adjusts p-values for multiple-testing. In order to carry out a differential expression analysis using `DESeq2`, three kinds of inputs are necessary: -1. The **read count table**: must be raw read counts as integers that are not processed in any form by a normalization technique. The rows represent features (e.g. genes, transcripts, genomic intervals) and columns represent samples. -2. A **colData** table: this table describes the experimental design. -3. A **design formula**: this formula is needed to describe the variable of interest in the analysis (e.g. treatment status) along with (optionally) other covariates (e.g. batch, temperature, sequencing technology). +1. The **read count table**: This table must be raw read counts as integers that are not processed in any form by a normalization technique. The rows represent features (e.g. genes, transcripts, genomic intervals) and columns represent samples. +2. A **colData** table: This table describes the experimental design. +3. A **design formula**: This formula is needed to describe the variable of interest in the analysis (e.g. treatment status) along with (optionally) other covariates (e.g. batch, temperature, sequencing technology). Let's define these inputs: @@ -308,7 +308,8 @@ Let's define these inputs: #remove the 'width' column countData <- as.matrix(subset(counts, select = c(-width))) #define the experimental setup -colData <- read.table(coldata_file, header = T, sep = '\t', stringsAsFactors = TRUE) +colData <- read.table(coldata_file, header = T, sep = '\t', + stringsAsFactors = TRUE) #define the design formula designFormula <- "~ group" ``` @@ -326,7 +327,7 @@ dds <- DESeqDataSetFromMatrix(countData = countData, print(dds) ``` -The `DESeqDataSet` object contains all the information about the experimental setup, the read counts, and the design formulas. Certain functions can be used to access these information separately: `rownames(dds)` shows which features are used in the study (e.g. genes), `colnames(dds)` displays the studied samples, `counts(dds)` displays the count table, `colData(dds)` displays the experimental setup. +The `DESeqDataSet` object contains all the information about the experimental setup, the read counts, and the design formulas. Certain functions can be used to access this information separately: `rownames(dds)` shows which features are used in the study (e.g. genes), `colnames(dds)` displays the studied samples, `counts(dds)` displays the count table, and `colData(dds)` displays the experimental setup. Remove genes that have almost no information in any of the given samples. ```{r deseq_setup_3} @@ -352,45 +353,49 @@ DEresults <- DEresults[order(DEresults$pvalue),] Thus we have obtained a table containing the differential expression status of case samples compared to the control samples. -It is important to note that the sequence of the elements provided in the `contrast` argument determines which group of samples are to be used as `control`. This impacts the way the results are interpreted, for instance, if a gene is found up-regulated (has a positive log2 fold change), the up-regulation status is only relative to the factor that is provided as `control`. In this case, we used samples from the `CTRL` group as `control` and contrasted the samples from the `CASE` group with respect to the `CTRL` samples. Thus genes with a positive log2 fold change are called up-regulated in the case samples with respect to the control, while genes with a negative log2 fold change are down-regulated in the case samples. Whether the deregulation is significant or not, warrants assessment of the adjusted p-values. +It is important to note that the sequence of the elements provided in the `contrast` argument determines which group of samples are to be used as the control. This impacts the way the results are interpreted, for instance, if a gene is found up-regulated (has a positive log2 fold change), the up-regulation status is only relative to the factor that is provided as control. In this case, we used samples from the "CTRL" group as control and contrasted the samples from the "CASE" group with respect to the "CTRL" samples. Thus genes with a positive log2 fold change are called up-regulated in the case samples with respect to the control, while genes with a negative log2 fold change are down-regulated in the case samples. Whether the deregulation is significant or not, warrants assessment of the adjusted p-values. -Let's have a look into the contents of the `DEresults` table. +Let's have a look at the contents of the `DEresults` table. ```{r deseq_post_2} #shows a summary of the results print(DEresults) ``` -The first three lines in this output shows the contrast and the statistical test that were used to compute these results, along with the dimensions of the resulting table (number of columns and rows). Below these lines is the actual table with 6 columns: `baseMean` represents the average normalized expression of the gene across all considered samples. `log2FoldChange` represents the base-2 logarithm of the fold change of the normalized expression of the gene in the given contrast. `lfcSE` represents the standard error of log2 fold change estimate, and `stat` is the statistic calculated in the contrast which is translated into a `pvalue` and adjusted for multiple testing in the `padj` column. To find out about the importance of adjusting for `multiple testing`, refer to section `insert section for multiple testing`. +The first three lines in this output show the contrast and the statistical test that were used to compute these results, along with the dimensions of the resulting table (number of columns and rows). Below these lines is the actual table with 6 columns: `baseMean` represents the average normalized expression of the gene across all considered samples. `log2FoldChange` represents the base-2 logarithm of the fold change of the normalized expression of the gene in the given contrast. `lfcSE` represents the standard error of log2 fold change estimate, and `stat` is the statistic calculated in the contrast which is translated into a `pvalue` and adjusted for multiple testing in the `padj` column. To find out about the importance of adjusting for multiple testing, see Chapter \@ref(stats). #### Diagnostic plots At this point, before proceeding to do any downstream analysis and jumping to conclusions about the biological insights that are reachable with the experimental data at hand, it is important to do some more diagnostic tests to improve our confidence about the quality of the data and the experimental setup. -An MA plot is useful to observe if the data normalization worked well (Figure \@ref(fig:DEmaplot)). MA plot is a scatterplot where x axis denotes the average of normalized counts across samples and the y axis denotes the log fold change in the given contrast. Most points are expected to be on the horizontal 0 line (most genes are expected to be not differentially expressed). + ##### MA plot +An MA plot is useful to observe if the data normalization worked well (Figure \@ref(fig:DEmaplot)). The MA plot is a scatter plot where the x-axis denotes the average of normalized counts across samples and the y-axis denotes the log fold change in the given contrast. Most points are expected to be on the horizontal 0 line (most genes are not expected to be differentially expressed). -```{r DEmaplot, fig.cap='MA plot of differential expression results'} +```{r DEmaplot, fig.cap='MA plot of differential expression results.'} library(DESeq2) DESeq2::plotMA(object = dds, ylim = c(-5, 5)) ``` -It is also important to observe the distribution of raw p-values (Figure \@ref(fig:DEpvaldist)). We expect to see a peak around low p-values and a uniform distribution at P-values above 0.1. Otherwise, adjustment for multiple testing does not work and the results are not meaningful. -##### p-value distribution -```{r DEpvaldist, fig.cap='p-value distribution genes before adjusting for multiple-testing'} +##### P-value distribution + +It is also important to observe the distribution of raw p-values (Figure \@ref(fig:DEpvaldist)). We expect to see a peak around low p-values and a uniform distribution at P-values above 0.1. Otherwise, adjustment for multiple testing does not work and the results are not meaningful. + +```{r DEpvaldist, fig.cap='P-value distribution genes before adjusting for multiple testing.'} library(ggplot2) ggplot(data = as.data.frame(DEresults), aes(x = pvalue)) + geom_histogram(bins = 100) ``` -A final diagnosis is to check the biological reproducibility of the sample replicates in a PCA plot or a heatmap. To plot the PCA results, we need to extract the normalized counts from the DESeqDataSet object. It is possible to color the points in the scatterplot by the variable of interest, which helps to see if the replicates cluster well (Figure \@ref(fig:DEpca)). + ##### PCA plot +A final diagnosis is to check the biological reproducibility of the sample replicates in a PCA plot or a heatmap. To plot the PCA results, we need to extract the normalized counts from the DESeqDataSet object. It is possible to color the points in the scatter plot by the variable of interest, which helps to see if the replicates cluster well (Figure \@ref(fig:DEpca)). -```{r DEpca, fig.cap='Principle Component Analysis Plot based on top 500 most variable genes'} +```{r DEpca, fig.cap='Principle component analysis plot based on top 500 most variable genes.'} library(DESeq2) # extract normalized counts from the DESeqDataSet object countsNormalized <- DESeq2::counts(dds, normalized = TRUE) @@ -404,9 +409,9 @@ plotPCA(countsNormalized[selectedGenes,], xlim = c(-0.5, 0.5), ylim = c(-0.5, 0.6)) ``` -Alternatively, the normalized counts can be transformed using `DESeq2::rlog` function and `DESeq2::plotPCA()` can be readily used to plot the PCA results (Figure \@ref(fig:DErldnorm)). +Alternatively, the normalized counts can be transformed using the `DESeq2::rlog` function and `DESeq2::plotPCA()` can be readily used to plot the PCA results (Figure \@ref(fig:DErldnorm)). -```{r DErldnorm, fig.cap='PCA plot of top 500 most variable genes '} +```{r DErldnorm, fig.cap='PCA plot of top 500 most variable genes.'} rld <- rlog(dds) DESeq2::plotPCA(rld, ntop = 500, intgroup = 'group') + ylim(-50, 50) + theme_bw() @@ -414,11 +419,11 @@ DESeq2::plotPCA(rld, ntop = 500, intgroup = 'group') + ##### Relative Log Expression (RLE) plot -A similar plot to the MA plot is the RLE (Relative Log Expression) plot that is useful in finding out if the data at hand needs normalization [@gandolfo_rle_2018]. Sometimes, even the datasets normalized using the explained methods above may need further normalization due to unforeseen sources of variation that might stem from the library preparation, the person who carries out the experiment, the date of sequencing, the temperature changes in the laboratory at the time of library preparation, and so on and so fort. RLE plot is a quick diagnostic that can be applied on the raw or normalized count matrices to see if further processing is required. +A similar plot to the MA plot is the RLE (Relative Log Expression) plot that is useful in finding out if the data at hand needs normalization [@gandolfo_rle_2018]. Sometimes, even the datasets normalized using the explained methods above may need further normalization due to unforeseen sources of variation that might stem from the library preparation, the person who carries out the experiment, the date of sequencing, the temperature changes in the laboratory at the time of library preparation, and so on and so forth. The RLE plot is a quick diagnostic that can be applied on the raw or normalized count matrices to see if further processing is required. -Let's do RLE plots on the raw counts and normalized counts using the `EDASeq` package [@risso_gc-content_2011]\index{R Packages!\texttt{EDASeq}} (See Figure \@ref(fig:DErleplot)). +Let's do RLE plots on the raw counts and normalized counts using the `EDASeq` package [@risso_gc-content_2011]\index{R Packages!\texttt{EDASeq}} (see Figure \@ref(fig:DErleplot)). -```{r DErleplot,fig.width=8, fig.cap='Relative Log Expression plots based on raw and normalized count matrices'} +```{r DErleplot,fig.width=8, fig.cap='Relative log expression plots based on raw and normalized count matrices'} library(EDASeq) par(mfrow = c(1, 2)) plotRLE(countData, outline=FALSE, ylim=c(-4, 4), @@ -430,20 +435,20 @@ plotRLE(DESeq2::counts(dds, normalized = TRUE), main = 'Normalized Counts') ``` -Here the RLE plot is comprised of box plots, where each box-plot represents the distribution of the relative log expression of the genes expressed in the corresponding sample. Each gene's expression is divided by the median expression value of that gene across all samples. Then this is transformed to log scale, which gives the relative log expression value for a single gene. The RLE values for all the genes from a sample is visualized as a boxplot. +Here the RLE plot is comprised of boxplots, where each box-plot represents the distribution of the relative log expression of the genes expressed in the corresponding sample. Each gene's expression is divided by the median expression value of that gene across all samples. Then this is transformed to log scale, which gives the relative log expression value for a single gene. The RLE values for all the genes from a sample are visualized as a boxplot. -Ideally the boxplots are centered around the horizontal zero line and are as tightly distributed as possible [@risso_normalization_2014]. From the plots that we have made for the raw and normalized count data, we can observe how the normalized dataset has improved upon the raw count data for all the samples. However, in some cases, it is important to visualize RLE plots in combination with other diagnostic plots such as PCA plots, heatmaps, and correlation plots to see if there is more unwanted variation in the data, which can be further accounted for using packages such as `RUVSeq` [@risso_normalization_2014]\index{R Packages!\texttt{RUVSeq}} and `sva` [@leek_sva_2012]\index{R Packages!\texttt{sva}} . We will cover details about `RUVSeq` package to account for unwanted sources of noise in RNA-seq datasets in the later sections. +Ideally the boxplots are centered around the horizontal zero line and are as tightly distributed as possible [@risso_normalization_2014]. From the plots that we have made for the raw and normalized count data, we can observe how the normalized dataset has improved upon the raw count data for all the samples. However, in some cases, it is important to visualize RLE plots in combination with other diagnostic plots such as PCA plots, heatmaps, and correlation plots to see if there is more unwanted variation in the data, which can be further accounted for using packages such as `RUVSeq` [@risso_normalization_2014]\index{R Packages!\texttt{RUVSeq}} and `sva` [@leek_sva_2012]\index{R Packages!\texttt{sva}}. We will cover details about the `RUVSeq` package to account for unwanted sources of noise in RNA-seq datasets in later sections. -### Functional Enrichment Analysis +### Functional enrichment analysis #### GO term analysis -In a typical differential expression analysis, thousands of genes are found differentially expressed between two groups of samples. While prior knowledge of the functions of individual genes can give some clues about what kind of cellular processes have been affected, e.g. by a drug treatment, manually going through the whole list of thousands of genes would be very cumbersome and not be very informative in the end. Therefore a commonly used tool to address this problem is to do enrichment analyses of functional terms that appear associated to the given set of differentially expressed genes more often than expected by chance. The functional terms usually are associated to multiple genes. Thus, genes can be grouped into sets by shared functional terms. However, it is important to have an agreed upon controlled vocabulary on the list of terms used to describe the functions of genes. Otherwise, it would be impossible to exchange scientific results globally. That's why initiatives such as Gene Ontology Consortium have collated a list of Gene Ontology (GO) \index{Gene Ontology (GO)}terms for each gene. GO term analysis is probably the most common analysis applied after a differential expression analysis. GO term analysis helps quickly find out systematic changes that can describe differences between groups of samples. +In a typical differential expression analysis, thousands of genes are found differentially expressed between two groups of samples. While prior knowledge of the functions of individual genes can give some clues about what kind of cellular processes have been affected, e.g. by a drug treatment, manually going through the whole list of thousands of genes would be very cumbersome and not be very informative in the end. Therefore a commonly used tool to address this problem is to do enrichment analyses of functional terms that appear associated to the given set of differentially expressed genes more often than expected by chance. The functional terms are usually associated to multiple genes. Thus, genes can be grouped into sets by shared functional terms. However, it is important to have an agreed upon controlled vocabulary on the list of terms used to describe the functions of genes. Otherwise, it would be impossible to exchange scientific results globally. That's why initiatives such as the Gene Ontology Consortium have collated a list of Gene Ontology (GO) \index{Gene Ontology (GO)}terms for each gene. GO term analysis is probably the most common analysis applied after a differential expression analysis. GO term analysis helps quickly find out systematic changes that can describe differences between groups of samples. -In R, one of the simplest ways to do functional enrichment analysis for a set of genes is via the `gProfileR` package \index{R Packages!\texttt{gProfileR}}. +In R, one of the simplest ways to do functional enrichment analysis for a set of genes is via the `gProfileR` package. \index{R Packages!\texttt{gProfileR}} Let's select the genes that are significantly differentially expressed between the case and control samples. -Let's extract genes that have an adjusted p-value below 0.1 and that show a 2-fold change (either negative or positive) in the case compared to control. We will then feed this gene set into `gProfileR` function. +Let's extract genes that have an adjusted p-value below 0.1 and that show a 2-fold change (either negative or positive) in the case compared to control. We will then feed this gene set into the `gProfileR` function. The top 10 detected GO terms are displayed in Table \@ref(tab:GOanalysistable). ```{r GO_analysis} library(DESeq2) @@ -469,19 +474,19 @@ goResults <- gprofiler(query = genesOfInterest, hier_filtering = 'moderate') ``` -The top 10 detected GO terms are displayed in Table \@ref(tab:GOanalysistable): + ```{r GOanalysistable, echo = FALSE} # sort the enriched GO terms by pvalue and print the top 10 terms # for the selected columns from the go results kable(goResults[order(goResults$p.value), c(3:4, 7, 10, 12)][1:10,], - booktabs = TRUE, caption = 'Top GO terms sorted by p-value') + booktabs = TRUE, caption = 'Top GO terms sorted by p-value.') ``` #### Gene set enrichment analysis -A gene set is a collection of genes with some common property. This shared property among a set of genes could be a GO term, a common biological pathway, a shared interaction partner, or any biologically relevant commonality that is meaningful in the context of the pursued experiment. Gene set enrichment analysis (GSEA) is a valuable exploratory analysis tool that can associate systematic changes to a high-level function rather than individual genes. Analysis of coordinated changes of expression levels of gene sets can provide complementary benefits on top of per-gene based differential expression analyses. For instance, consider a gene set belonging to a biological pathway where each member of the pathway displays a slight deregulation in a disease sample compared to a normal sample. In such a case individual genes might not be picked up by the per-gene based differential expression analysis. Thus, the GO/Pathway enrichment on the differentially expressed list of genes would not show an enrichment of this pathway. However, the additive effect of slight changes of the genes could amount to a large effect at the level of the gene set, thus the pathway could be detected as a significant pathway that could explain the mechanistic problems in the disease sample. +A gene set is a collection of genes with some common property. This shared property among a set of genes could be a GO term, a common biological pathway, a shared interaction partner, or any biologically relevant commonality that is meaningful in the context of the pursued experiment. Gene set enrichment analysis (GSEA) is a valuable exploratory analysis tool that can associate systematic changes to a high-level function rather than individual genes. Analysis of coordinated changes of expression levels of gene sets can provide complementary benefits on top of per-gene-based differential expression analyses. For instance, consider a gene set belonging to a biological pathway where each member of the pathway displays a slight deregulation in a disease sample compared to a normal sample. In such a case, individual genes might not be picked up by the per-gene-based differential expression analysis. Thus, the GO/Pathway enrichment on the differentially expressed list of genes would not show an enrichment of this pathway. However, the additive effect of slight changes of the genes could amount to a large effect at the level of the gene set, thus the pathway could be detected as a significant pathway that could explain the mechanistic problems in the disease sample. We use the bioconductor package `gage` [@luo_gage:_2009] \index{R Packages!\texttt{gage}}to demonstrate how to do GSEA using normalized expression data of the samples as input. Here we are using only two gene sets: one from the top GO term discovered from the previous GO analysis, one that we compile by randomly selecting a list of genes. However, annotated gene sets can be used from databases such as MSIGDB [@subramanian_gene_2005], which compile gene sets from a variety of resources such as KEGG [@kanehisa_kegg_2016] and REACTOME [@fabregat_reactome_2018]. @@ -517,23 +522,23 @@ gseaResults <- gage(exprs = log2(normalizedCounts+1), gsets = geneSets, compare = 'as.group') ``` -We can observe if there is a significant upregulation or downregulation of the gene set in the case group compared to the controls by accessing `gseaResults$greater` as in Table \@ref(tab:gseaPost1) or `gseaResults$less` as in Table \@ref(tab:gseaPost2). +We can observe if there is a significant up-regulation or down-regulation of the gene set in the case group compared to the controls by accessing `gseaResults$greater` as in Table \@ref(tab:gseaPost1) or `gseaResults$less` as in Table \@ref(tab:gseaPost2). ```{r gseaPost1, echo = FALSE} knitr::kable(gseaResults$greater, - booktabs = TRUE, + booktabs = TRUE, digits = 4, caption = "Up-regulation statistics") ``` ```{r gseaPost2, echo = FALSE} kable(gseaResults$less, - booktabs = TRUE, + booktabs = TRUE,digits = 4, caption = 'Down-regulation statistics') ``` -We can see that the random gene set shows no significant up or down-regulation (Tables \@ref(tab:gseaPost1) and (\@ref(tab:gseaPost2)), while the gene set we defined using the top GO term shows a significant up-regulation (adjusted p-value < 0.0007) (\@ref(tab:gseaPost1)). It is worthwhile to visualize these systematic changes in a heatmap as in Figure \@ref(fig:gseaPost3). +We can see that the random gene set shows no significant up- or down-regulation (Tables \@ref(tab:gseaPost1) and (\@ref(tab:gseaPost2)), while the gene set we defined using the top GO term shows a significant up-regulation (adjusted p-value < 0.0007) (\@ref(tab:gseaPost1)). It is worthwhile to visualize these systematic changes in a heatmap as in Figure \@ref(fig:gseaPost3). -```{r gseaPost3, fig.width=8,fig.cap='Heatmap of expression value from the genes with the top GO term'} +```{r gseaPost3, fig.width=8,fig.cap='Heatmap of expression value from the genes with the top GO term.'} library(pheatmap) # get the expression data for the gene set of interest M <- normalizedCounts[rownames(normalizedCounts) %in% geneSet1, ] @@ -553,7 +558,7 @@ compared to the controls. ### Accounting for additional sources of variation -When doing a differential expression analysis in a case-control setting, the variable of interest, i.e. the variable that explains the separation of the case samples from the control is usually the treatment, genotypic differences, a certain phenotype so on and sofort. However, in reality, depending on how the experiment and the sequencing was designed, there may be additional factors that might contribute to the variation between the compared samples. Sometimes, such variables are known, for instance, the date of the sequencing for each sample (batch information), or the temperature under which samples were kept. Such variables are not necessarily biological but rather technical, however, they still impact the measurements obtained from an RNA-seq experiment. Such variables can introduce systematic shifts in the obtained measurements. Here, we will demonstrate: firstly how to account for such variables using DESeq2, when the possibles sources of variation are actually known; secondly, how to account for such variables when all we have is just a count table but we observe that the variable of interest only explains a small proportion of the differences between case and control samples. +When doing a differential expression analysis in a case-control setting, the variable of interest, i.e. the variable that explains the separation of the case samples from the control, is usually the treatment, genotypic differences, a certain phenotype, etc. However, in reality, depending on how the experiment and the sequencing were designed, there may be additional factors that might contribute to the variation between the compared samples. Sometimes, such variables are known, for instance, the date of the sequencing for each sample (batch information), or the temperature under which samples were kept. Such variables are not necessarily biological but rather technical, however, they still impact the measurements obtained from an RNA-seq experiment. Such variables can introduce systematic shifts in the obtained measurements. Here, we will demonstrate: firstly how to account for such variables using DESeq2, when the possible sources of variation are actually known; secondly, how to account for such variables when all we have is just a count table but we observe that the variable of interest only explains a small proportion of the differences between case and control samples. #### Accounting for covariates using DESeq2 @@ -573,7 +578,7 @@ colData <- read.table(colData_file, header = T, sep = '\t', Let's take a look at how the samples cluster by calculating the TPM counts as displayed as a heatmap in Figure \@ref(fig:batcheffects2). -```{r batcheffects2,fig.width=8, fig.cap='Visualizing batch effects in an experiment'} +```{r batcheffects2,fig.width=8, fig.cap='Visualizing batch effects in an experiment.'} library(pheatmap) #find gene length normalized values geneLengths <- counts$width @@ -590,7 +595,7 @@ pheatmap(tpm[selectedGenes,], show_rownames = FALSE) ``` -Here we can see from the clusters that the dominating variable is the 'Library Selection' variable rather than the 'diagnosis' variable that determines the state of the organ from which the sample was taken. Case and control samples are all mixed in both two major clusters. However, ideally, we'd like to see a separation of the case and control samples regardless of the additional covariates. When testing for differential gene expression between conditions, such confounding variables can be accounted for using `DESeq2`. Below is a demonstration of how we instruct `DESeq2` to account for the 'library selection' variable: +Here we can see from the clusters that the dominating variable is the 'Library Selection' variable rather than the 'diagnosis' variable, which determines the state of the organ from which the sample was taken. Case and control samples are all mixed in both two major clusters. However, ideally, we'd like to see a separation of the case and control samples regardless of the additional covariates. When testing for differential gene expression between conditions, such confounding variables can be accounted for using `DESeq2`. Below is a demonstration of how we instruct `DESeq2` to account for the 'library selection' variable: ```{r batch_effects_3} library(DESeq2) @@ -616,7 +621,7 @@ DEresults <- results(dds, contrast = c('group', 'CASE', 'CTRL')) In cases when the sources of potential variation are not known, it is worthwhile to use tools such as `RUVSeq` or `sva` that can estimate potential sources of variation and clean up the counts table from those sources of variation. Later on, the estimated covariates can be integrated into DESeq2's design formula. -Let's see how to utilize `RUVseq` package to first diagnose the problem and then solve it. Here, for demonstration purposes, we'll use a count table from a lung carcinoma study in which a transcription factor (Ets homologous factor - EHF) is overexpressed and compared to the control samples with baseline EHF expression. Again, we only consider protein coding genes and use only five case and five control samples. The original data can be found on the `recount2` database with the accession 'SRP049988'. +Let's see how to utilize the `RUVseq` package to first diagnose the problem and then solve it. Here, for demonstration purposes, we'll use a count table from a lung carcinoma study in which a transcription factor (Ets homologous factor - EHF) is overexpressed and compared to the control samples with baseline EHF expression. Again, we only consider protein coding genes and use only five case and five control samples. The original data can be found on the `recount2` database with the accession 'SRP049988'. ```{r ruv_setup} counts_file <- system.file('extdata/rna-seq/SRP049988.raw_counts.tsv', @@ -632,8 +637,8 @@ colData$source_name <- ifelse(colData$group == 'CASE', 'EHF_overexpression', 'Empty_Vector') ``` -Let's start by making heatmaps of the samples using TPM counts (See Figure \@ref(fig:ruvdiagnose1)) -```{r ruvdiagnose1,fig.width=8, fig.cap='Diagnostic plot to observe' } +Let's start by making heatmaps of the samples using TPM counts (see Figure \@ref(fig:ruvdiagnose1)). +```{r ruvdiagnose1,fig.width=8, fig.cap='Diagnostic plot to observe.' } #find gene length normalized values geneLengths <- counts$width rpk <- apply( subset(counts, select = c(-width)), 2, @@ -662,9 +667,9 @@ set <- newSeqExpressionSet(counts = countData, phenoData = colData) ``` -Next, let's make a diagnostic RLE plot on raw count table. +Next, let's make a diagnostic RLE plot on the raw count table. -```{r ruvdiagnose2p1,fig.width=8, fig.cap='Diagnostic RLE and PCA plots based on raw count table'} +```{r ruvdiagnose2p1,fig.width=8, fig.cap='Diagnostic RLE and PCA plots based on raw count table.'} # make an RLE plot and a PCA plot on raw count data and color samples by group par(mfrow = c(1,2)) plotRLE(set, outline=FALSE, ylim=c(-4, 4), col=as.numeric(colData$group)) @@ -672,7 +677,7 @@ plotPCA(set, col = as.numeric(colData$group), adj = 0.5, ylim = c(-0.7, 0.5), xlim = c(-0.5, 0.5)) ``` -```{r ruvdiagnose2p2,fig.width=8, fig.cap='Diagnostic RLE and PCA plots based on TPM normalized count table'} +```{r ruvdiagnose2p2,fig.width=8, fig.cap='Diagnostic RLE and PCA plots based on TPM normalized count table.'} ## make RLE and PCA plots on TPM matrix par(mfrow = c(1,2)) plotRLE(tpm, outline=FALSE, ylim=c(-4, 4), col=as.numeric(colData$group)) @@ -688,7 +693,7 @@ Both RLE and PCA plots look better on normalized data (Figure \@ref(fig:ruvdiagn ##### Using RUVg -One way of removing unwanted variation is dependent on using a set of reference genes that are not expected to change by the sources of technical variation. One strategy along this line is to use spike-in genes, which are artifically introduced into the sequencing run [@jiang_synthetic_2011]. However, there are many sequencing datasets that don't have this spike-in data available. In such cases, an emprical set of genes can be collected from the expression data by doing a differential expression analysis and discovering genes that are unchanged in the given conditions. These unchanged genes are used to clean up the data from systematic shifts in expression due to the unwanted sources of variation. Another strategy could be to use a set of house-keeping genes as negative controls, and use them as a reference to correct the systematic biases in the data. Let's use a list of ~500 house-keeping genes compiled here: https://www.tau.ac.il/~elieis/HKG/HK_genes.txt. +One way of removing unwanted variation depends on using a set of reference genes that are not expected to change by the sources of technical variation. One strategy along this line is to use spike-in genes, which are artificially introduced into the sequencing run [@jiang_synthetic_2011]. However, there are many sequencing datasets that don't have this spike-in data available. In such cases, an empirical set of genes can be collected from the expression data by doing a differential expression analysis and discovering genes that are unchanged in the given conditions. These unchanged genes are used to clean up the data from systematic shifts in expression due to the unwanted sources of variation. Another strategy could be to use a set of house-keeping genes as negative controls, and use them as a reference to correct the systematic biases in the data. Let's use a list of ~500 house-keeping genes compiled here: https://www.tau.ac.il/~elieis/HKG/HK_genes.txt. ```{r ruv_g} library(RUVSeq) @@ -702,8 +707,8 @@ HK_genes <- read.table(file = system.file("extdata/rna-seq/HK_genes.txt", # in the count table house_keeping_genes <- intersect(rownames(set), HK_genes$V1) ``` -We will now run `RUVg()` with different number of factors of unwanted variation. We will plot the PCA after removing the unwanted variation. We should be able to see which `k` values, number of factors, produce better separation between sample groups. -```{r ruvgf1, fig.cap='PCA plots on RUVg normalized data with varying number of covariates (k)', fig.width=8,fig.height=8} +We will now run `RUVg()` with the different number of factors of unwanted variation. We will plot the PCA after removing the unwanted variation. We should be able to see which `k` values, number of factors, produce better separation between sample groups. +```{r ruvgf1, fig.cap='PCA plots on RUVg normalized data with varying number of covariates (k).', fig.width=8,fig.height=8} # now, we use these genes as the empirical set of genes as input to RUVg. # we try different values of k and see how the PCA plots look @@ -717,7 +722,7 @@ for(k in 1:4) { ``` Based on the separation of case and control samples in the PCA plots in Figure \@ref(fig:ruvgf1), -we choose k = 1 and re-run `RUVg()` function with the house keeping genes to do more diagnostic plots. +we choose k = 1 and re-run the `RUVg()` function with the house-keeping genes to do more diagnostic plots. ```{r ruv_g2} # choose k = 1 @@ -725,9 +730,9 @@ we choose k = 1 and re-run `RUVg()` function with the house keeping genes to do set_g <- RUVg(x = set, cIdx = house_keeping_genes, k = 1) ``` -Now let's do diagnostics: compare the count matrices with or without RUVg processing, comparing RLE plots (Figure \@ref(fig:ruvgf2)) and PCA plots (Figure \@ref(fig:ruvgf3)) to see the effect of RUVg on the normalisation and separation of case and control samples. +Now let's do diagnostics: compare the count matrices with or without RUVg processing, comparing RLE plots (Figure \@ref(fig:ruvgf2)) and PCA plots (Figure \@ref(fig:ruvgf3)) to see the effect of RUVg on the normalization and separation of case and control samples. -```{r ruvgf2,fig.width=8, fig.cap='RLE plots to observe the effect of RUVg'} +```{r ruvgf2,fig.width=8, fig.cap='RLE plots to observe the effect of RUVg.'} # RLE plots par(mfrow = c(1,2)) @@ -737,7 +742,7 @@ plotRLE(set_g, outline=FALSE, ylim=c(-4, 4), col=as.numeric(colData$group), main = 'with RUVg') ``` -```{r ruvgf3,fig.width=8, fig.cap='PCA plots to observe the effect of RUVg'} +```{r ruvgf3,fig.width=8, fig.cap='PCA plots to observe the effect of RUVg.'} # PCA plots par(mfrow = c(1,2)) @@ -753,9 +758,9 @@ We can observe that using `RUVg()` with house-keeping genes as reference has imp ##### Using RUVs -There is another strategy of `RUVSeq` that works better in the presence of replicates in the absence of a confounded experimental design, which is the `RUVs()` function. Let's see how that performs with this data. This time we don't use the house-keeping genes. We rather use all genes as input to `RUVs()`. This function estimates the correction factor by assuming that replicates should have constant biological variation, rather the variation in the replicates are the unwanted variation. +There is another strategy of `RUVSeq` that works better in the presence of replicates in the absence of a confounded experimental design, which is the `RUVs()` function. Let's see how that performs with this data. This time we don't use the house-keeping genes. We rather use all genes as input to `RUVs()`. This function estimates the correction factor by assuming that replicates should have constant biological variation, rather, the variation in the replicates are the unwanted variation. -```{r ruvsf1, fig.cap='PCA plots on RUVs normalized data with varying number of covariates (k)', fig.height=8, fig.width=8} +```{r ruvsf1, fig.cap='PCA plots on RUVs normalized data with varying number of covariates (k).', fig.height=8, fig.width=8} # make a table of sample groups from colData differences <- makeGroups(colData$group) @@ -773,8 +778,7 @@ for(k in 1:4) { ``` Based on the separation of case and control samples in the PCA plots in Figure \@ref(fig:ruvsf1), -we can see that the samples are better separated event at k = 2 when using `RUVs()`. Here, we re-run `RUVs()` function using k = 2, -in order to do more diagnostic plots. We try to pick a value of k that is good enough to distinguish the samples by condition of interest. While setting the value of k to higher values could improve the percentage of explained variation by the first principle component to up to 61%, we try to avoid setting the value unnecessarily high to avoid removing factors that might also correlate with important biological differences between conditions. +we can see that the samples are better separated even at k = 2 when using `RUVs()`. Here, we re-run the `RUVs()` function using k = 2, in order to do more diagnostic plots. We try to pick a value of k that is good enough to distinguish the samples by condition of interest. While setting the value of k to higher values could improve the percentage of explained variation by the first principle component to up to 61%, we try to avoid setting the value unnecessarily high to avoid removing factors that might also correlate with important biological differences between conditions. ```{r ruv_s2} @@ -782,9 +786,9 @@ in order to do more diagnostic plots. We try to pick a value of k that is good e set_s <- RUVs(set, unique(rownames(set)), k=2, differences) # ``` -Now let's do diagnostics again: compare the count matrices with or without RUVs processing, comparing RLE plots (Figure \@ref(fig:ruvsf2)) and PCA plots (Figure \@ref(fig:ruvsf3)) to see the effect of RUVg on the normalisation and separation of case and control samples. +Now let's do diagnostics again: compare the count matrices with or without RUVs processing, comparing RLE plots (Figure \@ref(fig:ruvsf2)) and PCA plots (Figure \@ref(fig:ruvsf3)) to see the effect of RUVg on the normalization and separation of case and control samples. -```{r ruvsf2,fig.width=8, fig.cap='RLE plots to observe the effect of RUVs'} +```{r ruvsf2,fig.width=8, fig.cap='RLE plots to observe the effect of RUVs.'} ## compare the initial and processed objects ## RLE plots @@ -797,7 +801,7 @@ plotRLE(set_s, outline=FALSE, ylim=c(-4, 4), main = 'with RUVs') ``` -```{r ruvsf3,fig.width=8, fig.cap='PCA plots to observe the effect of RUVs'} +```{r ruvsf3,fig.width=8, fig.cap='PCA plots to observe the effect of RUVs.'} ## PCA plots par(mfrow = c(1,2)) @@ -809,9 +813,9 @@ plotPCA(set_s, col=as.numeric(colData$group), ylim = c(-0.75, 0.75), xlim = c(-0.75, 0.75)) ``` -Let's compare PCA results from RUVs and RUVg with the initial raw counts matrix. We will simply run `plotPCA()` function on different normalization schemes. The resulting plots are in Figure \@ref(fig:ruvcompare): +Let's compare PCA results from RUVs and RUVg with the initial raw counts matrix. We will simply run the `plotPCA()` function on different normalization schemes. The resulting plots are in Figure \@ref(fig:ruvcompare): -```{r ruvcompare,fig.width=12,fig.height=5, fig.cap='PCA plots to observe the before/after effect of RUV functions',out.width="90%"} +```{r ruvcompare,fig.width=12,fig.height=5, fig.cap='PCA plots to observe the before/after effect of RUV functions.',out.width="90%"} par(mfrow = c(1,3)) plotPCA(countData, col=as.numeric(colData$group), main = 'without RUV - raw counts', adj = 0.5, @@ -826,7 +830,7 @@ plotPCA(set_s, col=as.numeric(colData$group), It looks like `RUVs()` has performed better than `RUVg()` in this case. So, let's use count data that is processed by `RUVs()` to re-do the initial heatmap. The resulting heatmap is in Figure \@ref(fig:ruvpost). -```{r ruvpost,fig.width=8, fig.cap='Clustering samples using top 500 most variable genes normalized using RUVs (k = 2)'} +```{r ruvpost,fig.width=8, fig.cap='Clustering samples using the top 500 most variable genes normalized using RUVs (k = 2).'} library(EDASeq) library(pheatmap) # extract normalized counts that are cleared from unwanted variation using RUVs @@ -873,7 +877,7 @@ res <- res[order(res$padj),] ## Other applications of RNA-seq -RNA-seq generates valuable data that contains information not only at the gene level but also at the level of exons and transcripts. Moreover, the kind of information that we can extract from RNA-seq is not limited to expression quantification. It is possible to detect alternative splicing events such as novel isoforms [@trapnell_transcript_2010], differential usage of exons [@anders_detecting_2012]. It is also possible to observe sequence variants (substitutions, insertions, deletions, RNA-editing) that may change the translated protein product [@mckenna_genome_2010]. In the context of cancer genomes, gene-fusion events can be detected with RNA-seq [@mcpherson_defuse:_2011]. Finally, for the purposes of gene prediction or improving existing gene predictions, RNA-seq is a valuable method [@stanke_augustus:_2005]. In order to learn more about how to implement these, it is recommended to go through the tutorials of the cited tools. +RNA-seq generates valuable data that contains information not only at the gene level but also at the level of exons and transcripts. Moreover, the kind of information that we can extract from RNA-seq is not limited to expression quantification. It is possible to detect alternative splicing events such as novel isoforms [@trapnell_transcript_2010], and differential usage of exons [@anders_detecting_2012]. It is also possible to observe sequence variants (substitutions, insertions, deletions, RNA-editing) that may change the translated protein product [@mckenna_genome_2010]. In the context of cancer genomes, gene-fusion events can be detected with RNA-seq [@mcpherson_defuse:_2011]. Finally, for the purposes of gene prediction or improving existing gene predictions, RNA-seq is a valuable method [@stanke_augustus:_2005]. In order to learn more about how to implement these, it is recommended that you go through the tutorials of the cited tools. ## Exercises @@ -889,20 +893,19 @@ coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv", ``` 1. Normalize the counts using the TPM approach. [Difficulty: **Beginner**] -2. Plot a heatmap of the top 500 most variable genes. Compare with the heatmap obtained using 100 most variable genes. [Difficulty: **Beginner**] -3. Re-do the heatmaps setting `scale` argument to `none`, and `column`. Compare the results with `scale = 'row'`. [Difficulty: **Beginner**] -4. Draw a correlation plot for the samples depicting the sample differences as 'ellipses', drawing only the upper end of the matrix, order samples by hierarchical clustering results based on `average` linkage clustering method. [Difficulty: **Beginner**] -5. How else could the count matrix be subsetted to obtain quick and accurate clusters? Try selecting top 100 genes that have the highest total expression in all samples and re-draw the cluster heatmaps and PCA plots. [Difficulty: **Intermediate**] -6. Add an additional column to the annotation data.frame object to annotate the samples and use the updated annotation data.frame to plot the heatmaps. (Hint: assign different batch values to CASE and CTRL samples). Make a PCA plot and color samples by the added variable (e.g. batch). [Difficulty: Intermediate] +2. Plot a heatmap of the top 500 most variable genes. Compare with the heatmap obtained using the 100 most variable genes. [Difficulty: **Beginner**] +3. Re-do the heatmaps setting the `scale` argument to `none`, and `column`. Compare the results with `scale = 'row'`. [Difficulty: **Beginner**] +4. Draw a correlation plot for the samples depicting the sample differences as 'ellipses', drawing only the upper end of the matrix, and order samples by hierarchical clustering results based on `average` linkage clustering method. [Difficulty: **Beginner**] +5. How else could the count matrix be subsetted to obtain quick and accurate clusters? Try selecting the top 100 genes that have the highest total expression in all samples and re-draw the cluster heatmaps and PCA plots. [Difficulty: **Intermediate**] +6. Add an additional column to the annotation data.frame object to annotate the samples and use the updated annotation data.frame to plot the heatmaps. (Hint: Assign different batch values to CASE and CTRL samples). Make a PCA plot and color samples by the added variable (e.g. batch). [Difficulty: Intermediate] 7. Try making the heatmaps using all the genes in the count table, rather than sub-selecting. [Difficulty: **Advanced**] -8. Use [Rtsne package](https://cran.r-project.org/web/packages/Rtsne/Rtsne.pdf) to draw a t-SNE plot of the expression values. Color the points by sample group. Compare the results with the PCA plots. [Difficulty: **Advanced**] +8. Use the [`Rtsne` package](https://cran.r-project.org/web/packages/Rtsne/Rtsne.pdf) to draw a t-SNE plot of the expression values. Color the points by sample group. Compare the results with the PCA plots. [Difficulty: **Advanced**] ### Differential expression analysis -Firstly, carry out a differential expression analysis starting from raw counts - -Use the below datasets: +Firstly, carry out a differential expression analysis starting from raw counts. +Use the following datasets: ``` counts_file <- system.file("extdata/rna-seq/SRP029880.raw_counts.tsv", @@ -911,30 +914,30 @@ coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv", package = "compGenomRData") ``` -- Import the read counts and colData tables -- Set up a DESeqDataSet object -- Filter out genes with low counts +- Import the read counts and colData tables. +- Set up a DESeqDataSet object. +- Filter out genes with low counts. - Run DESeq2 contrasting the `CASE` sample with `CONTROL` samples. Now, you are ready to do the following exercises: -1. Make a volcano plot using the differential expression analysis results. (Hint: x axis denotes the log2FoldChange and the y-axis represents the -log10(pvalue)).[Difficulty: **Beginner**] -2. Use DESeq2::plotDispEsts to make a dispersion plot and find out the meaning of this plot. (Hint: type ?DESeq2::plotDispEsts) [Difficulty: **Beginner**] -3. Find out about `lfcThreshold` argument of `DESeq2::results` function. What is its default value? What does it mean to change the default value to, for instance, `1`? [Difficulty: **Intermediate**] +1. Make a volcano plot using the differential expression analysis results. (Hint: x-axis denotes the log2FoldChange and the y-axis represents the -log10(pvalue)). [Difficulty: **Beginner**] +2. Use DESeq2::plotDispEsts to make a dispersion plot and find out the meaning of this plot. (Hint: Type ?DESeq2::plotDispEsts) [Difficulty: **Beginner**] +3. Explore `lfcThreshold` argument of the `DESeq2::results` function. What is its default value? What does it mean to change the default value to, for instance, `1`? [Difficulty: **Intermediate**] 4. What is independent filtering? What happens if we don't use it? Google `independent filtering statquest` and watch the online video about independent filtering. [Difficulty: **Intermediate**] -5. Re-do the differential expression analysis using `edgeR` package. Find out how much DESeq2 and edgeR agree on the list of differentially expressed genes.[Difficulty: **Advanced**] -6. Use `compcodeR` package to run the differential expression analysis using at least three different tools and compare and contrast the results following the compcodeR vignette. [Difficulty: **Advanced**] +5. Re-do the differential expression analysis using the `edgeR` package. Find out how much DESeq2 and edgeR agree on the list of differentially expressed genes. [Difficulty: **Advanced**] +6. Use the `compcodeR` package to run the differential expression analysis using at least three different tools and compare and contrast the results following the `compcodeR` vignette. [Difficulty: **Advanced**] ### Functional enrichment analysis 1. Re-run gProfileR, this time using pathway annotations such as KEGG, REACTOME, and protein complex databases such as CORUM, in addition to the GO terms. Sort the resulting tables by columns `precision` and/or `recall`. How do the top GO terms change when sorted for `precision`, `recall`, or `p.value`? [Difficulty: **Beginner**] -2. Repeat the gene set enrichment analysis by trying different options for the `compare` argument of `GAGE:gage` +2. Repeat the gene set enrichment analysis by trying different options for the `compare` argument of the `GAGE:gage` function. How do the results differ? [Difficulty: **Beginner**] -3. Make a scatter plot of GO term sizes and obtained p-values by setting the `gProfiler::gprofiler` argument `significant = FALSE`. Is there a correlation of term sizes and p-values? (Hint: take -log10 of p-values). If so, how can this bias be mitigated? [Difficulty: **Intermediate**] +3. Make a scatter plot of GO term sizes and obtained p-values by setting the `gProfiler::gprofiler` argument `significant = FALSE`. Is there a correlation of term sizes and p-values? (Hint: Take -log10 of p-values). If so, how can this bias be mitigated? [Difficulty: **Intermediate**] 4. Do a gene-set enrichment analysis using gene sets from top 10 GO terms. [Difficulty: **Intermediate**] 5. What are the other available R packages that can carry out gene set enrichment analysis for RNA-seq datasets? [Difficulty: **Intermediate**] -6. Use the topGO package (https://bioconductor.org/packages/release/bioc/html/topGO.html) to re-do the GO term analysis. Compare and contrast the results with what has been obtained using gProfileR package. Which tool is faster? gProfileR or topGO? Why? [Difficulty: **Advanced**] -7. Given a gene set annotated for human, how can it be utilized to work on C. elegans data? (Hint: see `biomaRt::getLDS`). [Difficulty: **Advanced**] +6. Use the topGO package (https://bioconductor.org/packages/release/bioc/html/topGO.html) to re-do the GO term analysis. Compare and contrast the results with what has been obtained using the `gProfileR` package. Which tool is faster, `gProfileR` or topGO? Why? [Difficulty: **Advanced**] +7. Given a gene set annotated for human, how can it be utilized to work on _C. elegans_ data? (Hint: See `biomaRt::getLDS`). [Difficulty: **Advanced**] 8. Import curated pathway gene sets with Entrez identifiers from the [MSIGDB database](http://software.broadinstitute.org/gsea/msigdb/collections.jsp) and re-do the GSEA for all curated gene sets. [Difficulty: **Advanced**] ### Removing unwanted variation from the expression data @@ -949,6 +952,6 @@ colData_file <- system.file('extdata/rna-seq/SRP049988.colData.tsv', 1. Run RUVSeq using multiple values of `k` from 1 to 10 and compare and contrast the PCA plots obtained from the normalized counts of each RUVSeq run. [Difficulty: **Beginner**] 2. Re-run RUVSeq using the `RUVr()` function. Compare PCA plots from `RUVs`, `RUVg` and `RUVr` using the same `k` values and find out which one performs the best. [Difficulty: **Intermediate**] -3. Do the necessary diagnostic plots using the differential expression results from the EHF count table. [difficulty: Intermediate] +3. Do the necessary diagnostic plots using the differential expression results from the EHF count table. [Difficulty: **Intermediate**] 4. Use the `sva` package to discover sources of unwanted variation and re-do the differential expression analysis using variables from the output of `sva` and compare the results with `DESeq2` results using `RUVSeq` corrected normalization counts. [Difficulty: **Advanced**] diff --git a/09-chip-seq-analysis.Rmd b/09-chip-seq-analysis.Rmd index 27fdb17..a6ac25d 100644 --- a/09-chip-seq-analysis.Rmd +++ b/09-chip-seq-analysis.Rmd @@ -15,78 +15,78 @@ knitr::opts_chunk$set(echo = TRUE, -Protein-DNA interactions are responsible for a large part of the gene expression regulation. Proteins such as transcription factors as well as histones are directly related to how much and in which contexts the genes are expressed. Some of these concepts are already introduced in Chapter \@ref(intro) if readers need a more in depth introduction. In this chapter, we will introduce how to process and analyze ChIP-seq data in order to identify genome-wide protein binding sites and to discover underlying sequence context via transcription factor binding-site motifs. +Protein-DNA interactions are responsible for a large part of the gene expression regulation. Proteins such as transcription factors as well as histones are directly related to how much and in which contexts the genes are expressed. Some of these concepts are already introduced in Chapter \@ref(intro) if readers need a more in-depth introduction. In this chapter, we will introduce how to process and analyze ChIP-seq data in order to identify genome-wide protein binding sites and to discover underlying sequence context via transcription factor binding-site motifs. ## Regulatory protein-DNA interactions -One of the most fascinating biological phenomena is the fact that a myriad of different cell types, in a multicellular organisms, are encoded by one single genome. How exactly this is achieved is still a +One of the most fascinating biological phenomena is the fact that a myriad of different cell types, in a multicellular organism, are encoded by one single genome. How exactly this is achieved is still a major unanswered question in biology. Cell types differ based on a multitude of features: their size, shape, mobility, surface receptors, metabolic content. -However, the main predominant feature which influences all of the above is which +However, the main predominant feature, which influences all of the above, is which genes are expressed in each cell type. Therefore, if we can understand what controls which genes will be expressed, and where they will be expressed, we can start forming a picture of how a single genomic template, can give rise to a complex organism. -As explained in chapter \@ref(intro), gene expression is controlled by a special class of genes called +As explained in Chapter \@ref(intro), gene expression is controlled by a special class of genes called transcription factors - genes which control other genes. Transcription factor genes encode proteins which can bind to the DNA, and control whether a certain part of DNA will be transcribed (expressed), or stay silent (repressed). They program the expression patterns in each cell. -Transcription factors contain DNA binding domains - specifically folded protein sequences +Transcription factors contain DNA binding domains, which are specifically folded protein sequences which recognize specific DNA motifs (a short nucleotide sequence). -Such sequence binding imparts transcription factors with specificity - +Such sequence binding imparts transcription factors with specificity, transcription factors do not bind everywhere on the DNA, rather they are localized to short stretches which contain the corresponding DNA motif. DNA in the nucleus is wrapped around a protein complex called the histone complex. Histones form a chain of beads along the DNA. By changing their position, histones can make certain parts of the DNA more or less accessible to transcription -factors. Histone complexes can be chemically modified with different post-translational modifications (see chapter \@ref(intro)). Such modifications change histone +factors. Histone complexes can be chemically modified with different post-translational modifications (see Chapter \@ref(intro)). Such modifications change histone mobility, and their interactions with different proteins, thereby creating an additional regulatory layer on top of the DNA sequence. -In order to understand what are the target genes of a certain transcription factor, -and how it controls the gene expression we need to where on the DNA is the -transcription factor located. +In order to understand the target genes of a certain transcription factor, +and how they control the gene expression, we need to know where on the DNA the +transcription factor is located. ## Measuring protein-DNA interactions with ChIP-seq -ChIP-seq, stands for chromatin immunoprecipitation followed by sequencing, is an experimental method for finding locations on DNA which are bound by proteins. It has been extensively used to study +ChIP-seq stands for chromatin immunoprecipitation followed by sequencing, and is an experimental method for finding locations on DNA which are bound by proteins. It has been extensively used to study in-vivo binding preferences of transcription factors, and genomic distribution of modified histones. In the remainder of this chapter, you will learn how to assess quality control -of ChIP-seq data sets, perform peak calling, to find bound regions, and +of ChIP-seq data sets, perform peak calling to find bound regions, and assess the quality of the peak calling. Once you have obtained peaks, you will learn how to perform sequence analysis to construct motif models, and compare signals between experiments. -Biological experiment often contain multitude of consecutive steps. Each +Biological experiments often contain a multitude of consecutive steps. Each step can profoundly influence the quality of the data, and the subsequent analysis. -The computational biologist has to have an in depth knowledge of the experimental -design, and the underlying experimental steps, in order to choose the proper tools, -and the type of the analysis, which will give proper and correct results [@kharchenko_2008; @kidder_2011; @landt_2012; @chen_2012; @felsani_2015]. +The computational biologist has to have an in-depth knowledge of the experimental +design, and the underlying experimental steps, in order to choose the proper tools +and the type of analysis, which will give proper and correct results [@kharchenko_2008; @kidder_2011; @landt_2012; @chen_2012; @felsani_2015]. In this chapter we will go through the main experimental steps in the ChIP-seq analysis and address the most common experimental pitfalls. -The main principle of the method is use a specific antibody to enrich +The main principle of the method is to use a specific antibody to enrich DNA fragments which are bound by the protein of interest. The DNA fragments are then sequenced, mapped onto the corresponding -reference genome, and computationally analyzed to discriminate regions which +reference genome, and computationally analyzed to distinguish regions which were really bound by the protein, from the background regions. The experimental methodology is depicted in Figure \@ref(fig:ChIP-seq-Protocol-plot), and consists of the following steps: -1. Cross linking of cells with formaldehyde, to bind the proteins to the DNA. +1. Cross linking of cells with formaldehyde to bind the proteins to the DNA. This process covalently links the proteins to the DNA. -2. Fragmentation of DNA using sonication or enzymatic digestion - shearing -of DNA into small fragments (ranging from 50 - 500 bp) +2. Fragmentation of DNA using sonication or enzymatic digestion, shearing +of DNA into small fragments (ranging from 50 - 500 bp). 3. Immunoprecipitation using a specific antibody. An immunoprecipitation step which enriches fragments bound by the protein. @@ -101,7 +101,7 @@ protocol. Therefore the fragments need to be amplified using PCR. 7. DNA fragment sequencing -```{r ChIP-seq-Protocol-plot, echo=FALSE, include=TRUE, fig.cap="Main experimental steps in the ChIP-seq protocol"} +```{r ChIP-seq-Protocol-plot, echo=FALSE, include=TRUE, fig.cap="Main experimental steps in the ChIP-seq protocol."} knitr::include_graphics('./Figures/Chip-Seq_Protocol_Extended.png') ``` @@ -110,7 +110,7 @@ After sequencing, the role of the computational biologist is to assess the quality of the experiment, find the location of the protein of interest, and finally, to integrate with existing data sets. -Each step of the experimental protocol can affect on the quality +Each step of the experimental protocol can affect the quality of the data set, and the subsequent analysis steps. It is, therefore, of crucial importance to perform quality control for every sequenced experiment. @@ -119,7 +119,7 @@ importance to perform quality control for every sequenced experiment. ### Antibody specificity Antibody specificity is a term which refers to how strongly an antibody -binds to it's preferred target, with respect to everything else present in the cell. +binds to its preferred target, with respect to everything else present in the cell. It is the paramount measure influencing the successful execution of a ChIP experiment. Antibodies can bind multiple proteins with the same affinity. @@ -134,7 +134,7 @@ The exact recommendations are listed by the ENCODE consortium [@landt_2012]. Every time we are analyzing a new ChIP-seq experiment, we have to take our time to convince ourselves that all of the appropriate experimental controls were performed -to validate the antibody specificity[@Wardle_2015]. +to validate the antibody specificity [@Wardle_2015]. ### Sequencing depth @@ -143,10 +143,10 @@ Variation in sequencing depth is the first systematic technical bias we encounter in ChIP-seq experiments. Namely, different samples will contain different number of sequenced reads. Different sequencing depth influences our ability to detect enriched regions, and complicates comparisons between samples [@jung_2014]. -Statistical procedure of removing the influence of sequencing depth on the -quantification is called depth scaling - we calculate a scaling factor which +The statistical procedure of removing the influence of sequencing depth on the +quantification is called depth scaling; we calculate a scaling factor which is used to multiply the signal strength before the comparison. -There are multiple methods for normalization, and each method comes with it's assumptions. +There are multiple methods for normalization, and each method comes with its assumptions. **Scale normalization** is done by dividing the read counts (in certain genomic locations) by the total amount of sequenced reads. This method presumes that the ChIP efficiency worked equally well in all studied conditions. Because the ChIP efficiency @@ -157,17 +157,17 @@ different biological conditions (regions where the protein is constantly bound) and then uses the sum of the reads in those regions as the scaling factor. This method presumes that we can reliably identify regions which do not change [@shao_2012]. **Background normalization** presumes that the genome can be split into two categories: -background regions, and true signal regions. It then uses the number of reads in the +background regions and true signal regions. It then uses the number of reads in the background regions to define the scaling factor [@liang_2012]. -**External normalization** uses external reference for normalization - we +**External normalization** uses external reference for normalization; we add known amounts of chromatin from a distant species, or artificial spike-ins which are then -used as a scaling reference. This is used when we thing there are global changes -in the biding profiles between two biological conditions - very large changes in the +used as a scaling reference. This is used when we think there are global changes +in the biding profiles between two biological conditions -- very large changes in the signal profile [@bonhoure_2014]. -The choice of normalization method depends on the type of the analysis [@angelini_2015]; +The choice of normalization method depends on the type of analysis [@angelini_2015]; if we want to quantitatively compare the abundance of different histone marks in -different cell types, we will need different normalization procedure than if +different cell types, we will need the different normalization procedure than if we want to compare TF binding in the same setting. @@ -178,10 +178,10 @@ than the minimal amount which can be sequenced. Polymerase chain reaction (PCR) is a procedure used for amplification of DNA fragments. It is used to increase the amount of DNA in our sample prior to sequencing. PCR is a stochastic procedure, meaning that the results of each PCR -reaction can not be predicted. Due to it's stochastic nature, PCR can +reaction cannot be predicted. Due to its stochastic nature, PCR can be a significant source of variability in the ChIP-seq experiments -[@aird_2011; @benjamini_2012; @teng_2016]. As a quality control is necessary to -check whether all of our samples have the same sequence properties - i.e. the +[@aird_2011; @benjamini_2012; @teng_2016]. A quality control is necessary to +check whether all of our samples have the same sequence properties, i.e. the same enrichment of dinucleotides, such as CpG. If the samples differ in their sequence properties, that means we have to account for them during the analysis [@Teng_2017]. @@ -199,7 +199,7 @@ whether the observed changes are a result of the inherent biological variability (the source of which we do not understand), or they result from the change in the biological condition (different tissue or transcription factor used in the experiment). -If we encounter and experimental setup which does not include biological +If we encounter an experimental setup which does not include biological replicates, we should be extremely skeptical about all conclusions derived from such analysis. @@ -208,11 +208,11 @@ from such analysis. There are three types of control experiments which can be performed to control for known and unknown experimental biases: -1. **Input control** - sequencing of genomic DNA without the immunoprecipitation step +1. **Input control**: Sequencing of genomic DNA without the immunoprecipitation step. -2. **IgG control** - using a polyclonal mixture of non-specific IgG antibodies instead of a specific antibody +2. **IgG control**: Using a polyclonal mixture of non-specific IgG antibodies instead of a specific antibody. -3. **Knockout control** - performing the ChIP experiment in a biological system which +3. **Knockout control**: Performing the ChIP experiment in a biological system which does not contains our protein of interest (i.e. in a cell line where the transcription factor was knocked out) [@krebs_2014]. @@ -224,16 +224,15 @@ Due to the hierarchical structure of chromatin, different genomic regions have sonication, and immunoprecipitation. This causes an uneven probability of observing DNA fragments originating from different genomic regions. Because different cell types (cell lines, and cancer cell lines), -have different chromatin structure, ChIP samples will show a cell type specific +have different chromatin structure, ChIP samples will show a cell-type-specific bias in observed enrichment profiles. An important note to consider is that the input control is basically a reduced whole genome sequencing experiment, while the ChIP enriches for only a subset of genomic regions. If both ChIP and Input samples are sequenced to the same depth (same number of reads), the background distribution in the input sample will -be under sampled. It is recommended to sequence the Input sample to at least -double the amount of reads of the ChIP sample (ref) +be under sampled. It is recommended to sequence the input sample deeper than the ChIP sample [@chen2012systematic]. -**IgG control** uses a soup of nonspecific antibodies for to control for background +**IgG control** uses a soup of nonspecific antibodies to control for background binding. In principle, the antibodies should be isolated from the same batch of serum which was used to create the specific antibody (used for ChIP). It should, in theory, give a background profile of non-specific binding. @@ -242,13 +241,13 @@ are unspecific, the amount of precipitated DNA will be low, and the samples will require additional rounds of PCR amplification. **KO control** is a ChIP experiment performed in the biological system where -the native protein is not present. Such experiment profiles the non-specific +the native protein is not present. Such an experiment profiles the non-specific binding of the antibody to other proteins, and directly to the DNA. -The primary, and only, concern is that the perturbation caused by the knock-out (knock-down), +The primary, and only, concern is that the perturbation caused by the knock-out (or knock-down), changes the cell so much, that the ChIP profile is not comparable to the original cell. This is the most accurate type of control experiment, however, it is frequently technically challenging -to perform - the cells are not viable after the knock out, or -it is knock out is impossible to perform. +to perform if the cells are not viable after the knock-out, or +if the knock-out is impossible to perform. ### Using tagged proteins @@ -265,41 +264,41 @@ of the protein, and therefore the experimental conclusions. ## Pre-processing ChIP data The focus of ChIP preprocessing is to check the quality of the sequencing -experiment, remove sequencing artifact, and find the genomic location of +experiment, remove sequencing artifacts, and find the genomic location of sequenced fragments using read mapping. -The quality control consists of the read quality control and adapter trimming. -These methods are described in depth in chapter \@ref(processingReads). +The quality control consists of read quality control and adapter trimming. +These methods are described in depth in Chapter \@ref(processingReads). ### Mapping of ChIP-seq data Mapping is a procedure of trying to locate the exact genomic location which -created each genomic fragment - each sequenced read. +created each genomic fragment, each sequenced read. Several tools are available for mapping ChIP-seq data sets: Bowtie, Bowtie2, BWA [@langmead_2009; @langmead_2012; @li_2009], and all of them have comparable sensitivity and specificity [@ruffalo_2011]. Read length is the variable with the biggest effect on the mapping procedure. -The longer the sequenced reads, the more uniquely can the read assigned +The longer the sequenced reads, the more uniquely can the read be assigned to a position on the genome. Reads which are assigned ambiguously to multiple locations in the genome are called multi-mapping reads. Such fragments are most often produced by repetitive genomic regions, such as retrotransposons, -pseudogenes or paraloguous genes [@li_2014]. -It is important to apriori decide whether such duplicated regions are of +pseudogenes or paralogous genes [@li_2014]. +It is important to, a priori, decide whether such duplicated regions are of interest for the current experimental setup (i.e. whether we want to study transcription factor binding in olfactory receptors). If they are, then the multi-mapping reads should be included in the analysis. If they are not, they should be omitted. This is done during the mapping step, by limiting the number of locations to which a read can map. The methodology of working with multi-mapping reads differs according to the -use case, and will not be considered in this chapter. For more information please +use case, and will not be considered in this chapter. For more information, please see the references [@chung_2011]. Current Illumina sequencing procedures enable sequencing of DNA fragments from just one, or both ends. Sequencing from both ends is called **paired-end** sequencing and greatly enhances -the sample **mappabillity** - percentage of genome which can be uniquely mapped. -Additionally, it provides out of the box estimate of the -average DNA fragment length - a parameter which is important for quality control +the sample **mappability**, the percentage of genome which can be uniquely mapped. +Additionally, it provides an out-of-the-box estimate of the +average DNA fragment length, a parameter which is important for quality control and peak calling. -Although it would be always preferable to do paired-end sequencing it +Although it would always be preferable to do paired-end sequencing it substantially increases the sequencing costs, which can be prohibitive. Different reads, which map to the same genomic location (same chromosome, position, and strand), @@ -308,7 +307,7 @@ DNA fragment was present multiple times during the library preparation. This can happen due to high enrichment with highly specific antibodies, or such fragments can be artificially produced during PCR amplification. Because we do not know the exact origin of the duplicated fragments, they are most often collapsed during -the peak calling procedure - i.e. when multiple reads map to the same chromosome, +the peak calling procedure, i.e. when multiple reads map to the same chromosome, position, and strand, only one read is used. If the transcription factor binds to a small number of regions in the genome, such data reduction might be too stringent, and we can increase the sensitivity by allowing up to __N__ different @@ -320,7 +319,7 @@ of reads, per position, which will be used in the analysis [@zhang_2008]. An important consideration to take into account is the genome which was used in the experiment. Cell lines, cancer samples, and personal genomes usually contain -structural genomic alteration which are not present in the reference genome +structural genomic alterations which are not present in the reference genome (duplications, insertions, and deletions). Such regions can cause false negatives, and false positives in the ChIP-seq experiment. If a region was present multiple times in the experimental system, and only a single @@ -332,13 +331,13 @@ Such regions are called **blacklisted** regions and should be removed from the downstream analysis. The [UCSC browser database](http://genome.ucsc.edu) contains tables with such regions for the most commonly used model organism species. -The following chapter presumes that the user is already familiar with +This chapter presumes that the user is already familiar with the following technical and conceptual knowledge in computational data processing. From Chapters \@ref(processingReads) and \@ref(genomicIntervals), you should be familiar with \index{read filtering} \index{read mapping} the concept of multi-mapping reads, and the following file formats BED\index{BED file}, GTF\index{GTF file}, WIG\index{WIG file}, bigWig\index{bigWig file}, BAM\index{BAM file}. -You should also know what PCR\index{PCR} is, what +You should also be familiar with PCR\index{PCR}, what are PCR duplicates, positive and negative DNA strands, and technical and biological replicates. @@ -346,22 +345,21 @@ replicates. ## ChIP quality control While the goal of the read quality assessment is to check whether the sequencing -produced high enough number of high quality reads; -the goal of the ChIP quality control to ascertaining whether the chromatin immunoprecipitation +produced a high enough number of high-quality reads +the goal of ChIP quality control is to ascertain whether the chromatin immunoprecipitation enrichment was successful. This is a crucial step in the ChIP-seq analysis because it can help us -identify low quality ChIP samples, and give information about which experimental +identify low-quality ChIP samples, and give information about which experimental steps went wrong. -There are four steps in the ChIP quality control: +There are four steps in ChIP quality control: -1. Sample correlation clustering - clustering of the pair-wise correlations between +1. Sample correlation clustering: Clustering of the pair-wise correlations between genome-wide signal profiles. 2. Data visualization in a genomic browser. -3. Average fragment length determination - determining whether the ChIP enriched -for fragments of a certain length. +3. Average fragment length determination: Determining whether the ChIP was enriched for fragments of a certain length. 4. Visualization of GC bias. Here we will plot the ChIP enrichment versus the average GC content in the corresponding genomic bin. @@ -373,8 +371,8 @@ Here we will familiarize ourselves with the datasets that will be used in the chapter. Experimental data was downloaded from the public ENCODE [@ENCODE_Project_Consortium2012-wf] database of ChIP-seq experiments. -The experiments were performed on a Lymphoblastoid cell line GM12878, and mapped -to the GRCh38 (hg38) version of the Human genome, using the standard ENCODE +The experiments were performed on a lymphoblastoid cell line, GM12878, and mapped +to the GRCh38 (hg38) version of the human genome, using the standard ENCODE ChIP-seq pipeline. In this chapter, due to compute time considerations, we have taken a subset of the data which corresponds to the human chromosome 21 (chr21). The data sets are located in the `compGenomRData`\index{R Packages!\texttt{compGenomRData}} package. @@ -385,7 +383,7 @@ in the following way: data_path = system.file('extdata/chip-seq',package='compGenomRData') ``` -The available data sets can be listed using the `list.files()` function: +The available datasets can be listed using the `list.files()` function: ```{r load.data, echo=TRUE, eval=TRUE, include=TRUE} chip_files = list.files(data_path, full.names=TRUE) @@ -395,7 +393,7 @@ chip_files = list.files(data_path, full.names=TRUE) head(chip_files) ``` -The data set consist of the following ChIP experiments: +The dataset consists of the following ChIP experiments: 1. **Transcription factors**: CTCF\index{CTCF protein}, SMC3, ZNF143, PolII (RNA polymerase 2) @@ -406,19 +404,19 @@ The data set consist of the following ChIP experiments: ### Sample clustering -Clustering is an ordering procedure which groups samples by similarity - +Clustering is an ordering procedure which groups samples by similarity; the more similar samples are grouped closer to one another. -The details of clustering methodologies are described in \@ref(unsupervisedLearning). +The details of clustering methodologies are described in Chapter \@ref(unsupervisedLearning). Clustering of ChIP signal profiles is used for two purposes: The first one is to ascertain whether there is concordance between -biological replicates - biological replicates should show greater similarity -than ChIP of different proteins. The second function is to see whether our experiments conform to known prior knowledge. For example, we would expect to see proteins greater similarity between proteins +biological replicates; biological replicates should show greater similarity +than ChIP of different proteins. The second function is to see whether our experiments conform to known prior knowledge. For example, we would expect to see greater similarity between proteins which belong to the same protein complex. -To quantify the ChIP signal we will firstly construct 1 kilobase wide tilling +To quantify the ChIP signal we will firstly construct 1-kilobase-wide tilling windows over the genome, and subsequently count the number of reads in each window, for each experiment. We will then normalize the counts, to -account for different total number of reads in each experiment, and finally +account for a different total number of reads in each experiment, and finally calculate the correlation between all pairs of samples. Although this procedure represents a crude way of data quantification, it provides sufficient information to ascertain the data quality. @@ -440,7 +438,7 @@ hg_chrs = subset(hg_chrs, grepl('chr21$',chrom)) ``` -`tileGenome()` function from the `GenomicRanges` package constructs equally sized +The `tileGenome()` function from the `GenomicRanges` package constructs equally sized windows over the genome of interest. The function takes two arguments: @@ -448,7 +446,7 @@ The function takes two arguments: 2. Window size -Firstly, we convert the chromosome lengths _data.frame_ into a _named vector_ +Firstly, we convert the chromosome lengths _data.frame_ into a _named vector_. ```{r sample-clustering.seqlen_vector} @@ -478,7 +476,7 @@ tilling_window We will use the `summarizeOverlaps()` function from the `GenomicAlignments` package to count the number of reads in each genomic window. The function will do the counting automatically for all our experiments. -`summarizeOverlaps()` function returns a `SummarizedExperiment` object. +The `summarizeOverlaps()` function returns a `SummarizedExperiment` object. The object contains the counts, genomic ranges which were used for the quantification, and the sample descriptions. @@ -501,12 +499,12 @@ counts = assays(so)[[1]] ``` -Different ChIP experiments were sequenced to different depth - each experiment -contains different number of reads. To remove the effect of the experimental +Different ChIP experiments were sequenced to different depths; each experiment +contains a different number of reads. To remove the effect of the experimental depth on the quantification, the samples need to be normalized. -Standard normalization procedure, for ChIP data, is to divide the +The standard normalization procedure, for ChIP data, is to divide the counts in each tilling window by the total number of sequenced reads, and -multiply it with a constant factor (to avoid extremely small numbers). +multiply it by a constant factor (to avoid extremely small numbers). This normalization procedure is called the **cpm**\index{cpm} - counts per million. \[ @@ -525,7 +523,7 @@ cpm = t(t(counts)*(1000000/colSums(counts))) We remove all tiles which do not have overlapping reads. Tiles with 0 counts do not provide any additional discriminatory power, rather, they introduce artificial similarity between the samples (i.e. samples with -a only a handful of bound regions will have a lot of tiles with 0 counts, while +only a handful of bound regions will have a lot of tiles with $0$ counts, while they do not have to have any overlapping enriched tiles). ```{r sample-clustering.filter_cpm} @@ -533,7 +531,7 @@ they do not have to have any overlapping enriched tiles). cpm = cpm[rowSums(cpm) > 0,] ``` -We use the sub function to shorten the column names of the cpm matrix. +We use the `sub()` function to shorten the column names of the cpm matrix. ```{r sample-clustering.change_colnames} # change the formatting of the column names @@ -549,9 +547,9 @@ colnames(cpm) ``` -Finally we calculate the pairwise pearson correlation coefficient using the +Finally, we calculate the pairwise Pearson correlation coefficient using the `cor()` function. -The function takes as input an region X sample count matrix, and returns +The function takes as input a region-by-sample count matrix, and returns a sample X sample matrix, where each field contains the correlation coefficient between two samples. @@ -567,7 +565,7 @@ samples which have the highest pairwise correlation. The diagonal represents the correlation of each sample with itself. -```{r sample-clustering-complex-heatmap, fig.cap='Heatmap showing ChIP-seq sample similarity using Pearson correlation coefficient'} +```{r sample-clustering-complex-heatmap, fig.cap='Heatmap showing ChIP-seq sample similarity using the Pearson correlation coefficient.'} # load ComplexHeatmap library(ComplexHeatmap) @@ -586,7 +584,7 @@ Heatmap( ) ``` -In figure \@ref(fig:sample-clustering-complex-heatmap) we can see a +In Figure \@ref(fig:sample-clustering-complex-heatmap) we can see a perfect example of why quality control is important. **CTCF** is a zinc finger protein which co-localizes with the Cohesin complex. **SMC3** is a sub unit of the Cohesin complex, and we would therefore expect to @@ -594,13 +592,13 @@ see that the **SMC3** signal profile has high correlation with the **CTCF** sign This is true for the second biological replicate of **SMC3**, while the first replicate (SMC3_r1) clusters with the input samples. This indicates that the sample likely has low enrichment. -We can see that the ChIP and Input samples form separate cluster. This implies +We can see that the ChIP and Input samples form separate clusters. This implies that the ChIP samples have an enrichment of fragments. Additionally, we see that the biological replicates of other experiments cluster together. -### Visualization in the Genome Browser +### Visualization in the genome browser One of the first steps in any ChIP-seq analysis should be looking at the @@ -612,27 +610,23 @@ regions of interest, or by loading data into a genome browser (such as IGV\index{IGV Browser}, or UCSC genome browsers\index{UCSC Genome Browser}). Genome browsers are standalone applications which represent the genome -as a one dimensional (1D) coordinate systems. The browsers enable +as a one-dimensional (1D) coordinate system. The browsers enable simultaneous visualization and comparison of multiple types of annotations and experimental data. Genome browsers can visualize most of the commonly used genomic data formats: -BAM\index{BAM file}, BED\index{BED file}, wig\index{wig file} and bigWig\index{bigWig file}. -The easiest way to access our data would be to load the .bam files into the browser.This will show us the sequence, and position of every mapped read. If we want to view multiple samples in parallel, loading every mapped read can be restrictive - -it takes up a lot of computational resources, and the amount of information +BAM\index{BAM file}, BED\index{BED file}, wig\index{wig file}, and bigWig\index{bigWig file}. +The easiest way to access our data would be to load the .bam files into the browser. This will show us the sequence and position of every mapped read. If we want to view multiple samples in parallel, loading every mapped read can be restrictive. It takes up a lot of computational resources, and the amount of information makes the visual comparison hard to do. We would like to convert our data so that we get a compressed visualization, -which would show us the main properties of our samples, namely, the quality, and +which would show us the main properties of our samples, namely, the quality and the location of the enrichment. This is achieved by summarizing the read enrichment into a signal profile - the whole experiment is converted into a numeric vector - a coverage vector. -The vector contains the information of how many reads overlap each position +The vector contains information on how many reads overlap each position in the genome. -We will proceed as follows: -Firstly we will import a **.bam** file into **R**. Then we will calculate -the signal profile (construct the coverage vector), and finally, we export the -vector as a **.bigWig** file. +We will proceed as follows: Firstly, we will import a **.bam** file into **R**. Then we will calculate the signal profile (construct the coverage vector), and finally, we export the vector as a **.bigWig** file. First we select one of the ChIP samples. @@ -667,7 +661,7 @@ reads = granges(reads) Because DNA fragments are being sequenced from their ends (both the 3' and 5' end), the read enrichment does not correspond to the exact location of the bound protein. Rather, reads end to form clusters of enrichment upstream and downstream of the true binding location. -To correct for this we use a small hack - before we create the signal profiles, +To correct for this, we use a small hack. Before we create the signal profiles, we will extend the reads towards their __3'__ end. The reads are extended to form fragments of 200 base pairs. This is an empiric measure, which corresponds to the average fragment size of the Illumina sample preparation kit. @@ -678,10 +672,10 @@ it will not affect the visual properties of our samples. Read extension is done using the `resize()` function. The function takes two arguments: -1. width - resulting fragment width +1. `width`: resulting fragment width -2. fix - which position of the fragment should not be changed (if fix is set to start, -the reads will be extended towards the __3'__ end, if fix is set to end, they will +2. `fix`: which position of the fragment should not be changed (if `fix` is set to start, +the reads will be extended towards the __3'__ end. If `fix` is set to end, they will be extended towards the __5'__ end) @@ -709,7 +703,7 @@ head(cov, 5) ``` The name of the output file is created by changing the file suffix from **.bam** -to **.bigWig** +to **.bigWig**. ```{r genome-browser.rename} # change the file extension from .bam to .bigWig @@ -730,10 +724,10 @@ export.bw(cov, 'output_file') #### Vizualization of track data using Gviz -We can create Genome browser like visualizations using the `Gviz` package, -which was introduced in chapter \@ref(genomicIntervals). -`Gviz` is a tool which enables exhaustive customized visualization of -genomics experiments.The basic usage principle is to define tracks, where each track can represent +We can create genome browserlike visualizations using the `Gviz` package, +which was introduced in Chapter \@ref(genomicIntervals). +The `Gviz` is a tool which enables exhaustive customized visualization of +genomics experiments. The basic usage principle is to define tracks, where each track can represent genomic annotation, or a signal profile; subsequently we define the order of the tracks and plot them. Here we will define two tracks, a genome axis, which will show the position @@ -754,13 +748,12 @@ dtrack = DataTrack(gcov, name = "CTCF", type='l') track_list = list(axis,dtrack) ``` -Tracks are plotted with the `plotTracks()` function. -`sizes` argument needs to be the same size as the track_list, and defines the +Tracks are plotted with the `plotTracks()` function. The `sizes` argument needs to be the same size as the track_list, and defines the relative size of each track. Figure \@ref(fig:genome-browser-gviz-show) shows the output of the `plotTracks()` function. -```{r genome-browser-gviz-show, fig.cap='ChIP-seq signal visualized as a browser track using Gviz', fig.width=8, fig.height = 3} +```{r genome-browser-gviz-show, fig.cap='ChIP-seq signal visualized as a browser track using Gviz.', fig.width=8, fig.height = 3} # plot the list of browser tracks # sizes argument defines the relative sizes of tracks # background title defines the color for the track labels @@ -780,20 +773,20 @@ a certain length. Similarity between the plus and minus strands defined as the correlation of the signal profiles for the reads that map to the **+** and the **-** strands. -The distribution of reads is shown on \@ref(fig:Figure-BrowserScreenshot) +The distribution of reads is shown in Figure \@ref(fig:Figure-BrowserScreenshot). -```{r Figure-BrowserScreenshot, echo=FALSE, include=TRUE, fig.cap='Browser screenshot of aligned reads for one ChIP, and control sample. ChIP samples have an assymetric distribution of reads - reads mapping to the + strand are located on the left side of the peak, while the reads mapping to the - strand are found on the right side of the peak'} +```{r Figure-BrowserScreenshot, echo=FALSE, include=TRUE, fig.cap='Browser screenshot of aligned reads for one ChIP, and control sample. ChIP samples have an asymetric distribution of reads; reads mapping to the + strand are located on the left side of the peak, while the reads mapping to the - strand are found on the right side of the peak.'} knitr::include_graphics('./Figures/BrowserScreenshot.png') ``` Due to the sequencing properties, reads which correspond to -the __5'__ fragment ends will map to the opposite strand then the reads +the __5'__ fragment ends will map to the opposite strand from the reads coming from the __3'__ ends. Most often (depending on the sequencing protocol) the reads from the __5'__ fragment ends map to the **+** strand, while the reads from the __3'__ ends map to the **-** strand. -We calculate the cross-correlation, by shifting the signal on the **+** strand, +We calculate the cross-correlation by shifting the signal on the **+** strand, by a pre-defined amount (i.e. shift by 1 - 400 nucleotides), and calculating, for each shift, the correlation between the **+**, and the **-** strands. Subsequently we plot the correlation versus shift, and locate the maximum value. @@ -804,24 +797,24 @@ fragments of certain length (i.e. whether the ChIP was successful). Due to the size of genomic data, it might be computationally prohibitive to calculate the Pearson correlation between whole genome (or even whole chromosome) signal profiles. -To get around this problem, we will resort to a trick - we will disregard the dynamic +To get around this problem, we will resort to a trick; we will disregard the dynamic range of the signal profiles, and only keep the information of which genomic bases contained the ends of the fragments. -This is done by calculating the coverage vector of read starting position (separately -for each strand), and converting the coverage vector into a boolean vector. -The boolean vector contains the information of which genomic positions +This is done by calculating the coverage vector of the read starting position (separately +for each strand), and converting the coverage vector into a Boolean vector. +The Boolean vector contains the information of which genomic positions contained the DNA fragment ends. -Similarity between two boolean vectors can be promptly computed using the Jaccard index. -Jaccard index is defined as an intersection between two boolean vectors, +Similarity between two Boolean vectors can be promptly computed using the Jaccard index. +The Jaccard index is defined as an intersection between two Boolean vectors, divided by their union as shown in Figure \@ref(fig:FigureJaccardSimilarity). ```{r, FigureJaccardSimilarity, echo=FALSE,fig.align = 'center', fig.cap="Jaccard similarity is defined as the ratio of the intersection and union of two sets.",out.width="30%"} knitr::include_graphics('./Figures/Jaccard.png') ``` -Firstly we load the reads for one of the CTCF ChIP experiments. -Then we create signal profiles, separately for reads on the **+**, and **-** +Firstly, we load the reads for one of the CTCF ChIP experiments. +Then we create signal profiles, separately for reads on the **+** and **-** strands. Unlike before, we do not extend the reads to the average expected fragment length (200 base pairs); we keep only the starting position of each read. @@ -837,8 +830,8 @@ reads = resize(reads, width=1, fix='start') reads = keepSeqlevels(reads, 'chr21', pruning.mode='coarse') ``` -Now we can calculate the coverage vector of read starting position. -The coverage vector is then automatically converted into a boolean vector by +Now we can calculate the coverage vector of the read starting position. +The coverage vector is then automatically converted into a Boolean vector by asking which genomic positions have $coverage > 0$. ```{r correlation.coverage} @@ -854,7 +847,7 @@ cov = lapply(reads, function(x){ cov = lapply(cov, as.vector) ``` -We will no shift the coverage vector from the plus strand by 1 - 400 base pairs, and for each pair shift we will calculate the Jaccard index between the vectors +We will now shift the coverage vector from the plus strand by $1$ to $400$ base pairs, and for each pair shift we will calculate the Jaccard index between the vectors on the plus and minus strand. ```{r correlation.jaccard, cache=TRUE} @@ -877,9 +870,9 @@ cc = shiftApply( cc = data.frame(fragment_size = wsize, cross_correlation = cc) ``` -We can finally plot the shift in basepairs versus the correlation coefficient: +We can finally plot the shift in base pairs versus the correlation coefficient: -```{r correlation-plot, fig.cap='The figure shows the correlation coefficient between the ChIP-seq signal on + and - strands. The peak of the distribution designates the fragment size'} +```{r correlation-plot, fig.cap='The figure shows the correlation coefficient between the ChIP-seq signal on + and $-$ strands. The peak of the distribution designates the fragment size'} library(ggplot2) ggplot(data = cc, aes(fragment_size, cross_correlation)) + geom_point() + @@ -908,7 +901,7 @@ The PCR amplification procedure can cause a significant bias in the ChIP experiments. The bias can be influenced by the DNA fragment size distribution, sequence composition, hexamer distribution of PCR primers, and the number of cycles used for the amplification. -One way how to determine whether some of the samples have significantly +One way to determine whether some of the samples have significantly different sequence composition is to look at whether regions with differing GC composition were equally enriched in all experiments. @@ -938,8 +931,8 @@ tilling_window = unlist(tileGenome( We will extract the sequence information from the `BSgenome.Hsapiens.UCSC.hg38` package. `BSgenome` are generic Bioconductor containers for genomic sequences. Sequences are extracted from the `BSgenome` container using the `getSeq()` function. -`getSeq()` function takes as input the genome object, and the ranges with the -regions of interest - in our case, the tilling windows. +The `getSeq()` function takes as input the genome object, and the ranges with the +regions of interest; in our case, the tilling windows. The function returns a `DNAString` object. @@ -957,9 +950,9 @@ To calculate the GC content, we will use the `oligonucleotideFrequency()` functi calculate the **dinucleotide** frequency. Each row in the resulting table will contain the number of all possible dinucleotides observed in each tilling window. -Because we have tilling windows of same length we do not +Because we have tilling windows of the same length, we do not necessarily need to normalize the counts by the window length. -If all of the windows do not have the same length (i.e. when at the ChIP-seq peaks), then the normalization is a prerequisite. +If all of the windows have different lengths (i.e. when at the ChIP-seq peaks), then normalization is a prerequisite. ```{r gc.oligo} @@ -1000,7 +993,7 @@ gc = cbind(data.frame(cpm_log), GC = nuc['GC']) and plot the results. -```{r gc-plot, fig.cap='GC content abundance in a ChIP-seq experiment'} +```{r gc-plot, fig.cap='GC content abundance in a ChIP-seq experiment.'} ggplot( data = gc, aes( @@ -1021,13 +1014,13 @@ ggplot( Figure \@ref(fig:gc-plot) visualizes the CPM versus GC content, and gives us two important pieces of information. Firstly, it shows whether there was a specific amplification of regions -with extremely high or extremely low GC content. This would be strong indication -that the either the PCR or the size selection procedure were not successfully +with extremely high or extremely low GC content. This would be a strong indication +that either the PCR or the size selection procedure were not successfully executed. The second piece of information comes by comparison of plots corresponding to multiple experiments. If different ChIP-samples have highly diverging enrichment of different ChIP regions, then -some of the samples were affected by unknown batch affects. Such effects +some of the samples were affected by unknown batch effects. Such effects need to be taken into account in downstream analysis. Firstly, we will reorder the columns of the `data.frame` using the `pivot_longer()` @@ -1053,9 +1046,9 @@ gcd = subset(gcd, grepl('CTCF', experiment)) gcd$experiment = sub('chr21.','',gcd$experiment) ``` -We can now visualize the relationship using a scatterplot. +We can now visualize the relationship using a scatter plot. Figure \@ref(fig:gc-tidy-plot) compares the GC content dependency on the CPM between -the first and the second CTCF replicate. In this case, the replicates looks similar. +the first and the second CTCF replicate. In this case, the replicate looks similar. ```{r gc-tidy-plot, warning = FALSE, fig.cap='Comparison of GC content and signal abundance between two CTCF biological replicates'} ggplot(data = gcd, aes(GC, log10(cpm+1))) + @@ -1097,16 +1090,11 @@ To solve the problem of multiple assignments, we need to construct a set of anno A heuristic solution is to organize the genomic annotation into a hierarchy which will imply prioritization. We can then look, for each read, which functional categories it overlaps, and -if it within multiple categories, we assign the read to the topmost category. -As an example, let's say that we have 4 genomic categories: - -1) TSS (transcription start sites) -\index{transcription start site (TSS)} -2) exon 3) intron and 4) intergenic with the following hierarchy: **TSS -> exon -> intron -> intergenic**. This means that If a read overlaps a TSS \index{transcription start site (TSS)} and an intron, -it will be annotates as TSS. +if it is within multiple categories, we assign the read to the topmost category. +As an example, let's say that we have 4 genomic categories: 1) TSS (transcription start sites)\index{transcription start site (TSS)}, 2) exon, 3) intron, and 4) intergenic with the following hierarchy: **TSS -> exon -> intron -> intergenic**. This means that if a read overlaps a TSS\index{transcription start site (TSS)} and an intron, it will be annotated as TSS. This approach is shown in Figure \@ref(fig:Figure-Hierarchical-Annotation). -```{r Figure-Hierarchical-Annotation, echo=FALSE, include=TRUE, fig.cap='Principle of hierarchical annotation. The region of interest is annotated as the topmost ranked category that it overlaps. In this case, our region overlaps a TSS, an exon, and a intergenic regions. Because the TSS has the topmost rank, it is annotated as a TSS.'} +```{r Figure-Hierarchical-Annotation, echo=FALSE, include=TRUE, fig.cap='Principle of hierarchical annotation. The region of interest is annotated as the topmost ranked category that it overlaps. In this case, our region overlaps a TSS, an exon, and an intergenic region. Because the TSS has the topmost rank, it is annotated as a TSS.'} knitr::include_graphics('./Figures/Hierarchical_Annotation.png') ``` @@ -1120,12 +1108,9 @@ There are multiple sources of genomic annotation. **UCSC**\index{UCSC Genome Bro **Genbank**, and **Ensembl**\index{Ensembl Genome Browser} databases represent stable resources, from which the annotation can be easily obtained. -`AnnotationHub`\index{R Packages!\texttt{AnnotationHub}} is a Bioconductor -based online resource which contains a large amount of experiments from various -sources. We will use the AnnotationHub to download the location of -genes corresponding to the **hg38** genome. - -The hub is accessed in the following way: +`AnnotationHub`\index{R Packages!\texttt{AnnotationHub}} is a Bioconductor-based online resource which contains a large number of experiments from various +sources. We will use the `AnnotationHub` to download the location of +genes corresponding to the **hg38** genome. The hub is accessed in the following way: ```{r read-annot.hub} # load the AnnotationHub package @@ -1135,8 +1120,7 @@ library(AnnotationHub) hub = AnnotationHub() ``` -The `hub` variable contains the programming interface towards the online database. -We can use the `query()` function to find out the ID of the +The `hub` variable contains the programming interface towards the online database. We can use the `query()` function to find out the ID of the "ENSEMBL"\index{Ensembl Genome Browser} gene annotation. ```{r read-annot.query} @@ -1150,7 +1134,7 @@ AnnotationHub::query( We are interested in the version **GRCh38.92**, which is available under **AH61126**. To download the data from the hub, we use the `[[` operator on the hub API. -We will download the annotation in the **GTF**\index{GTF file} format, into a`GRanges` object. +We will download the annotation in the **GTF**\index{GTF file} format, into a `GRanges` object. ```{r read-annot.fetch} # retrieve the human gene annotation @@ -1188,14 +1172,7 @@ gtf = gtf[seqnames(gtf) == 'chr21'] #### Constructing genomic annotation Once we have downloaded the annotation we can define the functional hierarchy\index{annotation hierarchy}. -We will use the previously mentioned ordering: - -**TSS -> exon -> intron -> intergenic** - -With **TSS** -\index{transcription start site (TSS)} having the highest priority and the -intergenic regions having the lowest priority. - +We will use the previously mentioned ordering: **TSS -> exon -> intron -> intergenic**, with **TSS**\index{transcription start site (TSS)} having the highest priority and the intergenic regions having the lowest priority. ```{r read-annot.hierarchy} # construct a GRangesList with human annotation @@ -1220,21 +1197,14 @@ reads in each genomic category. We will then loop over all of the **.bam**\index{BAM file} files to annotate each experiment. -`annotateReads()` function works in the following way: - -1. Load the **.bam** file - -2. Find overlaps between the reads and the annotation categories - -3. Arrange the annotated reads based on the hierarchy\index{annotation hierarchy}, and remove duplicated assignments +The `annotateReads()` function works in the following way: +1. Load the **.bam** file. +2. Find overlaps between the reads and the annotation categories. +3. Arrange the annotated reads based on the hierarchy\index{annotation hierarchy}, and remove duplicated assignments. 4. Count the number of reads in each category. - -The crucial step to understand here is using the `arrange()` and `filter()` functions -to keep only one annotated category per read. - - +The crucial step to understand here is using the `arrange()` and `filter()` functions to keep only one annotated category per read. ```{r read-annot.annotateReads, warnings=FALSE} annotateReads = function(bam_file, annotation_list){ @@ -1318,7 +1288,7 @@ annot_reads_df$experiment = experiment_name And plot the results. -```{r read-annotation-plot, eval=TRUE, warning = FALSE, fig.cap='Read distribution in genomice functional annotation categories'} +```{r read-annotation-plot, eval=TRUE, warning = FALSE, fig.cap='Read distribution in genomice functional annotation categories.'} ggplot(data = annot_reads_df, aes( x = experiment, @@ -1348,7 +1318,7 @@ transcription factor show increased read abundance around the TSS. After we are convinced that the data is of sufficient quality, we can proceed with the downstream analysis. -One of he first steps in the ChIP-seq analysis is peak calling. +One of the first steps in the ChIP-seq analysis is peak calling. Peak calling is a statistical procedure, which uses coverage properties of ChIP and Input samples to find regions which are enriched due to protein binding. @@ -1372,21 +1342,20 @@ separately, and the peaks need to be combined in post-processing. Based on the binding properties of ChIP-ped proteins, ChIP-seq signal profiles can be divided into three classes: -1. **Sharp** (point signal) - A signal profile which is localized to specific +1. **Sharp** (point signal): A signal profile which is localized to specific short genomic regions (up to couple of hundred base pairs) It is usually obtained from transcription factors, or highly localized posttranslational histone modifications (H3K4me3, which is found on gene promoters). -2. **Broad** (wide signal) - The signal covers broad genomic domains spanning up to several kilobases. +2. **Broad** (wide signal): The signal covers broad genomic domains spanning up to several kilobases. Usually produced by disperse histone modifications\index{histone modification} (H3K36me3, located on gene bodies, or H3K23me3, which is deposited by the Polycomb complex in large genomic regions). -3. **Mixed** - The signal consists of a mixture of sharp and broad regions. +3. **Mixed**: The signal consists of a mixture of sharp and broad regions. It is produced by proteins which have dynamic behavior. Most often these are ChIP experiments of RNA Polymerase 2. -Different types of ChIP experiment usually require specialized analysis tools - -some peak callers are developed to specifically detect narrow peaks[@zhang_2008; @xu_2010; @shao_2012], while others +Different types of ChIP experiments usually require specialized analysis tools. Some peak callers are developed to specifically detect narrow peaks [@zhang_2008; @xu_2010; @shao_2012], while others detect enrichment in diffuse broad regions [@zang_2009; @micsinai_2012; @beck_2012; @song_2011; @xing_2012], or mixed (Polymerase 2) signals [@han_2012]. Recent developments in peak calling methods (such as `normR`) can however accommodate @@ -1396,7 +1365,7 @@ results, and the peculiarities of the experimental design and execution [@laajal If you are not certain what kind of signal profile to expect from a ChIP-seq experiment, the best solution is to visualize the data. We will now use the data from **H3K4me3** (Sharp), **H3K36me3** (Broad), and **POL2** (Mixed) -ChIP experiments to show the differences in the signal profiles. We will use the the bigWig files to visualize the signal profiles around a +ChIP experiments to show the differences in the signal profiles. We will use the bigWig files to visualize the signal profiles around a highly expressed human gene from chromosome 21. This will give us an indication of how the profiles for different types of ChIP experiments differ. First we select the files of interest: @@ -1444,7 +1413,7 @@ ucsc_seqlevels = paste0('chr', ensembl_seqlevels) # replace ensembl with ucsc chromosome names seqlevels(gtf, pruning.mode='coarse') = ucsc_seqlevels ``` -To enable Gviz to work with genomic annotation we will convert the `GRanges` +To enable `Gviz` to work with genomic annotation we will convert the `GRanges` object into a transcript database using the following function: ```{r chip-type.txdb, warning=FALSE} @@ -1455,7 +1424,7 @@ library(GenomicFeatures) txdb = makeTxDbFromGRanges(gtf) ``` -And convert the transcript database into a Gviz track. +And convert the transcript database into a `Gviz` track. ```{r chip-type.gene} # define the gene track object @@ -1464,7 +1433,7 @@ gene_track = GeneRegionTrack(txdb, chr='chr21', genome='hg38') Once we have downloaded the annotation, and imported the signal profiles into **R** we are ready to visualize the data. -We will again use the `Gviz` library. We firstly define the coordinate system - the ideogram track which will show +We will again use the `Gviz` library. We firstly define the coordinate system. The ideogram track which will show the position of our current viewpoint on the chromosome, and a genome axis track, which will show the exact coordinates. ```{r chip-type.ideo} @@ -1516,7 +1485,7 @@ data_tracks = lapply(names(chip_profiles), function(exp_name){ We are finally ready to create the genome screenshot. We will focus on an extended region around the URB1 gene. -```{r chip-type-plot-gviz, fig.cap='ChIP-seq signal around the URB1 gene'} +```{r chip-type-plot-gviz, fig.cap='ChIP-seq signal around the URB1 gene.'} # select the start coordinate for the URB1 gene start = min(start(subset(gtf, gene_name == 'URB1'))) @@ -1543,17 +1512,14 @@ plotTracks( ) ``` -Figure \@ref(fig:chip-type-plot-gviz) shows the signal profile around the URB1 gene. -H3K4me3 signal profile contains a strong narrow peak on the transcription start site. -H3K36me3 shows strong enrichment in the gene body, wile the POL2 ChIP shows -a mixed profile, with a strong peak at the TSS and an enrichment over the gene body. +Figure \@ref(fig:chip-type-plot-gviz) shows the signal profile around the URB1 gene. H3K4me3 signal profile contains a strong narrow peak on the transcription start site. H3K36me3 shows strong enrichment in the gene body, while the POL2 ChIP shows a mixed profile, with a strong peak at the TSS and an enrichment over the gene body. -### Peak calling - sharp peaks +### Peak calling: Sharp peaks \index{Peak calling} -We will now use `normR` [@helmuth_2016] package for peak calling in sharp and broad peak experiments. +We will now use the `normR` [@helmuth_2016] package for peak calling in sharp and broad peak experiments. Select the input files. Since `normR` does not support the usage of biological replicates, we will showcase the peak calling on one of the CTCF samples. @@ -1566,10 +1532,10 @@ chip_file = file.path(data_path, 'GM12878_hg38_CTCF_r1.chr21.bam') control_file = file.path(data_path, 'GM12878_hg38_Input_r5.chr21.bam') ``` -To a feeling about the dynamic range of enrichment we will create a scatter plot +To understand the dynamic range of enrichment, we will create a scatter plot showing the strength of signal in the CTCF and Input. -Let us first count the reads in 1kb windows, and normalize them to counts per +Let us first count the reads in 1-kb windows, and normalize them to counts per million sequenced reads. ```{r peak-calling.sharp.count, warning=FALSE} @@ -1599,7 +1565,7 @@ cpm = t(t(counts)*(1000000/colSums(counts))) We can now plot the ChIP versus Input signal: -```{r peak-calling-sharp-plot, message = FALSE, erroe=FALSE, fig.cap='Comparison of CPM values between ChIP and Input experiments. Good ChIP experiments should always show enrichment'} +```{r peak-calling-sharp-plot, message = FALSE, erroe=FALSE, fig.cap='Comparison of CPM values between ChIP and Input experiments. Good ChIP experiments should always show enrichment.'} library(ggplot2) # convert the matrix into a data.frame for ggplot cpm = data.frame(cpm) @@ -1624,18 +1590,12 @@ ggplot( ggtitle('ChIP versus Input') ``` -Regions above the diagonal, in figure \@ref(fig:peak-calling-sharp-plot) show +Regions above the diagonal, in Figure \@ref(fig:peak-calling-sharp-plot), show higher enrichment in the ChIP samples, while the regions below the diagonal show higher enrichment in the Input samples. -Let us now perform for peak calling. -`normR` usage is deceivingly simple - we need to provide the location -ChIP and Control read files, and the genome version to the `enrichR()` function. -The function will automatically create tilling windows (250bp by default), -count the number of reads in each window, and fit a mixture of -binomial distributions. - +Let us now perform for peak calling. `normR` usage is deceivingly simple; we need to provide the location ChIP and Control read files, and the genome version to the `enrichR()` function. The function will automatically create tilling windows (250bp by default), count the number of reads in each window, and fit a mixture of binomial distributions. ```{r, peak-calling.sharp.peak-calling, message=FALSE, warning=FALSE} library(normr) @@ -1655,22 +1615,21 @@ ctcf_fit = enrichR( verbose = FALSE) ``` -With the summary function we can take look at the results: +With the summary function we can take a look at the results: ```{r peak-calling.summary} summary(ctcf_fit) ``` -The summary function shows that most of the regions of the chromosome 21 correspond -to the background - $97.72%$. -In total we have $1029=(627+120+195+87)$ significantly enriched regions. +The summary function shows that most of the regions of chromosome 21 correspond +to the background: $97.72%$. In total we have $1029=(627+120+195+87)$ significantly enriched regions. We will now extract the regions into a `GRanges` object. -`getRanges()` function extracts the regions from the model. Using the +The `getRanges()` function extracts the regions from the model. Using the `getQvalue()`, and `getEnrichment()` function we assign to our regions the statistical significance and calculated enrichment. In order to identify only highly significant regions, -we keep only ranges where the false discovery rate (q value) is below 0.01. +we keep only ranges where the false discovery rate (q value) is below $0.01$. ```{r peak-calling.sharp.peak-calling.ranges, cache=TRUE} # extracts the ranges @@ -1696,15 +1655,13 @@ ctcf_peaks = ctcf_peaks[order(ctcf_peaks$qvalue)] ```{r, peak-calling.sharp.peak-calling.show, include=TRUE, echo=FALSE, eval=TRUE, R.options=list(digits=3)} ctcf_peaks ``` -After stringent q value filtering we are left with 724 peaks. - -For the ease of downstream analysis, we will limit the sequence levels to +After stringent q value filtering we are left with $724$ peaks. For the ease of downstream analysis, we will limit the sequence levels to chromosome 21. ```{r peak-calling.sharp.peak-calling.seqlevels} seqlevels(ctcf_peaks, pruning.mode='coarse') = 'chr21' ``` -Let's export the peaks into a .txt file which we can use downstream in the analysis. +Let's export the peaks into a .txt file which we can use the downstream in the analysis. ```{r peak-calling.write_table, include=T, eval=T, echo=T} # write the peaks loacations into a txt table @@ -1713,7 +1670,7 @@ write.table(ctcf_peaks, file.path(data_path, 'CTCF_peaks.txt'), ``` -We can now repeat the CTCF versus Input plot, and label significantly marked peaks.Using the count overlaps we mark which of our 1kb regions contained significant peaks. +We can now repeat the CTCF versus Input plot, and label significantly marked peaks. Using the count overlaps we mark which of our 1-kb regions contained significant peaks. ```{r peak-calling.sharp.peak-calling.countOvlaps} # find enriched tilling windows @@ -1747,22 +1704,20 @@ ggplot( scale_color_manual(values=c('gray','red')) ``` -Figure \@ref(fig:peak-calling-sharp-peak-calling-plot) shows see that `normR` +Figure \@ref(fig:peak-calling-sharp-peak-calling-plot) shows that `normR` identified all of the regions above the diagonal as statistically significant. It has, however, labeled a significant number of regions below the diagonal. -Because the sophisticated statistical model, +Because of the sophisticated statistical model, `normR` has greater sensitivity, and these peaks might really be enriched regions, it is worth investigating the nature of these regions. This is left as an exercise to the reader. -We can now create a genome browser screenshot around a peak regions. +We can now create a genome browser screenshot around a peak region. This will show us what kind of signal properties have contributed to the peak calling. -We would expect to see a strong, bell shaped, enrichment in the ChIP sample, and a +We would expect to see a strong, bell-shaped, enrichment in the ChIP sample, and uniform noise in the Input sample. -Let us now visualize the signal around the most enriched peak: - - The following function takes as input a **.bam** file, and loads the bam into R. +Let us now visualize the signal around the most enriched peak. The following function takes as input a **.bam** file, and loads the bam into R. It extends the reads to a size of 200 bp, and creates the coverage vector. ```{r peak-calling.sharp.peak-calling.coverage,function} @@ -1806,8 +1761,8 @@ ctcf_cov = calculateCoverage(chip_file) cont_cov = calculateCoverage(control_file) ``` -Using Gviz, we will construct the layered tracks. -First the genome coordinates +Using `Gviz`, we will construct the layered tracks. +First, we layout the genome coordinates: ```{r peak-calling.sharp.peak-calling.gviz.axis} # load Gviz and get the chromosome coordinates @@ -1818,14 +1773,14 @@ axis = GenomeAxisTrack( ) ``` -Then the peak locations +Then, the peak locations: ```{r peak-calling.sharp.peak-calling.gviz.peaks} # peaks track peaks_track = AnnotationTrack(ctcf_peaks, name = "CTCF Peaks") ``` -And finally, the signal files +And finally, the signal files: ```{r peak-calling.sharp.peak-calling.gviz.signal} chip_track = DataTrack( @@ -1844,7 +1799,7 @@ cont_track = DataTrack( ``` -```{r peak-calling-signal-profile-plot, fig.cap='ChIP and Input signal profile in around the peak centers.'} +```{r peak-calling-signal-profile-plot, fig.cap='ChIP and Input signal profile around the peak centers.'} plotTracks( trackList = list(chr_track, axis, peaks_track, chip_track, cont_track), sizes = c(.2,.5,.5,1,1), @@ -1854,19 +1809,17 @@ plotTracks( ) ``` -Figure \@ref(fig:peak-calling-signal-profile-plot) ChIP sample looks as expected. +In Figure \@ref(fig:peak-calling-signal-profile-plot), the ChIP sample looks as expected. Although the Input sample shows an enrichment, it is important to compare the scales on both samples. The normalized ChIP signal goes up -to 2500, while the maximum value in the input sample is only 60. +to $2500$, while the maximum value in the input sample is only $60$. -### Peak calling - Broad regions +### Peak calling: Broad regions \index{Peak calling} We will now use `normR` to call peaks for the H3K36me3 histone modification, -which is associated with gene bodies of expressed genes. - -We define the ChIP and Input files: +which is associated with gene bodies of expressed genes. We define the ChIP and Input files: ```{r peak-calling.broad.files} # fetch the ChIP-file for H3K36me3 @@ -1876,9 +1829,9 @@ chip_file = file.path(data_path, 'GM12878_hg38_H3K36me3.chr21.bam') control_file = file.path(data_path, 'GM12878_hg38_Input_r5.chr21.bam') ``` -Because H3K36 regions span broad domains it is necessary to increase the +Because H3K36 regions span broad domains, it is necessary to increase the tilling window size which will be used for counting. -Using the `countConfiguration()` function we will set the tilling window size +Using the `countConfiguration()` function, we will set the tilling window size to 5000 base pairs. ```{r peak-calling.broad.config} @@ -1911,9 +1864,7 @@ h3k36_fit = enrichR( summary(h3k36_fit) ``` -The summary function shows that we get `r 1005+314+381+237` enriched regions. - -We will extract enriched regions, and plot them in the same way we did for the +The summary function shows that we get `r 1005+314+381+237` enriched regions. We will extract enriched regions, and plot them in the same way we did for the CTCF. ```{r peak-calling.broad.ranges} @@ -1963,17 +1914,17 @@ plotTracks( ) ``` -The figure \@ref(fig:peak-calling-broad-gviz) shows a highly enriched H3K36me3 +Figure \@ref(fig:peak-calling-broad-gviz) shows a highly enriched H3K36me3 region covering the gene body, as expected. ### Peak quality control -Peak calling is not a mathematically defined procedure - it is impossible +Peak calling is not a mathematically defined procedure; it is impossible to unambiguously define what a "peak" is. Therefore all of the peak calling procedures use heuristics, and statistical models which have been -show to work well in specific use-cases. -After peak calling it is always necessary to check +shown to work well in specific use cases. +After peak calling, it is always necessary to check whether the defined peaks really are located in enriched regions, and in addition, use prior knowledge to ascertain whether the peaks correspond to known biology. @@ -1981,9 +1932,9 @@ Peak calling can falsely identify enriched regions if the input sample is not sequenced to the proper depth. Because the input samples correspond to __de facto__ whole genome sequencing, and the ChIP procedure enriches for a subset of the genome, it can often happen that many regions -in the the genome are not sufficiently covered by the Input sample. +in the genome are not sufficiently covered by the Input sample. Such variability in the signal profile of Input samples can cause a region -to be defined as a peak - enriched in the ChIP sample, while in reality it is depleted in the +to be defined as a peak, enriched in the ChIP sample, while in reality it is depleted in the Input, due to under-sampling. For example, the figure in the previous chapter, showing an enriched region H3K36me3 over a gene body, shows a large depletion in the Input sample over the same region. Such depletion should be a concern and merit @@ -1991,7 +1942,7 @@ further investigation. The quality of enrichment can be checked by calculating the percentage of reads within peaks for both ChIP and Input samples. ChIP samples should have a high percentage of reads in peaks, -while for the input samples the percentage of reads should correspond to the +while for the input samples, the percentage of reads should correspond to the percentage of genome covered by peaks. For transcription factor ChIP experiments, an important control is to determine whether @@ -2003,7 +1954,7 @@ of binding DNA sequences. Such sequence models can be downloaded from public databases and compared to see whether there is a positional enrichment around our peaks. -We will now calculate the percentage of reads within peaks for H3K36me3 experiment. +We will now calculate the percentage of reads within peaks for the H3K36me3 experiment. Subsequently, we will download the known CTCF sequence model, and compare it to our peak regions. @@ -2060,7 +2011,7 @@ h3k36_counts_df We can now plot the percentage of reads in peaks: -```{r peak-quality-counts-plot, fig.cap='Percentage of ChIP read in called peaks. Higher percentage indicates higher ChIP quality.'} +```{r peak-quality-counts-plot, fig.cap='Percentage of ChIP reads in called peaks. Higher percentage indicates higher ChIP quality.'} ggplot( data = h3k36_counts_df, aes( @@ -2080,9 +2031,9 @@ ggplot( scale_fill_manual(values=c('gray','red')) ``` -The figure \@ref(fig:peak-quality-counts-plot) shows that the ChIP sample is +Figure \@ref(fig:peak-quality-counts-plot) shows that the ChIP sample is clearly enriched in the peak regions. -The percentage of read in peaks will depend on the quality of the antibody (strength of +The percentage of reads in peaks will depend on the quality of the antibody (strength of enrichment), and the size of peaks which are bound by the protein of interest. If the total size of peaks is small, relative to the genome size, we can expect that the percentage of reads in peaks will be small. @@ -2091,20 +2042,18 @@ the percentage of reads in peaks will be small. #### DNA motifs on peaks \index{DNA motif} -Well studied transcription factor have publicly available transcription +Well-studied transcription factors have publicly available transcription factor binding motifs. If such a model is available for our transcription factor of interest, we can use it to check the quality of our ChIP data. Two common measures are used for this purpose: 1. Percentage of peaks containing the motif of interest. - -2. Positional distribution of the motif - the distribution of motif locations should be -centered on the peak centers. +2. Positional distribution of the motif - the distribution of motif locations should be centered on the peak centers. ##### Representing motifs as matrices -In order to calculate the percentage of CTCF peaks which contain a know CTCF +In order to calculate the percentage of CTCF peaks which contain a known CTCF motif. We need to find the CTCF motif and have the computational tools to search for that motif. The DNA binding motifs can be extracted from the `MotifDB` Bioconductor database\index{R Packages!\texttt{MotifDB}}. The `MotifDB` is an agglomeration of multiple motif databases. @@ -2128,14 +2077,14 @@ We will extract the CTCF from the `MotifDB` [@khan_2018] database. ctcf_motif = motifs[[1]] ``` -The motifs are usually represented as matrices of 4-by-N dimensions. In the matrix, 4 rows corresponds to one nucleotide (A, C, G, T). +The motifs are usually represented as matrices of 4-by-N dimensions. In the matrix, each of 4 rows correspond to one nucleotide (A, C, G, T). The number of columns designates the width of the region bound by the transcription factor or the length of the motif that the protein recognizes. Each element of the matrix contains the probability of observing the corresponding nucleotide on this position. For example, for following the CTCF matrix in Table \@ref(tab:peakqualityshow), the probability of observing a thymine at the first position of the motif,$p_{i=1,k=4}$ , is 0.57 (1st column, 4th row). -Such a matrix, where each column is a probability distribution of over a sequence of nucleotides -is called a position frequency matrix - PFM\index{position frequency matrix (PFM)}. In some sources, this matrix is also called as "position probability matrix (PPM)". One way to construct such matrices is to get experimentally verified sequences that are bound by the protein of interest and then to use a motif finding algorithm. +Such a matrix, where each column is a probability distribution over a sequence of nucleotides, +is called a position frequency matrix (PFM)\index{position frequency matrix (PFM)}. In some sources, this matrix is also called "position probability matrix (PPM)". One way to construct such matrices is to get experimentally verified sequences that are bound by the protein of interest and then to use a motif-finding algorithm. ```{r peakqualityshow, echo=FALSE, R.options=list(digits=2)} the.table =knitr::kable(ctcf_motif,booktabs = TRUE, @@ -2145,10 +2094,10 @@ the.table ``` Such a matrix can be used to calculate the probability that the transcription -factor will bind to any given sequence. However, computationally, it is easier to work with summation rather than multiplication. In addition, the simple probabilistic model does not take the background probability of observing a certain base in a given position. We can correct for background base frequencies by the dividing the individual probability, $p_{i,k}$ in each cell of the matrix by the background base probability for a given base, $B_k$. We can then take the logarithm of that quantity to calculate a log-likelihood and bring everything to log-scale as follows $Score_{i,k}=log_2(p_{i,k}/B_k)$. We can now calculate a score for any given -sequence by summing up the the base position specific scores we obtain from this log scaled matrix. This matrix is formally called "position specific scoring matrix (PSSM) or position specific weight matrix (PWM). We can use this matrix to scan the genome in a sliding window manner and calculate a score for each window. Usually, a cutoff value is needed to call a motif hit. The higher the score you get from the PWM for a particular sequence the better it is. Traditional algorithms we will use in the following sections use 80% of the maximum rescaled score you can obtain from a PWM as the default cutoff for a hit. The rescaling is simple min-max rescaling where you rescale the score by subtracting the minimum score and dividing that by $max(PWMscore)-min(PWMscore)$. The motif scanning approach is illustrated in Figure \@ref(fig:FigurePWMScanning). In this example, ACACT is not considered a hit because its score only corresponds to only $15.6$ % of the rescaled maximum score. +factor will bind to any given sequence. However, computationally, it is easier to work with summation rather than multiplication. In addition, the simple probabilistic model does not take the background probability of observing a certain base in a given position. We can correct for background base frequencies by dividing the individual probability, $p_{i,k}$ in each cell of the matrix by the background base probability for a given base, $B_k$. We can then take the logarithm of that quantity to calculate a log-likelihood and bring everything to log-scale as follows $Score_{i,k}=log_2(p_{i,k}/B_k)$. We can now calculate a score for any given +sequence by summing up the base-position-specific scores we obtain from the log-scaled matrix. This matrix is formally called position-specific scoring matrix (PSSM) or position-specific weight matrix (PWM). We can use this matrix to scan the genome in a sliding window manner and calculate a score for each window. Usually, a cutoff value is needed to call a motif hit. The higher the score you get from the PWM for a particular sequence, the better it is. The traditional algorithms we will use in the following sections use 80% of the maximum rescaled score you can obtain from a PWM as the default cutoff for a hit. The rescaling is simple min-max rescaling where you rescale the score by subtracting the minimum score and dividing that by $max(PWMscore)-min(PWMscore)$. The motif scanning approach is illustrated in Figure \@ref(fig:FigurePWMScanning). In this example, ACACT is not considered a hit because its score only corresponds to only $15.6$ % of the rescaled maximum score. -(ref:FigurePWMScanning) PWM scanning principle. A genomic sequence is scanned by a PWM matrix. This matrix used to measure how likely is that the transcription factor will bind each nucelotide in each position. Here we are looking at how likely it is that our TF will bind to the sequence ACACT. The score for this sequence is -3.6. The maximal score obtainable by the PWM is 7.2 and minimun is -5.6. After min-max rescaling, -3.6 corresponds to 15% score and ACACT is not considered a hit +(ref:FigurePWMScanning) PWM scanning principle. A genomic sequence is scanned by a PWM matrix. This matrix is used to measure how likely it is that the transcription factor will bind each nucleotide in each position. Here we are looking at how likely it is that our TF will bind to the sequence ACACT. The score for this sequence is -3.6. The maximal score obtainable by the PWM is 7.2 and minimum is -5.6. After min-max rescaling, -3.6 corresponds to a 15% score and ACACT is not considered a hit. ```{r FigurePWMScanning, echo=FALSE, include=TRUE, fig.cap='(ref:FigurePWMScanning)'} knitr::include_graphics('./Figures/PWMScanning.png') @@ -2160,22 +2109,22 @@ knitr::include_graphics('./Figures/PWMScanning.png') Using the PFM, we can calculate the information content of each position in the matrix. The information content quantifies the contribution of each nucleotide to the -cumulative binding preference. This tells us how important is each nucleotide for the binding. It additionally allows us to visually represent the probability matrices as sequence logos. -The information content is quantified as relative entropy. It ranges from $0$ - no information, -to $2$ - maximal information. For a column in the PFM, the entropy is calculated as follows: +cumulative binding preference. This tells us how important each nucleotide is for the binding. It additionally allows us to visually represent the probability matrices as sequence logos. +The information content is quantified as relative entropy. It ranges from $0$, no information, +to $2$, maximal information. For a column in the PFM, the entropy is calculated as follows: $$ entropy = -\sum\limits_{k=1}^n p_{i,k}\log_2(p_{i,k}) $$ -$p_{i,k}$ is the probability of observing base $k$ in the column $i$ of the PFM. In other words, $p_{i,k}$ is simply the value of the cell in the PFM. The entropy value is high when the probabilities of each base is similar and low when one base is much more probable to occur in a given column. The relative portion comes from the fact that we compare the entropy we calculated for a column to the maximum entropy we can obtain. If the all bases are equally likely for a position in the PFM then we will have the maximum entropy and we compare our original entropy to that maximum entropy. The maximum entropy is simply $log_2{n}$ where $n$ is number of letters in the alphabet. In our case we have 4 letters A,C,G and T. The information content is then simply subtracting the observed entropy for a column from the maximum entropy, which translates to the following equation: +$p_{i,k}$ is the probability of observing base $k$ in the column $i$ of the PFM. In other words, $p_{i,k}$ is simply the value of the cell in the PFM. The entropy value is high when the probabilities of each base are similar and low when it is much more probable that only one base occur in a given column. The relative portion comes from the fact that we compare the entropy we calculated for a column to the maximum entropy we can obtain. If the all bases are equally likely for a position in the PFM, then we will have the maximum entropy and we compare our original entropy to that maximum entropy. The maximum entropy is simply $log_2{n}$ where $n$ is number of letters in the alphabet. In our case we have 4 letters A,C,G and T. The information content is then simply subtracting the observed entropy for a column from the maximum entropy, which translates to the following equation: $$ IC=log_2(n)+\sum\limits_{k=1}^n p_{i,k}\log_2(p_{i,k}) $$ -The information content, $IC$ in the preceding equation, will be high if a base have a high probability of occurrence and low if all bases are more or less equally likely to occur. +The information content, $IC$, in the preceding equation, will be high if a base has a high probability of occurrence and low if all bases are more or less equally likely to occur. -We can visualize the matrix by visualizing the letters weighted by their probabilities in the PFM. This approach has been shown on the left handside of the Figure \@ref(fig:peak-quality-seqLogo-plot). In addition, we can also the information content per column to weight the probabilities. This means that the columns that have very frequent letters will be higher.This approach is shown on the right handside of the Figure \@ref(fig:peak-quality-seqLogo-plot).We will use below the `seqLogo` package to visualize CTCF motif in two different ways we described above. +We can visualize the matrix by visualizing the letters weighted by their probabilities in the PFM. This approach is shown on the left-hand side of Figure \@ref(fig:peak-quality-seqLogo-plot). In addition, we can also calculate the information content per column to weight the probabilities. This means that the columns that have very frequent letters will be higher. This approach is shown on the right-hand side of Figure \@ref(fig:peak-quality-seqLogo-plot). We will use below the `seqLogo` package to visualize the CTCF motif in the two different ways we described above. ```{r peak-quality-seqLogo-command, eval = FALSE, echo = FALSE, include = TRUE} @@ -2184,7 +2133,7 @@ seqLogo::seqLogo(ctcf_motif) # probabilities seqLogo::seqLogo(ctcf_motif,ic.scale=TRUE) # scaled by IC ``` -```{r peak-quality-seqLogo-plot, echo=FALSE, include=TRUE, fig.cap="CTCF sequence motif visualized as a sequence logo. Y axis ranges from zero to two, and corresponds to the amount of information each base in the motif contributes to the overall motif. The larger the letter the greater the probability of observing just one defined base on the designated position.", out.width='70%'} +```{r peak-quality-seqLogo-plot, echo=FALSE, include=TRUE, fig.cap="CTCF sequence motif visualized as a sequence logo. Y-axis ranges from zero to two, and corresponds to the amount of information each base in the motif contributes to the overall motif. The larger the letter, the greater the probability of observing just one defined base on the designated position.", out.width='70%'} knitr::include_graphics('./Figures/CTCF_Motif.png') ``` @@ -2221,7 +2170,7 @@ head(seq) Once we have extracted the sequences, we can use the CTCF motif to -scan each sequences and determine the probability of CTCF binding. +scan each sequence and determine the probability of CTCF binding. For this we use the `TFBSTools`\index{R Packages!\texttt{TFBSTools}} [@TFBSTools] package. We first convert the raw probability matrix into a `PWMMatrix` object, @@ -2239,13 +2188,13 @@ ctcf_pwm = PWMatrix( ``` We can now use the `searchSeq()` function to scan each sequence for the motif occurrence. -Because the motif matrices are give a continuous binding score, we need to set a cutoff to +Because the motif matrices are given a continuous binding score, we need to set a cutoff to determine when a sequence contains the motif, and when it doesn't. The cutoff is set by determining the maximal possible score produced by the motif matrix; a percentage of that score is then taken as the threshold value. For example, if the best sequence would have a score of 1.4 of being bound, then we define a threshold of 80% of 1.4, which is 1.12; and any sequence which -scores less that 1.12 would not be marked as being bound by the protein. +scores less than 1.12 would not be marked as being bound by the protein. For the CTCF, we mark any peak containing a sequence with > 80% of the maximal rescaled score or "relative score" as a positive hit. @@ -2264,8 +2213,8 @@ head(hits)[,1:9] A common diagnostic plot is to graph a reverse cumulative distribution of peak occurrences. -On the x-axis we rank the peaks, with the most highly enriched peak on the -first position, and the least enriched peak on the last position. +On the x-axis we rank the peaks, with the most highly enriched peak in the +first position, and the least enriched peak in the last position. We then walk from the lowest to the highest ranking and measure the percentage of peaks containing the motif. @@ -2285,7 +2234,7 @@ motif_hits_df$perc_peaks = round(motif_hits_df$perc_peaks, 2) We can now visualize the percentage of peaks with matching CTCF motif. -```{r, peak-quality-scan-dist-plot, fig.cap="Percentage of peaks containing the motif'. Higher percentage indicates a better ChIP-experiment, and a better peak calling procedure."} +```{r, peak-quality-scan-dist-plot, fig.cap="Percentage of peaks containing the motif. Higher percentage indicates a better ChIP-experiment, and a better peak calling procedure."} # plot the cumulative distribution of motif hit percentages ggplot( motif_hits_df, @@ -2304,15 +2253,15 @@ ggplot( ggtitle('Percentage of CTCF peaks with the CTCF motif') ``` -The figure \@ref(fig:peak-quality-scan-dist-plot) -shows that, when we take all peaks into account ~45% of +Figure \@ref(fig:peak-quality-scan-dist-plot) +shows that, when we take all peaks into account, ~45% of the peaks contain a CTCF motif. -This is an excellent percentage and indicates a high quality ChIP experiment. +This is an excellent percentage and indicates a high-quality ChIP experiment. Our inability to locate the motif in ~50% of the sequences does not -necessarily need to be a consequence of a poor experiment - sometimes -it is a result of the molecular mechanism of by which the transcription factor +necessarily need to be a consequence of a poor experiment; sometimes +it is a result of the molecular mechanism by which the transcription factor binds. If a transcription factor has multiple binding modes, which are context -dependent - for example, if the transcription factor binds indirectly to +dependent, for example, if the transcription factor binds indirectly to a subset of regions, through an interacting partner, we do not have to observe a motif. @@ -2322,9 +2271,9 @@ an interacting partner, we do not have to observe a motif. If the ChIP experiment was performed properly, we would expect the motif to be localized just below the summit of each peak. By plotting the motif localization around ChIP peaks, we are quantifying -the uncertainty in peak location. +the uncertainty in the peak location. -We will firstly resize our peaks into regions around +/- 1kb around the peak +We will firstly resize our peaks into regions around +/−1-kb around the peak center. ```{r chip-quality.motifloc.resize} @@ -2345,8 +2294,8 @@ hits = as.data.frame(hits) ``` We now construct a plot, where the -X axis represents the +/ 1000 nucleotides around the peak, while the -Y axis shows the motif enrichment at each position. +X-axis represents the +/- 1000 nucleotides around the peak, while the +Y-axis shows the motif enrichment at each position. ```{r chip-quality-motifloc-plot, fig.cap='Transcription factor sequence motif localization with respect to the defined binding sites.', warning = FALSE} # set the position relative to the start @@ -2365,8 +2314,8 @@ ggplot(data=hits, aes(position)) + plot.title = element_text(hjust = 0.5)) ``` -We can in figure \@ref(fig:chip-quality-motifloc-plot) see that the bulk of motif -hits are found in a region of +/- 250 bp around the peak centers. +We can in Figure \@ref(fig:chip-quality-motifloc-plot), see that the bulk of motif +hits are found in a region of $+/-$ 250 bp around the peak centers. This means that the peak calling procedure was quite precise. @@ -2375,7 +2324,7 @@ This means that the peak calling procedure was quite precise. As the final step of quality control we will visualize the distribution of peaks in different functional genomic regions. The purpose of the analysis is to check whether the location of the peaks -conforms with our prior knowledge. +conforms our prior knowledge. This analysis is equivalent to constructing distributions for reads. Firstly we download the human gene models and construct the annotation hierarchy\index{annotation hierarchy}. @@ -2402,11 +2351,8 @@ and calculates the summary statistics. The function contains four major parts: 1. Creating a disjoint set of peak regions. - 2. Finding the overlapping annotation for each peak. - 3. Annotating each peak with the corresponding annotation class. - 4. Calculating summary statistics ```{r peak-annotation.function, warning=FALSE} @@ -2453,9 +2399,7 @@ annotatePeaks = function(peaks, annotation_list, name){ ``` Using the above defined `annotatePeaks()` function we will now annotate CTCF -and H3K36me3 peaks. - -Firstly we create a list which contains both CTCF and H3K36me3 peaks. +and H3K36me3 peaks. Firstly we create a list which contains both CTCF and H3K36me3 peaks. ```{r peak-annotation.list} peak_list = list( @@ -2465,7 +2409,7 @@ peak_list = list( ``` Using the `lapply()` function we apply the `annotatePeaks()` function -on each element of the least. +on each element of the list. ```{r peak-annotation.apply.function} # calculate the distribution of peaks in annotation for each experiment @@ -2482,9 +2426,11 @@ statistics into one data frame. annot_peaks_df = dplyr::bind_rows(annot_peaks_list) ``` -And visualize the results as bar plots. +And visualize the results as bar plots. Resulting plot is in Figure \@ref(fig:peak-annotation-plot), which shows that the H3K36me3 peaks are +located preferentially in gene bodies, as expected, while the CTCF peaks are +found preferentially in introns. -```{r, peak-annotation-plot, fig.cap='Enrichment of transcription factor or histone modifications in functional genomic features'} +```{r, peak-annotation-plot, fig.cap='Enrichment of transcription factor or histone modifications in functional genomic features.'} # plot the distribution of peaks in genomic features ggplot(data = annot_peaks_df, aes( @@ -2504,9 +2450,7 @@ ggplot(data = annot_peaks_df, ylab('Frequency') ``` -The plot\@ref(fig:peak-annotation-plot) shows that the H3K36me3 peaks are -located preferentially in gene bodies, as expected, while the CTCF peaks are -found preferentially in introns. + @@ -2514,11 +2458,11 @@ found preferentially in introns. The first analysis step downstream of peak calling is motif discovery. Motif discovery is a procedure of finding enriched sets of similar short sequences -in a large sequence data set. In our case the large sequence data set are +in a large sequence dataset. In our case the large sequence dataset are sequences around ChIP peaks, while the short sequence sets are the transcription factor binding sites. -There are two types of motif discovery tools: supervised, and unsupervised. +There are two types of motif discovery tools: supervised and unsupervised. Supervised tools require explicit positive (we are certain that the motif is enriched), and negative sequence sets (we are certain that the motif is not enriched), and then search for relative enrichment of short motifs in the foreground versus the background. @@ -2526,16 +2470,16 @@ Unsupervised models, on the other hand, require only a set of positive sequences and then compare motif abundance to a statistically constructed background set. Due to the combinatorial nature of the procedure, motif discovery is -computational expensive. It is therefore often performed on a subset of the -highest quality peaks. In this tutorial we will use `rGADEM`\index{R Packages!\texttt{rGADEM}} -pacakge for motif discovery. -`rGADEM` is a unsupervised, stochastic motif discovery tools, which uses +computationally expensive. It is therefore often performed on a subset of the +highest-quality peaks. In this tutorial we will use the `rGADEM`\index{R Packages!\texttt{rGADEM}} +package for motif discovery. +`rGADEM` is an unsupervised, stochastic motif discovery tools, which uses sampling with subsequent enrichment analysis to find over-represented sequence motifs. We will firstly load our CTCF peaks, and convert them to a GRanges object. We will then select the top 500 peaks, and extract the DNA sequence, which -will be used as input for the motif discovery. Nearby ChIP peaks can have overlapping coordinates. After selection, overlapping CTCF peaks have to be merged using the `reduce()` function from `GenomicRanges` package. If we do not execute this step, we will include the same sequence multiple times in the sequence set, and artificially enrich DNA patterns. +will be used as input for the motif discovery. Nearby ChIP peaks can have overlapping coordinates. After selection, overlapping CTCF peaks have to be merged using the `reduce()` function from the `GenomicRanges` package. If we do not execute this step, we will include the same sequence multiple times in the sequence set, and artificially enrich DNA patterns. ```{r motif-discovery.peak} # read the CTCF peaks created in the peak calling part of the tutorial @@ -2552,7 +2496,7 @@ ctcf_peaks = head(ctcf_peaks, n = 500) ctcf_peaks = reduce(ctcf_peaks) ``` -Create a region of +/- 50 bp around the center of the peaks, +Create a region of $+/-$ 50 bp around the center of the peaks, ```{r motif-discovery.resize} # expand the CTCF peaks @@ -2569,8 +2513,7 @@ library(BSgenome.Hsapiens.UCSC.hg38) ctcf_seq = getSeq(BSgenome.Hsapiens.UCSC.hg38, ctcf_peaks_resized) ``` -We are now ready to run the motif discovery. -Firstly we load the `rGADEM` package: +We are now ready to run the motif discovery. Firstly we load the `rGADEM` package: ```{r, rgadem.library, include=TRUE, eval=TRUE, echo=FALSE, warning = FALSE} # load the rGADEM package @@ -2583,7 +2526,7 @@ specify two parameters: 1. **seed** - the random number generator seed, which will make the analysis reproducible. -2. **nmotifs** - the number of motifs to look for +2. **nmotifs** - the number of motifs to look for. ```{r rgadem.run, include=TRUE, eval=TRUE, echo=FALSE} @@ -2595,15 +2538,15 @@ novel_motifs = GADEM( ) ``` -`rGADEM` package contains a convenient `plot()` function for +The `rGADEM` package contains a convenient `plot()` function for motif visualization. We will use the plot function to visualize the most enriched DNA motif: -```{r motif-discovery-logo, fig.cap='Motif with highest enrichment in top 500 CTCF peaks', width = 6, height = 3} +```{r motif-discovery-logo, fig.cap='Motif with highest enrichment in top 500 CTCF peaks.', width = 6, height = 3} # visualize the resulting motif plot(novel_motifs[1]) ``` -The motif show in Figure \@ref(fig:motif-discovery-logo) corresponds to the +The motif shown in Figure \@ref(fig:motif-discovery-logo) corresponds to the previously visualized CTCF motif. Nevertheless, we will computationally annotate our motif by querying the JASPAR [@khan_2018] database in the next section. @@ -2631,7 +2574,7 @@ unknown_pwm = PWMatrix( Using the `getMatrixSet()` function we extract all motifs which correspond to known human transcription factors. -`opts` parameter defines which `PWM` database to use for comparison. +The `opts` parameter defines which `PWM` database to use for comparison. ```{r motif-annotation.jaspar} # load the JASPAR motif database @@ -2692,10 +2635,10 @@ As expected, the topmost candidate is CTCF. ## What to do next? -One of the first next steps you have your peaks is to find out what kind of genes they might be associated with. This is very similar to gene set analysis \index{gene set analysis} we have introduced for RNA-seq in Chapter \@ref(rnaseqanalysis). The same tools such as `gProfileR` package\index{R Packages!\texttt{gProfileR}} can be used on the genes associated with the peaks. However, associating peaks to genes is not always trivial due to long range gene regulation. Many enhancers can regulate genes that are far away and their targets are not always the nearest gene. However, associating peaks to nearest genes is a generally practiced strategy in ChIP-seq analysis. We have introduced how to gene the nearest genes in Chapter \@ref(genomicIntervals). There are also other R packages that will do the association to genes and the gene set analysis in a single workflow. One such package is `rGREAT` from Bioconductor. This package relies on a web-based tool called [_GREAT_](http://great.stanford.edu/public/html/). +One of the first next steps after you have your peaks is to find out what kind of genes they might be associated with. This is very similar to the gene set analysis \index{gene set analysis} we introduced for RNA-seq in Chapter \@ref(rnaseqanalysis). The same tools, such as `gProfileR` package\index{R Packages!\texttt{gProfileR}}, can be used on the genes associated with the peaks. However, associating peaks to genes is not always trivial due to long-range gene regulation. Many enhancers can regulate genes that are far away and their targets are not always the nearest gene. However, associating peaks to nearest genes is a generally practiced strategy in ChIP-seq analysis. We have introduced how to find the nearest genes in Chapter \@ref(genomicIntervals). There are also other R packages that will do the association to genes and the gene set analysis in a single workflow. One such package is `rGREAT` from Bioconductor. This package relies on a web-based tool called [_GREAT_](http://great.stanford.edu/public/html/). Knowing every location in the genome bound by a protein can provide a lot -of mechanistic information, however, quite often it is hard to make +of mechanistic information. However, quite often it is hard to make biologically relevant conclusions just from one ChIP-seq experiment (i.e. if we want to explain how our protein causes a disease, it is hard to guess which of the tens of thousands of binding places is relevant for the phenotype). @@ -2709,20 +2652,20 @@ It is possible to look at the pairwise differences between samples using differential peak calling [@zhang_2014; @lun_2014; @allhoff_2014; @allhoff_2016]. It is a procedure analogous to the differential expression analysis, except it results in sets of coordinates that are differentially bound in two biological -conditions. We can then search for specific DNA binding motif in such regions, +conditions. We can then search for a specific DNA binding motif in such regions, or correlate changes in the binding with changes in gene expression. -With an increase in number of ChIP experiment, pairwise comparisons becomes +With an increase in the number of ChIP experiments, pairwise comparisons become combinatorially complex. In this case we can segment the genome into multiple classes, where each class corresponds to a combination of bound transcription factors. Genome segmentation is usually done using probabilistic models (such as hidden Markov models [@ernst_2012; @hoffman_2012]), or machine learning algorithms [@mortazavi_2013]. -## Exercises: +## Exercises -### Quality control: +### Quality control -1. Apply the fragment size estimation procedure to all ChIP and Input available data sets [Difficulty: **Beginner**] +1. Apply the fragment size estimation procedure to all ChIP and Input available datasets. [Difficulty: **Beginner**] 2. Visualize the resulting distributions. [Difficulty: **Beginner**] @@ -2730,12 +2673,12 @@ Markov models [@ernst_2012; @hoffman_2012]), or machine learning algorithms [@mo 4. Write a function which converts the bam files into bigWig files. [Difficulty: **Beginner**] -5. Apply the function to all files, and visualize them in the Genome browser. -Observe the signal profiles, what can you notice, about the similarity of the samples? [Difficulty: **Beginner**] +5. Apply the function to all files, and visualize them in the genome browser. +Observe the signal profiles. What can you notice, about the similarity of the samples? [Difficulty: **Beginner**] -6. Use GViz to visualize the profiles for CTCF, SMC3 and ZNF143 [Difficulty: **Beginner/Intermediate**] +6. Use `GViz` to visualize the profiles for CTCF, SMC3 and ZNF143. [Difficulty: **Beginner/Intermediate**] -7. Calculate the cross correlation for the both CTCF replicates, and +7. Calculate the cross correlation for both CTCF replicates, and the input samples. How does the profile look for the control samples? [Difficulty: **Intermediate**] 8. Calculate the cross correlation coefficients for all samples and @@ -2754,15 +2697,15 @@ the percentage of CTCF peaks falling in such regions. [Difficulty: **Advanced**] How many peaks are specific to each biological replicate, and how many peaks overlap. [Difficulty: **Intermediate**] 5. Plot a scatter plot of signal strengths for biological replicates. Do intersecting -peaks have equal signal strength in both samples. [Difficulty: **Intermediate**] +peaks have equal signal strength in both samples? [Difficulty: **Intermediate**] 6. Quantify the combinatorial binding of all three proteins. Find the number of places which are bound by all three proteins, by a combination of two proteins, and exclusively by one protein. Annotate the different regions based on their genomic location. [Difficulty: **Advanced**] -7. Correlate the normR enrichment score for CTCF with peak presence/absence. -(create boxplots of enrichment for peaks which contain and do not contain CTCF motifs) [Difficulty: **Advanced**] +7. Correlate the normR enrichment score for CTCF with peak presence/absence +(create boxplots of enrichment for peaks which contain and do not contain CTCF motifs). [Difficulty: **Advanced**] 8. Explore the co-localization of CTCF and ZNF143. Where are the co-bound regions located? Which sequence motifs do they contain? Download the ChIA-pet @@ -2771,13 +2714,13 @@ classes of binding sites. [Difficulty: **Advanced**] #### Motif discovery -1. Repeat the motif discovery analysis on peaks from ZNF143 transcription factor. -How many motifs do you observe? How do the motifs look like (visualize the motif logs)? [Difficulty: **Intermediate**] +1. Repeat the motif discovery analysis on peaks from the ZNF143 transcription factor. +How many motifs do you observe? How do the motifs look (visualize the motif logs)? [Difficulty: **Intermediate**] 2. Scan the ZNF143 peaks with the top motifs found in the previous exercise. Where are the motifs located? [Difficulty: **Advanced**] -3. Scan the CTCF peaks with top motifs identified in **ZNF143** peaks. +3. Scan the CTCF peaks with the top motifs identified in the **ZNF143** peaks. Where are the motifs located? What can you conclude from the previous exercises? [Difficulty: **Advanced**] diff --git a/10-bs-seq-analysis.Rmd b/10-bs-seq-analysis.Rmd index da2ee06..f8b50f9 100644 --- a/10-bs-seq-analysis.Rmd +++ b/10-bs-seq-analysis.Rmd @@ -13,19 +13,19 @@ knitr::opts_chunk$set(echo = TRUE, fig.align = 'center') ``` -Epigenome consists of chemical modifications of DNA and histones. These modifications are shown to be associated with gene regulation in various settings (see Chapter \@ref(intro) for an intro). These modifications in turn have specific importance for cell type identification. There are many different ways of measuring such modifications. We have shown how histone modifications can be measured in a genome-wide manner in Chapter \@ref(chipseq) using ChIP-seq. In this chapter we will focus on the analysis of DNA methylation data using data from a technique called Bisulfite sequencing (BS-seq). We will introduce how to process data and data quality checks, as well as statistical analysis relevant for BS-seq data. +The epigenome consists of chemical modifications of DNA and histones. These modifications are shown to be associated with gene regulation in various settings (see Chapter \@ref(intro) for an intro). These modifications in turn have specific importance for cell type identification. There are many different ways of measuring such modifications. We have shown how histone modifications can be measured in a genome-wide manner in Chapter \@ref(chipseq) using ChIP-seq. In this chapter we will focus on the analysis of DNA methylation data using data from a technique called bisulfite sequencing (BS-seq). We will introduce how to process data and data quality checks, as well as statistical analysis relevant for BS-seq data. -## What is DNA methylation ? -Cytosine methylation (5-methylcytosine, 5mC) is one of the main covalent base modifications in eukaryotic genomes, generally observed on CpG dinucleotides. Methylation can also rarely occur in non-CpG context, but this was mainly observed in human embryonic stem and neuronal cells [@Lister2009-sd; @Lister2013-vs]. DNA methylation is a part of the epigenetic regulation mechanism of gene expression. It is cell-type specific DNA modification. \index{DNA methylation}It is reversible but mostly remains stable through cell division. There are roughly 28 million CpGs in the human genome, 60–80% are generally methylated. Less than 10% of CpGs occur in CG-dense regions that are termed CpG islands in the human genome [@Smith2013-jh]. It has been demonstrated that DNA methylation is also not uniformly distributed over the genome, but rather is associated with CpG density. In vertebrate genomes, cytosine bases are usually unmethylated in CpG-rich regions such as CpG islands and tend to be methylated in CpG-deficient regions. Vertebrate genomes are largely CpG deficient except at CpG islands. Conversely, invertebrates such as Drosophila melanogaster and Caenorhabditis elegans do not exhibit cytosine methylation and consequently do not have CpG rich and poor regions but rather a steady CpG frequency over their genomes [@Deaton2011-pm]. +## What is DNA methylation? +Cytosine methylation (5-methylcytosine, 5mC) is one of the main covalent base modifications in eukaryotic genomes, generally observed on CpG dinucleotides. Methylation can also rarely occur in a non-CpG context, but this was mainly observed in human embryonic stem and neuronal cells [@Lister2009-sd; @Lister2013-vs]. DNA methylation is a part of the epigenetic regulation mechanism of gene expression. It is cell-type-specific DNA modification. \index{DNA methylation}It is reversible but mostly remains stable through cell division. There are roughly 28 million CpGs in the human genome, 60–80% are generally methylated. Less than 10% of CpGs occur in CG-dense regions that are termed CpG islands in the human genome [@Smith2013-jh]. It has been demonstrated that DNA methylation is also not uniformly distributed over the genome, but rather is associated with CpG density. In vertebrate genomes, cytosine bases are usually unmethylated in CpG-rich regions such as CpG islands and tend to be methylated in CpG-deficient regions. Vertebrate genomes are largely CpG deficient except at CpG islands. Conversely, invertebrates such as _Drosophila melanogaster_ and _Caenorhabditis elegans_ do not exhibit cytosine methylation and consequently do not have CpG rich and poor regions but rather a steady CpG frequency over their genomes [@Deaton2011-pm]. ### How DNA methylation is set ? -DNA methylation is established by DNA methyltransferases DNMT3A and DNMT3B in combination with DNMT3L and maintained through cell division by the methyltransferase DNMT1 and associated proteins. DNMT3a and DNMT3b are in charge of the de novo methylation during early development. Loss of 5mC can be achieved passively by dilution during replication or exclusion of DNMT1 from the nucleus. Recent discoveries of ten-eleven translocation (TET) family of proteins and their ability to convert 5-methylcytosine (5mC) into 5-hydroxymethylcytosine (5hmC) in vertebrates provide a path for catalysed active DNA demethylation [@Tahiliani2009-ar]. Iterative oxidations of 5hmC catalysed by TET result in 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC). 5caC mark is excised from DNA by G/T mismatch-specific thymine-DNA glycosylase (TDG), which as a result returns cytosine residue back to its unmodified state [@He2011-pw]. Apart from these, mainly bacteria but possibly higher eukaryotes contain base modifications on bases other than cytosine, such as methylated adenine or guanine [@Clark2011-sc]. +DNA methylation is established by DNA methyltransferases DNMT3A and DNMT3B in combination with DNMT3L and maintained through cell division by the methyltransferase DNMT1 and associated proteins. DNMT3a and DNMT3b are in charge of the de novo methylation during early development. Loss of 5mC can be achieved passively by dilution during replication or exclusion of DNMT1 from the nucleus. Recent discoveries of the ten-eleven translocation (TET) family of proteins and their ability to convert 5-methylcytosine (5mC) into 5-hydroxymethylcytosine (5hmC) in vertebrates provide a path for catalyzed active DNA demethylation [@Tahiliani2009-ar]. Iterative oxidations of 5hmC catalyzed by TET result in 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC). 5caC mark is excised from DNA by G/T mismatch-specific thymine-DNA glycosylase (TDG), which as a result reverts cytosine residue to its unmodified state [@He2011-pw]. Apart from these, mainly bacteria, but possibly higher eukaryotes, contain base modifications on bases other than cytosine, such as methylated adenine or guanine [@Clark2011-sc]. -### How to measure DNA methylation with bisulfite-sequencing +### How to measure DNA methylation with bisulfite sequencing One of the most reliable and popular ways to measure DNA methylation is high-throughput bisulfite sequencing. This method, and the related ones, allow measurement of DNA methylation at the single nucleotide resolution. The bisulfite conversion turns unmethylated Cs to Ts and methylated Cs remain intact. Then, the only thing to do is to align the reads with those C->T conversions and count C->T mutations to calculate fraction of methylated bases. In the end, we can get quantitative genome-wide measurements for DNA methylation. ## Analyzing DNA methylation data -For the remainder of this chapter, we will explain how to do DNA methylation analysis using R. The analysis process is somewhat similar to the analysis patterns observed in other sequencing data analyses. The process can be chunked to four main parts with further sub-chunks:\index{DNA methylation +For the remainder of this chapter, we will explain how to do DNA methylation analysis using R. The analysis process is somewhat similar to the analysis patterns observed in other sequencing data analyses. The process can be chunked to four main parts with further sub-chunks:\index{DNA methylation} 1. Processing raw data - Quality check @@ -44,14 +44,14 @@ For the remainder of this chapter, we will explain how to do DNA methylation ana - Integration with other quantitative genomics data ## Processing raw data and getting data into R -The rawest form of data that most users get is probably in the form of fastq files obtained from the sequencing experiments. We will describe necessary steps and the tools that can be used for raw data processing and if exists we will mention their R equivalents. However, the data processing is usually done outside of the R framework and for the following sections we will assume that the data processing is done and our analysis is starting from methylation call files. +The rawest form of data that most users get is probably in the form of fastq files obtained from the sequencing experiments. We will describe the necessary steps and the tools that can be used for raw data processing and if they exist, we will mention their R equivalents. However, the data processing is usually done outside of the R framework, and for the following sections we will assume that the data processing is done and our analysis is starting from methylation call files. -Typical data processing step starts with data quality check. The fastq files are first ran through a quality check software that shows the quality of the sequencing run. We would typically use [fastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for this. However, there are several bioconductor packages that could be of use, such as [`Rqc`](https://bioconductor.org/packages/release/bioc/html/Rqc.html) and [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html). We have introduced how to use some of these tools for sequencing quality check in Chapter \@ref(processingReads). Following the quality check, provided everything is OK, the reads can be aligned to the genome. Before the alignment adapters or low quality ends of reads can be trimmed to increase number of alignments. Low quality ends mostly likely have poor basecalls which will lead to many mismatches. Reads with non-trimmed adapters will also not align to the genome. We would use adapter trimming tools such as [cutadapt](https://cutadapt.readthedocs.io/en/stable/) or [flexbar](https://github.com/seqan/flexbar) for this purpose, although there are a bunch of them to be chosen from. Following this, reads are aligned to the genome with a bisulfite treatment aware aligner. For our own purposes, we use Bismark[@Krueger2011-vv], however there are other equally accurate aligners, some are reviewed [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378906/). In addition, Bioconductor package [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html) can align BS-seq reads within the R framework. +The typical data processing step starts with a data quality check. The fastq files are first run through quality check software that shows the quality of the sequencing run. We would typically use [fastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for this. However, there are several bioconductor packages that could be of use, such as [`Rqc`](https://bioconductor.org/packages/release/bioc/html/Rqc.html) and [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html). We have introduced how to use some of these tools for sequencing quality check in Chapter \@ref(processingReads). Following the quality check, provided everything is OK, the reads can be aligned to the genome. Before the alignment, adapters or low-quality ends of the reads can be trimmed to increase number of alignments. Low-quality ends mostly likely have poor basecalls, which will lead to many mismatches. Reads with non-trimmed adapters will also not align to the genome. We would use adapter trimming tools such as [cutadapt](https://cutadapt.readthedocs.io/en/stable/) or [flexbar](https://github.com/seqan/flexbar) for this purpose, although there are a bunch of them to choose from. Following this, reads are aligned to the genome with a bisulfite-treatment-aware aligner. For our own purposes, we use Bismark[@Krueger2011-vv], however there are other equally accurate aligners, and some are reviewed [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378906/). In addition, the Bioconductor package [`QuasR`](https://bioconductor.org/packages/release/bioc/html/QuasR.html) can align BS-seq reads within the R framework. -After alignment, we need to call C->T conversions and calculate fraction/percentage of methylation. Most of the time aligners come with auxiliary tools that calculate per base methylation values. Normally, they output a tabular format containing the location of the Cs and methylation value and strand. Within R, `QuasR`\index{R Packages!\texttt{QuasR}} and `methylKit` \index{R Packages!\texttt{methylKit}}can call methylation values from BAM files albeit with some limitations. In essence, these methylation call files can be easily read into R and downstream analysis within R starts from that point. An important quality measure at this stage is to look at conversion rate. This simply means how many unmethylated Cs are converted to Ts. Since we expect, non-CpG methylation to be rare. We can simply count number of C->T conversions in non-CpG context and calculate conversion rate. The best way to do this would be via spike-in sequences where we expect no methylation at all. Since, non-CpG methylation is tissue specific, calculating conversion rate using non-CpG Cs might be misleading in some cases. +After alignment, we need to call C->T conversions and calculate the fraction/percentage of methylation. Most of the time, aligners come with auxiliary tools that calculate per-base methylation values. Normally, they output a tabular format containing the location of the Cs and methylation value and strand. Within R, `QuasR`\index{R Packages!\texttt{QuasR}} and `methylKit` \index{R Packages!\texttt{methylKit}}can call methylation values from BAM files albeit with some limitations. In essence, these methylation call files can be easily read into R and downstream analysis within R starts from that point. An important quality measure at this stage is to look at the conversion rate. This simply means how many unmethylated Cs are converted to Ts. Since we expect non-CpG methylation to be rare, we can simply count the number of C->T conversions in the non-CpG context and calculate conversion rate. The best way to do this would be via spike-in sequences where we expect no methylation at all. Since non-CpG methylation is tissue specific, calculating the conversion rate using non-CpG Cs might be misleading in some cases. ## Data filtering and exploratory analysis -We assume that we start the analysis in R with the methylation call files. We will read those files in and carry out exploratory analysis, we will show how to filter bases or regions from the data and in what circumstances we might need to do so. We will use [methylKit](https://bioconductor.org/packages/release/bioc/html/methylKit.html)[@Akalin2012-af] package for the bulk of the analysis. \index{R Packages!\texttt{methylKit}} +We assume that we start the analysis in R with the methylation call files. We will read those files in and carry out exploratory analysis, and we will show how to filter bases or regions from the data and in what circumstances we might need to do so. We will use the [methylKit](https://bioconductor.org/packages/release/bioc/html/methylKit.html)[@Akalin2012-af] package for the bulk of the analysis. \index{R Packages!\texttt{methylKit}} ### Reading methylation call files A typical methylation call file looks like this: @@ -63,10 +63,10 @@ tab ``` -Most of the time bisulfite sequencing experiments have test and control samples. The test samples can be from a disease tissue while the control samples can be from a healthy tissue. You can read a set of methylation call files that have test/control conditions giving `treatment` vector option. The treatment vector defines the sample groups and it is very important for the differential methylation analysis. For sake of subsequent analysis, file.list, sample.id and treatment option should have the same order. In the following example, first two files have the sample ids "test1" and "test2" and as determined by treatment vector they belong to the same group. The third and fourth files have sample ids "ctrl1" and "ctrl2" and they belong to the same group as indicated by the treatment vector. We will first get a list of file paths and have a look at the content. +Most of the time bisulfite sequencing experiments have test and control samples. The test samples can be from a disease tissue while the control samples can be from a healthy tissue. You can read a set of methylation call files that have test/control conditions giving a `treatment` vector option. The treatment vector defines the sample groups and it is very important for the differential methylation analysis. For the sake of subsequent analysis, file.list, sample.id and treatment option should have the same order. In the following example, the first two files have the sample IDs "test1" and "test2" and as determined by the treatment vector they belong to the same group. The third and fourth files have sample IDs "ctrl1" and "ctrl2" and they belong to the same group as indicated by the treatment vector. We will first get a list of file paths and have a look at the content. -```{r readMethFiles,message=FALSE} +```{r readMethFiles,message=FALSE,echo=FALSE} library(methylKit) file.list=list( system.file("extdata", "test1.myCpG.txt", package = "methylKit"), @@ -77,10 +77,10 @@ file.list=list( system.file("extdata", system.file("extdata", "control2.myCpG.txt", package = "methylKit") ) -file.list + ``` -As you can see `file.list` variable is a simple list of file paths. Each file contains methylation calls for a given sample. Now, we can read the files with `methRead()` function. +If you look what is inside the `file.list` variable, you will see that it is a simple list of file paths. Each file contains methylation calls for a given sample. Now, we can read the files with the `methRead()` function. ```{r readfiles_Chp10} # read the files to a methylRawList object: myobj myobj=methRead(file.list, @@ -91,13 +91,13 @@ myobj=methRead(file.list, ) ``` -tab-separated bedgraph like formats from Bismark methylation caller can also be read in by methylkit. In those cases, we have to provide either `pipeline="bismarkCoverage"` or `pipeline="bismarkCytosineReport"` to `methRead` function. In addition to the options we mentioned above, -any tab separated text file with a generic format can be read in using methylKit, +Tab-separated bedgraph like formats from Bismark methylation caller can also be read in by methylkit. In those cases, we have to provide either `pipeline="bismarkCoverage"` or `pipeline="bismarkCytosineReport"` to the `methRead()` function. In addition to the options we mentioned above, +any tab-separated text file with a generic format can be read in using methylKit, such as methylation ratio files from [BSMAP](http://code.google.com/p/bsmap/). See [here](http://zvfak.blogspot.com/2012/10/how-to-read-bsmap-methylation-ratio.html) for an example. Before we move on, let us have a look at what kind of information is stored in `myobj`. This is technically a `methylRawList` object, which is essentially a list of `methylRaw` objects. These objects hold -the information for location of Cs, and methylated Cs and unmethylated Cs. +the information for the genomic location of Cs, and methylated Cs and unmethylated Cs. ```{r showMethObj} ## inside the methylRawList object length(myobj) @@ -105,22 +105,22 @@ head(myobj[[1]]) ``` ### Further quality check -It is always a good idea to check how the data looks like before proceeding further. For example, the methylation values should have bimodal distribution generally. This can be checked via -`getMethylationStats` function. Normally, we should see a bimodal -distributions. Strong deviations from the bimodality may be due poor experimental quality, such as problems with bisulfite treatment.Below we are showing how to get these plots using `getMethylationStats()` function. The result is shown in Figure \@ref(fig:methStats). As expected it has a bimodal distribution where most CpGs have either high methylation or low methylation. -```{r methStats,fig.cap="Histogram for methylation values for all CpGs in the data set"} +It is always a good idea to check how the data looks before proceeding further. For example, the methylation values should have bimodal distribution generally. This can be checked via the +`getMethylationStats()` function. Normally, we should see bimodal +distributions. Strong deviations from the bimodality may be due to poor experimental quality, such as problems with bisulfite treatment. Below we show how to get these plots using the `getMethylationStats()` function. The result is shown in Figure \@ref(fig:methStats). As expected, it has a bimodal distribution where most CpGs have either high methylation or low methylation. +```{r methStats,fig.cap="Histogram for methylation values for all CpGs in the dataset."} getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE) ``` -In addition, we might want to see coverage values. By default, methylkit handles bases with at least 10X coverage by that can be changed. The bases with unusually high coverage is usually alarming. It might indicate a PCR bias issue in the experimental procedure. The general coverage statistics can be checked with -`getCoverageStats` function shown below. The resulting plot is shown in Figure \@ref(fig:coverageStats). +In addition, we might want to see coverage values. By default, methylkit handles bases with at least 10X coverage but that can be changed. The bases with unusually high coverage are usually alarming. It might indicate a PCR bias issue in the experimental procedure. The general coverage statistics can be checked with the +`getCoverageStats()` function shown below. The resulting plot is shown in Figure \@ref(fig:coverageStats). -```{r coverageStats,fig.cap="Histogram for log10 read counts per CpG"} +```{r coverageStats,fig.cap="Histogram for log10 read counts per CpG."} getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE) ``` -It might be useful to filter samples based on coverage. Particularly, if our samples are suffering from PCR bias it would be useful to discard bases with very high read coverage. Furthermore, we would also like to discard bases that have low read coverage, a high enough read coverage will increase the power of the statistical tests. The code below filters a `methylRawList` and discards bases that have coverage below 10X and also discards the bases that have more than 99.9th percentile of coverage in each sample. +It might be useful to filter samples based on coverage. Particularly, if our samples are suffering from PCR bias, it would be useful to discard bases with very high read coverage. Furthermore, we would also like to discard bases that have low read coverage; a high enough read coverage will increase the power of the statistical tests. The code below filters a `methylRawList`, discards bases that have coverage below 10X, and also discards the bases that have more than 99.9th percentile of coverage in each sample. ```{r filterCovMeth} filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL, @@ -129,19 +129,19 @@ filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL, ### Merging samples into a single table -When we first read the files, each file is stored as its own entity. If we want compare samples in any way, we need to make a unified data structure that contains CpGs that are covered in most samples. The `unite` function creates a new object using the CpGs covered in each sample. This means +When we first read the files, each file is stored as its own entity. If we want to compare samples in any way, we need to make a unified data structure that contains CpGs that are covered in most samples. The `unite()` function creates a new object using the CpGs covered in each sample. ```{r uniteMeth} ## we use :: notation to make sure unite() function from methylKit is called meth=methylKit::unite(myobj, destrand=FALSE) ``` -Let us take a look at the data content of methylBase object: +Let us take a look at the data content of the `methylBase` object: ```{r headMeth} head(meth) ``` -By default, `unite` function produces bases/regions covered in all samples. That requirement can be relaxed using "min.per.group" option in `unite` function. +By default, the `unite()` function produces bases/regions covered in all samples. That requirement can be relaxed using the `min.per.group` option in the `unite()` function. ```{r methUnite,eval=FALSE} # creates a methylBase object, @@ -153,9 +153,9 @@ meth.min=unite(myobj,min.per.group=1L) ``` ### Filtering CpGs -We might need to filter the CpGs further before exploratory analysis or even before the downstream analysis such as differential methylation . For exploratory analysis, it is of general interest to see how samples relate to each other and we might want to remove CpGs that are not variable before doing that. Or we might remove Cs that are potentially C->T mutations. First, we show how to -filter based on variation. Below, we extract percent methylation values from CpGs as a matrix. Calculate standard deviation for each CpG and filter based on standard deviation. We also plot the the distribution of per CpG standard deviations with `hist()` function. The resulting plot is shown in Figure \@ref(fig:methVar). -```{r methVar,fig.cap="Histogram of per CpG standard deviations"} +We might need to filter the CpGs further before exploratory analysis or even before the downstream analysis such as differential methylation. For exploratory analysis, it is of general interest to see how samples relate to each other and we might want to remove CpGs that are not variable before doing that. Or we might remove Cs that are potentially C->T mutations. First, we show how to +filter based on variation. Below, we extract percent methylation values from CpGs as a matrix. Calculate the standard deviation for each CpG and filter based on standard deviation. We also plot the distribution of per-CpG standard deviations with the `hist()` function. The resulting plot is shown in Figure \@ref(fig:methVar). +```{r methVar,fig.cap="Histogram of per-CpG standard deviations."} pm=percMethylation(meth) # get percent methylation matrix mds=matrixStats::rowSds(pm) # calculate standard deviation of CpGs head(meth[mds>20,]) @@ -164,9 +164,9 @@ hist(mds,col="cornflowerblue",xlab="Std. dev. per CpG") ``` Now, let's assume we know the locations of C->T mutations. These locations should be removed from the analysis as they do not represent -bisulfite treatment associated conversions. Mutation locations are +bisulfite-treatment-associated conversions. Mutation locations are stored in a `GRanges` object, and we can use that to remove CpGs -overlapping with mutations. In order to do overlap operation, we will convert the methylKit object to a `GRanges` object and do the filtering with `%over%` function within `[ ]`. The returned object will still be a methylKit object. +overlapping with mutations. In order to do the overlap operation, we will convert the methylKit object to a `GRanges` object and do the filtering with the `%over%` function within `[ ]`. The returned object will still be a methylKit object. ```{r snps} library(GenomicRanges) # example SNP @@ -181,12 +181,12 @@ nrow(sub.meth) ``` ### Clustering samples -Clustering is used for grouping data points by their similarity. It is a general concept that can be achieved by many different algorithms and we introduced clustering and multiple prominent clustering algorithms in Chapter \@ref(unsupervisedLearning). In the context of DNA methylation we are trying to find samples that are similar to each other. For example, if we sequenced 3 heart samples and 4 liver samples, we would expect liver samples will be more similar to each other than heart samples on the DNA methylation space. +Clustering is used for grouping data points by their similarity. It is a general concept that can be achieved by many different algorithms and we introduced clustering and multiple prominent clustering algorithms in Chapter \@ref(unsupervisedLearning). In the context of DNA methylation, we are trying to find samples that are similar to each other. For example, if we sequenced 3 heart samples and 4 liver samples, we would expect liver samples will be more similar to each other than heart samples on the DNA methylation space. The following function will cluster the samples and draw a dendrogram. -It will use correlation distance which is $1-\rho$ , where $\rho$ is the correlation coefficient between two pairs of samples. The cluster tree will be drawn using the "ward" method. \index{clustering! +It will use correlation distance, which is $1-\rho$ , where $\rho$ is the correlation coefficient between two pairs of samples. The cluster tree will be drawn using the "ward" method. \index{clustering! hierarchical clustering}This specific variant uses a "bottom up" approach: each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. In Ward's method, two clusters are merged if the variance is minimized compared to other possible merge operations. This bottom up approach helps build the dendrogram showing the relationship between clusters. The result of the clustering is shown in Figure \@ref(fig:clusterMethPlot). -```{r clusterMethPlot, fig.cap="Dendrogram for samples using correlation distance and Ward's method for hierarchical clustering"} +```{r clusterMethPlot, fig.cap="Dendrogram for samples using correlation distance and Ward's method for hierarchical clustering."} clusterSamples(meth, dist="correlation", method="ward", plot=TRUE) ``` @@ -197,42 +197,42 @@ hc = clusterSamples(meth, dist="correlation", method="ward", plot=FALSE) ``` ### Principal component analysis -Principal component analysis (PCA) \index{principal component analysis (PCA)}is a mathematical transformation of (possibly) correlated variables into a number of uncorrelated variables called principal components. The resulting components from this transformation is defined in such a way that the first principal component has the highest variance and accounts for as most of the variability in the data. We have introduced PCA and other similar methods in Chapter \@ref(unsupervisedLearning). The following function will plot a scree plot for importance of components and the result is shown in Figure \@ref(fig:pcaMethScree). +Principal component analysis (PCA) \index{principal component analysis (PCA)}is a mathematical transformation of (possibly) correlated variables into a number of uncorrelated variables called principal components. The resulting components from this transformation are defined in such a way that the first principal component has the highest variance and accounts for most of the variability in the data. We have introduced PCA and other similar methods in Chapter \@ref(unsupervisedLearning). The following function will plot a scree plot for importance of components and the result is shown in Figure \@ref(fig:pcaMethScree). -```{r pcaMethScree, fig.cap="Scree plot for explained variance for principal components"} +```{r pcaMethScree, fig.cap="Scree plot for explained variance for principal components."} PCASamples(meth, screeplot=TRUE) ``` -We can also plot PC1 and PC2 axis and a scatter plot of our samples on those axis which will reveal how they cluster within these new dimensions. Similar to clustering dendrogram, we would like to see samples that are similar to be close to each other on the scatter plot. If they are not, it might indicate problems with the experiment such as batch effects. The function below plots the samples in such a scatter plot on principal component axes. The resulting plot is shown in Figure \@ref(fig:pcaMethScatter). +We can also plot the PC1 and PC2 axes and a scatter plot of our samples on those axes which will reveal how they cluster within these new dimensions. Similar to the clustering dendrogram, we would like to see samples that are similar to be close to each other on the scatter plot. If they are not, it might indicate problems with the experiment such as batch effects. The function below plots the samples in such a scatter plot on principal component axes. The resulting plot is shown in Figure \@ref(fig:pcaMethScatter). -```{r pcaMethScatter, fig.cap="Samples plotted on principal components"} +```{r pcaMethScatter, fig.cap="Samples plotted on principal components."} pc=PCASamples(meth,obj.return = TRUE, adj.lim=c(1,1)) ``` -In this case, we also returned an object from the plotting function. this is the output of R `prcomp()` function, which includes loadings and eigen vectors which might be useful. You can also do your own PCA analysis using `percMethylation()` and `prcomp()`. In the case above, the methylation matrix is transponsed. This allows us to compare distances between samples on the PCA scatterplot. +In this case, we also returned an object from the plotting function. This is the output of the `prcomp()` function, which includes loadings and eigenvectors which might be useful. You can also do your own PCA analysis using `percMethylation()` and `prcomp()`. In the case above, the methylation matrix is transposed. This allows us to compare distances between samples on the PCA scatter plot. -## Extracting interesting regions: segmentation and differential methylation -When analyzing DNA methylation data, we usually look for regions that are different than the rest of the methylome or different from a reference methylome. These regions are so called "interesting regions". They usually mark important genomic features that are related to gene regulation which in turn defines the cell type. Therefore, it is a general interest to find such regions and analyze them further to understand our biological sample or to answer specific research questions. Below we will describe two ways of defining "regions of interest". +## Extracting interesting regions: Differential methylation and segmentation +When analyzing DNA methylation data, we usually look for regions that are different than the rest of the methylome or different from a reference methylome. These regions are so-called "interesting regions". They usually mark important genomic features that are related to gene regulation, which in turn defines the cell type. Therefore, it is a general interest to find such regions and analyze them further to understand our biological sample or to answer specific research questions. Below we will describe two ways of defining "regions of interest". ### Differential methylation -Once methylation proportions per base are obtained, generally, the differences between methylation profiles are considered next. When there are multiple sample groups where each group defines a separate biological entity or treatment, it is usually of interest to locate bases or regions with different methylation proportions across the sample groups. The bases or regions with different methylation proportions across samples are called differentially methylated CpG sites (DMCs) and differentially methylated regions (DMRs). They have been shown to play a role in many different diseases due to their association with epigenetic control of gene regulation. In addition, DNA methylation profiles can be highly tissue-specific due to their role in gene regulation [@Schubeler2015-ai]. DNA methylation is highly informative when studying normal and diseased cells, because it can also act as a biomarker. For example, the presence of large-scale abnormally methylated genomic regions is a hallmark feature of many types of cancers [@Ehrlich2002-hv]. Because of aforementioned reasons, investigating differential methylation is usually one of the primary goals of doing bisulfite sequencing. +Once methylation proportions per base are obtained, generally, the differences between methylation profiles are considered next. When there are multiple sample groups where each group defines a separate biological entity or treatment, it is usually of interest to locate bases or regions with different methylation proportions across the sample groups. The bases or regions with different methylation proportions across samples are called differentially methylated CpG sites (DMCs) and differentially methylated regions (DMRs). They have been shown to play a role in many different diseases due to their association with epigenetic control of gene regulation. In addition, DNA methylation profiles can be highly tissue-specific due to their role in gene regulation [@Schubeler2015-ai]. DNA methylation is highly informative when studying normal and diseased cells, because it can also act as a biomarker. For example, the presence of large-scale abnormally methylated genomic regions is a hallmark feature of many types of cancers [@Ehrlich2002-hv]. Because of the aforementioned reasons, investigating differential methylation is usually one of the primary goals of doing bisulfite sequencing. #### Fisher's exact test -Differential DNA methylation is usually calculated by comparing the proportion of methylated Cs in a test sample relative to a control. In simple comparisons between such pairs of samples (i.e. test and control), methods such as Fisher’s Exact Test can be used. if there are replicates, replicates can be pooled within groups to a single sample per group. This strategy, however, does not take into account biological variability between replicates. We will now show how to compare pairs of samples via `calculateDiffMeth()` function in `methylKit`. When there are only one sample per sample group, `calculateDiffMeth()` automatically applies Fisher's exact test. We will not extract one sample from each group and run `calculateDiffMeth()`, which will automatically run Fisher's exact test. +Differential DNA methylation is usually calculated by comparing the proportion of methylated Cs in a test sample relative to a control. In simple comparisons between such pairs of samples (i.e. test and control), methods such as Fisher’s exact test can be used. If there are replicates, replicates can be pooled within groups to a single sample per group. This strategy, however, does not take into account biological variability between replicates. We will now show how to compare pairs of samples via the `calculateDiffMeth()` function in `methylKit`. When there is only one sample per sample group, `calculateDiffMeth()` automatically applies Fisher's exact test. We will now extract one sample from each group and run `calculateDiffMeth()`, which will automatically run Fisher's exact test. ```{r fishers,eval=FALSE} getSampleID(meth) new.meth=reorganize(meth,sample.ids=c("test1","ctrl1"),treatment=c(1,0)) dmf=calculateDiffMeth(new.meth) ``` -As mentioned, we can also pool the samples from the same group by adding up the number of Cs and Ts per group. This way even if we have replicated experiments we treat them as single experiments, and can apply Fisher's exact test. We will now pool the samples and apply +As mentioned, we can also pool the samples from the same group by adding up the number of Cs and Ts per group. This way even if we have replicated experiments we treat them as single experiments, and can apply Fisher's exact test. We will now pool the samples and apply the `calculateDiffMeth()` function. ```{r pool} pooled.meth=pool(meth,sample.ids=c("test","control")) dm.pooledf=calculateDiffMeth(pooled.meth) ``` -`calculateDiffMeth()` function returns the P-values for all bases or regions in the input methylBase object. We need to filter to get differentially methylated CpGs. This can be done via `getMethlyDiff()` function or simple filtering via `[ ]` notation. Below we show how to filter the `methylDiff` object output by `calculateDiffMeth()` function in order to get differentially methylated CpGs. The function arguments defines cutoff values for the methylation difference between groups and Q-value. In these cases, we require a methylation difference of 25% and Q-value of at least $0.01$. +The `calculateDiffMeth()` function returns the P-values for all bases or regions in the input methylBase object. We need to filter to get differentially methylated CpGs. This can be done via the `getMethlyDiff()` function or simple filtering via `[ ]` notation. Below we show how to filter the `methylDiff` object output by the `calculateDiffMeth()` function in order to get differentially methylated CpGs. The function arguments define cutoff values for the methylation difference between groups and q-value. In these cases, we require a methylation difference of 25% and a q-value of at least $0.01$. ```{r filter} # get differentially methylated bases/regions with specific cutoffs @@ -249,23 +249,23 @@ hyper2=dm.pooledf[dm.pooledf$qvalue < 0.01 & dm.pooledf$meth.diff > 25,] ``` #### Logistic regression based tests -Regression-based methods are generally used to model methylation levels in relation to the sample groups and variation between replicates. Differences between currently available regression methods stem from the choice of distribution to model the data and the variation associated with it. In the simplest case, linear regression\index{linear regression} can be used to model methylation per given CpG or loci across sample groups. The model fits regression coefficients to model the expected methylation proportion values for each CpG site across sample groups. Hence, the null hypothesis of the model coefficients being zero could be tested using t-statistics. However, linear regression based methods might produce fitted methylation levels outside the range $[0,1]$ unless the values are transformed before regression. An alternative is logistic regression\index{logistic regression} , which can deal with data strictly bounded between 0 and 1 and with non-constant variance, such as methylation proportion/fraction values. In the logistic regression, it is assumed that fitted values have variation $np(1-p)$, where $p$ is the fitted methylation proportion for a given sample and n is the read coverage. If the observed variance is larger or smaller than assumed by the model, one speaks of under- or over-dispersion. This over/under-dispersion can be corrected by calculating a scaling factor and using that factor to adjust the variance estimates as in $np(1-p)s$, where $s$ is the scaling factor. MethylKit can apply logistic regression to test the methylation difference with or without the over-dispersion correction. In this case, Chi-square or F-test can be used to compare the difference in the deviances of the null model and the alternative model. The null model assumes there is no relationship between sample groups and methylation, and the alternative model assumes that there is a relationship where sample groups are predictive of methylation values for a given CpG or region for which the model is constructed. Next, we are going to use the logistic regression based model with over-dispersion correction and Chi-square test. +Regression-based methods are generally used to model methylation levels in relation to the sample groups and variation between replicates. Differences between currently available regression methods stem from the choice of distribution to model the data and the variation associated with it. In the simplest case, linear regression\index{linear regression} can be used to model methylation per given CpG or loci across sample groups. The model fits regression coefficients to model the expected methylation proportion values for each CpG site across sample groups. Hence, the null hypothesis of the model coefficients being zero could be tested using t-statistics. However, linear-regression-based methods might produce fitted methylation levels outside the range $[0,1]$ unless the values are transformed before regression. An alternative is logistic regression\index{logistic regression}, which can deal with data strictly bounded between 0 and 1 and with non-constant variance, such as methylation proportion/fraction values. In the logistic regression, it is assumed that fitted values have variation $np(1-p)$, where $p$ is the fitted methylation proportion for a given sample and $n$ is the read coverage. If the observed variance is larger or smaller than assumed by the model, one speaks of under- or over-dispersion. This over/under-dispersion can be corrected by calculating a scaling factor and using that factor to adjust the variance estimates as in $np(1-p)s$, where $s$ is the scaling factor. MethylKit can apply logistic regression to test the methylation difference with or without the over-dispersion correction. In this case, Chi-square or F-test can be used to compare the difference in the deviances of the null model and the alternative model. The null model assumes there is no relationship between sample groups and methylation, and the alternative model assumes that there is a relationship where sample groups are predictive of methylation values for a given CpG or region for which the model is constructed. Next, we are going to use the logistic-regression-based model with over-dispersion correction and Chi-square test. ```{r logReg} dm.lr=calculateDiffMeth(meth,overdispersion = "MN",test ="Chisq") ``` -#### Betabinomial distribution based tests -More complex regression models use beta binomial distribution and are particularly useful for better modeling the variance. Similar to logistic regression, their observation follows binomial distribution (number of reads), but methylation proportion itself can vary across samples, according to a beta distribution.\index{betabinomial distribution} It can deal with fitting values in [0,1] range and performs better when there is greater variance than expected by the simple logistic model. In essence, these models have a different way of calculating a scaling factor when there is over-dispersion in the model. Further enhancements are made to these models by using the Empirical Bayes methods that can better estimate hyper parameters of the beta distribution (variance-related parameters) by borrowing information between loci or regions within the genome to aid with inference about each individual loci or region. We are now going to use a beta-binomial based model called DSS[@Feng2014-pd] to calculate differential methylation. +#### Betabinomial-distribution-based tests +More complex regression models use beta binomial distribution and are particularly useful for better modeling the variance. Similar to logistic regression, their observation follows binomial distribution (number of reads), but methylation proportion itself can vary across samples, according to a beta distribution.\index{betabinomial distribution} It can deal with fitting values in the $[0,1]$ range and performs better when there is greater variance than expected by the simple logistic model. In essence, these models have a different way of calculating a scaling factor when there is over-dispersion in the model. Further enhancements are made to these models by using the empirical Bayes methods that can better estimate hyper parameters of the beta distribution (variance-related parameters) by borrowing information between loci or regions within the genome to aid with inference about each individual loci or region. We are now going to use a beta-binomial based model called DSS [@Feng2014-pd] to calculate differential methylation. ```{r dss} dm.dss=calculateDiffMethDSS(meth) ``` #### Differential methylation for regions rather than base-pairs -Until now, we worked on differentially methylated cytosines. However, -working with base-pair resolution data has its problems. Not all the CpGs will be covered in all samples, if covered they may have low coverage which reduces the power of the tests. Instead of base-pairs, we can choose to work with regions. So, it might be desirable to summarize methylation information over pre-defined regions rather than doing base-pair resolution analysis. `methylKit` provides functionality to do such analysis. We can either tile the whole genome to tiles with predefined length, or we can use pre-defined regions such as promoters or CpG islands. This kind of regional analysis is carried out by adding up C and T counts from each covered cytosine and returning a total C and T count for each region. +Until now, we have worked on differentially methylated cytosines. However, +working with base-pair resolution data has its problems. Not all the CpGs will be covered in all samples. If covered they may have low coverage, which reduces the power of the tests. Instead of base-pairs, we can choose to work with regions. So, it might be desirable to summarize methylation information over pre-defined regions rather than doing base-pair resolution analysis. `methylKit` provides functionality to do such analysis. We can either tile the whole genome to tiles with predefined length, or we can use pre-defined regions such as promoters or CpG islands. This kind of regional analysis is carried out by adding up C and T counts from each covered cytosine and returning a total C and T count for each region. -The function below tiles the genome with windows 1000bp length and 1000bp step-size and summarizes the methylation information on those tiles. In this case, it returns a `methylRawList` object which can be fed into `unite` and `calculateDiffMeth` functions consecutively to get differentially methylated regions. +The function below tiles the genome with windows of $1000$ bp length and $1000$ bp step-size and summarizes the methylation information on those tiles. In this case, it returns a `methylRawList` object which can be fed into `unite()` and `calculateDiffMeth()` functions consecutively to get differentially methylated regions. ```{r tileMethylCounts,warning=FALSE} tiles=tileMethylCounts(myobj,win.size=1000,step.size=1000) head(tiles[[1]],3) @@ -273,7 +273,7 @@ head(tiles[[1]],3) In addition, if we are interested in particular regions, we can also get those regions as methylKit objects after summarizing the methylation information as described above. The code below summarizes the methylation information over a given set of promoter regions and outputs a `methylRaw` or `methylRawList` object depending on the input. We are using the output of `genomation` functions used above to provide the locations of promoters. For regional summary functions, we need to -provide regions of interest as GRanges object\index{R Packages!\texttt{genomation}}. +provide regions of interest as GRanges objects\index{R Packages!\texttt{genomation}}. ```{r methregionCounts, eval=TRUE} library(genomation) @@ -286,19 +286,19 @@ promoters=regionCounts(myobj,gene.obj$promoters) head(promoters[[1]]) ``` -In addition, it is possible to cluster DMCs based on their proximity and direction of differential methylation. This can be achieved by `methSeg()` function in methylKit. We will see more about `methSeg()` function in the following section. -But it can take the output of `getMethylDiff()` function therefore can work on DMCs to get differentially methylated regions. +In addition, it is possible to cluster DMCs based on their proximity and direction of differential methylation. This can be achieved by the `methSeg()` function in methylKit. We will see more about the `methSeg()` function in the following section. +But it can take the output of `getMethylDiff()` function and therefore can work on DMCs to get differentially methylated regions. #### Adding covariates Covariates can be included in the analysis as well in methylKit. The `calculateDiffMeth()` function will then try to separate the influence of the covariates from the treatment effect via the logistic regression model. In this case, we will test -if full model (model with treatment and covariates) is better than the model with +if the full model (model with treatment and covariates) is better than the model with the covariates only. If there is no effect due to the treatment (sample groups), the full model will not explain the data better than the model with covariates only. In `calculateDiffMeth()`, this is achieved by supplying the `covariates` argument in the format of a `data.frame`. -Below, we simulate methylation data and add make a `data.frame` for the age. +Below, we simulate methylation data and add a `data.frame` for the age. The data frame can include more columns, and those columns can also be `factor` variables. The row order of the data.frame should match the order of samples in the `methylBase` object. Below we are showing an example @@ -321,11 +321,11 @@ my.diffMeth3=calculateDiffMeth(sim.methylBase, ``` ### Methylation segmentation -The analysis of methylation dynamics is not exclusively restricted to differentially methylated regions across samples, apart from this there is also an interest in examining the methylation profiles within the same sample. Usually, depressions in methylation profiles pinpoint regulatory regions like gene promoters that co-localize with CG-dense CpG islands. On the other hand, many gene-body regions are extensively methylated and CpG-poor [@Bock2012-oh]. These observations would describe a bimodal model of either hyper- or hypomethylated regions dependent on the local density of CpGs [@Lovkvist2016-ky]. However, given the detection of CpG-poor regions with locally reduced levels of methylation (on average 30%) in pluripotent embryonic stem cells and in neuronal progenitors in both mouse and human, a different model seems also reasonable [@Stadler2011-iu]. These low-methylated regions (LMRs) are located distal to promoters, have little overlap with CpG islands and associated with enhancer marks such as p300 binding sites and H3K27ac enrichment. +The analysis of methylation dynamics is not exclusively restricted to differentially methylated regions across samples. Apart from this there is also an interest in examining the methylation profiles within the same sample. Usually, depressions in methylation profiles pinpoint regulatory regions like gene promoters that co-localize with CG-dense CpG islands. On the other hand, many gene-body regions are extensively methylated and CpG-poor [@Bock2012-oh]. These observations would describe a bimodal model of either hyper- or hypomethylated regions depending on the local density of CpGs [@Lovkvist2016-ky]. However, given the detection of CpG-poor regions with locally reduced levels of methylation (on average 30%) in pluripotent embryonic stem cells and in neuronal progenitors in both mouse and human, a different model also seems reasonable [@Stadler2011-iu]. These low-methylated regions (LMRs) are located distal to promoters, have little overlap with CpG islands, and are associated with enhancer marks such as p300 binding sites and H3K27ac enrichment. -Now we are going to try to segment portion for the H1 human embryonic stem cell line. MethylKit \index{R Packages!\texttt{methylKit}}uses change-point analysis to segment the methylome. In change-point analysis, the change-points of a genome-wide methylation signal are recorded and the genome is partitioned into regions between consecutive change points. CpGs in each segment is similar to eachoter more than the following segment. -After segmentation, methylKit function `methSeg()` identifies segments that are further clustered into segment classes using a mixture modeling approach. This clustering is based on only the average methylation level of the segments and allows the detection of distinct methylome features comparable to unmethylated regions (UMRs), lowly methylated regions (LMRs) and fully methylated regions (FMRs) mentioned at [@Stadler2011-yv]. The code snippet below reads the methylation data from H1 cell line as a `GRanges` object, and runs the segmentation with potentially up to classes of segments. Mixture modelling determines the optimal number of segments using a statistic called bayesian information criterion (BIC). BIC is a statistic based on model likelihood and helps us select the model that fits the data better. We have set the number of segment classes to try using `G=1:4` argument.The `minSeg` arguments are related to minimum number of CpGs in the segments. The function `methSeg()` outputs a diagnostic plot for segmentation. This plot is shown in Figure \@ref(fig:segDiag). It shows methylation values and lengths of segments in each segment class, as well as BIC for different number of segments. -```{r segDiag, fig.width=14,fig.height=8,fig.cap="Segmentation characteristics shown in different plots. top left: Mean methylation values per segment in each segment class. Top middle: Length of each segment as boxplots for each segment class. Top right: Number of segments in each segment class. Bottom left: distribution of segment methylation values. Bottom right: BIC for different number of segment classes",warning=FALSE,out.width="90%"} +Now we are going to try to segment a portion for the H1 human embryonic stem cell line. MethylKit \index{R Packages!\texttt{methylKit}}uses change-point analysis to segment the methylome. In change-point analysis, the change-points of a genome-wide methylation signal are recorded and the genome is partitioned into regions between consecutive change points. CpGs in each segment are similar to each other more than the following segment. +After segmentation, methylKit function `methSeg()` identifies segments that are further clustered into segment classes using a mixture modeling approach. This clustering is based on only the average methylation level of the segments and allows the detection of distinct methylome features comparable to unmethylated regions (UMRs), lowly methylated regions (LMRs), and fully methylated regions (FMRs) mentioned in Stadler et al. [@Stadler2011-yv]. The code snippet below reads the methylation data from the H1 cell line as a `GRanges` object, and runs the segmentation with potentially up to 4 classes of segments. Mixture modeling determines the optimal number of segments using a statistic called Bayesian information criterion (BIC). The BIC is a statistic based on model likelihood and helps us select the model that fits the data better. We have set the number of segment classes to try using the `G=1:4` argument. The `minSeg` arguments are related to the minimum number of CpGs in the segments. The function `methSeg()` outputs a diagnostic plot for segmentation. This plot is shown in Figure \@ref(fig:segDiag). It shows methylation values and lengths of segments in each segment class, as well as the BIC for different numbers of segments. +```{r segDiag, fig.width=14,fig.height=8,fig.cap="Segmentation characteristics shown in different plots. Top left: Mean methylation values per segment in each segment class. Top middle: Length of each segment as boxplots for each segment class. Top right: Number of segments in each segment class. Bottom left: Distribution of segment methylation values. Bottom right: BIC for different number of segment classes",warning=FALSE,out.width="90%"} # read methylation data methFile=system.file("extdata","H1.chr21.chr22.rds", @@ -338,8 +338,8 @@ res=methSeg(mbw,minSeg=10,G=1:4, ``` -In this case, we know that BIC does not improve much after 4 segment classes. Now, we will not have a look at the characteristics of the segment classes. We are going to plot mean methylation value and the length of the segment as a scatter plot, the result of this plot is shown in Figure \@ref(fig:segplot). -```{r segplot, fig.cap="Scatter plot of segment mean methylation values versus segment length. Each dot is a segment identified by `methSeg()` function."} +In this case, we know that BIC does not improve much after 4 segment classes. Now, we will not have a look at the characteristics of the segment classes. We are going to plot the mean methylation value and the length of the segment as a scatter plot; the result of this plot is shown in Figure \@ref(fig:segplot). +```{r segplot, fig.cap="Scatter plot of segment mean, methylation values versus segment length. Each dot is a segment identified by the methSeg() function."} # plot plot(res$seg.mean, log10(width(res)),pch=20, @@ -349,15 +349,15 @@ plot(res$seg.mean, ``` -The highly methylated segment classes that have more than 70% methylation are usually longer, median length is 17889 bp. The segment class that has the lowest methylation values have median length of 1376 bp and the shortest segment class has low to medium methylation level, has median length of 412 bp. +The highly methylated segment classes that have more than 70% methylation are usually longer; the median length is 17889 bp. The segment class that has the lowest methylation values have the median length of 1376 bp and the shortest segment class has low to medium methylation level, with median length of 412 bp. ### Working with large files -We might want to perform differential methylation analysis in R using whole genome methylation data of multiple samples. The problem is that for genome-wide experiments, file sizes can easily range from hundreds of megabytes to gigabytes and processing multiple instances of those files in memory (RAM) might become unfeasible unless we have access to a high performance cluster (HPC) with extensive RAM. If we want to use a desktop computer or laptop with limited RAM, we either need to restrict our analysis to a subset of the data or use packages that can handle this situation. +We might want to perform differential methylation analysis in R using whole genome methylation data of multiple samples. The problem is that for genome-wide experiments, file sizes can easily range from hundreds of megabytes to gigabytes and processing multiple instances of those files in memory (RAM) might become unfeasible unless we have access to a high-performance compute cluster (HPC) with extensive RAM. If we want to use a desktop computer or laptop with limited RAM, we either need to restrict our analysis to a subset of the data or use packages that can handle this situation. -The methylKit package provides the capability of dealing large files and high number of samples by exploiting flat file databases to substitute in-memory objects. The internal data apart from meta information has a tabular structure storing chromosome, start/end position, strand information of the associated CpG base just like many other biological formats like BED, GFF or SAM. By exporting this tabular data into a TAB-delimited file and making sure it is accordingly position-sorted it can be indexed using the generic [tabix tool](http://www.htslib.org/doc/tabix.html). In general tabix indexing is a generalization of BAM\index{BAM file} indexing for generic TAB-delimited files. It inherits all the advantages of BAM indexing, including data compression and efficient random access in terms of few seek function calls per query [@Li2011-wc]. `MethylKit` relies on [`Rsamtools`](http://bioconductor.org/packages/release/bioc/html/Rsamtools.html) which implements tabix functionality for R and this way internal methylKit objects can be efficiently stored as compressed file on the disk and still \index{R Packages!\texttt{Rsamtools}}be fast accessed. Another advantage is that existing compressed files can be loaded in interactive sessions, allowing the backup and transfer of intermediate analysis results. +The methylKit package provides the capability of dealing with large files and high numbers of samples by exploiting flat file databases to substitute in-memory objects. The internal data, apart from meta information, has a tabular structure storing chromosome, start/end position, and strand information of the associated CpG base just like many other biological formats like BED, GFF or SAM. By exporting this tabular data into a TAB-delimited file and making sure it is accordingly position-sorted, it can be indexed using the generic [tabix tool](http://www.htslib.org/doc/tabix.html). In general, tabix indexing is a generalization of BAM\index{BAM file} indexing for generic TAB-delimited files. It inherits all the advantages of BAM indexing, including data compression and efficient random access in terms of few seek function calls per query [@Li2011-wc]. `MethylKit` relies on [`Rsamtools`](http://bioconductor.org/packages/release/bioc/html/Rsamtools.html) which implements tabix functionality for R. This way internal methylKit objects can be efficiently stored as a compressed file on the disk and still \index{R Packages!\texttt{Rsamtools}}be quickly accessed. Another advantage is that existing compressed files can be loaded in interactive sessions, allowing the backup and transfer of intermediate analysis results. -`methylKit` provides the capability for storing objects in tabix format within various functions. Every methylKit object has their tabix-based flat-file database equivalent. For example, when reading a methylation call file the `dbtype` argument can be provided, this will create tabix based objects. +`methylKit` provides the capability for storing objects in tabix format within various functions. Every methylKit object has its tabix-based flat-file database equivalent. For example, when reading a methylation call file, the `dbtype` argument can be provided, which will create tabix-based objects. ```{r tabix,eval=FALSE} myobj=methRead( file.list, sample.id=list("test1","test2","ctrl1","ctrl2"), @@ -367,7 +367,7 @@ The methylKit package provides the capability of dealing large files and high nu The advantage of tabix-based objects is of course saving memory and more efficient parallelization for differential methylation calculation. However, since the data is written to a file and indexed whenever a new object is created, working with tabix-based objects will be slower at certain steps of the analysis compared to in-memory objects. ## Annotation of DMRs/DMCs and segments -The regions of interest obtained through differential methylation or segmentation analysis often need to be integrated with genome annotation datasets. Without this type of integration, differential methylation or segmentation results will be hard to interpret in biological terms. The most common annotation task is to see where regions of interest land in relation to genes and gene parts and regulatory regions: Do they mostly occupy promoter, intronic or exonic regions? Do they overlap with repeats? Do they overlap with other epigenomic markers or long-range regulatory regions? These questions are not specific to methylation −nearly all regions of interest obtained via genome-wide studies have to deal with such questions. Thus, there are already multiple software tools that can produce such annotations. One is the Bioconductor package [`genomation`](http://bioconductor.org/packages/release/bioc/html/genomation.html)[@Akalin2015-yk]. \index{R Packages!\texttt{genomation}}It can be used to annotate DMRs/DMCs and it can also be used to integrate methylation proportions over the genome with other quantitative information and produce meta-gene plots or heatmaps. Below, we are reading a BED file for transcripts and using that to annotate DMCs with promoter/intron/exon/intergenic annotation.`genomation::readTranscriptFeatures()` function reads a BED12 file, calculates the coordinates of promoters, exons and introns and the subsequent function uses that information for annotation. +The regions of interest obtained through differential methylation or segmentation analysis often need to be integrated with genome annotation datasets. Without this type of integration, differential methylation or segmentation results will be hard to interpret in biological terms. The most common annotation task is to see where regions of interest land in relation to genes and gene parts and regulatory regions: Do they mostly occupy promoter, intronic or exonic regions? Do they overlap with repeats? Do they overlap with other epigenomic markers or long-range regulatory regions? These questions are not specific to methylation −nearly all regions of interest obtained via genome-wide studies have to deal with such questions. Thus, there are already multiple software tools that can produce such annotations. One is the Bioconductor package [`genomation`](http://bioconductor.org/packages/release/bioc/html/genomation.html)[@Akalin2015-yk]. \index{R Packages!\texttt{genomation}}It can be used to annotate DMRs/DMCs and it can also be used to integrate methylation proportions over the genome with other quantitative information and produce meta-gene plots or heatmaps. Below, we are reading a BED file for transcripts and using that to annotate DMCs with promoter/intron/exon/intergenic annotation. The `genomation::readTranscriptFeatures()` function reads a BED12 file, calculates the coordinates of promoters, exons, and introns and the subsequent function uses that information for annotation. ```{r annotMeth} library(genomation) @@ -399,22 +399,22 @@ diffCpGann=annotateWithFeatureFlank(as(all.diff,"GRanges"), feature.name="CpGi",flank.name="shores") ``` -Besides these, DMRs/DMCs might be associated with changes in gene regulation. It might be desirable to overlap them with known transcription binding sites or motifs or histone modifications. These are simply overlap operations for these kinds of analysis you can use `genomation::annotateWithFeature()` function or any other approach shown in Chapter \@ref(genomicIntervals), you can also do motif discovery with methods shown in Chapter \@ref(chipseq). +Besides these, DMRs/DMCs might be associated with changes in gene regulation. It might be desirable to overlap them with known transcription binding sites or motifs or histone modifications. These are simply overlap operations for these kinds of analysis. You can use the `genomation::annotateWithFeature()` function or any other approach shown in Chapter \@ref(genomicIntervals), and you can also do motif discovery with methods shown in Chapter \@ref(chipseq). ### Further annotation with genes or gene sets -The next obvious steps for annotating your DMRs/DMCs are figuring out which genes they are associated with. Figuring out which genes are associated with your regions of interest can give a better idea on biological implications of the methylation changes. Once you have gene set you can do gene set analysis as shown in Chapter \@ref(rnaseqanalysis) or in Chapter \@ref(multiomics). There are also packages such as [`rGREAT`](https://www.bioconductor.org/packages/release/bioc/html/rGREAT.html) that can simultaneosuly associate DMRs or any other region of interest to genes and do gene set analysis. +The next obvious steps for annotating your DMRs/DMCs are figuring out which genes they are associated with. Figuring out which genes are associated with your regions of interest can give a better idea of the biological implications of the methylation changes. Once you have your gene set, you can do gene set analysis as shown in Chapter \@ref(rnaseqanalysis) or in Chapter \@ref(multiomics). There are also packages such as [`rGREAT`](https://www.bioconductor.org/packages/release/bioc/html/rGREAT.html) that can simultaneously associate DMRs or any other region of interest to genes and do gene set analysis. ## Other R packages that can be used for methylation analysis -- [DSS](http://bioconductor.org/packages/release/bioc/html/genomation.html) beta-binomial models with Empirical Bayes for moderating dispersion. -- [BSseq](http://bioconductor.org/packages/release/bioc/html/BSseq.html) Regional differential methylation analysis using smoothing and linear regression based tests. -- [BiSeq](http://bioconductor.org/packages/release/bioc/html/BiSeq.html) Regional differential methylation analysis using beta-binomial models +- [DSS](http://bioconductor.org/packages/release/bioc/html/genomation.html) beta-binomial models with empirical Bayes for moderating dispersion. +- [BSseq](http://bioconductor.org/packages/release/bioc/html/BSseq.html) Regional differential methylation analysis using smoothing and linear-regression-based tests. +- [BiSeq](http://bioconductor.org/packages/release/bioc/html/BiSeq.html) Regional differential methylation analysis using beta-binomial models. - [MethylSeekR](http://bioconductor.org/packages/release/bioc/html/MethylSeekR.html): Methylome segmentation using HMM and cutoffs. -- [QuasR](http://bioconductor.org/packages/release/bioc/html/QuasR.html): Methylation aware alignment and methylation calling. As well as fastQC-like fastq raw data quality check features. +- [QuasR](http://bioconductor.org/packages/release/bioc/html/QuasR.html): Methylation aware alignment and methylation calling, as well as fastQC-like fastq raw data quality check features. ## Exercises ### Differential methylation -The main objective of this exercise is getting differential methylated cytosines between two groups of samples: IDH-mut (AML patients with IDH mutations) vs NBM (normal bone marrow samples). +The main objective of this exercise is getting differential methylated cytosines between two groups of samples: IDH-mut (AML patients with IDH mutations) vs. NBM (normal bone marrow samples). 1. Download methylation call files from GEO. These files are readable by methlKit using default `methRead` arguments. [Difficulty: **Beginner**] @@ -432,18 +432,18 @@ m=methRead("~/Downloads/GSM919982_NBM_1_myCpG.txt.gz", sample.id = "idh",assembly="hg18") ``` -2. Find differentially methylated cytosines. Use chr1 and chr2 only if you need to save time. You can subset it after you download the files either in R or unix. The files are for hg18 assembly of human genome.[Difficulty: **Beginner**] -3. Describe the general differential methylation trend, what is the main effect for most CpGs ? [Difficulty: **Intermediate**] -4. Annotate differentially methylated cytosines (DMCs) as promoter/intron/exon ? [Difficulty: **Beginner**] -5. Which genes are the nearest to DMCs ? [Difficulty: **Intermediate**] -6. Can you do gene set analysis either in R or via web-based tools ? [Difficulty: **Advanced**] +2. Find differentially methylated cytosines. Use chr1 and chr2 only if you need to save time. You can subset it after you download the files either in R or Unix. The files are for hg18 assembly of human genome. [Difficulty: **Beginner**] +3. Describe the general differential methylation trend, what is the main effect for most CpGs? [Difficulty: **Intermediate**] +4. Annotate differentially methylated cytosines (DMCs) as promoter/intron/exon? [Difficulty: **Beginner**] +5. Which genes are the nearest to DMCs? [Difficulty: **Intermediate**] +6. Can you do gene set analysis either in R or via web-based tools? [Difficulty: **Advanced**] ### Methylome segmentation The main objective of this exercise is to learn how to do methylome segmentation and the downstream analysis for annotation and data integration. -1. Download the human embryonic stem-cell (H1 Cell Line) methylation bigWig files from [Roadmap Epigenomics website](http://egg2.wustl.edu/roadmap/web_portal/processed_data.html#MethylData). It may take a while to understand how the website is structured and which bigWig file to use. That is part of the exercise. The files you will download are for hg19 assembly unless stated otherwise. [Difficulty: **Beginner**] +1. Download the human embryonic stem-cell (H1 Cell Line) methylation bigWig files from the [Roadmap Epigenomics website](http://egg2.wustl.edu/roadmap/web_portal/processed_data.html#MethylData). It may take a while to understand how the website is structured and which bigWig file to use. That is part of the exercise. The files you will download are for hg19 assembly unless stated otherwise. [Difficulty: **Beginner**] 2. Do segmentation on hESC methylome. You can only use chr1 if using the whole genome takes too much time. [Difficulty: **Intermediate**] -3. Annotate segments, what kind of gene-based features each segment class overlaps with (promoter/exon/intron) [Difficulty: **Beginner**] -4. For each segment type, annotate the segments with chromHMM annotations from Roadmap Epigenome database available [here](https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state), the specific file you should use is [here](https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E003_15_coreMarks_mnemonics.bed.gz). This is a bed file with chromHMM annotations. chromHMM annotations are parts of the genome identified by a hidden-markov-model based machine-learning algorithm. The segments correspond to active promoters, enhancers, active transcription, insulators. etc. The chromHMM model uses histone modification ChIP-seq and potentially other ChIP-seq data sets to annotate the genome.[Difficulty: **Advanced**] +3. Annotate segments and the kinds of gene-based features each segment class overlaps with (promoter/exon/intron). [Difficulty: **Beginner**] +4. For each segment type, annotate the segments with chromHMM annotations from the Roadmap Epigenome database available [here](https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state). The specific file you should use is [here](https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E003_15_coreMarks_mnemonics.bed.gz). This is a bed file with chromHMM annotations. chromHMM annotations are parts of the genome identified by a hidden-Markov-model-based machine learning algorithm. The segments correspond to active promoters, enhancers, active transcription, insulators, etc. The chromHMM model uses histone modification ChIP-seq and potentially other ChIP-seq data sets to annotate the genome.[Difficulty: **Advanced**] diff --git a/11-multiomics-analysis.Rmd b/11-multiomics-analysis.Rmd index 965e142..ad396e6 100644 --- a/11-multiomics-analysis.Rmd +++ b/11-multiomics-analysis.Rmd @@ -14,11 +14,11 @@ knitr::opts_chunk$set(echo = TRUE, -\index{multi-omics}Living cells are a symphony of complex processes. Modern sequencing technology has lead to many comprehensive assays being routinely available to experimenters, giving us different ways to peek at the internal doings of the cells, each experiment revealing a different part of some underlying processes. As an example, most cells have the same DNA, but sequencing the genome of a cell allows us to find mutations and structural alterations that drive tumerogenesis in cancer. If we treat the DNA with bisulfite prior to sequencing, cytosine residues are converted to uracil, but 5-methylcytosine residues are unaffected. This allows us to probe the methylation patterns of the genome, or its methylome. By sequencing the mRNA molecules in a cell, we can calculate the abundance, in different samples, of different mRNA transcripts, or uncover its transcriptome. Performing different experiments on the same samples, for instance RNA-seq, DNA-seq, and BS-seq, results in multi-dimensional omics datasets, which enable the study of relationships between different biological processes, e.g. DNA methylation, mutations, and gene expression, and the leveraging of multiple data types to draw inferences about biological systems. This chapter provides an overview of some of the available methods for such analyses, focusing on matrix factorization approaches. In the examples in this chapter we will demonstrate how these methods are applicable to cancer molecular subtyping, i.e. finding tumors which are driven by the same molecular processes. +\index{multi-omics}Living cells are a symphony of complex processes. Modern sequencing technology has led to many comprehensive assays being routinely available to experimenters, giving us different ways to peek at the internal doings of the cells, each experiment revealing a different part of some underlying processes. As an example, most cells have the same DNA, but sequencing the genome of a cell allows us to find mutations and structural alterations that drive tumerogenesis in cancer. If we treat the DNA with bisulfite prior to sequencing, cytosine residues are converted to uracil, but 5-methylcytosine residues are unaffected. This allows us to probe the methylation patterns of the genome, or its methylome. By sequencing the mRNA molecules in a cell, we can calculate the abundance, in different samples, of different mRNA transcripts, or uncover its transcriptome. Performing different experiments on the same samples, for instance RNA-seq, DNA-seq, and BS-seq, results in multi-dimensional omics datasets, which enable the study of relationships between different biological processes, e.g. DNA methylation, mutations, and gene expression, and the leveraging of multiple data types to draw inferences about biological systems. This chapter provides an overview of some of the available methods for such analyses, focusing on matrix factorization approaches. In the examples in this chapter we will demonstrate how these methods are applicable to cancer molecular subtyping, i.e. finding tumors which are driven by the same molecular processes. -### Use case: Multi-omics data from colorectal cancer +## Use case: Multi-omics data from colorectal cancer -\index{multi-omics}\index{colorectal cancer}The examples in this chapter will use the following data: a set of 121 tumors from the TCGA [@tcga_pan_cancer] cohorts of Colon and Rectum adenocarcinoma. The tumors have been profiled for gene expression using RNA-seq, mutations using Exome-seq, and copy number variations using genotyping arrays. Projects such as TCGA have turbocharged efforts to sub-divide cancer into subtypes. Although two tumors arise in the colon, they may have distinct molecular profiles, which is important for treatment decisions. The subset of tumors used in this chapter belong to two distinct molecular subtypes defined by the Colorectal Cancer Subtyping Consortium [@cmscc], _CMS1_ and _CMS3_. The following code snippets load this multi-omics data from the companion package, starting with gene expression data from RNA-seq (see Chapter \@ref(rnaseqanalysis)). Below we are reading the RNA-seq data from the `compGenomRData` package. +\index{multi-omics}\index{colorectal cancer}The examples in this chapter will use the following data: a set of 121 tumors from the TCGA [@tcga_pan_cancer] colorectal cancer cohort. The tumors have been profiled for gene expression using RNA-seq, mutations using Exome-seq, and copy number variations using genotyping arrays. Projects such as TCGA have turbocharged efforts to sub-divide cancer into subtypes. Although two tumors arise in the colon, they may have distinct molecular profiles, which is important for treatment decisions. The subset of tumors used in this chapter belong to two distinct molecular subtypes defined by the Colorectal Cancer Subtyping Consortium [@cmscc], _CMS1_ and _CMS3_. The following code snippets load this multi-omics data from the companion package, starting with gene expression data from RNA-seq (see Chapter \@ref(rnaseqanalysis)). Below we are reading the RNA-seq data from the `compGenomRData` package. ```{r,moloadMultiomicsGE, tidy=FALSE} # read in the csv from the companion package as a data frame csvfile <- system.file("extdata", "multi-omics", "COREAD_CMS13_gex.csv", @@ -29,11 +29,11 @@ rownames(x1) <- sapply(strsplit(rownames(x1), "\\|"), function(x) x[1]) # Output a table knitr::kable(head(t(head(x1))), caption="Example gene expression data (head)") ``` -Table \@ref(tab:moloadMultiomicsGE) shows the head of the gene expression matrix. The rows correspond to patients, referred to by their TCGA identifier as the first column of the table. Columns represent the genes, and values are RPKM expression values. The column names are the names or symbols of the genes. +Table \@ref(tab:moloadMultiomicsGE) shows the head of the gene expression matrix. The rows correspond to patients, referred to by their TCGA identifier, as the first column of the table. Columns represent the genes, and the values are RPKM expression values. The column names are the names or symbols of the genes. The details about how these expression values are calculated are in Chapter \@ref(rnaseqanalysis). -**read mutation data**: +We first **read mutation data** with the following code snippet. ```{r,moloadMultiomicsMUT, tidy=FALSE} # read in the csv from the companion package as a data frame csvfile <- system.file("extdata", "multi-omics", "COREAD_CMS13_muts.csv", @@ -45,9 +45,9 @@ x2[x2>0]=1 # output a table knitr::kable(head(t(head(x2))), caption="Example mutation data (head)") ``` -Table \@ref(tab:moloadMultiomicsMUT) shows the mutations of these tumors (mutations were introduced in Chapter \@ref(intro)). In the mutation matrix, each cell is a binary 1/0, indicating whether or not a tumor has a non-synonymous mutation in the gene indicated by the column. These types of mutations change the aminoacid sequence therefore they are likely to change the function of the protein. +Table \@ref(tab:moloadMultiomicsMUT) shows the mutations of these tumors (mutations were introduced in Chapter \@ref(intro)). In the mutation matrix, each cell is a binary 1/0, indicating whether or not a tumor has a non-synonymous mutation in the gene indicated by the column. These types of mutations change the aminoacid sequence, therefore they are likely to change the function of the protein. -**read copy number data**: +Next, we **read copy number data** with the following code snippet. ```{r,moloadMultiomicsCNV, tidy=FALSE} # read in the csv from the companion package as a data frame csvfile <- system.file("extdata", "multi-omics", "COREAD_CMS13_cnv.csv", @@ -59,7 +59,7 @@ knitr::kable(head(t(head(x3))), ``` Finally, table \@ref(tab:moloadMultiomicsCNV) shows GISTIC scores [@mermel2011gistic2] for copy number alterations in these tumors. During transformation from healthy cells to cancer cells, the genome sometimes undergoes large-scale instability; large segments of the genome might be replicated or lost. This will be reflected in each segment's "copy number". In this matrix, each column corresponds to a chromosome segment, and the value of the cell is a real-valued score indicating if this segment has been amplified (copied more) or lost, relative to a non-cancer control from the same patient. -Each of the data types (gene expression, mutations, copy number variation) on its own, provides some signal which allows to somewhat separate the samples into the two different subtypes. In order to explore these relations, we must first obtain the subtypes of these tumors. The following code snippet reads these, also from the companion package: +Each of the data types (gene expression, mutations, copy number variation) on its own, provides some signal which allows us to somewhat separate the samples into the two different subtypes. In order to explore these relations, we must first obtain the subtypes of these tumors. The following code snippet reads these, also from the companion package: ```{r,moloadCOREADSubsypes} # read in the csv from the companion package as a data frame @@ -67,17 +67,18 @@ csvfile <- system.file("extdata", "multi-omics", "COREAD_CMS13_subtypes.csv", package="compGenomRData") covariates <- read.csv(csvfile, row.names=1) # Fix the TCGA identifiers so they match up with the omics data -rownames(covariates) <- gsub(pattern = '-', replacement = '\\.', rownames(covariates)) +rownames(covariates) <- gsub(pattern = '-', replacement = '\\.', + rownames(covariates)) covariates <- covariates[colnames(x1),] # create a dataframe which will be used to annotate later graphs anno_col <- data.frame(cms=as.factor(covariates$cms_label)) rownames(anno_col) <- rownames(covariates) ``` -Before proceding with any multi-omics integration analysis which might obscure the underlying data, it is important to take a look at each omic data type on its own, and in this case in particular, to examine their relation to the underlying condition, i.e. the cancer subtype. A great way to get an eagle-eye view of such large data is using heatmaps (see Chapter \@ref(unsupervisedLearning) for more details). +Before proceeding with any multi-omics integration analysis which might obscure the underlying data, it is important to take a look at each omic data type on its own, and in this case in particular, to examine their relation to the underlying condition, i.e. the cancer subtype. A great way to get an eagle-eye view of such large data is using heatmaps (see Chapter \@ref(unsupervisedLearning) for more details). We will first check the gene expression data in relation to the subtypes. One way of doing that is plotting a heatmap and clustering the tumors, while displaying a color annotation atop the heatmap, indicating which subtype each tumor belongs to. This is shown in Figure \@ref(fig:mogeneExpressionHeatmap), which is generated by the following code snippet: -```{r,mogeneExpressionHeatmap, out.width='60%', fig.cap="Heatmap of gene expression data for colorectal cancers"} +```{r,mogeneExpressionHeatmap, out.width='60%', fig.cap="Heatmap of gene expression data for colorectal cancers."} pheatmap::pheatmap(x1, annotation_col = anno_col, show_colnames = FALSE, @@ -88,7 +89,7 @@ pheatmap::pheatmap(x1, In Figure \@ref(fig:mogeneExpressionHeatmap), each column is a tumor, and each row is a gene. The values in the cells are FPKM values. There is another band above the heatmap annotating each column (tumor) with its corresponding subtype. The tumors are clustered using hierarchical clustering denoted by the dendrogram above the heatmap, according to which the columns (tumors) are ordered. While this ordering corresponds somewhat to the subtypes, it would not be possible to cut this dendrogram in a way which achieves perfect separation between the subtypes. Next we repeat the same exercise using the mutation data. The following snippet generates Figure \@ref(fig:momutationsHeatmap): -```{r,momutationsHeatmap,fig.cap="Heatmap of mutation data for colorectal cancers"} +```{r,momutationsHeatmap,fig.cap="Heatmap of mutation data for colorectal cancers."} pheatmap::pheatmap(x2, annotation_col = anno_col, show_colnames = FALSE, @@ -99,7 +100,7 @@ pheatmap::pheatmap(x2, An examination of Figure \@ref(fig:momutationsHeatmap) shows that tumors clustered and ordered by mutation data correspond very closely to their CMS subtypes. However, one should be careful in drawing conclusions about this result. Upon closer examination, you might notice that the separating factor seems to be that CMS1 tumors have significantly more mutations than do CMS3 tumors. This, rather than mutations in a specific genes, seems to be driving this clustering result. Nevertheless, this hyper-mutated status is an important indicator for this subtype. Finally, we look into copy number variation data and try to see if clustered samples are in concordance with subtypes. The following code snippet generates Figure \@ref(fig:moCNVHeatmap): -```{r,moCNVHeatmap,fig.cap="Heatmap of copy number variation data, colorectal cancers"} +```{r,moCNVHeatmap,fig.cap="Heatmap of copy number variation data, colorectal cancers."} pheatmap::pheatmap(x3, annotation_col = anno_col, show_colnames = FALSE, @@ -109,13 +110,13 @@ pheatmap::pheatmap(x3, The interpretation of Figure \@ref(fig:moCNVHeatmap) is left as an exercise for the reader. -It is clear that while there is some "signal" in each of these omics types, as is evident from the heatmaps above, it is equally clear that none of these omics types completely and on its own explains the subtypes. Each omics type provides but a glimpse into what makes each of these tumors different from a healthy cell. Through the rest of this chapter, we will demonstrate how analyzing the gene expression, mutations, and copy number variations, in tandem, we will be able to get a better picture of what separates these cancer subtypes. +It is clear that while there is some "signal" in each of these omics types, as is evident from these heatmaps, it is equally clear that none of these omics types completely and on its own explains the subtypes. Each omics type provides but a glimpse into what makes each of these tumors different from a healthy cell. Through the rest of this chapter, we will demonstrate how analyzing the gene expression, mutations, and copy number variations, in tandem, we will be able to get a better picture of what separates these cancer subtypes. The next section will describe latent variable models for multi-omics integrations. Latent variable models are a form of dimensionality reduction (see Chapter \@ref(unsupervisedLearning)). Each omics data type is "big data" in its own right; a typical RNA-seq experiment profiles upwards of 50 thousand different transcripts. The difficulties in handling large data matrices are only exacerbated by the introduction of more omics types into the analysis, as we are suggesting here. In order to overcome these challenges, latent variable models are a powerful way to reduce the dimensionality of the data down to a manageable size. ## Latent variable models for multi-omics integration -\index{unsupervised learning}Unsupervised multi-omics integration methods are methods that look for patterns within and across data types, in a label-agnostic fashion, i.e. without knowledge of the identity or label of the analyzed samples (e.g. cell type, tumor/normal). This chapter focuses on latent variable models, a form of dimensionality reduction technique (see Chapter \@ref(unsupervisedLearning)). Latent variable models make an assumption that the high dimensional data we observe (e.g. counts of tens of thousands of mRNA molecules) arise from a lower dimension description. The variables in that lower dimensional description are termed _Latent Variables_, as they are believed to be latent in the data, but not directly observable through experimentation. Therefore, there is a need for methods to infer the latent variables from the data. For instance, (see Chapter \@ref(rnaseqanalysis) for details of RNA-seq analysis) the relative abundance of different mRNA molecules in a cell is largely determined by the cell type. There are other experiments which may be used to discern the cell type of cells (e.g. looking at them under a microscope), but an RNA-seq experiment does not, directly, reveal whether the analyzed sample was taken from one organ or another. A latent variable model would set the cell type as a latent variable, and the observable abundance of mRNA molecules to be dependent on the value of the latent variable (e.g. if the latent variable is "Regulatory T-cell", we would expect to find high expression of CD4, FOXP3, and CD25). +\index{unsupervised learning}Unsupervised multi-omics integration methods are methods that look for patterns within and across data types, in a label-agnostic fashion, i.e. without knowledge of the identity or label of the analyzed samples (e.g. cell type, tumor/normal). This chapter focuses on latent variable models, a form of dimensionality reduction technique (see Chapter \@ref(unsupervisedLearning)). Latent variable models make an assumption that the high-dimensional data we observe (e.g. counts of tens of thousands of mRNA molecules) arise from a lower dimension description. The variables in that lower dimensional description are termed _latent variables_, as they are believed to be latent in the data, but not directly observable through experimentation. Therefore, there is a need for methods to infer the latent variables from the data. For instance, (see Chapter \@ref(rnaseqanalysis) for details of RNA-seq analysis) the relative abundance of different mRNA molecules in a cell is largely determined by the cell type. There are other experiments which may be used to discern the cell type of cells (e.g. looking at them under a microscope), but an RNA-seq experiment does not, directly, reveal whether the analyzed sample was taken from one organ or another. A latent variable model would set the cell type as a latent variable, and the observable abundance of mRNA molecules to be dependent on the value of the latent variable (e.g. if the latent variable is "Regulatory T-cell", we would expect to find high expression of CD4, FOXP3, and CD25). ## Matrix factorization methods for unsupervised multi-omics data integration @@ -134,13 +135,13 @@ As we normally seek a latent variable model with a considerably lower dimensiona The loss function we choose to minimize may be further subject to some constraints or regularization terms\index{regularization}\index{loss function}\index{optimization}. Regularization has been introduced in Chapter \@ref(supervisedLearning). In the current context of latent factor models, a regularization term might be added to the loss function, i.e. we might choose to minimize $min~\|X-WH\| + \lambda \|W\|^2$ (this is called $L_2$-regularization) instead of merely the reconstruction error. Adding such a term to our loss function here will push the $W$ matrix entries towards 0, in effect balancing between better reconstruction of the data and a more parsimonious model. A more parsimonious latent factor model is one with more sparsity in the latent factors. This sparsity is desirable for model interpretation, as will become evident in later sections. -```{r,momatrixFactorization,fig.cap="General matrix factorization framework. The data matrix on the left hand side is decomposed into factors on the right hand side. The equality may be an approximation as some matrix factorization methods are lossless (exact), while others are an approximation.",fig.align = 'center',out.width='75%',echo=FALSE} +```{r,momatrixFactorization,fig.cap="General matrix factorization framework. The data matrix on the left-hand side is decomposed into factors on the right-hand side. The equality may be an approximation as some matrix factorization methods are lossless (exact), while others are an approximation.",fig.align = 'center',out.width='75%',echo=FALSE} knitr::include_graphics("images/matrix_factorization.png" ) ``` In Figure \@ref(fig:momatrixFactorization), the $5 \times 4$ data matrix $X$ is decomposed to a 2-dimensional latent variable model. -### Multiple Factor Analysis +### Multiple factor analysis \index{multiple factor analysis}Multiple factor analysis is a natural starting point for a discussion about matrix factorization methods for integrating multiple data types. It is a straightforward extension of PCA into the domain of multiple data types [^mfamca]. @@ -161,7 +162,7 @@ X = \begin{bmatrix} X_{L} \end{bmatrix} = WH, $$ -a joint decomposition of the different data matrices ($X_i$) into the factor matrix $W$ and the latent variable matrix $H$. This way, we can leverage the ability of PCA to find the highest variance decomposition of the data, when the data consists of different omics types. As a reminder, PCA finds the linear combinations of the features which, when the data is projected onto them, preserve the most variance of any $K$ dimensional space. But because measurements from different experiments have different scales, they will also have variance (and co-variance) at different scales. +a joint decomposition of the different data matrices ($X_i$) into the factor matrix $W$ and the latent variable matrix $H$. This way, we can leverage the ability of PCA to find the highest variance decomposition of the data, when the data consists of different omics types. As a reminder, PCA finds the linear combinations of the features which, when the data is projected onto them, preserve the most variance of any $K$-dimensional space. But because measurements from different experiments have different scales, they will also have variance (and co-variance) at different scales. Multiple Factor Analysis addresses this issue and achieves balance among the data types by normalizing each of the data types, before stacking them and passing them on to PCA. Formally, MFA is given by @@ -175,7 +176,7 @@ X_n = \begin{bmatrix} $$ where $\lambda^{(i)}_1$ is the first eigenvalue of the principal component decomposition of $X_i$. -Following this normalization step, we apply PCA to $X_n$. From there on, MFA analysis is the same as PCA analysis, and we refer the reader to chapter \@ref(unsupervisedLearning) for more details. +Following this normalization step, we apply PCA to $X_n$. From there on, MFA analysis is the same as PCA analysis, and we refer the reader to Chapter \@ref(unsupervisedLearning) for more details. #### MFA in R @@ -188,7 +189,7 @@ r.mfa <- FactoMineR::MFA( graph=FALSE) ``` -Since this generates a two-dimensional factorization of the multi-omcis data, we can now plot each tumor as a dot in a 2D scatter plot to see how well the MFA factors separate the cancer subtypes. The following code snippet generates Figure \@ref(fig:momfascatterplot): +Since this generates a two-dimensional factorization of the multi-omics data, we can now plot each tumor as a dot in a 2D scatter plot to see how well the MFA factors separate the cancer subtypes. The following code snippet generates Figure \@ref(fig:momfascatterplot): ```{r,momfascatterplot,fig.cap="Scatter plot of 2-dimensional MFA for multi-omics data shows separation between the subtypes."} # first, extract the H and W matrices from the MFA run result mfa.h <- r.mfa$global.pca$ind$coord @@ -218,14 +219,14 @@ Figure \@ref(fig:momfaheatmap) shows that indeed, when tumors are clustered and __Want to know more ?__ -- Learn more FactoMineR on the website: http://factominer.free.fr/ +- Learn more about FactoMineR on the website: http://factominer.free.fr/ - Learn more about MFA on the Wikipedia page https://en.wikipedia.org/wiki/Multiple_factor_analysis ``` -### Joint Non-negative Matrix Factorization +### Joint non-negative matrix factorization \index{non-negative matrix factorization (NMF)}As introduced in Chapter \@ref(unsupervisedLearning), NMF (Non-negative Matrix Factorization) is an algorithm from 2000 that seeks to find a non-negative additive decomposition for a non-negative data matrix. It takes the familiar form $X \approx WH$, with $X \ge 0$, $W \ge 0$, and $H \ge 0$. The non-negative constraints make a lossless decomposition (i.e. $X=WH$) generally impossible. Hence, NMF attempts to find a solution which minimizes the Frobenius norm of the reconstruction: @@ -250,7 +251,7 @@ This is typically solved for $W$ and $H$ using random initializations followed b \end{align} -Since this algorithm is guaranteed only to converge to a local minima, it is typically run several times with random initializations, and the best result is kept. +Since this algorithm is guaranteed only to converge to a local minimum, it is typically run several times with random initializations, and the best result is kept. In the multi-omics context, we will, as in the MFA case, wish to find a decomposition for an integrated data matrix of the form @@ -295,7 +296,7 @@ knitr::kable(cnvs_with_neg, caption="Example copy number data. Data can be both ``` ```{r,mocnvsplitcolshow2,echo=FALSE} -knitr::kable(cnvs_split_pos, caption="Example copy number data after splitting each column to a column representing copy number gains (+) and a column representing deletions (-). This data matrix is non-negative, and thus suitable for NMF algorithms.") +knitr::kable(cnvs_split_pos, caption="Example copy number data after splitting each column into a column representing copy number gains (+) and a column representing deletions (-). This data matrix is non-negative, and thus suitable for NMF algorithms.") ``` @@ -387,7 +388,7 @@ nmfw <- t(nmf.w) ``` As with MFA, we can examine how well 2-factor NMF splits tumors into subtypes by looking at the scatter plot in Figure \@ref(fig:monmfscatterplot), generated by the following code chunk: -```{r,monmfscatterplot,fig.cap="NMF creates a disentangled representation of the data using two components which allow for separation between tumor sub-types CMS1 and CMS3 based on NMF factors learned form multi-omics data."} +```{r,monmfscatterplot,fig.cap="NMF creates a disentangled representation of the data using two components which allow for separation between tumor sub-types CMS1 and CMS3 based on NMF factors learned from multi-omics data."} # create a dataframe with the H matrix and the CMS label (subtype) nmf_df <- as.data.frame(nmf.h) colnames(nmf_df) <- c("dim1", "dim2") @@ -414,8 +415,8 @@ pheatmap::pheatmap(t(nmf_df[,1:2]), __Want to know more ?__ -- Joint NMF to uncover gene regulatory networks: Zhang S., Li Q., Liu J., Zhou X. J. (2011). A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics 27, i401–i409. 10.1093/bioinformatics/btr206 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117336/ -- Joint NMF for cancer research: Zhang S., Liu C.-C., Li W., Shen H., Laird P. W., Zhou X. J. (2012). Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 40, 9379–9391. 10.1093/nar/gks725 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3479191/ +- Joint NMF to uncover gene regulatory networks: Zhang S., Li Q., Liu J., Zhou X. J. (2011). A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. _Bioinformatics_ 27, i401–i409. 10.1093/bioinformatics/btr206 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117336/ +- Joint NMF for cancer research: Zhang S., Liu C.-C., Li W., Shen H., Laird P. W., Zhou X. J. (2012). Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. _Nucleic Acids Res._ 40, 9379–9391. 10.1093/nar/gks725 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3479191/ ``` @@ -427,7 +428,7 @@ $$ X_{(i)} = W_{(i)}Z + \epsilon_i, $$ -where $X_{(i)}$ is a data matrix from a single omics platform, $W_{(i)}$ are model parameters, $Z$ is a latent variable matrix, and is shared between the different omics platforms, and $\epsilon_i$ is a "noise" random variable, $\epsilon \sim N(0,\Psi)$, with $\Psi = diag(\psi_1,\dots \psi_M)$ is a diagonal covariance matrix. +where $X_{(i)}$ is a data matrix from a single omics platform, $W_{(i)}$ are model parameters, $Z$ is a latent variable matrix, and is shared among the different omics platforms, and $\epsilon_i$ is a "noise" random variable, $\epsilon \sim N(0,\Psi)$, with $\Psi = diag(\psi_1,\dots \psi_M)$ is a diagonal covariance matrix. ```{r,moiCluster,fig.cap="Sketch of iCluster model. Each omics datatype is decomposed to a coefficient matrix and a shared latent variable matrix, plus noise.",fig.align = 'center',out.width='75%',echo=FALSE} knitr::include_graphics("images/icluster.png" ) @@ -455,18 +456,16 @@ $$ The parameter $\lambda$ acts as a dial to weigh the trade-off between better model fits (higher log-likelihood) and a sparser model, with more $w_{ij}$s set to $0$, which gives models which generalize better and are more interpretable. -In order to solve this problem, iCluster employs the Expectation Maximization (EM) algorithm. The full details are beyond the scope of this textbook. We will introduce a short sketch instead. The intuition behind the EM algorithm is a more general case of the k-means clustering algorithm (Chapter 4). +In order to solve this problem, iCluster employs the Expectation Maximization (EM) algorithm. The full details are beyond the scope of this textbook. We will introduce a short sketch instead. The intuition behind the EM algorithm is a more general case of the k-means clustering algorithm (Chapter 4). The basic **EM algorithm** is as follows. -##### EM algorithm sketch - -* Initialize $W$ and $\Psi$ +* Initialize $W$ and $\Psi$. * **Until convergence of $W$, $\Psi$** - - E-step: calculate the expected value of $Z$ given the current estimates of $W$ and $\Psi$ and the data $X$ - - M-step: calculate maximum likelihood estimates for the parameters $W$ and $\Psi$ based on the current estimate of $Z$ and the data $X$. + - E-step: Calculate the expected value of $Z$ given the current estimates of $W$ and $\Psi$ and the data $X$. + - M-step: Calculate maximum likelihood estimates for the parameters $W$ and $\Psi$ based on the current estimate of $Z$ and the data $X$. -#### iCluster+ +#### iCluster+: Extending iCluster -iCluster+ is an extension of the iCluster framework, which allows for omics types to arise from other distributions than a Gaussian. While normal distributions are a good assumption for log-transformed, centered gene expression data, it is a poor model for binary mutations data, or for copy number variation data, which can typically take the values $(-2, 1, 0, 1, 2)$ for for heterozygous / monozygous deletions or amplifications. iCluster+ allows the different $X$s to have different distributions: +iCluster+ is an extension of the iCluster framework, which allows for omics types to arise from distributions other than a Gaussian. While normal distributions are a good assumption for log-transformed, centered gene expression data, it is a poor model for binary mutations data, or for copy number variation data, which can typically take the values $(-2, 1, 0, 1, 2)$ for heterozygous / monozygous deletions or amplifications. iCluster+ allows the different $X$s to have different distributions: * for binary mutations, $X$ is drawn from a multivariate binomial * for normal, continuous data, $X$ is drawn from a multivariate Gaussian @@ -475,11 +474,11 @@ iCluster+ is an extension of the iCluster framework, which allows for omics type In that way, iCluster+ allows us to explicitly model our assumptions about the distributions of our different omics data types, and leverage the strengths of Bayesian inference. -Both iCluster and iCluster+ make use of sophisticated Bayesian inference algorithms (EM for iCluster, Metropolis-Hastings MCMC for iCluster+), which means they do not scale up trivially. Therefore, it is recommended to filter down the features to a manageable size before inputing data to the algorithm. The exact size of "manageable" data depends on your hardware, but a rule of thumb is that dimensions in the thousands are ok, but in the tens of thousands might be too slow. +Both iCluster and iCluster+ make use of sophisticated Bayesian inference algorithms (EM for iCluster, Metropolis-Hastings MCMC for iCluster+), which means they do not scale up trivially. Therefore, it is recommended to filter down the features to a manageable size before inputting data to the algorithm. The exact size of "manageable" data depends on your hardware, but a rule of thumb is that dimensions in the thousands are ok, but in the tens of thousands might be too slow. -#### running iCluster +#### Running iCluster+ -iCluster+ is available through the BioConductor package `iClusterPlus`. The following code snippet demonstrates how it can be run with two components: +iCluster+ is available through the Bioconductor package `iClusterPlus`. The following code snippet demonstrates how it can be run with two components: ```{r,momultiOmicsiclusterplus} # run the iClusterPlus function r.icluster <- iClusterPlus::iClusterPlus( @@ -525,26 +524,26 @@ pheatmap::pheatmap(t(icp_df[,1:2]), __Want to know more ?__ -- Read the original iCluster paper: Shen R., Olshen A. B., Ladanyi M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912. 10.1093/bioinformatics/btp543 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2800366/ -- Read the original iClusterPlus paper: an extension of iCluster: Shen R., Mo Q., Schultz N., Seshan V. E., Olshen A. B., Huse J., et al. . (2012). Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE 7:e35236. 10.1371/journal.pone.0035236 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3335101/ -- Learn more about the LASSO for model regularization: Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288: http://www-stat.stanford.edu/%7Etibs/lasso/lasso.pdf -- Learn more about the EM algorithm: Dempster, A. P., et al. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, 1977, pp. 1–38. JSTOR, JSTOR: http://www.jstor.org/stable/2984875 -- Read about MCMC algorithms: Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika. 57 (1): 97–109. doi:10.1093/biomet/57.1.97: https://www.jstor.org/stable/2334940 +- Read the original iCluster paper: Shen R., Olshen A. B., Ladanyi M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. _Bioinformatics_ 25, 2906–2912. 10.1093/bioinformatics/btp543 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2800366/ +- Read the original iClusterPlus paper: an extension of iCluster: Shen R., Mo Q., Schultz N., Seshan V. E., Olshen A. B., Huse J., et al. (2012). Integrative subtype discovery in glioblastoma using iCluster. _PLoS ONE_ 7:e35236. 10.1371/journal.pone.0035236 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3335101/ +- Learn more about the LASSO for model regularization: Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. _J. Royal. Statist. Soc B._, Vol. 58, No. 1, pages 267-288: http://www-stat.stanford.edu/%7Etibs/lasso/lasso.pdf +- Learn more about the EM algorithm: Dempster, A. P., et al. Maximum likelihood from incomplete data via the EM algorithm. _Journal of the Royal Statistical Society. Series B (Methodological)_, vol. 39, no. 1, 1977, pp. 1–38. JSTOR, JSTOR: http://www.jstor.org/stable/2984875 +- Read about MCMC algorithms: Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and their applications. _Biometrika._ 57 (1): 97–109. doi:10.1093/biomet/57.1.97: https://www.jstor.org/stable/2334940 ``` ## Clustering using latent factors -\index{clustering}\index{unsupervised learning}A common analysis in biological investigations is clustering. This is often interesting in cancer studies as one hopes to find groups of tumors (clusters) which behave similarly, i.e. have similar risks and/or respond to the same drugs. PCA is a common step in clustering analyses, and so it is easy to see how the latent variable models above may all be a useful pre-processing step before clustering. In the examples below, we will use the latent variables inferred by the algorithms in the previous section on the set of Colorectal cancer tumors from the TCGA. For a more complete introduction to clustering, see Chapter \@ref(unsupervisedLearning). +\index{clustering}\index{unsupervised learning}A common analysis in biological investigations is clustering. This is often interesting in cancer studies as one hopes to find groups of tumors (clusters) which behave similarly, i.e. have similar risks and/or respond to the same drugs. PCA is a common step in clustering analyses, and so it is easy to see how the latent variable models above may all be a useful pre-processing step before clustering. In the examples below, we will use the latent variables inferred by the algorithms in the previous section on the set of colorectal cancer tumors from the TCGA. For a more complete introduction to clustering, see Chapter \@ref(unsupervisedLearning). ### One-hot clustering -A specific clustering method for NMF data is to assume each sample is driven by one component, i.e. that the number of clusters $K$ is the same as the number of latent variables in the model and that each sample may be associated to one of those components. We assign each sample a cluster label based on the latent variable which affects it the most. The figure above (Heatmap of 2-component NMF) shows the latent variable values for the two latent variables, for the 72 tumors, obtained by Joint NMF. +A specific clustering method for NMF data is to assume each sample is driven by one component, i.e. that the number of clusters $K$ is the same as the number of latent variables in the model and that each sample may be associated to one of those components. We assign each sample a cluster label based on the latent variable which affects it the most. Figure \@ref(fig:monmfheatmap) above (heatmap of 2-component NMF) shows the latent variable values for the two latent variables, for the 72 tumors, obtained by Joint NMF. The two rows are the two latent variables, and the columns are the 72 tumors. We can observe that most tumors are indeed driven mainly by one of the factors, and not a combination of the two. We can use this to assign each tumor a cluster label based on its dominant factor, shown in the following code snippet, which also produces the heatmap in Figure \@ref(fig:moNMFClustering). -```{r,moNMFClustering,fig.cap="Joint NMF factors with clusters, and molecular sub-types. One-hot clustering assigns one cluser per dimension, where each sample is assigned a cluster based on its dominant component. The clusters largely recapitulate the CMS sub-types.",fig.height=3} +```{r,moNMFClustering,fig.cap="Joint NMF factors with clusters, and molecular sub-types. One-hot clustering assigns one cluster per dimension, where each sample is assigned a cluster based on its dominant component. The clusters largely recapitulate the CMS sub-types.",fig.height=3} # one-hot clustering in one line of code: # assign each sample the cluster according to its dominant NMF factor # easily accessible using the max.col function @@ -629,7 +628,7 @@ Inspection of the factor coefficients in the heatmap above reveals that Joint NM #### Disentangled representations -\index{disentangled representations}The property displayed above, where each feature is predominantly associated with only a single factor, is termed _disentangledness_, i.e., it leads to _disentangled_ latent variable representations, as changing one input feature only affects a single latent variable. This property is very desirable as it greatly simplifies the biological interpretation of modules. Here, we have two modules with a set of co-occurring molecular signatures which merit deeper investigation into the mechanisms by which these different omics features are related. For this reason, NMF is widely used in computational biology today. +\index{disentangled representations}The property displayed above, where each feature is predominantly associated with only a single factor, is termed _disentangledness_, i.e. it leads to _disentangled_ latent variable representations, as changing one input feature only affects a single latent variable. This property is very desirable as it greatly simplifies the biological interpretation of modules. Here, we have two modules with a set of co-occurring molecular signatures which merit deeper investigation into the mechanisms by which these different omics features are related. For this reason, NMF is widely used in computational biology today. ### Making sense of factors using enrichment analysis @@ -637,15 +636,15 @@ Inspection of the factor coefficients in the heatmap above reveals that Joint NM #### Enrichment analysis -The recent decades of genomics have uncovered many of the ways in which genes cooperate to perform biological functions in concert. This work has resulted in rich annotations of genes, groups of genes, and the different functions they carry out. Examples of such annotations include the Gene Ontology Consortium's _GO terms_ [@go_first_paper, @go_latest_paper], the _Reactome pathways database_ [@reactome_latent_paper], and the _Kyoto Encyclopaedia of Genes and Genomes_ [@kegg_latest_paper]. These resources, as well as others, publish lists of so-called _gene sets_, or _pathways_, which are a set of genes which are known to operate together in some biological function, e.g. protein synthesis, DNA mismatch repair, cellular adhesion, and many other functions. Gene set enrichment analysis is a method which looks for overlaps between genes which we have found to be of interest, e.g. by them being implicated in a certain tumor type, and the a-priori gene sets discussed above. +The recent decades of genomics have uncovered many of the ways in which genes cooperate to perform biological functions in concert. This work has resulted in rich annotations of genes, groups of genes, and the different functions they carry out. Examples of such annotations include the Gene Ontology Consortium's _GO terms_ [@go_first_paper, @go_latest_paper], the _Reactome pathways database_ [@reactome_latent_paper], and the _Kyoto Encyclopaedia of Genes and Genomes_ [@kegg_latest_paper]. These resources, as well as others, publish lists of so-called _gene sets_, or _pathways_, which are sets of genes which are known to operate together in some biological function, e.g. protein synthesis, DNA mismatch repair, cellular adhesion, and many other functions. Gene set enrichment analysis is a method which looks for overlaps between genes which we have found to be of interest, e.g. by them being implicated in a certain tumor type, and the a-priori gene sets discussed above. -In the context of making sense of latent factors, the question we will be asking is whether the genes which drive the value of a latent factor (the genes with the highest factor coefficients) also belong to any interesting annotated gene sets, and whether the overlap greater than we would expect by chance. If there are $N$ genes in total, $K$ of which belong to a gene set, the probability that $k$ out of the $n$ genes associated with a latent factor are also associated with a gene set is given by the hypergeometric distribution: +In the context of making sense of latent factors, the question we will be asking is whether the genes which drive the value of a latent factor (the genes with the highest factor coefficients) also belong to any interesting annotated gene sets, and whether the overlap is greater than we would expect by chance. If there are $N$ genes in total, $K$ of which belong to a gene set, the probability that $k$ out of the $n$ genes associated with a latent factor are also associated with a gene set is given by the hypergeometric distribution: $$ P(k) = \frac{{\binom{K}{k}} - \binom{N-K}{n-k}}{\binom{N}{n}}. $$ -the **hypergeometric test** \index{statistical test} uses the hypergeometric distribution to assess the statistical significance of the presence of genes belonging to a gene set in the latent factor. The null hypothesis is that there is no relationship between genes in a gene set, and genes in a latent factor. When testing for over-representation of gene set genes in a latent factor, the P value from the hypergeometric test is the probability of getting $k$ or more genes from a gene set in a latent factor +The **hypergeometric test** \index{statistical test} uses the hypergeometric distribution to assess the statistical significance of the presence of genes belonging to a gene set in the latent factor. The null hypothesis is that there is no relationship between genes in a gene set, and genes in a latent factor. When testing for over-representation of gene set genes in a latent factor, the P value from the hypergeometric test is the probability of getting $k$ or more genes from a gene set in a latent factor $$ p = \sum_{i=k}^K P(k=i). @@ -695,7 +694,7 @@ the.table ### Interpretation using additional covariates -Another way to ascribe biological significance to the latent variables is by correlating them with additional covariates we might have about the samples. In our example, the Colorectal cancer tumors have also been characterized for Microsattelite Instability status, using an external test (typically PCR-based). By examining the latent variable values as they relate to a tumor's MSI status, we might discover that we've learned latent factors that are related to it. The following code snippet demonstrates how this might be looked into, by generating Figures \@ref(fig:moNMFClinicalCovariates) and \@ref(fig:moNMFClinicalCovariates2): +Another way to ascribe biological significance to the latent variables is by correlating them with additional covariates we might have about the samples. In our example, the colorectal cancer tumors have also been characterized for microsatellite instability (MSI) status, using an external test (typically PCR-based). By examining the latent variable values as they relate to a tumor's MSI status, we might discover that we've learned latent factors that are related to it. The following code snippet demonstrates how this might be looked into, by generating Figures \@ref(fig:moNMFClinicalCovariates) and \@ref(fig:moNMFClinicalCovariates2): ```{r,moNMFClinicalCovariates,fig.cap="Box plot showing MSI/MSS status distribution and NMF factor 1 values."} # create a data frame holding covariates (age, gender, MSI status) @@ -719,7 +718,7 @@ ggplot2::ggplot(cov_factor, ggplot2::aes(x=msi, y=factor2, group=msi)) + ggplot2::ggtitle("NMF factor 2 and microsatellite instability") ``` -Figures \@ref(fig:moNMFClinicalCovariates) and \@ref(fig:moNMFClinicalCovariates2) show that NMF factor 1 and NMF factor two are are separated by the MSI/MSS status of the tumors. +Figures \@ref(fig:moNMFClinicalCovariates) and \@ref(fig:moNMFClinicalCovariates2) show that NMF factor 1 and NMF factor 2 are separated by the MSI or MSS (microsatellite stability) status of the tumors. ## Exercises @@ -727,7 +726,7 @@ Figures \@ref(fig:moNMFClinicalCovariates) and \@ref(fig:moNMFClinicalCovariates 1. Find features associated with iCluster and MFA factors, and visualize the feature weights. [Difficulty: **Beginner**] -2. Normalizing the data matrices by their $\lambda_1$'s as in MFA supposes we wish to assign each data type the same importance in the down-stream analysis. This leads to a natural generalization whereby the different data types may be differently weighed. Provide an implementation of weighed-MFA where the different data types may be assigned individual weights. [Difficulty: **Intermediate**] +2. Normalizing the data matrices by their $\lambda_1$'s as in MFA supposes we wish to assign each data type the same importance in the down-stream analysis. This leads to a natural generalization whereby the different data types may be differently weighted. Provide an implementation of weighed-MFA where the different data types may be assigned individual weights. [Difficulty: **Intermediate**] 3. In order to use NMF algorithms on data which can be negative, we need to split each feature into two new features, one positive and one negative. Implement the following function, and see that the included test does not fail: [Difficulty: **Intermediate/Advanced**] @@ -748,7 +747,7 @@ test_split_neg_columns <- function() { test_split_neg_columns() ``` -4. The iCluster+ algorithm has some parameters which may be tuned for maximum performance. The `iClusterPlus` package has a method, `iClusterPlus::tune.iClusterPlus`, which does this automatically based on the Bayesian Information Criterion (BIC). Run this method on the data from the examples above. and find the optimal lambda and alpha values. [Difficulty: **Beginner/Intermediate**] +4. The iCluster+ algorithm has some parameters which may be tuned for maximum performance. The `iClusterPlus` package has a method, `iClusterPlus::tune.iClusterPlus`, which does this automatically based on the Bayesian Information Criterion (BIC). Run this method on the data from the examples above and find the optimal lambda and alpha values. [Difficulty: **Beginner/Intermediate**] ### Clustering using latent factors @@ -776,4 +775,5 @@ ggplot2::ggplot(cov_factor, ggplot2::aes(x=cimp, y=factor2, group=cimp)) + ggplo 4. Microsatellite instability (MSI) is associated with hyper-mutated tumors. As seen in Figure \@ref(fig:momutationsHeatmap), one of the subtypes has tumors with significantly more mutations than the other. Which subtype is that? Which NMF factor is associated with that subtype? And which NMF factor is associated with MSI? [Difficulty: **Advanced**] -[^mfamca]: When dealing with categorical variables, MFA uses MCA (Multiple Correspondence Analysis). This is less relevant to biological data analysis and will not be discussed here \ No newline at end of file +[^mfamca]: When dealing with categorical variables, MFA uses MCA (Multiple Correspondence Analysis). This is less relevant to biological data analysis and will not be discussed here. + diff --git a/_build.sh b/_build.sh index dde5925..1cdff0a 100644 --- a/_build.sh +++ b/_build.sh @@ -2,8 +2,14 @@ set -ev -Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')" +# render gitbook +Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook',new_session=TRUE)" + +# render pdf #Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book',output_dir='book_pdf')" -#Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book',new_session=TRUE)" -Rscript -e "bookdown::preview_chapter('01-intro2Genomics.Rmd', 'bookdown::pdf_book',new_session=TRUE)" +# compile each chapter separetely and merge +Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book',new_session=TRUE)" + +# compile one chapter +#Rscript -e "bookdown::preview_chapter('01-intro2Genomics.Rmd', 'bookdown::pdf_book',new_session=TRUE)" diff --git a/_manual_deploy.sh b/_manual_deploy.sh index 00e330c..0cbbd04 100644 --- a/_manual_deploy.sh +++ b/_manual_deploy.sh @@ -5,5 +5,6 @@ git clone -b gh-pages https://github.com/compgenomr/book.git book-output cd book-output cp -r ../_book/* ./ git add * -git commit -m "Update the book manually 2" +git commit -m "Update the book manually 3" git push origin gh-pages + diff --git a/_output.yml b/_output.yml index 4cacc4e..fd507b6 100644 --- a/_output.yml +++ b/_output.yml @@ -10,7 +10,7 @@ bookdown::gitbook:
  • Computational Genomics with R
  • after: |
  • Published with bookdown
  • - download: [pdf] + download: no edit: https://github.com/compgenomr/book/edit/master/%s sharing: github: yes diff --git a/apalike.bst b/apalike.bst new file mode 100644 index 0000000..e42af73 --- /dev/null +++ b/apalike.bst @@ -0,0 +1,1114 @@ +% BibTeX `apalike' bibliography style (version 0.99a, 8-Dec-10), adapted from +% the `alpha' style, version 0.99a; for BibTeX version 0.99a. +% +% Copyright (C) 1988, 2010 Oren Patashnik. +% Unlimited copying and redistribution of this file are permitted as long as +% it is unmodified. Modifications (and redistribution of modified versions) +% are also permitted, but only if the resulting file is renamed. +% +% Differences between this style and `alpha' are generally heralded by a `%'. +% The file btxbst.doc has the documentation for alpha.bst. +% +% This style should be used with the `apalike' LaTeX style (apalike.sty). +% \cite's come out like "(Jones, 1986)" in the text but there are no labels +% in the bibliography, and something like "(1986)" comes out immediately +% after the author. Author (and editor) names appear as last name, comma, +% initials. A `year' field is required for every entry, and so is either +% an author (or in some cases, an editor) field or a key field. +% +% Editorial note: +% Many journals require a style like `apalike', but I strongly, strongly, +% strongly recommend that you not use it if you have a choice---use something +% like `plain' instead. Mary-Claire van Leunen (A Handbook for Scholars, +% Knopf, 1979) argues convincingly that a style like `plain' encourages better +% writing than one like `apalike'. Furthermore the strongest arguments for +% using an author-date style like `apalike'---that it's "the most practical" +% (The Chicago Manual of Style, University of Chicago Press, thirteenth +% edition, 1982, pages 400--401)---fall flat on their face with the new +% computer-typesetting technology. For instance page 401 anachronistically +% states "The chief disadvantage of [a style like `plain'] is that additions +% or deletions cannot be made after the manuscript is typed without changing +% numbers in both text references and list." LaTeX sidesteps the disadvantage. +% +% History: +% 15-sep-86 (OP) Original version by Oren Patashnik, ideas from Susan King. +% 10-nov-86 (OP) Truncated the sort.key$ string to the correct length +% in bib.sort.order to eliminate error message. +% 24-jan-88 (OP) Updated for BibTeX version 0.99a, from alpha.bst 0.99a; +% apalike now sorts by author, then year, then title; +% THIS `apalike' VERSION DOES NOT WORK WITH BIBTEX 0.98i. +% 8-dec-10 (OP) Still version 0.99a, as the code itself was unchanged; +% this release clarified the license. + +ENTRY + { address + author + booktitle + chapter + edition + editor + howpublished + institution + journal + key +% month not used in apalike + note + number + organization + pages + publisher + school + series + title + type + volume + year + } + {} + { label extra.label sort.label } + +INTEGERS { output.state before.all mid.sentence after.sentence after.block } + +FUNCTION {init.state.consts} +{ #0 'before.all := + #1 'mid.sentence := + #2 'after.sentence := + #3 'after.block := +} + +STRINGS { s t } + +FUNCTION {output.nonnull} +{ 's := + output.state mid.sentence = + { ", " * write$ } + { output.state after.block = + { add.period$ write$ + newline$ + "\newblock " write$ + } + { output.state before.all = + 'write$ + { add.period$ " " * write$ } + if$ + } + if$ + mid.sentence 'output.state := + } + if$ + s +} + +FUNCTION {output} +{ duplicate$ empty$ + 'pop$ + 'output.nonnull + if$ +} + +FUNCTION {output.check} +{ 't := + duplicate$ empty$ + { pop$ "empty " t * " in " * cite$ * warning$ } + 'output.nonnull + if$ +} + +% apalike needs this function because +% the year has special punctuation; +% apalike ignores the month +FUNCTION {output.year.check} +{ year empty$ + { "empty year in " cite$ * warning$ } + { write$ + " (" year * extra.label * ")" * + mid.sentence 'output.state := + } + if$ +} + +FUNCTION {output.bibitem} +{ newline$ + "\bibitem[" write$ + label write$ + "]{" write$ + cite$ write$ + "}" write$ + newline$ + "" + before.all 'output.state := +} + +FUNCTION {fin.entry} +{ add.period$ + write$ + newline$ +} + +FUNCTION {new.block} +{ output.state before.all = + 'skip$ + { after.block 'output.state := } + if$ +} + +FUNCTION {new.sentence} +{ output.state after.block = + 'skip$ + { output.state before.all = + 'skip$ + { after.sentence 'output.state := } + if$ + } + if$ +} + +FUNCTION {not} +{ { #0 } + { #1 } + if$ +} + +FUNCTION {and} +{ 'skip$ + { pop$ #0 } + if$ +} + +FUNCTION {or} +{ { pop$ #1 } + 'skip$ + if$ +} + +FUNCTION {new.block.checkb} +{ empty$ + swap$ empty$ + and + 'skip$ + 'new.block + if$ +} + +FUNCTION {field.or.null} +{ duplicate$ empty$ + { pop$ "" } + 'skip$ + if$ +} + +FUNCTION {emphasize} +{ duplicate$ empty$ + { pop$ "" } + { "{\em " swap$ * "}" * } + if$ +} + +INTEGERS { nameptr namesleft numnames } + +FUNCTION {format.names} +{ 's := + #1 'nameptr := + s num.names$ 'numnames := + numnames 'namesleft := + { namesleft #0 > } + { s nameptr "{vv~}{ll}{, jj}{, f.}" format.name$ 't := % last name first + nameptr #1 > + { + nameptr #3 + #1 + = + numnames #5 + > and + { "others" 't := + #1 'namesleft := } + 'skip$ + if$ + namesleft #1 > + { ", " * t * } + { numnames #2 > + { "," * } + 'skip$ + if$ + t "others" = + { " et~al." * } + { " and " * t * } + if$ + } + if$ + } + 't + if$ + nameptr #1 + 'nameptr := + namesleft #1 - 'namesleft := + } + while$ +} + +FUNCTION {format.authors} +{ author empty$ + { "" } + { author format.names } + if$ +} + +FUNCTION {format.key} % this function is just for apalike +{ empty$ + { key field.or.null } + { "" } + if$ +} + +FUNCTION {format.editors} +{ editor empty$ + { "" } + { editor format.names + editor num.names$ #1 > + { ", editors" * } + { ", editor" * } + if$ + } + if$ +} + +FUNCTION {format.title} +{ title empty$ + { "" } + { title "t" change.case$ } + if$ +} + +FUNCTION {n.dashify} +{ 't := + "" + { t empty$ not } + { t #1 #1 substring$ "-" = + { t #1 #2 substring$ "--" = not + { "--" * + t #2 global.max$ substring$ 't := + } + { { t #1 #1 substring$ "-" = } + { "-" * + t #2 global.max$ substring$ 't := + } + while$ + } + if$ + } + { t #1 #1 substring$ * + t #2 global.max$ substring$ 't := + } + if$ + } + while$ +} + +FUNCTION {format.btitle} +{ title emphasize +} + +FUNCTION {tie.or.space.connect} +{ duplicate$ text.length$ #3 < + { "~" } + { " " } + if$ + swap$ * * +} + +FUNCTION {either.or.check} +{ empty$ + 'pop$ + { "can't use both " swap$ * " fields in " * cite$ * warning$ } + if$ +} + +FUNCTION {format.bvolume} +{ volume empty$ + { "" } + { "volume" volume tie.or.space.connect + series empty$ + 'skip$ + { " of " * series emphasize * } + if$ + "volume and number" number either.or.check + } + if$ +} + +FUNCTION {format.number.series} +{ volume empty$ + { number empty$ + { series field.or.null } + { output.state mid.sentence = + { "number" } + { "Number" } + if$ + number tie.or.space.connect + series empty$ + { "there's a number but no series in " cite$ * warning$ } + { " in " * series * } + if$ + } + if$ + } + { "" } + if$ +} + +FUNCTION {format.edition} +{ edition empty$ + { "" } + { output.state mid.sentence = + { edition "l" change.case$ " edition" * } + { edition "t" change.case$ " edition" * } + if$ + } + if$ +} + +INTEGERS { multiresult } + +FUNCTION {multi.page.check} +{ 't := + #0 'multiresult := + { multiresult not + t empty$ not + and + } + { t #1 #1 substring$ + duplicate$ "-" = + swap$ duplicate$ "," = + swap$ "+" = + or or + { #1 'multiresult := } + { t #2 global.max$ substring$ 't := } + if$ + } + while$ + multiresult +} + +FUNCTION {format.pages} +{ pages empty$ + { "" } + { pages multi.page.check + { "pages" pages n.dashify tie.or.space.connect } + { "page" pages tie.or.space.connect } + if$ + } + if$ +} + +FUNCTION {format.vol.num.pages} +{ volume field.or.null + number empty$ + 'skip$ + { "(" number * ")" * * + volume empty$ + { "there's a number but no volume in " cite$ * warning$ } + 'skip$ + if$ + } + if$ + pages empty$ + 'skip$ + { duplicate$ empty$ + { pop$ format.pages } + { ":" * pages n.dashify * } + if$ + } + if$ +} + +FUNCTION {format.chapter.pages} +{ chapter empty$ + 'format.pages + { type empty$ + { "chapter" } + { type "l" change.case$ } + if$ + chapter tie.or.space.connect + pages empty$ + 'skip$ + { ", " * format.pages * } + if$ + } + if$ +} + +FUNCTION {format.in.ed.booktitle} +{ booktitle empty$ + { "" } + { editor empty$ + { "In " booktitle emphasize * } + { "In " format.editors * ", " * booktitle emphasize * } + if$ + } + if$ +} + +FUNCTION {format.thesis.type} +{ type empty$ + 'skip$ + { pop$ + type "t" change.case$ + } + if$ +} + +FUNCTION {format.tr.number} +{ type empty$ + { "Technical Report" } + 'type + if$ + number empty$ + { "t" change.case$ } + { number tie.or.space.connect } + if$ +} + +FUNCTION {format.article.crossref} +{ "In" % this is for apalike + " \cite{" * crossref * "}" * +} + +FUNCTION {format.book.crossref} +{ volume empty$ + { "empty volume in " cite$ * "'s crossref of " * crossref * warning$ + "In " + } + { "Volume" volume tie.or.space.connect + " of " * + } + if$ + "\cite{" * crossref * "}" * % this is for apalike +} + +FUNCTION {format.incoll.inproc.crossref} +{ "In" % this is for apalike + " \cite{" * crossref * "}" * +} + +FUNCTION {article} +{ output.bibitem + format.authors "author" output.check + author format.key output % special for + output.year.check % apalike + new.block + format.title "title" output.check + new.block + crossref missing$ + { journal emphasize "journal" output.check + format.vol.num.pages output + } + { format.article.crossref output.nonnull + format.pages output + } + if$ + new.block + note output + fin.entry +} + +FUNCTION {book} +{ output.bibitem + author empty$ + { format.editors "author and editor" output.check + editor format.key output + } + { format.authors output.nonnull + crossref missing$ + { "author and editor" editor either.or.check } + 'skip$ + if$ + } + if$ + output.year.check % special for apalike + new.block + format.btitle "title" output.check + crossref missing$ + { format.bvolume output + new.block + format.number.series output + new.sentence + publisher "publisher" output.check + address output + } + { new.block + format.book.crossref output.nonnull + } + if$ + format.edition output + new.block + note output + fin.entry +} + +FUNCTION {booklet} +{ output.bibitem + format.authors output + author format.key output % special for + output.year.check % apalike + new.block + format.title "title" output.check + new.block + howpublished output + address output + new.block + note output + fin.entry +} + +FUNCTION {inbook} +{ output.bibitem + author empty$ + { format.editors "author and editor" output.check + editor format.key output + } + { format.authors output.nonnull + crossref missing$ + { "author and editor" editor either.or.check } + 'skip$ + if$ + } + if$ + output.year.check % special for apalike + new.block + format.btitle "title" output.check + crossref missing$ + { format.bvolume output + format.chapter.pages "chapter and pages" output.check + new.block + format.number.series output + new.sentence + publisher "publisher" output.check + address output + } + { format.chapter.pages "chapter and pages" output.check + new.block + format.book.crossref output.nonnull + } + if$ + format.edition output + new.block + note output + fin.entry +} + +FUNCTION {incollection} +{ output.bibitem + format.authors "author" output.check + author format.key output % special for + output.year.check % apalike + new.block + format.title "title" output.check + new.block + crossref missing$ + { format.in.ed.booktitle "booktitle" output.check + format.bvolume output + format.number.series output + format.chapter.pages output + new.sentence + publisher "publisher" output.check + address output + format.edition output + } + { format.incoll.inproc.crossref output.nonnull + format.chapter.pages output + } + if$ + new.block + note output + fin.entry +} + +FUNCTION {inproceedings} +{ output.bibitem + format.authors "author" output.check + author format.key output % special for + output.year.check % apalike + new.block + format.title "title" output.check + new.block + crossref missing$ + { format.in.ed.booktitle "booktitle" output.check + format.bvolume output + format.number.series output + format.pages output + address output % for apalike + new.sentence % there's no year + organization output % here so things + publisher output % are simpler + } + { format.incoll.inproc.crossref output.nonnull + format.pages output + } + if$ + new.block + note output + fin.entry +} + +FUNCTION {conference} { inproceedings } + +FUNCTION {manual} +{ output.bibitem + format.authors output + author format.key output % special for + output.year.check % apalike + new.block + format.btitle "title" output.check + organization address new.block.checkb + organization output + address output + format.edition output + new.block + note output + fin.entry +} + +FUNCTION {mastersthesis} +{ output.bibitem + format.authors "author" output.check + author format.key output % special for + output.year.check % apalike + new.block + format.title "title" output.check + new.block + "Master's thesis" format.thesis.type output.nonnull + school "school" output.check + address output + new.block + note output + fin.entry +} + +FUNCTION {misc} +{ output.bibitem + format.authors output + author format.key output % special for + output.year.check % apalike + new.block + format.title output + new.block + howpublished output + new.block + note output + fin.entry +} + +FUNCTION {phdthesis} +{ output.bibitem + format.authors "author" output.check + author format.key output % special for + output.year.check % apalike + new.block + format.btitle "title" output.check + new.block + "PhD thesis" format.thesis.type output.nonnull + school "school" output.check + address output + new.block + note output + fin.entry +} + +FUNCTION {proceedings} +{ output.bibitem + format.editors output + editor format.key output % special for + output.year.check % apalike + new.block + format.btitle "title" output.check + format.bvolume output + format.number.series output + address output % for apalike + new.sentence % we always output + organization output % a nonempty organization + publisher output % here + new.block + note output + fin.entry +} + +FUNCTION {techreport} +{ output.bibitem + format.authors "author" output.check + author format.key output % special for + output.year.check % apalike + new.block + format.title "title" output.check + new.block + format.tr.number output.nonnull + institution "institution" output.check + address output + new.block + note output + fin.entry +} + +FUNCTION {unpublished} +{ output.bibitem + format.authors "author" output.check + author format.key output % special for + output.year.check % apalike + new.block + format.title "title" output.check + new.block + note "note" output.check + fin.entry +} + +FUNCTION {default.type} { misc } + +MACRO {jan} {"January"} + +MACRO {feb} {"February"} + +MACRO {mar} {"March"} + +MACRO {apr} {"April"} + +MACRO {may} {"May"} + +MACRO {jun} {"June"} + +MACRO {jul} {"July"} + +MACRO {aug} {"August"} + +MACRO {sep} {"September"} + +MACRO {oct} {"October"} + +MACRO {nov} {"November"} + +MACRO {dec} {"December"} + +MACRO {acmcs} {"ACM Computing Surveys"} + +MACRO {acta} {"Acta Informatica"} + +MACRO {cacm} {"Communications of the ACM"} + +MACRO {ibmjrd} {"IBM Journal of Research and Development"} + +MACRO {ibmsj} {"IBM Systems Journal"} + +MACRO {ieeese} {"IEEE Transactions on Software Engineering"} + +MACRO {ieeetc} {"IEEE Transactions on Computers"} + +MACRO {ieeetcad} + {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"} + +MACRO {ipl} {"Information Processing Letters"} + +MACRO {jacm} {"Journal of the ACM"} + +MACRO {jcss} {"Journal of Computer and System Sciences"} + +MACRO {scp} {"Science of Computer Programming"} + +MACRO {sicomp} {"SIAM Journal on Computing"} + +MACRO {tocs} {"ACM Transactions on Computer Systems"} + +MACRO {tods} {"ACM Transactions on Database Systems"} + +MACRO {tog} {"ACM Transactions on Graphics"} + +MACRO {toms} {"ACM Transactions on Mathematical Software"} + +MACRO {toois} {"ACM Transactions on Office Information Systems"} + +MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"} + +MACRO {tcs} {"Theoretical Computer Science"} + +READ + +FUNCTION {sortify} +{ purify$ + "l" change.case$ +} + +INTEGERS { len } + +FUNCTION {chop.word} +{ 's := + 'len := + s #1 len substring$ = + { s len #1 + global.max$ substring$ } + 's + if$ +} + +% There are three apalike cases: one person (Jones), +% two (Jones and de~Bruijn), and more (Jones et~al.). +% This function is much like format.crossref.editors. +% +FUNCTION {format.lab.names} +{ 's := + s #1 "{vv~}{ll}" format.name$ + s num.names$ duplicate$ + #2 > + { pop$ " et~al." * } + { #2 < + 'skip$ + { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" = + { " et~al." * } + { " and " * s #2 "{vv~}{ll}" format.name$ * } + if$ + } + if$ + } + if$ +} + +FUNCTION {author.key.label} +{ author empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key % apalike uses the whole key + if$ + } + { author format.lab.names } + if$ +} + +FUNCTION {author.editor.key.label} +{ author empty$ + { editor empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key % apalike uses the whole key + if$ + } + { editor format.lab.names } + if$ + } + { author format.lab.names } + if$ +} + +FUNCTION {editor.key.label} +{ editor empty$ + { key empty$ + { cite$ #1 #3 substring$ } + 'key % apalike uses the whole key, no organization + if$ + } + { editor format.lab.names } + if$ +} + +FUNCTION {calc.label} +{ type$ "book" = + type$ "inbook" = + or + 'author.editor.key.label + { type$ "proceedings" = + 'editor.key.label % apalike ignores organization + 'author.key.label % for labeling and sorting + if$ + } + if$ + ", " % these three lines are + * % for apalike, which + year field.or.null purify$ #-1 #4 substring$ % uses all four digits + * + 'label := +} + +FUNCTION {sort.format.names} +{ 's := + #1 'nameptr := + "" + s num.names$ 'numnames := + numnames 'namesleft := + { namesleft #0 > } + { nameptr #1 > + { " " * } + 'skip$ + if$ % apalike uses initials + s nameptr "{vv{ } }{ll{ }}{ f{ }}{ jj{ }}" format.name$ 't := % <= here + nameptr numnames = t "others" = and + { "et al" * } + { t sortify * } + if$ + nameptr #1 + 'nameptr := + namesleft #1 - 'namesleft := + } + while$ +} + +FUNCTION {sort.format.title} +{ 't := + "A " #2 + "An " #3 + "The " #4 t chop.word + chop.word + chop.word + sortify + #1 global.max$ substring$ +} + +FUNCTION {author.sort} +{ author empty$ + { key empty$ + { "to sort, need author or key in " cite$ * warning$ + "" + } + { key sortify } + if$ + } + { author sort.format.names } + if$ +} + +FUNCTION {author.editor.sort} +{ author empty$ + { editor empty$ + { key empty$ + { "to sort, need author, editor, or key in " cite$ * warning$ + "" + } + { key sortify } + if$ + } + { editor sort.format.names } + if$ + } + { author sort.format.names } + if$ +} + +FUNCTION {editor.sort} +{ editor empty$ + { key empty$ + { "to sort, need editor or key in " cite$ * warning$ + "" + } + { key sortify } + if$ + } + { editor sort.format.names } + if$ +} + +% apalike uses two sorting passes; the first one sets the +% labels so that the `a's, `b's, etc. can be computed; +% the second pass puts the references in "correct" order. +% The presort function is for the first pass. It computes +% label, sort.label, and title, and then concatenates. +FUNCTION {presort} +{ calc.label + label sortify + " " + * + type$ "book" = + type$ "inbook" = + or + 'author.editor.sort + { type$ "proceedings" = + 'editor.sort + 'author.sort + if$ + } + if$ + #1 entry.max$ substring$ % for + 'sort.label := % apalike + sort.label % style + * + " " + * + title field.or.null + sort.format.title + * + #1 entry.max$ substring$ + 'sort.key$ := +} + +ITERATE {presort} + +SORT % by label, sort.label, title---for final label calculation + +STRINGS { last.label next.extra } % apalike labels are only for the text; + +INTEGERS { last.extra.num } % there are none in the bibliography + +FUNCTION {initialize.extra.label.stuff} % and hence there is no `longest.label' +{ #0 int.to.chr$ 'last.label := + "" 'next.extra := + #0 'last.extra.num := +} + +FUNCTION {forward.pass} +{ last.label label = + { last.extra.num #1 + 'last.extra.num := + last.extra.num int.to.chr$ 'extra.label := + } + { "a" chr.to.int$ 'last.extra.num := + "" 'extra.label := + label 'last.label := + } + if$ +} + +FUNCTION {reverse.pass} +{ next.extra "b" = + { "a" 'extra.label := } + 'skip$ + if$ + label extra.label * 'label := + extra.label 'next.extra := +} + +EXECUTE {initialize.extra.label.stuff} + +ITERATE {forward.pass} + +REVERSE {reverse.pass} + +% Now that the label is right we sort for real, +% on sort.label then year then title. This is +% for the second sorting pass. +FUNCTION {bib.sort.order} +{ sort.label + " " + * + year field.or.null sortify + * + " " + * + title field.or.null + sort.format.title + * + #1 entry.max$ substring$ + 'sort.key$ := +} + +ITERATE {bib.sort.order} + +SORT % by sort.label, year, title---giving final bibliography order + +FUNCTION {begin.bib} +{ preamble$ empty$ % no \etalchar in apalike + 'skip$ + { preamble$ write$ newline$ } + if$ + "\begin{thebibliography}{}" write$ newline$ % no labels in apalike +} + +EXECUTE {begin.bib} + +EXECUTE {init.state.consts} + +ITERATE {call.type$} + +FUNCTION {end.bib} +{ newline$ + "\end{thebibliography}" write$ newline$ +} + +EXECUTE {end.bib} diff --git a/book.bib b/book.bib index 049e319..c1371dc 100755 --- a/book.bib +++ b/book.bib @@ -1,3 +1,14 @@ +@article{chen2012systematic, + title={Systematic evaluation of factors influencing ChIP-seq fidelity}, + author={Chen, Yiwen and Negre, Nicolas and Li, Qunhua and Mieczkowska, Joanna O and Slattery, Matthew and Liu, Tao and Zhang, Yong and Kim, Tae-Kyung and He, Housheng Hansen and Zieba, Jennifer and others}, + journal={Nature methods}, + volume={9}, + number={6}, + pages={609--614}, + year={2012}, + publisher={Nature Publishing Group} +} + @article{mermel2011gistic2, title={GISTIC2. 0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers}, author={Mermel, Craig H and Schumacher, Steven E and Hill, Barbara and Meyerson, Matthew L and Beroukhim, Rameen and Getz, Gad}, diff --git a/chicago-manual.csl b/chicago-manual.csl new file mode 100644 index 0000000..bcf4f78 --- /dev/null +++ b/chicago-manual.csl @@ -0,0 +1,656 @@ + + diff --git a/images/CompGen2019_A3_v2_final.png b/images/CompGen2019_A3_v2_final.png deleted file mode 100644 index 2ef1ddc..0000000 Binary files a/images/CompGen2019_A3_v2_final.png and /dev/null differ diff --git a/images/dedication.pdf b/images/dedication.pdf index e5b2e0a..7a9c697 100755 Binary files a/images/dedication.pdf and b/images/dedication.pdf differ diff --git a/images/dedicationOld.pdf b/images/dedicationOld.pdf new file mode 100644 index 0000000..3c3c9b1 Binary files /dev/null and b/images/dedicationOld.pdf differ diff --git a/index.Rmd b/index.Rmd index b8b3433..c52c0c3 100644 --- a/index.Rmd +++ b/index.Rmd @@ -6,7 +6,8 @@ knit: "bookdown::render_book" documentclass: krantz classoption: numberinsequence,krantz2 bibliography: [book.bib] -biblio-style: apalike +biblio-style: [apalike.bst] +csl: chicago-manual.csl link-citations: yes colorlinks: yes fontsize: 12pt @@ -46,66 +47,65 @@ lapply(c('citr', 'formatR', 'svglite'), function(pkg) { knitr::include_graphics('images/cover.jpg', dpi = NA) ``` -The aim of this book is to provide the fundamentals for data analysis for genomics. We developed this book based on the computational genomics courses we are giving every year. We have had invariably an interdisciplinary audience with backgrounds from physics, biology, medicine, math, computer science or other quantitative fields. We want this book to be a starting point for computational genomics students and a guide for further data analysis in more specific topics in genomics. This is why we tried to cover a large variety of topics from programming to basic genome biology. As the field is interdisciplinary, it requires different starting points for people with different backgrounds. A biologist might skip sections on basic genome biology and start with R programming whereas a computer scientist might want to start with genome biology. In the same manner, a more experienced person might want to refer to this book when s/he needs to do a certain type of analysis where s/he does not have prior experience. +The aim of this book is to provide the fundamentals for data analysis for genomics. We developed this book based on the computational genomics courses we are giving every year. We have had invariably an interdisciplinary audience with backgrounds from physics, biology, medicine, math, computer science or other quantitative fields. We want this book to be a starting point for computational genomics students and a guide for further data analysis in more specific topics in genomics. This is why we tried to cover a large variety of topics from programming to basic genome biology. As the field is interdisciplinary, it requires different starting points for people with different backgrounds. A biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology. In the same manner, a more experienced person might want to refer to this book when needing to do a certain type of analysis, but having no prior experience. ![Creative Commons License](images/by-nc-sa.png) The online version of this book is licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/). ## Who is this book for? {-} -The book contains practical and theoretical aspects for computational genomics. Biology and +The book contains practical and theoretical aspects of computational genomics. Biology and medicine generate more data than ever before. Therefore, we need to educate more people with data analysis skills and understanding of computational genomics. Since computational genomics is interdisciplinary, this book aims to be accessible for biologists, medical scientists, computer scientists and people from other quantitative backgrounds. We wrote this book for the following audiences: - Biologists and medical scientists who generate the data and are keen on analyzing it themselves. - - Students and researchers who are formally starting to do research on or using computational genomics but do not have extensive domain specific knowledge but have at least a beginner level understanding in a quantitative field: math, stats - - Experienced researchers looking for recipes or quick how-tos to get started in specific data analysis tasks related to computational genomics. + - Students and researchers who are formally starting to do research on or using computational genomics do not have extensive domain-specific knowledge, but have at least a beginner-level understanding in a quantitative field, for example, math, stats. + - Experienced researchers looking for recipes or quick how-to's to get started in specific data analysis tasks related to computational genomics. ### What will you get out of this? {-} -This resource describes the skills and provides how-tos that will help readers +This resource describes the skills and provides how-to's that will help readers analyze their own genomics data. After reading: - If you are not familiar with R, you will get the basics of R and dive right in to specialized uses of R for computational genomics. -- You will understand genomic intervals and operations on them, such as overlap -- You will be able to use R and its vast package library to do sequence analysis: Such as calculating GC content for given segments of a genome or find transcription factor binding sites -- You will be familiar with visualization techniques used in genomics, such as heatmaps, meta-gene plots and genomic track visualization -- You will be familiar with supervised and unsupervised learning techniques which are important in data modelling and exploratory analysis of high-dimensional data -- You will be familiar with analysis of different high-throughput sequencing data -sets mostly using R based tools. +- You will understand genomic intervals and operations on them, such as overlap. +- You will be able to use R and its vast package library to do sequence analysis, such as calculating GC content for given segments of a genome or find transcription factor binding sites. +- You will be familiar with visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization. +- You will be familiar with supervised and unsupervised learning techniques which are important in data modeling and exploratory analysis of high-dimensional data. +- You will be familiar with analysis of different high-throughput sequencing datasets (RNA-seq, ChIP-seq, BS-seq and multi-omics integration) mostly using R-based tools. ## Structure of the book {-} The book is designed with the idea that practical and conceptual -understanding of data analysis methods is as important, if not more important, than the theoretical understanding, such as detailed derivation of equations in statistics or machine learning. That is why we first try to give a conceptual explanation of the concepts then we try to give essential parts of the mathematical formulas for more detailed understanding. In this spirit, we always show the code and -explain the code for a particular data analysis task. In addition, we give additional references such as books, websites , video lectures and scientific papers for readers who desire to gain deeper theoretical understanding of data analysis related methods or concepts. +understanding of data analysis methods is as important, if not more important, than the theoretical understanding, such as detailed derivation of equations in statistics or machine learning. That is why we first try to give a conceptual explanation of the concepts then we try to give essential parts of the mathematical formulas for more detailed understanding. In this spirit, we always show and +explain the code for a particular data analysis task. We also give additional references such as books, websites, video lectures and scientific papers for readers who desire to gain deeper theoretical understanding of data analysis-related methods or concepts. -Chapter \@ref(intro): "Introduction to Genomics" introduces the basic concepts in genome biology and genomics. Understanding these concepts are important for computational genomics. +Chapter \@ref(intro): "Introduction to Genomics" introduces the basic concepts in genome biology and genomics. Understanding these concepts is important for computational genomics. -Chapter \@ref(Rintro): "Introduction to R for Genomic Data Analysis" provides basic R skills necessary to follow the book in addition to common data analysis paradigms we observe in genomic data analysis. Chapter \@ref(stats): "Statistics for Genomics", chapter \@ref(unsupervisedLearning): "Exploratory Data Analysis with Unsupervised Machine Learning" and Chapter \@ref(supervisedLearning): "Predictive Modeling with Supervised Machine Learning" introduce the necessary quantitative skills that one might need when analyzing high-dimensional genomics data. +Chapter \@ref(Rintro): "Introduction to R for Genomic Data Analysis" provides the basic R skills necessary to follow the book in addition to common data analysis paradigms we observe in genomic data analysis. Chapter \@ref(stats): "Statistics for Genomics", Chapter \@ref(unsupervisedLearning): "Exploratory Data Analysis with Unsupervised Machine Learning" and Chapter \@ref(supervisedLearning): "Predictive Modeling with Supervised Machine Learning" introduce the necessary quantitative skills that one will need when analyzing high-dimensional genomics data. -Chapter \@ref(genomicIntervals): "Operations on Genomic Intervals and Genome Arithmetic" introduces the fundamental tools for dealing with genomic intervals and their relationship to each other over the genome. In addition, the chapter introduces a variety of ways used in genomic data visualization. The skills introduced in this chapter are key skills that are needed to work with processed genomic data which are available through public databases such as Ensembl and UCSC browser. +Chapter \@ref(genomicIntervals): "Operations on Genomic Intervals and Genome Arithmetic" introduces the fundamental tools for dealing with genomic intervals and their relationship to each other over the genome. In addition, the chapter introduces a variety of genomic data visualization methods. The skills introduced in this chapter are key skills that are needed to work with processed genomic data which are available through public databases such as Ensembl and the UCSC browser. -The next chapters deals with specific analysis of high-throughput sequencing data and integrating different kinds of data sets. Chapter \@ref(processingReads): "Quality Check, Processing and Alignment of High-throughput Sequencing Reads" introduces quality checks that need to be done on sequencing reads and different ways to process them further. The chapters \@ref(rnaseqanalysis), \@ref(chipseq) and \@ref(bsseq) deals with RNA-seq analysis, ChIP-seq analysis and BS-seq analysis. The last chapter, Chapter \@ref(multiomics):"Multi-omics Analysis" deals with methods for integrating multiple omics data sets. +The next chapters deal with specific analysis of high-throughput sequencing data and integrating different kinds of datasets. Chapter \@ref(processingReads): "Quality Check, Processing and Alignment of High-throughput Sequencing Reads" introduces quality checks that need to be done on sequencing reads and different ways to process them further. Chapters \@ref(rnaseqanalysis), \@ref(chipseq) and \@ref(bsseq) deal with RNA-seq analysis, ChIP-seq analysis and BS-seq analysis. The last chapter, Chapter \@ref(multiomics):"Multi-omics Analysis" deals with methods for integrating multiple omics datasets. -Most chapters have exercises that reinforces some of the important points introduced in the chapters. The exercises are classified into "Beginner", "Intermediate" and "Advanced" categories. If you are well versed in a certain subject you might want to skip "Beginner" level exercises. +Most chapters have exercises that reinforce some of the important points introduced in the chapters. The exercises are classified into beginner, intermediate and advanced categories. If you are well versed in a certain subject you might want to skip beginner-level exercises. To sum it up, this book is a comprehensive guide for computational genomics. Some sections are there for the sake of the wide interdisciplinary audience and completeness, and not all sections will be equally useful to all readers of this broad audience. ## Software information and conventions {-} -Package names and inline code and file names are formatted in a typewriter font (e.g., `methylKit`). Function names are followed by parentheses (e.g., `genomation::ScoreMatrix()`). The double-colon operator `::` means accessing an object from a package. +Package names and inline code and file names are formatted in a typewriter font (e.g. `methylKit`). Function names are followed by parentheses (e.g. `genomation::ScoreMatrix()`). The double-colon operator `::` means accessing an object from a package. ### Assignment operator convention {-} Traditionally, `<-` is the preferred assignment operator. However, throughout the book we use `=` and `<-` as the assignment operator interchangeably. ### Packages needed to run the book code {-} -This book is primarily about using R packages to analyze genomics data, therefore if you want to reproduce the analysis in this book you need to install the relevant packages in each chapter using `install.packages` or `BiocManager::install` functions. In each chapter, we load the necessary packages with the `library()` or `require()` function when we use the needed functions from respective packages. By looking at that calls, you can see which packages are needed for that code chunk or chapter. If you need to install all the package dependencies for the book, you can run the following command and have a cup of tea while waiting. +This book is primarily about using R packages to analyze genomics data, therefore if you want to reproduce the analysis in this book you need to install the relevant packages in each chapter using `install.packages` or `BiocManager::install` functions. In each chapter, we load the necessary packages with the `library()` or `require()` function when we use the needed functions from respective packages. By looking at calls, you can see which packages are needed for that code chunk or chapter. If you need to install all the package dependencies for the book, you can run the following command and have a cup of tea while waiting. ```{r,installAllPackages,eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") @@ -130,31 +130,65 @@ BiocManager::install(c('qvalue','plot3D','ggplot2','pheatmap','cowplot', ## Data for the book {-} -We rely on data from different R and Bioconductor packages through out the book. For the datasets that do not ship with those packages, we created our own package [**compGenomRData**](https://github.com/compgenomr/compGenomRData). You can install this package via `devtools::install_github("compgenomr/compGenomRData")`. We use `system.file()` function to get the path to the files. We noticed many inexperienced users are confused about this function. This function just outputs full path to the file that is installed with the data package. +We rely on data from different R and Bioconductor packages throughout the book. For the datasets that do not ship with those packages, we created our own package [**compGenomRData**](https://github.com/compgenomr/compGenomRData). You can install this package via `devtools::install_github("compgenomr/compGenomRData")`. We use the `system.file()` function to get the path to the files. We noticed many inexperienced users are confused about this function. This function just outputs the full path to the file that is installed with the data package. ## Exercises in the book {-} -There are a set of exercises at the end of each chapter. The exercises are +There is a set of exercises at the end of each chapter. The exercises are separated in thematic sections that follow the major sections in the chapter. In addition, each exercise is classified based on its difficulty as "Beginner", -"Intermediate" and "Advanced". Beginner level exercises can be usually done -by refactoring the code in the chapter. Advanced level exercises usually requires -combination of code from different sections or chapters. The intermediate level +"Intermediate" and "Advanced". Beginner-level exercises can usually be done +by refactoring the code in the chapter. Advanced-level exercises usually require +a combination of code from different sections or chapters. The intermediate level is somewhere in between. The solutions to the exercises are available at https://github.com/compgenomr/exercises. +## Reproducibility statement {-} +This book is compiled with R `r getRversion()` and the following packages. We only list the main packages and their versions but not their dependencies. + +```{r reproStatement, echo=FALSE, eval=TRUE,collapse = TRUE} +packagesUsed=c('qvalue','plot3D','ggplot2','pheatmap','cowplot', + 'cluster', 'NbClust', 'fastICA', 'NMF','matrixStats', + 'Rtsne', 'mosaic', 'knitr', 'genomation', + 'ggbio', 'Gviz', 'DESeq2', 'RUVSeq', + 'gProfileR', 'ggfortify', 'corrplot', + 'gage', 'EDASeq', 'citr', 'formatR', + 'svglite', 'Rqc', 'ShortRead', 'QuasR', + 'methylKit','FactoMineR', 'iClusterPlus', + 'enrichR','caret','xgboost','glmnet', + 'DALEX','kernlab','pROC','nnet','RANN', + 'ranger','GenomeInfoDb', 'GenomicRanges', + 'GenomicAlignments', 'ComplexHeatmap', 'circlize', + 'rtracklayer','tidyr', + 'AnnotationHub', 'GenomicFeatures', 'normr', + 'MotifDb', 'TFBSTools', 'rGADEM', 'JASPAR2018', + 'BSgenome.Hsapiens.UCSC.hg38', + 'BSgenome.Hsapiens.UCSC.hg19') +pVer=sapply( packagesUsed,function(x) paste(x,packageVersion(x),sep="_")) +names(pVer)=c() +#pVer +for( i in seq(1,length(pVer),4)){ + my.end=i+3 + if( (i+3) > length(pVer) ){ + my.end=length(pVer) + } + cat(pVer[i:my.end],sep = " | ") + cat("\n") +} +#cat(pVer,sep = " | ") +``` ## Acknowledgements {-} -I wish to thank R and Bioconductor community for developing and maintaining libraries for genomic data analysis. Without their constant work and dedication, writing such a book will not be possible. +I wish to thank the R and Bioconductor communities for developing and maintaining libraries for genomic data analysis. Without their constant work and dedication, writing such a book would not be possible. -I also wish to thank all the past and present mentors, colleagues and employers. -The interaction with them provided the motivation to write such as book, and organize and teach hands-on courses on computational genomics. +I also wish to thank all my past and present mentors, colleagues and employers. +The interaction with them provided the motivation to write such a book, and organize and teach hands-on courses on computational genomics. I wish to thank John Kimmel, the editor from Chapman & Hall/CRC, who helped me publish this book. It was a pleasure to work with him. He generously agreed to let me keep the online version of this book, so I can continue updating it after it is printed. -This has been a long journey for me. I started writing parts of this book as early as 2013. If it wasn't for Vedran Franke, Bora Uyar and Jonathan Ronen it would have taken even longer. They kindly agreed to contribute the missing chapters and they did a great job. I am thankful for their contributions. +This has been a long journey for me. I started writing parts of this book as early as 2013. If it wasn't for Vedran Franke, Bora Uyar and Jonathan Ronen, it would have taken even longer. They kindly agreed to contribute the missing chapters and they did a great job. I am thankful for their contributions. -The following people kindly contributed fixes for typos and code, and various suggestions: Thomas Schalch, Alex Gosdschan, Rodrigo Ogava, Fei Zhao, Janathan Kitt, Janani Ravi, Christian Schudoma, Samuel Sledzieski and Dania Hamo, Sarvesh Nikumbh. +The following people kindly contributed fixes for typos and code, and various suggestions: Thomas Schalch, Alex Gosdschan, Rodrigo Ogava, Fei Zhao, Jonathan Kitt, Janani Ravi, Christian Schudoma, Samuel Sledzieski, Dania Hamo and Sarvesh Nikumbh. ```{block2, type='flushright', html.tag='p'} Altuna Akalin diff --git a/latex/before_body.tex b/latex/before_body.tex index 094ad3d..a33bf22 100755 --- a/latex/before_body.tex +++ b/latex/before_body.tex @@ -3,7 +3,7 @@ %\cleardoublepage\newpage \thispagestyle{empty} \begin{center} -\includegraphics{images/dedication.pdf} +\includegraphics{images/dedicationOld.pdf} \end{center} \setlength{\abovedisplayskip}{-5pt} diff --git a/taylor-and-francis.csl b/taylor-and-francis.csl new file mode 100644 index 0000000..16f8b6e --- /dev/null +++ b/taylor-and-francis.csl @@ -0,0 +1,534 @@ + +