-
Notifications
You must be signed in to change notification settings - Fork 287
/
01-intro2Genomics.Rmd
832 lines (676 loc) · 47.3 KB
/
01-intro2Genomics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
\mainmatter
```{r setupintroGenomics, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
error = FALSE,
cache = TRUE,
out.width = "55%",
fig.width = 5,
fig.align = 'center')
```
# Introduction to Genomics {#intro}
The aim of this chapter is to provide the reader with some of the fundamentals
required for
understanding genome biology. By no means, is this a complete overview of the
subject, but just a summary that will help the non-biologist reader understand
the recurring biological concepts in computational genomics. Readers that are
well-versed in genome biology and modern genome-wide quantitative assays should
feel
free to skip this chapter or skim it through.
## Genes, DNA and central dogma
A central concept that will come up again and again is "the gene".
Before we can explain that, we need to
introduce a few other concepts that are important to understand the gene concept.
The human body is made up of billions of cells. These cells specialize in different
tasks. For example, in the liver there are cells that help produce enzymes
to break toxins. In the heart, there are specialized muscle cells that make
the heart beat. Yet, all these different kinds of cells come from a single-celled
embryo. All the instructions to make different kinds of cells are contained
within that single cell and with every division of that cell,
those instructions
are transmitted to new cells. These instructions can be coded into a string -- a
molecule of DNA, a polymer made of recurring units called nucleotides. The four
nucleotides in DNA molecules, Adenine, Guanine, Cytosine and Thymine (coded as
four letters: A, C, G, and T) in a specific sequence, store the information for
life. DNA is organized in a double-helix form where two complementary polymers
interlace with each other and twist into the familiar helical shape.
### What is a genome?
The full DNA sequence of an organism, which contains all the
hereditary information, is called a genome. The genome contains all the information
to build and maintain an organism. Genomes come in different sizes and structures.
Our genome is not only a naked stretch of DNA.
In eukaryotic cells, DNA is wrapped around proteins (histones) \index{histone} forming higher-order structures like
nucleosomes which make up chromatins \index{chromatin} and chromosomes (see Figure \@ref(fig:chromatinChr)).
```{r,chromatinChr,fig.cap="Chromosome structure in animals.",fig.align = 'center',out.width='60%',echo=FALSE}
knitr::include_graphics("images/chromatinChr.png" )
```
There might be several chromosomes \index{chromosome}
depending on the organism. However, in some species (such as most prokaryotes)
DNA is stored in a circular form. The size of the genome between species differs too.
The human genome has 46 chromosomes and over 3 billion base-pairs, whereas the wheat genome
has 42 chromosomes and 17 billion base-pairs; both genome size and chromosome numbers
are variable between different organisms. Genome sequences of organisms are
obtained using sequencing technology. With this technology, fragments of the
DNA sequence from the genome, called reads, are obtained.
Larger chunks of the genome sequence
are later obtained by stitching the
initial fragments to larger ones by using the overlapping reads. The latest
sequencing technologies made genome sequencing cheaper and faster. These
technologies output more reads, longer reads and more accurate reads.
The estimated cost
of the first human genome was $300 million in 1999--2000; today a high-quality human genome
can be obtained for $1500. Since the costs are going down, researchers and clinicians
can generate more data. This drives up the costs for data storage and also drives
up the demand for qualified people to analyze genomic data. This was one of the
motivations behind writing this book.
### What is a gene?
In the genome, there are specific regions containing the precise information that \index{gene}
encodes for physical products of genetic information. A region in the genome
with this information is traditionally called a “gene”. However, the
precise definition of the gene is still developing. According to the classical
textbooks in molecular biology, a gene is a segment of a DNA sequence
corresponding to a single protein or to a single catalytic and structural RNA
molecule [@Albe_2002_book]. A modern definition is: "A region (or regions) that includes all
of the sequence elements necessary to encode a functional transcript" [@eilbeck2005sequence]. No
matter how variable the definitions are, all agree on the fact that genes are
basic units of heredity in all living organisms.
All cells use their hereditary
information in the same way most of the time; the DNA is replicated to
transfer the information to new cells. If activated, the genes are transcribed into
messenger RNAs (mRNAs) \index{mRNA} in the nucleus (in eukaryotes), followed by mRNAs (if the
gene is protein coding) getting translated into proteins in the cytoplasm. This is
essentially a process of information transfer between information-carrying
polymers; DNA, RNA and proteins, known as the “central dogma” \index{central dogma} of molecular
biology (see Figure \@ref(fig:CentDog) for a summary).
Proteins are essential elements for life.
The growth and repair, functioning and structure of all living cells depend on them.
This is why the gene is a central
concept in genome biology, because a gene can encode information for proteins and other
functional molecules. How genes are controlled and activated dictates
everything about an organism. From the identity of a cell to response to
an infection, how cells develop and behave against certain stimuli is governed
by the activity of the genes and the functional molecules they encode. The liver cell becomes a liver cell because certain
genes are activated and their functional products are produced to help the liver
cell achieve its tasks.
```{r,CentDog,fig.cap="Central Dogma: replication, transcription, translation",fig.align = 'center',out.width='100%',echo=FALSE}
knitr::include_graphics("images/centDogma.png" )
```
### How are genes controlled? Transcriptional and post-transcriptional regulation
In order to answer this question, we have to dig a little deeper into the
transcription concept we introduced via the central dogma.
The first step in a process of information transfer - the production of an RNA \index{gene regulation}
copy of a part of the DNA sequence - is called transcription. This task is
carried out by the RNA polymerase enzyme. RNA polymerase-dependent initiation of
transcription is enabled by the existence of a specific region in the sequence
of DNA - a core promoter. Core promoters are regions of DNA that promote
transcription and are found upstream from the start site of transcription. In
eukaryotes, several proteins, called general transcription factors, recognize and
bind to core promoters and form a pre-initiation complex. RNA polymerases
recognize these complexes and initiate synthesis of RNAs, the
polymerase travels along the template DNA and makes an RNA
copy [@hager2009transcription]. After mRNA is
produced it is often spliced by spliceosome. The sections, called 'introns', are removed and
sections called 'exons' left in. Then, the remaining mRNA is translated into proteins. Which exons
will be part of the final mature transcript can also be regulated and creates
diversity in protein structure and function (See Figure \@ref(fig:TransSplice)).
```{r,TransSplice,fig.cap="Transcription can be followed by splicing, which creates different transcript isoforms. This will in return create different protein isoforms since the information required to produce the protein is encoded in the transcripts. Differences in transcripts of the same gene can give rise to different protein isoforms.",fig.align = 'center',out.width='70%',ref.label='TransSplice',echo=FALSE}
knitr::include_graphics("images/TransSplice.png" )
```
Contrary to protein coding genes, non-coding RNA (ncRNAs)
genes are processed and assume their functional structures after transcription
and without going into translation, hence the name: non-coding RNAs. Certain
ncRNAs can also be spliced but still not translated. ncRNAs and
other RNAs in general can form complementary base-pairs within the RNA molecule
which gives them additional complexity. This self-complementarity-based
structure, termed the RNA secondary structure, is often necessary for functions of many
ncRNA species.
In summary, the set of processes, from transcription initiation to production
of the functional product, is referred to as gene expression.
Gene expression quantification and regulation is a fundamental topic in
genome biology.
### What does a gene look like?
Before we move forward, it will be good to discuss how we can visualize genes. \index{gene}
As someone interested in computational genomics, you will frequently encounter
a gene on a computer screen, and how it is represented on the computer will be
equivalent to what you imagine when you hear the word "gene".
In the online databases, the genes will appear as a sequence of letters
or as a series of connected boxes showing exon-intron structure, which may
include the direction of transcription as well (see Figure \@ref(fig:RealGene)). You will encounter more with the latter, so this is likely what will
pop into your mind when you think of genes.
As we have mentioned, DNA has two strands. A gene can be located
on either of them, and the direction of transcription will depend on that. In the
Figure \@ref(fig:RealGene), you can see arrows on introns (lines connecting boxes) indicating the
direction of the gene.
```{r,RealGene,fig.cap="A) Representation of a gene in the UCSC browser. Boxes indicate exons, and lines indicate introns. B) Partial sequence of FATE1 gene as shown in the NCBI GenBank database.",fig.align = 'center',out.width='80%',ref.label='RealGene',echo=FALSE}
knitr::include_graphics("images/RealGene.png" )
```
## Elements of gene regulation
The mechanisms regulating gene expression \index{gene regulation} are essential for
all living organisms as they dictate where and how much of a gene product (it may
be protein or ncRNA) should be manufactured. This regulation could occur
at the pre- and co-transcriptional level by controlling how many transcripts should be produced and/or which version of the transcript should be produced by regulating
splicing. The same gene could encode for different versions of the same protein via
splicing regulation.This process defines which parts of the gene will go into the final
mRNA that will code for the protein variant.
In addition, gene products can be regulated post-transcriptionally where certain
molecules bind to RNA and mark them for degradation even before they can be used
in protein production.
Gene regulation drives cellular differentiation; a
process during which different tissues and cell types are produced. It also
helps cells maintain differentiated states of cells/tissues. As a result of
this process, at the final stage of differentiation, different kinds of cells
maintain different expression profiles, although they contain the same genetic
material. As mentioned above, there are two main types of regulation and next we
will provide information on those.
### Transcriptional regulation
The rate of transcription initiation is the primary regulatory element in gene
expression regulation. The rate is controlled by core promoter elements as well as
distant-acting regulatory elements such as enhancers. On top of that, processes
like histone modifications and/or DNA methylation have a crucial regulatory
impact on transcription. If a region is not accessible for the transcriptional
machinery, e.g. in the case where the chromatin structure is compacted due to the
presence of specific histone modifications, or if the promoter DNA is
methylated, transcription may not start at all. Last but not least, gene
activity is also controlled post-transcriptionally by ncRNAs such as microRNAs
(miRNAs), as well as by cell signaling, resulting in protein modification or
altered protein-protein interactions.
#### Regulation by transcription factors through regulatory regions
Transcription factors are proteins that \index{transcription factors (TFs)}
recognize a specific DNA motif to bind on a regulatory region and regulate the transcription rate of \index{DNA motif}
the gene associated with that regulatory region (see Figure \@ref(fig:regSummary)
for an illustration). These factors bind to a variety of regulatory regions summarized in
Figure \@ref(fig:regSummary), and their concerted action
controls the transcription rate. Apart from their binding preference, their
concentration, and the availability of synergistic or competing transcription
factors will also affect the transcription rate.
```{r,regSummary,fig.cap="Representation of regulatory regions in animal genomes",fig.align = 'center',out.width='70%',ref.label='regSummary',echo=FALSE}
knitr::include_graphics("images/regulationSummary.png" )
```
##### Core and proximal promoters
Core promoters are the immediate neighboring regions around \index{promoter}
the transcription start site (TSS) \index{transcription start site (TSS)} that serve as a docking site for the
transcriptional machinery and pre-initiation complex (PIC) assembly. The
textbook model for transcription initiation is as follows: The core promoter has
a TATA motif (referred as TATA-box) 30 bp upstream of an initiator sequence
(Inr), which also contains TSS. Firstly, transcription factor TFIID binds to the
TATA-box. Next, general transcription factors are recruited and transcription is
initiated on the initiator sequence. Apart from the
TATA-box and Inr, there are a number of sequence elements on the animal core
promoters that are associated with transcription initiation and PIC assembly,
such as downstream promoter elements (DPEs), the BRE elements and CpG islands.
DPEs are found 28-32 bp downstream of the TSS in TATA-less promoters of
_Drosophila melanogaster_. They generally co-occur with the Inr element, and are
thought to have a similar function to the TATA-box. The BRE element is
recognized by the TFIIB protein and lie upstream of the TATA-box. CpG islands
are CG dinucleotide-enriched segments of vertebrate genomes, despite the general
depletion of CG dinucleotides in those genomes. 50 to 70% of promoters in the
human genome are associated with CpG islands.
Proximal promoter elements are typically right upstream
of the core promoters, usually contain binding sites for activator
transcription factors, and
provide additional control over gene expression.
##### Enhancers
Proximal regulation is not the only\index{enhancer} or the most important mode
of gene regulation. Most of the transcription factor binding sites in
the human genome are found in intergenic regions or in introns.
This indicates the widespread usage of distal regulatory elements in animal
genomes. On a molecular
function level, enhancers are similar to proximal promoters; they contain binding
sites for the same transcriptional activators and they basically enhance the
gene expression. However, they are often highly modular and several of them
can affect the same promoter at the same time or in different time-points or
tissues. In addition, their activity is independent of their
orientation and their distance to the promoter they interact with. A
number of studies showed that enhancers can act upon their target genes over
several kilobases away. According to a popular
model, enhancers achieve this by looping the DNA and coming into contact with
their target genes.
##### Silencers
Silencers are similar to enhancers; however their effect is
opposite of enhancers on the transcription of the target gene, and results in
decreasing their level of transcription. They contain binding sites for
repressive transcription factors. Repressor transcription factors can either
block the binding of an activator , directly compete for the same binding site,
or induce a repressive chromatin state in which no activator binding is
possible. Silencer effects, similar to those of enhancers, are independent of
orientation and distance to target genes. In contradiction to this general view,
in _Drosophila_ there are two types of silencers, long-range and short-range.
Short-range silencers are close to promoters and long-range silencers can
silence multiple promoters or enhancers over kilobases away. Like
enhancers, silencers bound by repressors may also induce changes in DNA
structure by looping and creating higher-order structures. One class of
such repressor proteins, which is thought to initiate higher-order structures by
looping, is Polycomb group proteins (PcGs).
##### Insulators
Insulator regions limit the effect of other regulatory elements
to certain chromosomal boundaries; in other words, they create regulatory
domains untainted by the regulatory elements in regions outside that domain.
Insulators can block enhancer-promoter communication and/or prevent spreading of
repressive chromatin domains. In vertebrates and insects, some of the
well-studied insulators are bound by CTCF (CCCTC-binding factor).
Genome-wide studies from different mammalian tissues confirm that CTCF binding
is largely invariant of cell type, and CTCF \index{CTCF protein} motif locations are conserved in
vertebrates. At present, there are two models that explain the insulator
function; the most prevalent model claims insulators create physically separate
domains by modifying chromosome structure. This is thought to be achieved by
CTCF-driven chromatin looping and recent evidence shows that CTCF can induce a
higher-order chromosome structure through creating loops of chromatins.
According to the second model, an insulator-bound activator cannot bind an
enhancer; thus enhancer-blocking activity is achieved and insulators can also
recruit an active histone domain, creating an active domain for enhancers to
function.
##### Locus control regions
Locus control regions (LCRs) are clusters of
different regulatory elements that control an entire set of genes on a locus. LCRs
help genes achieve their temporal and/or tissue-specific expression programs.
LCRs may be composed of multiple cis-regulatory elements, such as insulators and
enhancers, and they act upon their targets even from long distances. However,
LCRs function in an orientation-dependent manner, for example the activity of
beta-globin LCR is lost if inverted. The mechanism of LCR function otherwise
seems similar to other long-range regulators described above. The evidence is
mounting in the direction of a model where DNA-looping creates a chromosomal
structure in which target genes are clustered together, which seems to be
essential for maintaining an open chromatin domain.
#### Epigenetic regulation
Epigenetics in biology usually refers to \index{gene regulation}
constructions (chromatin structure, DNA methylation, etc.) other than DNA \index{epigenetics}
sequence that influence gene regulation. In essence, epigenetic regulation is
the regulation of DNA packing and structure, the consequence of which is gene
expression regulation. A typical example is that DNA packing inside the nucleus can
directly influence gene expression by creating accessible regions for transcription
factors to bind.
There are two main mechanisms in
epigenetic regulation: i) DNA modifications and ii) histone modifications. Below,
we will introduce these two mechanisms.
##### DNA modifications such as methylation
DNA methylation is usually associated with gene silencing. \index{DNA methylation}
DNA methyltransferase enzyme catalyzes the addition of a methyl group to cytosine of
CpG dinucleotides (while in mammals the addition of methyl group is largely
restricted to CpG dinucleotides, methylation can occur in other bases as well). This covalent modification either interferes with transcription factor
binding on the region, or methyl-CpG binding proteins induce the spread of
repressive chromatin domains, thus the gene is silenced if its promoter has
methylated CG dinucleotides. DNA methylation usually occurs in repeat
sequences to repress transposable elements. These elements, when active, can
jump around and insert them to random parts of the genome, potentially disrupting
the genomic functions.\index{CpG island}
DNA methylation is also related to a
key core and proximal promoter element: CpG islands. CpG islands are usually
unmethylated, however, for some genes, CpG island methylation accompanies their
silenced expression. For example, during X-chromosome inactivation, many CpG
islands are heavily methylated and the associated genes are silenced. In
addition, in embryonic stem cell differentiation, pluripotency-associated genes
are silenced due to DNA methylation. Apart from methylation, there are other
kinds of DNA modifications present in mammalian genomes, such as hydroxy-methylation and
formylcytosine. These are other modifications under current research that are either
intermediate or stable modifications with distinct functional associations. There
are at least a dozen distinct DNA modifications observed when we look across
all studied species [@sood2019dnamod].
##### Histone modifications
Histones are proteins that constitute a \index{histone} nucleosome. In
eukaryotes, eight histone proteins are wrapped by DNA and make up the
nucleosome. They help super-coiling of DNA and inducing high-order structure
called chromatin. In chromatin, DNA is either densely packed (called
heterochromatin or closed chromatin), or it is loosely packed (called
euchromatin or open chromatin). Heterochromatin is thought to harbor
inactive genes since DNA is densely packed and transcriptional machinery cannot
access it. On the other hand, euchromatin is more accessible for transcriptional
machinery and might therefore harbor active genes. Histones have long and
unstructured N-terminal tails which can be covalently modified. The most studied
modifications include acetylation, methylation and phosphorylation [@strahl2000language]. Using
their tails, histones interact with neighboring nucleosomes and the
modifications on the tail affect the nucleosomes' affinity to bind DNA and
therefore influence DNA packaging around nucleosomes. Different modifications on
histones are used in different combinations to program the activity of the genes
during differentiation. Histone modifications have a distinct nomenclature, for \index{histone modification}
example: H3K4me3 means the lysine (K) on the 4th position of histone H3 is
tri-methylated.
Modifications Effect
------------ ------
H3K9ac Active promoters and enhancers
H3K14ac Active transcription
H3K4me3/me2/me1 Active promoters and enhancers,
H3K4me1 and H3K27ac is enhancer-specific
H3K27ac H3K27ac is enhancer-specific
H3K36me3 Active transcribed regions
H3K27me3/me2/me1 Silent promoters
H3K9me3/me2/me1 Silent promoters
Table: (\#tab:histoneMod) Histone modifications and their effects. If more than one histone modification has the same effect, they are separated by commas.
Histone modifications are associated with a number of different
transcription-related conditions; some of them are summarized in Table \@ref(tab:histoneMod).
Histone modifications can indicate where the regulatory regions are and they can
also indicate activity of the genes. From a gene regulatory perspective, maybe
the most important modifications are the ones associated with enhancers and
promoters.
Furthermore, certain proteins can influence chromatin structure by interacting
with histones. Some of these proteins, like those of the Polycomb Group (PcG)
and CTCF, are discussed above in the insulators and silencer sections. In
vertebrates and insects, PcGs are responsible for maintaining the silent state
of developmental genes, and trithorax group proteins (trxG) for maintaining
their active state [@henikoff2008nucleosome ; @schwartz2007polycomb]. PcGs and trxGs induce repressed or active states by
catalyzing histone modifications or DNA methylation. Both the proteins bind PREs
that can be on promoters or several kilobases away. Another protein
that induces histone modifications is CTCF. CTCF is associated with boundaries between active and repressive histone marks \index{CTCF protein} [@phillips2009ctcf]. This is due to the role of CTCF in regulating the 3D genome structure. Two CTCF binding sites that are far away from each other in linear distance can bind together in 3D space thus forming chromatin loops.
```{block2, transReg, type='rmdtip'}
__Want to know more?__
- Transcriptional regulatory elements in the human genome: http://www.ncbi.nlm.nih.gov/pubmed/16719718
- On metazoan promoters: Types and transcriptional properties:
http://www.ncbi.nlm.nih.gov/pubmed/22392219
- General principles of regulatory sequence function:
http://www.nature.com/nrg/journal/v15/n7/abs/nrg3684.html
- DNA methylation: Roles in mammalian development:
http://www.nature.com/doifinder/10.1038/nrg3354
- Histone modifications and organization of the genome:
http://www.nature.com/nrg/journal/v12/n1/full/nrg2905.html
- DNA methylation and histone modifications are linked:
http://www.nature.com/nrg/journal/v10/n5/abs/nrg2540.html
```
### Post-transcriptional regulation
#### Regulation by non-coding RNAs
Recent years have witnessed an explosion in non-coding \index{gene regulation}
RNA (ncRNA)-related research\index{non-coding RNA (ncRNA)}. Many publications implicated ncRNAs as important
regulatory elements. Plants and animals produce many
different types of ncRNAs such as long non-coding RNAs (lncRNAs),
small interferring RNAs (siRNAs), microRNAs (miRNAs), promoter-associated RNAs
(PARs) and small nucleolar RNAs (snoRNAs) [@morris2014rise]. lncRNAs are typically >200-bp long,
they are involved in epigenetic regulation by interacting with chromatin
remodeling factors and they function in gene regulation. siRNAs are short
double-stranded RNAs which are involved in gene regulation and transposon
control; they silence their target genes by cooperating with Argonaute proteins. miRNAs are short single-stranded RNA molecules that interact with their
target genes by using their complementary sequence and mark them for quicker
degradation. PARs may regulate gene expression as well: they are approximately
18-to -200-bp-long ncRNAs originating from promoters of coding genes [@morris2014rise].
snoRNAs are also shown
to play roles in gene regulation, although they are mostly believed to guide
ribosomal RNA modifications [@morris2014rise].
#### Splicing regulation
Splicing is regulated by regulatory elements on the pre-mRNA and proteins \index{gene splicing}
binding to those elements. Regulatory elements are categorized
as splicing
enhancers and repressors. They can be located either in exons
or introns. Depending on their activity and their locations there are four types of regulatory elements for splicing:
- exonic splicing enhancers (ESEs)
- exonic splicing silencers (ESSs)
- intronic splicing enhancers (ISEs)
- intronic splicing silencers (ISSs).
The majority of splicing repressors are heterogeneous nuclear ribonucleoproteins (hnRNPs). If splicing repressor protein bind
silencer elements, they reduce the chance of a nearby site being
used as a splice junction. On the contrary, splicing enhancers are sites to which splicing activator proteins bind and binding
on that region increases the probability that a nearby site will be used as a splice junction [@wang2008splicing]. Most of the activator proteins that bind to splicing enhancers are members of the SR protein family. Such proteins can recognize specific RNA recognition motifs. By regulating splicing exons can be skipped or included,
which creates protein diversity [@wang2008splicing].
```{block2, postTranscriptreg, type='rmdtip'}
__Want to know more?__
- On miRNAs, their genesis, and modes of regulation [@bartel2004micrornas]:
http://www.sciencedirect.com/science/article/pii/S0092867404000455
- Functions of non-coding RNAs [@morris2014rise]:
http://www.nature.com/nrg/journal/v15/n6/abs/nrg3722.html
- On splicing and its regulation [@wang2008splicing]: https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18369186/
```
## Shaping the genome: DNA mutation
Human and chimpanzee genomes are 98.8% similar. The 1.2% difference is what separates \index{mutation}
us from chimpanzees. The further you move away from human species in terms of evolutionary
distance, the higher the difference gets. However, even between the members of
the same species, differences in genome sequences exist. These differences are
due to a process called mutation which drives differences between individuals
but also provides the fuel for evolution as the source of the genetic
variation. Individuals with beneficial mutations can adapt to their surroundings
better than others and in time, these mutations, which are beneficial for survival,
spread in the population due to a process called "natural selection". Selection
acts upon individuals with beneficial features, which gives them an edge for
survival in a given environment. Genetic variation created by the mutations
in individuals provides the material on which selection can act. If the
selection process goes for a long time in a relatively isolated environment
that requires adaptation, this population can evolve into a different species
given enough time. This is the basic idea behind evolution in a nutshell,
and without mutations providing the genetic variation, there would be no evolution.
Mutations in the genome occur due to multiple reasons. First, DNA replication
is not an error-free process. Before a cell division, the DNA is replicated with
1 mistake per 10^8 to 10^10 base-pairs. Second, mutagens such as UV light
can induce mutations on the genome. The third factor that contributes to mutation
is imperfect DNA repair. Every day, any human cell suffers multiple instances of DNA damage.
DNA repair enzymes are there to cope with this damage but they are also not
error-free, depending on which DNA repair mechanism is used (there are multiple),
mistakes will be made at varying rates.
Mutations are classified by how many bases they affect, their effect on
DNA structure and gene function. By their effect on DNA structure
the mutations \index{mutation} are classified as follows:
- __Base substitution__: A base is changed with another.
- __Deletion__: One or more bases is deleted.
- __Insertion__: New base or bases inserted into the genome.
- __Microsatellite mutation__: Small insertions or deletions of small tandemly
repeating DNA segments.
- __Inversion__: A DNA fragment changes its orientation 180 degrees.
- __Translocation__: A DNA fragment moves to another location in the genome.
Mutations can also be classified by their size as follows:
- __Point mutations__: Mutations that involve one base. Substitutions, deletions and
insertions are point mutations. They are also termed as single nucleotide
polymorphisms (__SNPs__).\index{SNP}
- __Small-scale mutations__: Mutations that involve several bases.
- __Large-scale mutations__: Mutations which involve larger chromosomal regions.
Transposable element insertions (where a segment of the genome
jumps to another region in the genome) and segmental duplications (a large
region is copied multiple times in tandem) are typical large scale mutations.
- __Aneuploidies__: Insertions or deletions of whole chromosomes.
- __Whole-genome polyploidies__: Duplications involving whole genome.
Mutations can be classified by their effect on gene function as follows:
- __Gain-of-function mutations__: A type of mutation in which the altered gene
product possesses a new molecular function or a new pattern of gene
expression.
- __Loss-of-function mutations__: A mutation that results in reduced or abolished
protein function. This is the more common type of mutation.
```{block2, mutations, type='rmdtip'}
__Want to know more?__
- Interactive tutorial on mutations: http://www.dnaftb.org/27/
- Tutorial on mutation and health maintained by NIH:https://ghr.nlm.nih.gov/primer#mutationsanddisorders
- Visualizations for different types of mutations: https://www.yourgenome.org/facts/what-types-of-mutation-are-there
- Review article on mutations regarding human disease: https://www.nature.com/articles/nrg3241
```
## High-throughput experimental methods in genomics
Most of the biological phenomena described above relating to transcription, gene regulation or DNA mutation can be measured over the entire genome using
high-throughput experimental techniques, which are quickly becoming
the standard for studying genome biology. In addition, their
applications in the clinic are also gaining momentum as there are already diagnostic
tests that are based on these techniques.
Some of the things that can be measured by high-throughput assays are as follows:
- _Which genes are expressed and how much?_
- _Where does a transcription factor bind?_
- _Which bases are methylated in the genome?_
- _Which transcripts are translated?_
- _Where does RNA-binding proteins bind?_
- _Which microRNAs are expressed?_
- _Which parts of the genome are in contact with each other?_
- _Where are the mutations in the genome located?_
- _Which parts of the genome are nucleosome-free?_
There are many more questions one can answer using modern genome-wide techniques
and every other day a new variant of the existing techniques comes along
to answer a new question.
However, one has to keep in mind that these methods are at varying degrees
of maturity and they all come with technical limitations
and are not noise-free. Despite this, they are extremely useful for research and clinical
purposes. And, thanks to these methods, we are able to sequence and annotate
genomes on a massive scale.
### The general idea behind high-throughput techniques
High-throughput methods aim to quantify or locate all or most of the genome that harbors
the biological feature (expressed genes, binding sites, etc.) of interest.
Most of the methods rely on some sort of enrichment of the targeted
biological feature. For example, if you want to measure expression of protein coding
genes you need to be able to extract mRNA molecules with special post-transcriptional
alterations that protein-coding genes acquire, as done in many RNA sequencing (RNA-seq) experiments\index{RNA-seq}. If you are looking for transcription factor
binding, you need to enrich
for the DNA fragments that are bound by the protein of interest, as it is done in ChIP-seq experiments. \index{ChIP-seq}This part depends on available molecular biology and chemistry techniques, and the final
product of this part is RNA or DNA fragments.
Next, you need to be able to
tell where these fragments are coming from in the genome and how many of them
there are. Microarrays \index{microarray} were the standard tool for the quantification step
until the spread of sequencing techniques. In microarrays, one had to design complementary bases, called "oligos" or "probes", to the genetic material enriched via the experimental protocol.
If the enriched material is complementary to the oligos,
a light signal will
be produced and the intensity of the signal will be proportional to the amount of the
genetic material pairing with that oligo. There will be more probes available for
hybridization (process of complementary bases forming bonds), so the more fragments
available, stronger the signal. For this to be able to work, you need to know
at least part of your genome sequence, and design probes. If you want to measure
gene expression, your probes should overlap with genes and should be unique
enough to not to bind sequences from other genes. This technology is now being
replaced with sequencing technology, where you directly sequence your genetic \index{high-throughput sequencing}
material. If you have the sequence of your fragments, you can align them back
to the genome, see where they are coming from, and count them. This is a better
technology where the quantification is based on the real identity of fragments
rather than based on hybridization to designed probes.
In summary, HT techniques have the following steps, and this also summarized in
Figure \@ref(fig:HTassays):
- Extraction: This is the step where you extract the genetic
material of interest, RNA or DNA.
- Enrichment: In this step, you enrich for the event you are interested
in. For example, protein binding sites. In some cases such as whole-genome DNA
sequencing, there is no need for enrichment step. You just get fragments of
genomic DNA and sequence them.
- Quantification: This is where you quantify your enriched material. Depending
on the protocol you may need to quantify a control set as well, where you
should see no enrichment or only background enrichment.
```{r,HTassays,fig.cap="Common steps of high-throughput assays in genome biology.",fig.align = 'center',out.width='70%',ref.label='HTassays',echo=FALSE}
knitr::include_graphics("images/HTassays.png" )
```
### High-throughput sequencing
High-throughput sequencing, or massively parallel sequencing, is a
collection of methods and technologies that can sequence DNA thousands/millions \index{high-throughput sequencing}
of fragments at a time. This is in contrast to older technologies that can
produce a limited
number of fragments at a time. Here, throughput refers to the number
of sequenced bases per hour. The older low-throughput sequencing methods have
~100 times less throughput compared to modern high-throughput methods. The
increased throughput gives the ability to measure biological features on a
genome-wide scale in a shorter time frame.
Similar to other high-throughput methods, sequencing-based methods also require
an enrichment step. This step enriches for the features we are interested in.
The main difference of the sequencing-based methods is the quantification step.
In high-throughput sequencing,
enriched fragments are put through the sequencer which outputs the sequences
for the fragments. Due to limitations in current leading technologies, only
a limited number of bases can be sequenced from the input fragments. However,
the length is usually enough to uniquely map the reads to the genome and quantify
the input fragments.
#### High-throughput sequencing data
If there is a genome available, the reads are aligned to the genome and based
on the library preparation protocol, different strategies are applied for analysis. A sequencing library is composed of fragments of RNA or DNA ready to be sequenced. The library preparation primarily depends on the experiment of interest. There are a number of library preparation protocols aimed at quantifying different signals from the genome. Some of the potential analysis strategies for different library-prep protocols and processed output of read alignments
are depicted in Figure \@ref(fig:HTseq).\index{high-throughput sequencing}
For example, we may be interested in quantifying the gene expression.
The experimental protocol, called RNA sequencing,
RNA-seq, enriches for fragments of RNA that are coming from protein coding genes.\index{RNA-seq}
Upon alignment, we can calculate the coverage profile which gives us a read count
per base along the genome. This information can be stored in a text file or
specialized file formats to be used in subsequent analysis or visualization.
We can also just count how many reads overlap with exons of each gene and record
read counts per gene for further analysis. This essentially produces a table
with gene names and read counts for different samples. As we will see in later
chapters, this is an essential information for statistical models for
RNA-seq data. Furthermore, we can stack up the reads and count how many times a base position in a read mismatches the base in the genome. Read aligners
allow for mismatches, and for this reason we can see reads with mismatches.
This information can be used to identify SNPs, and can be stored again in
a tabular format with the information of position and mismatch type and
number of reads supporting the mismatch. The original algorithms are a bit
more complicated than just counting mismatches but the general idea is the
same; what they are doing differently is trying to minimize false positive
rates by using filters, so that not every mismatch is recorded as a SNP.\index{SNP}
```{r,HTseq,fig.cap="High-throughput sequencing summary",fig.align = 'center',out.width='80%',ref.label='HTseq',echo=FALSE}
knitr::include_graphics("images/HTseq.png" )
```
#### Future of high-throughput sequencing
The sequencing technology is still evolving. Obtaining
longer single-molecule reads, and preferably, being able to call base modifications \index{high-throughput sequencing}
on the fly is the next frontier.
With longer reads, the genome assembly will be easier for the regions
that have high repeat content. With single-molecule sequencing, we will be able to
tell how many transcripts are present in a given cell population without
relying on fragment amplification methods which can introduce biases.
Another recent development is single-cell sequencing. Current technologies usually
work on genetic material from thousands to millions of cells. This means that the
results you receive represent the population of cells that were used in the
experiment. However, there is a lot of variation between the same type of cells, but
this variation is not observed at all. Newer sequencing techniques can work
on single cells and give quantitative information on each cell.
```{block2, htseqtip, type='rmdtip'}
__Want to know more?__
- Current and the future high-throughput sequencing technologies: http://www.sciencedirect.com/science/article/pii/S1097276515003408
- Illumina repository for different library preparation protocols for sequencing: http://www.illumina.com/techniques/sequencing/ngs-library-prep/library-prep-methods.html
```
## Visualization and data repositories for genomics
There are ~100 animal genomes sequenced as of 2016. On top these, there are many
research projects from either individual labs or consortia that
produce petabytes of auxiliary genomics data, such as ChIP-seq, RNA-seq, etc. \index{ChIP-seq} \index{RNA-seq}
There are two requirements to be able to visualize genomes and their associated
data: 1) you need to be able to work with a species
that has a sequenced genome and 2) you want to have annotation on that genome,
meaning, at the very least, you want to know where the genes are. Most
genomes after sequencing are quickly annotated with gene-predictions or
known gene sequences are mapped on to them, and you can also
have conservation to other species to filter functional elements. If you
are working with a model organism or human, you will also have a lot of
auxiliary information to help demarcate the functional regions such
as regulatory regions, ncRNAs, and SNPs that are common in the population.
Or you might have disease- or tissue-specific data available.
The more the organism is worked on, the more auxiliary data you will have.
#### Accessing genome sequences and annotations via genome browsers
As someone who intends to work with genomics, you will need to visualize a
large amount of data to make biological inferences or simply check regions of
interest in the genome visually. Looking at the genome case by case with all
the additional datasets is a necessary step to develop a hypothesis and understand
the data.
Many genomes and their associated data are
available through genome browsers. A genome browser
is a website or an application
that helps you visualize the genome and all the available data associated
with it. Via genome browsers\index{genome browser}, you will be able to see where genes are in
relation to each other and other functional elements. You will be
able to see gene structure. You will be able to see auxiliary data such as
conservation, repeat content and SNPs. Here we review some of the popular
browsers.
__UCSC genome browser:__ This is an online browser hosted by University of
California, Santa Cruz at http://genome.ucsc.edu/.
This is an interactive website that contains genomes \index{UCSC Genome Browser}
and annotations for many species. You can search for genes or genome coordinates
for the species of your interest. It is usually very responsive and allows
you to visualize large amounts of data. In addition, it has multiple other tools
that can be used in connection with the browser. One of the most useful tools
is the _UCSC Table Browser_, which lets you download all the data you see on the
browser, including sequence data, in multiple formats.
Users can upload data or provide links to the data
to visualize user-specific data.
__Ensembl:__ This is another online browser maintained by the European Bioinformatics Institute and the Wellcome Trust Sanger Institute in
the UK, http://www.ensembl.org.
Similar to the UCSC browser, users can visualize genes or genomic coordinates
from multiple species and it also comes with auxiliary data. Ensembl is
associated with the _Biomart_ \index{Biomart} tool which is similar to UCSC Table browser, and can
download genome data including all the auxiliary data set in multiple formats.\index{Ensembl Genome Browser}
__IGV:__ Integrated genomics viewer (IGV) is a desktop application developed
by Broad institute (https://www.broadinstitute.org/igv/). It is
developed to deal with large amounts of high-throughput sequencing data, which
is harder to view in online browsers. IGV can integrate your local sequencing
results with online annotation on your desktop machine. This is useful when
viewing sequencing data, especially alignments. Other browsers mentioned above
have similar features, however you will need to make your large
sequencing data available online somewhere before it can be viewed by browsers. \index{IGV Browser}
#### Data repositories for high-throughput assays
Genome browsers contain lots of auxiliary high-throughput data. However, there are
many more public high-throughput data sets available and they are certainly
not available through genome browsers. Normally, every high-throughput dataset
associated with a publication should be deposited in public archives. There
are two major public archives we use to deposit data. One of them is
_Gene Expression Omnibus (GEO)_ hosted at http://www.ncbi.nlm.nih.gov/geo/,
and the other one is _European Nucleotide Archive (ENA)_ hosted at
http://www.ebi.ac.uk/ena. These repositories accept high-throughput datasets
and users can freely download and use these public data sets for their
own research. Many data sets in these repositories are in their raw format,
for example, the format the sequencer provides mostly. Some data sets will also
have processed data but that is not a norm.
Apart from these repositories, there are multiple multi-national consortia dedicated to certain genome biology or disease-related problems and
they maintain their own databases and provide access to processed and raw data.
Some of these consortia are mentioned below.
Consortium | What is it for?
------------------- | -------------------------------------
[ENCODE](https://www.encodeproject.org/) | Transcription factor binding sites, gene expression and epigenomics data for cell lines | https://www.encodeproject.org/
[Epigenomics Roadmap](http://www.roadmapepigenomics.org/) | Epigenomics data for multiple cell types |
[The Cancer Genome Atlas (TCGA)](http://cancergenome.nih.gov/) | Expression, mutation and epigenomics data for multiple cancer types
[1000 genomes project](http://www.1000genomes.org/) | Human genetic variation data obtained by sequencing 1000s of individuals
\index{ENCODE}
\index{Epigenomics Roadmap}
\index{The Cancer Genome Atlas (TCGA)}