forked from bioinformatics-core-shared-training/RNAseq-R
-
Notifications
You must be signed in to change notification settings - Fork 2
/
rna-seq-preprocessing.Rmd
537 lines (342 loc) · 24.6 KB
/
rna-seq-preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
---
title: "RNA-seq analysis in R"
author: "Mark Dunning"
date: "February 2020"
output:
html_notebook:
toc: yes
toc_float: yes
html_document:
toc: yes
toc_float: yes
minutes: 300
layout: page
subtitle: Pre-processsing RNA-seq data
---
**Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law**, **Stephane Ballereau, Oscar Rueda, Ashley Sawle**
Based on the course [RNAseq analysis in R](http://combine-australia.github.io/2016-05-11-RNAseq/) delivered on May 11/12th 2016 and modified by Cancer Research Uk Cambridge Centre for the [Functional Genomics Autumn School 2017](https://bioinformatics-core-shared-training.github.io/cruk-autumn-school-2017/)
```{r knitrOpts, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Resources and data files
This material has been created using the following resources:
- http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf [@Lun2016]
- http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html
- http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
- https://bioconductor.github.io/BiocWorkshops/rna-seq-data-analysis-with-deseq2.html
## Overview
* Reading in table of counts
* Filtering lowly expressed genes
* Quality control
* Normalisation for composition bias
## Introduction
Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.
There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. Traditionally, these are aligned to a reference genome, then the number of reads mapped to each gene can be counted. More modern approaches such as `salmon` quantify transcripts directly and do not require genome aligment to have taken place. Either approach results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, this session will be starting from the count data and getting stuck into analysis.
We will be following a workflow that uses the `DESeq2` package. An alternative and well-respected workflow is based on the [edgeR and limma packages](https://bioconductor.github.io/BiocWorkshops/rna-seq-analysis-is-easy-as-1-2-3-with-limma-glimma-and-edger.html).
### Mouse mammary gland dataset
The data for this tutorial comes from a Nature Cell Biology paper, [*EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival*](http://www.ncbi.nlm.nih.gov/pubmed/25730472) [@Fu2015].
This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates.
The sequencing reads for this experiment were uploaded to the [Sequencing Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra?term=SRP045534) and details of how to obtain the raw reads and process are given in the session "Preparing fastq files for analysis in R".
## Obtaining the metadata
The `sampleInfo.txt` in the `meta_data` folder contains basic information about the samples that we will need for the analysis today. This includes the ID for the sample from SRA, an ID assigned by the researcher, and the cell type and developmental stage for each sample.
```{r loadSampleInfo}
# Read the sample information into R
sampleinfo <- read.delim("meta_data/sampleInfo.txt")
View(sampleinfo)
sampleinfo
rownames(sampleinfo) <- sampleinfo$run
```
## Reading in the count data
### Overview
We are going to use the [`tximport`](http://dx.doi.org/10.12688/f1000research.7563.1) package to import the count data into R and collapse the data to the *gene level*. This requires us to run a function in the following form:-
```{r eval=FALSE}
txi <- tximport(files=..., type="salmon", tx2gene=...)
```
So we will need to define the files that we want to import and a transcript mapping data frame. The transcript mapping takes the form
```
| TXNAME | GENEID
1| ENST00000456328.2 | ENSG00000223972.5
2| ENST00000450305.2 | ENSG00000223972.5
3| ENST00000473358.1 | ENSG00000243485.5
4| ENST00000469289.1 | ENSG00000243485.5
5| ENST00000607096.1 | ENSG00000284332.1
6| ENST00000606857.1 | ENSG00000268020.3
```
`tximport` is able to import counts produced by different software, and different workflows are described for each in the [tximport vignette](https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html).
## Identifying the files
The samples from this study have been *quantified* using `salmon`. For details on how this is done, please see the previous session on preparing fastq files for analysis.
The script we used to run salmon (`run_salmon.sh`) created a separate folder for each sample (named according to the SRA ID), and inside each of these folders we find the salmon quantification file. Note that the salmon analysis produced many other files (e.g. log files), but we will only need the `quant.sf.gz` files for analysis.
The function we are going to use to import the salmon files requires a `vector` comprising the paths to the files that are to be imported. To construct such a vector we can use the following code chunk. Furthermore, we can name each item in the vector according to the directory name. These names will be used eventually to name the columns of our count matrices.
```{r}
dirs <- list.files("salmon_quant/")
quant_files <- list.files("salmon_quant/",pattern="quant.sf.gz",recursive = TRUE,full.names = TRUE)
names(quant_files) <- dirs
quant_files
```
### Inspecting the salmon output
The quant files are simple tab-delimited files that tabulate the counting results for each transcript in our chosen organism. Although we will use a specialised Bioconductor package (`tximport`) to import the counts for entire dataset into R, we can inspect the first of the files using the standard `read_tsv` function from the `readr` package.
```{r}
library(readr)
quants <- read_tsv(quant_files[1])
head(quants)
```
Various methods have been proposed to account for the two main known biases in RNA-seq; library composition and gene length. A nice summary is presented on [this blog](https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/). The current favourite is *TPM* (transcripts per million) which is similar in concept to *RPKM*, but the different order in which operations are applied makes it easier to compare across samples.
The *TPM* are giving in the table, but we can demonstrate the calculation in a few steps
- divide the number of reads for each transcript by it's length (*reads per kilobase* - RPK)
- sum the RPK values and divide by 1 million to get a scaling factor
- divide the RPK values by the scaling factor to get the *TPM*
```{r}
rpk <- quants$NumReads / quants$EffectiveLength
scale_factor <- sum(rpk) / 1e6
tpm <- rpk / scale_factor
```
### Defining the transcript mapping
In order for `tximport` to give *gene-level* counts, we need to supply a data frame that can be used to associate each transcript name with a gene identifier. **It is important to use a transcript file that corresponds to the name genome build as the file used to count the transcripts**.
We can check if the `gtf` file exists in the directory we expect by running the `file.exists` function; returning `TRUE` or `FALSE`
```{r}
gtf_file <- "Mus_musculus.GRCm38.91.chr.gtf.gz"
file.exists(gtf_file)
```
If required, we can download from the Ensembl FTP site.
```{r eval=FALSE}
download.file("ftp://ftp.ensembl.org/pub/release-91/gtf/mus_musculus/Mus_musculus.GRCm38.91.chr.gtf.gz",destfile = gtf_file)
```
### Note on analysing your own data
![images/download_gtf.png]
If analysing your own data, you will have to locate the gtf file on the Ensembl FTP site. If you enter `ftp://ftp.ensembl.org/pub/release-91/gtf` into a web browser you will be able to navigate the site and find your organism of interest. By right-clicking on the name of the gtf you will be able to copy the URL and then paste into RStudio.
```{r eval=FALSE}
gtf_file <- "ensembl_ref/my_ref.gtf"
download.file(PASTE_LINK_FROM_ENSEMBL_HERE,destfile = gtf_file)
```
### Creating a transcript database
The Bioconducor website provides many pre-built transcript databases for some organisms (Human, Mouse,Rat etc) which provide transcript definitions and allow users to query the locations of particular genes, exons and other genomic features. You may find a pre-built package that already has the transcript locations required to create the transcript mapping file. Check out the annotation section of the Bioconductor website - http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData and look for packages starting `TxDb...`
However, it is quite easy to build such a database if we have a `gtf` file using the `GenomicFeatures` infrastructure.
```{r message=FALSE}
## Could take a few minutes to run the makeTxDbFromGFF command
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(gtf_file)
```
The database has a number of predefined "keys" and "columns" that have to be specified when creating a query
```{r}
keytypes(txdb)
```
```{r}
columns(txdb)
```
Sometimes we would want to query the positions for a limited set of selected genes (perhaps the results of a differential-expression analysis), but in this case we want the gene names that correspond to every transcript in the database. To get the names of all transcripts we can use the `keys` function. We then compose the query using the `select` function to return a data frame
```{r}
k <- keys(txdb, keytype="TXNAME")
tx_map <- select(txdb, keys = k, columns="GENEID", keytype = "TXNAME")
head(tx_map)
```
Such a data frame *should* be sufficient to allow us to use the `tximport` package. There is a bit of a problem though..
```{r eval=FALSE}
library(tximport)
tx2gene <- tx_map
write.csv(tx2gene,file="tx2gene.csv",row.names = FALSE,quote=FALSE)
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene)
```
In this case R is reporting a useful error message; the IDs we have supplied in the `tx2gene` data frame do not correspond to the transcript names in our files.
```{r}
table(tx_map$TXNAME %in% quants$Name)
```
Fortunately the authors of `tximport` have recognised this as a common problem and added an argument to the `tximport` function that can be used to overcome the error.
```{r}
library(tximport)
tx2gene <- tx_map
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene,ignoreTxVersion = TRUE)
```
The resulting object is a simple `list` structure in R which contains a number of components that we can access using a `$` operator
```{r}
names(txi)
```
```{r}
head(txi$counts)
```
```{r}
all(rownames(sampleinfo) == colnames(txi$counts))
```
This was a fairly simple fix, but we could have also used some functionality from the `tidyverse` collection of packages to make the names compatible.
### Fixing the problem with transcript names - the hard way!
None of the transcript names are matching up. If we look at the transcript names in the `quant` files they have a `.` symbol followed by a number after the transcript name. The `tidyr` package contains several useful functions for *tidying* a data frame. One function we will use in particular is to split a column of value that contain multiple pieces of information separated by a common character. In this case the character is "." but other common separating characters include `-`, `_`, `:`. The `separate` function is able to detect which character is being used.
Here we create two new columns, `TXNAME` and `Number`. The `TXNAME` being the column that should match entries in the `tx_map` data frame.
```{r}
library(tidyr)
quants <- separate(quants, Name, c("TXNAME","Number"),remove = FALSE)
head(quants)
```
Another useful piece of `tidyverse` functionality is to join two tables together based on a common column. We want to create a data frame that has the transcript name as it appears in the quant files and the gene name. This can be achieved by a `left_join` to ensure that all the transcript IDs in the quant files have an entry in the combined table.
N.B. those already familiar with R might have encountered the `merge` function for achieving a similar result.
```{r}
library(dplyr)
quants <- left_join(quants, tx_map, by="TXNAME")
head(quants)
```
To create the `tx2gene` we can pull out the columns that we require from this merged data frame. There is another `dplyr` function for doing this that has simpler syntax than the *"base R"* method of subsetting a data frame.
```{r}
tx2gene <- dplyr:::select(quants, Name, GENEID)
head(tx2gene)
```
We might want to further check that there are no "missing values" in the Gene ID column and remove them from the data frame
```{r}
any(is.na(tx2gene$GENEID))
tx2gene <- filter(tx2gene, !is.na(GENEID))
```
```{r}
library(tximport)
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene)
```
## Quality control of the imported counts
We will be using the `DESeq2` library to analyse this dataset. As part of the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) you will see examples of importing count data from different sources. In all cases, [raw count data are expected](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-un-normalized-counts).
For this workflow we are going to import data from `tximport` with the `DESeqDataSetFromTximport` function along with the sample information that we created earlier.
A *design* for the experiment also needs to be specified. This will define how the differential expression analysis is carried out, but can be changed at a later stage so we will use `CellType` for now as our factor of interest.
```{r message=FALSE}
library(DESeq2)
dds <- DESeqDataSetFromTximport(txi,
colData = sampleinfo,
design <- ~CellType)
```
The object contains all the metadata for the experiment, along with the counts.
```{r}
colData(dds)
```
We will be using these raw counts throughout the workshop and transforming them using methods in the `DESeq2` package. If TPM values are desired for some other application, we can extract them from the `tximport` object. These are the transcript-level TPM values that we saw earlier in the quant files that have been summarised to the gene-level.
```{r}
tpm <- txi$abundance
write.csv(tpm, file="tpm_values.csv",quote=FALSE)
```
At the time of writing, *TPM* is the recommended way of transforming RNA-seq counts to account for gene length and library composition biases. DESeq2 also provides methods for extracting normalised counts as *FPKM* (fragments per kilobase per million mapped fragments) and *FPM* (fragments per million mapped fragments) which might be required for some other analysis or visualisation outside of Bioconductor or `DESeq2`.
```{r}
fpm <- fpm(dds)
write.csv(fpm, file="fpm_values.csv",quote=FALSE)
fpkm <- fpkm(dds)
write.csv(fpkm, file="fpkm_values.csv",quote=FALSE)
```
### Visualising library sizes
We can look at a few different plots to check that the data is good quality, and that the samples are as we would expect. First, we can check how many reads we have for each sample in the `DESeqDataSet`. The counts themselves are accessed using the `assay` function; giving a matrix of counts. The sum of a particular column is therefore the total number of reads for that sample.
```{r}
sum(assay(dds)[,1])
```
A convenience function `colSums` exists for calculating the sum of each column in a matrix, returning a `vector` as a result.
```{r dgeLibrarySizes}
colSums(assay(dds))
```
> ## Challenge 1 {.challenge}
>
> 1. Produce a bar plot to show the Millions of reads for each sample
> 2. Change the names under the plot so they are the "Name" of the sample rather than Run Name
> 3. Add a horizontal line at 20 million reads
> HINT: look at the help for the base graphics functions `barplot` and `abline` (or try and use `ggplot2` if you prefer)
![](images/lib_size.png)
### Filtering non-expressed genes
The dataset is usually filtered at this stage to remove any genes that are not expressed. Although not strictly required for the DESeq2 differential expression algorithm, it can reduce the time and memory required to perform some of the analysis. Let's say that for a gene to be "expressed" in a particular sample we need to see 5 or more counts
```{r}
is_expressed <- assay(dds) >= 5
head(is_expressed)
```
R is happy to think of logical values (`TRUE` or `FALSE`) as the integers `0` or `1`. Therefore if we calculate the sum across a particular row it will give the number of samples that gene is expressed in.
```{r}
sum(is_expressed[1,])
sum(is_expressed[2,])
```
In a similar manner to `colSums`, `rowSums` will give the sum of each row in a matrix and return the result as a `vector`.
```{r}
hist(rowSums(is_expressed),main="Number of samples a gene is expressed in",xlab="Sample Count")
```
It seems that genes are either expressed in all samples, or not expressed at all. We will decide to keep genes that are expressed in at least 2 samples.
```{r}
keep <- rowSums(assay(dds) >= 5) >= 2
table(keep)
dds <- dds[keep,]
```
### Visualising count distributions
We typically use a `boxplot` to visualise difference the distributions of the columns of a numeric data frame. Applying the `boxplot` function to the raw counts from our dataset reveals something about the nature of the data; the distributions are dominated by a few genes with very large counts.
```{r}
boxplot(assay(dds))
```
```{r message=FALSE}
boxplot(log10(assay(dds)))
```
We can use the `vst` or `rlog` function from `DESeq2`to compensate for the effect of different library sizes and put the data on the log$_2$ scale. The effect is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. For more details see the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-data-transformations)
```{r}
# Get log2 counts
vsd <- vst(dds,blind=TRUE)
# Check distributions of samples using boxplots
boxplot(assay(vsd), xlab="", ylab="Log2 counts per million",las=2,main="Normalised Distributions")
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(assay(vsd)), col="blue")
```
### Heatmap of the sample-to-sample distances
Another use of the transformed data is sample clustering. Here, we apply the `dist` function to the transpose of the transformed count matrix to get sample-to-sample distances.
```{r}
sampleDists <- dist(t(assay(vsd)))
```
A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples. By re-naming the rows and columns of the distance matrix we can make the plot easier to interpret.
```{r}
library(RColorBrewer)
library(pheatmap)
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(colData(dds)$CellType, colData(dds)$Status, sep="-")
colnames(sampleDistMatrix) <- colData(dds)$Name
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
col=colors)
```
### Principal component (PCA)
Related to the distance matrix heatmap is the [(Principal Components Analysis) PCA](http://setosa.io/ev/principal-component-analysis/) plot, which shows the samples in the 2D plane spanned by their first two principal components. A principle components analysis is an example of an unsupervised analysis, where we don’t need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers
`DESeq2` has a convenient `plotPCA` function for making the PCA plot, which makes use of the `ggplot2` graphics package.
```{r}
plotPCA(vsd,intgroup="CellType")
```
> ## Challenge 2 {.challenge}
>
> 1. Is the `plotPCA` plot based on all genes in the dataset? How can we change how many genes are used for the PCA analysis? Does this significantly change the plot? (HINT: check the documentation for the `plotPCA` function.)
> 2. Change the `intgroup` parameter so that both CellType and Status are used for grouping. (See the documentation again)
> 3. Is there something strange going on with the samples?
> 4. Identify the two samples that don't appear to be in the right place.
> 5. What other problems can you see with the metadata?
### Note about batch effects
In our unsupervised analysis we should see that the main source of variation is due to biological effects, and not technical variation such as when the libraries were sequenced. If we do observe high technical variation in our data, it is not a complete disaster provided that we have designed our experiment propery. In particular the [sva Bioconductor package](https://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf) can correct for batch effects provided that representatives of the groups of interest appear in each batch. Alternatively, the batch or confounding factor may be incorporated into the differential expression analysis.
### Correcting the sample information
Hopefully we have spotted a potential sample swap in the dataset. The mislabelled samples are MCL1.DH, which is labelled as luminal but should be basal, and MCL1.LA, which is labelled as basal but should be luminal. Such errors are not uncommon when handling large numbers of samples and sometimes we need to go back to the lab books and verify that a swap has been made. *If there is no supporting evidence for a swap then it can be safer to exclude the samples*.
Furthermore, the person creating the sample sheet has been inconsistent about the way that values of `CellType` and `Status` have been entered into the metadata. Such errors can be annoying when labelling plots, but have more serious consequences when attempting to fit statistical models to the data.
```{r correctSampleSheet}
library(stringr)
sampleinfo_corrected <- sampleinfo
sampleinfo_corrected <- mutate(sampleinfo_corrected, CellType = str_to_lower(CellType))
sampleinfo_corrected <- mutate(sampleinfo_corrected, Status = str_trim(Status))
sampleinfo_corrected <- mutate(sampleinfo_corrected, CellType = ifelse(Name == "MCL1-DH","basal",CellType))
sampleinfo_corrected <- mutate(sampleinfo_corrected, CellType= ifelse(Name == "MCL1-LA","luminal",CellType))
write.table(sampleinfo_corrected, file="meta_data/sampleInfo_corrected.txt",sep="\t",row.names = FALSE)
```
> ## Challenge 3 {.challenge}
>
> 1. Re-create the DESeqDataset object to include the corrected sample information
> 2. Re-run the plotPCA function on the new data and verify that the sample groups now look correct
>
### A note about "pipes" and ggplot2
It is common practice when using a series of `dplyr` data frame manipulations (such as `mutate` above) to create a workflow with the `%>%` operation from the `magrittr` package. The result is a more-readable chunk of code, although the result is exactly the same.
```{r}
sampleinfo_corrected <- mutate(sampleinfo, CellType = str_to_lower(CellType)) %>%
mutate(Status = str_trim(Status)) %>%
mutate(CellType = ifelse(Name == "MCL1-DH","basal",CellType)) %>%
mutate(CellType= ifelse(Name == "MCL1-LA","luminal",CellType))
sampleinfo_corrected
```
The `plotPCA` function produces a `ggplot2` plot (rather than "base" graphics) and we can also get the function to return the data used to create the plot by changing the `returnData` argument.
```{r}
plot_data <- plotPCA(vsd,intgroup=c("CellType","Status"),returnData=TRUE)
plot_data <- bind_cols(plot_data,sampleinfo_corrected)
```
With `ggplot2` plot can be created by mapping various *aesthetics* of the plot (colour, shape, x- and y-coordinates) to columns in the data frame.
```{r}
library(ggplot2)
ggplot(plot_data, aes(x = PC1,y=PC2, col=CellType)) + geom_point()
```
Having the data in this form allows us to customise the plot in lots of ways
```{r}
ggplot(plot_data, aes(x = PC1,y=PC2, col=CellType,pch=Status)) + geom_point(size=5)
```
```{r}
ggplot(plot_data, aes(x = PC1,y=PC2, col=CellType,pch=Status,label=Name)) + geom_point() + geom_text(alpha=0.4)
```