-
Notifications
You must be signed in to change notification settings - Fork 6
/
README.Rmd
1096 lines (795 loc) · 37.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to crisprDesign"
output:
github_document:
toc: true
bibliography: vignettes/references.bib
---
```{r, echo=FALSE, results="hide"}
options("knitr.graphics.auto_pdf"=TRUE)
```
Authors: Jean-Philippe Fortin, Aaron Lun, Luke Hoberecht
Date: July 1, 2022
# Introduction
`crisprDesign` is the core package of the
[crisprVerse](https://github.com/crisprVerse) ecosystem,
and plays the role of a
one-stop shop for designing and annotating
CRISPR guide RNA (gRNA) sequences. This includes the characterization of
on-targets and off-targets using different aligners, on- and off-target
scoring, gene context annotation, SNP annotation, sequence feature
characterization, repeat annotation, and many more.
The software was developed to be as applicable and generalizable as
possible.
It currently support five types of
CRISPR modalities (modes of perturbations): CRISPR knockout (CRISPRko), CRISPR
activation (CRISPRa), CRISPR interference (CRISPRi), CRISPR base editing
(CRISPRbe), and CRISPR knockdown (CRISPRkd) (see @crispracrisprireview for a review of CRISPR modalities).
It utilizes the `crisprBase` package to enable gRNA design for any
CRISPR nuclease and base editor via the `CrisprNuclease` and `BaseEditor`
classes, respectively. Nucleases that are commonly used in the field are
provided, including DNA-targeting nucleases (e.g. SpCas9, AsCas12a) and
RNA-targeting nucleases (e.g. CasRx (RfxCas13d)).
`crisprDesign` is fully developed to work with the genome of any organism, and
can also be used to design gRNAs targeting custom DNA sequences.
Finally, more specialized gRNA design functionalities are also available,
including design for optical pooled screening (OPS), paired gRNA design,
and gRNA filtering and ranking functionalities.
This vignette is meant to be an overview of the main features included in
the package, using toy examples for the sake of time (the vignette has to
compile within a few minutes, as required by Bioconductor). For detailed
and comprehensive tutorials, please visit our [crisprVerse tutorials page](https://github.com/crisprVerse/Tutorials).
# Installation
`crisprDesign` can be installed from from the Bioconductor devel branch
using the following commands in a fresh R session:
```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version="devel")
BiocManager::install("crisprDesign")
```
Users interested in contributing to `crisprDesign` might want to look at the
following CRISPR-related package dependencies:
- [crisprBase](https://github.com/crisprVerse/crisprBase): core CRISPR functions and S4 objects
- [crisprBowtie](https://github.com/crisprVerse/crisprBowtie): aligns gRNA spacers to genomes using the ungapped
aligner `bowtie`
- [crisprBwa](https://github.com/crisprVerse/crisprBWa): aligns gRNA spacers to genomes using the ungapped
aligner `BWA`
- [crisprScore](https://github.com/crisprVerse/crisprScore): implements state-of-the-art on- and off-target scoring
algorithms
- [crisprViz](https://github.com/crisprVerse/crisprViz): gRNA visualization using genomic tracks
You can contribute to the package by submitting pull requests to our [GitHub repo](https://github.com/crisprVerse/crisprDesign).
The complete documentation for the package can be found [here](https://bioconductor.org/packages/devel/bioc/manuals/crisprDesign/man/crisprDesign.pdf).
# Terminology
CRISPR nucleases are examples of RNA-guided endonucleases. They require two
binding components for cleavage. First, the nuclease needs to recognize a
constant nucleotide motif in the target DNA called the protospacer adjacent
motif (PAM) sequence. Second, the gRNA, which guides the nuclease to the target
sequence, needs to bind to a complementary sequence adjacent to the PAM
sequence, called the **protospacer** sequence. The latter can be thought of as a
variable binding motif that can be specified by designing corresponding gRNA
sequences.
The **spacer** sequence is used in the gRNA construct to guide
the CRISPR nuclease to the target **protospacer** sequence in the host genome.
For DNA-targeting nucleases, the nucleotide sequence of the spacer and protospacer are identical. For RNA-targeting nucleases, they are the reverse complement of each other.
While a gRNA spacer sequence may not always uniquely target the host genome
(i.e. it may map to multiple protospacers in the host genome),
we can, for a given reference genome, uniquely identify a protospacer
sequence with a combination of 3 attributes:
- `chr`: chromosome name
- `strand`: forward (+) or reverse (-)
- `pam_site`: genomic coordinate of the first nucleotide of the
nuclease-specific PAM sequence (e.g. for SpCas9, the "N" in the NGG PAM
sequence; for AsCas12a, the first "T" of the TTTV PAM sequence)
For CRISPRko, we use an additional genomic coordinate, called `cut_site`,
to represent where the double-stranded break (DSB) occurs. For SpCas9, the cut
site (blunt-ended dsDNA break) is located 4nt upstream of the pam_site
(PAM-proximal editing). For AsCas12a, the 5nt 5' overhang dsDNA break will
cause a cut 19nt after the PAM sequence on the targeted strand, and 23nt after
the PAM sequence on the opposite strand (PAM-distal editing).
# CRISPRko design
We will illustrate the main functionalities of `crisprDesign` by
performing a common task: designing gRNAs to knock out a coding gene. In our
example, we will design gRNAs for the wildtype SpCas9 nuclease, with spacers
having a length of 20nt.
```{r, message=FALSE, warning=FALSE,results='hide' }
library(crisprDesign)
```
## Nuclease specification
The `crisprBase` package provides functionalities to create objects that store
information about CRISPR nucleases, and functions to interact with those
objects (see the `crisprBase` vignette). It also provides commonly-used CRISPR
nucleases. Let's look at the `SpCas9` nuclease object:
```{r}
library(crisprBase)
data(SpCas9, package="crisprBase")
SpCas9
```
The three motifs (NGG, NAG and NGA) represent the recognized PAM sequences by
SpCas9, and the weights indicate a recognition score. The canonical PAM
sequence NGG is fully recognized (weight of 1), while the two non-canonical
PAM sequences NAG and NGA are much less tolerated.
The spacer sequence is located on the 5-prime end with respect to the PAM
sequence, and the default spacer sequence length is 20 nucleotides.
If necessary, we can change the spacer length using the function
`crisprBase::spacerLength`. Let's see what the protospacer
construct looks like by using `prototypeSequence`:
```{r}
prototypeSequence(SpCas9)
```
## Target DNA specification
As an example, we will design gRNAs that knockout the human gene IQSEC3 by
finding all protospacer sequences located in the coding region (CDS)
of IQSEC3.
To do so, we need to create a `GRanges` object that defines the genomic
coordinates of the CDS of IQSEC3 in a reference genome.
The toy dataset `grListExample` object in `crisprDesign` contains gene
coordinates in hg38 for exons of all human IQSEC3 isoforms, and was
obtained by converting an Ensembl `TxDb` object into a `GRangesList`
object using the `TxDb2GRangesList` convenience function in `crisprDesign`.
```{r}
data(grListExample, package="crisprDesign")
```
The `queryTxObject` function allows us to query such objects for a specific
gene and feature. Here, we obtain a `GRanges` object containing the CDS
coordinates of IQSEC3:
```{r echo=TRUE, results='hide', warning=FALSE, message=FALSE}
gr <- queryTxObject(txObject=grListExample,
featureType="cds",
queryColumn="gene_symbol",
queryValue="IQSEC3")
```
We will only consider the first exon to speed up design:
```{r}
gr <- gr[1]
```
## Designing spacer sequences
`findSpacers` is the main function to obtain a list of all
possible spacer sequences targeting protospacers located in the target
DNA sequence(s). If a `GRanges` object is provided as input, a `BSgenome`
object (object containing sequences of a reference genome) will need to be
provided as well:
```{r, warning=FALSE, message=FALSE}
library(BSgenome.Hsapiens.UCSC.hg38)
bsgenome <- BSgenome.Hsapiens.UCSC.hg38
guideSet <- findSpacers(gr,
bsgenome=bsgenome,
crisprNuclease=SpCas9)
guideSet
```
This returns a `GuideSet` object that stores genomic coordinates for all spacer
sequences found in the regions provided by `gr`. The `GuideSet` object is an
extension of a `GenomicRanges` object that stores additional information about
gRNAs.
For the subsequent sections, we will only work with a random subset of 20
spacer sequences:
```{r}
set.seed(10)
guideSet <- guideSet[sample(seq_along((guideSet)),20)]
```
Several accessor functions are provided to extract information about the
spacer sequences:
```{r}
spacers(guideSet)
protospacers(guideSet)
pams(guideSet)
head(pamSites(guideSet))
head(cutSites(guideSet))
```
The genomic locations stored in the IRanges represent the PAM site locations in the reference genome.
## Sequence features characterization
There are specific spacer sequence features, independent of the genomic
context of the protospacer sequence, that can reduce or even eliminate gRNA
activity:
- **Poly-T stretches**: four or more consecutive T nucleotides in the
spacer sequence may act as a transcriptional termination signal for
the U6 promoter.
- **Self-complementarity**: complementary sites with the gRNA backbone
can compete with the targeted genomic sequence.
- **Percent GC**: gRNAs with GC content between 20% and 80% are preferred.
Use the function `addSequenceFeatures` to adds these spacer sequence
characteristics to the `GuideSet` object:
```{r, eval=TRUE, warning=FALSE, message=FALSE}
guideSet <- addSequenceFeatures(guideSet)
head(guideSet)
```
## Off-target search
In order to select gRNAs that are most specific to our target
of interest, it is important to avoid gRNAs that target additional
loci in the genome with either perfect sequence complementarity
(multiple on-targets), or imperfect complementarity through
tolerated mismatches (off-targets).
For instance, both the SpCas9 and AsCas12a nucleases can be tolerant
to mismatches between the gRNA spacer sequence (RNA) and the protospacer
sequence (DNA), thereby making it critical to characterize off-targets to
minimize the introduction of double-stranded breaks (DSBs) beyond
our intended target.
The `addSpacerAlignments` function appends a list of putative on-
and off-targets to a `GuideSet` object using one of three methods. The first
method uses the fast aligner
[bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
[@langmead2009bowtie] via the `crisprBowtie` package to map spacer sequences
to a specified reference genome. This can be done by specifying
`aligner="bowtie"` in `addSpacerAlignments`.
The second method uses the fast aligner
[BWA](https://github.com/lh3/bwa) via the `crisprBwa` package to map
spacer sequences to a specified reference genome.
This can be done by specifying
`aligner="bwa"` in `addSpacerAlignments`. Note that this is not available
for Windows machines.
The third method uses the package `Biostrings` to search for similar sequences
in a set of DNA coordinates sequences, usually provided through a `BSGenome`
object. This can be done by specifying
`aligner="biostrings"` in `addSpacerAlignments`. This is extremely slow,
but can be useful when searching for off-targets in custom short DNA
sequences.
We can control the alignment parameters and output using several
function arguments. `n_mismatches` sets the maximum number of permitted
gRNA:DNA mismatches (up to 3 mismatches). `n_max_alignments` specifies the
maximum number of alignments for a given gRNA spacer sequence
(1000 by default). The `n_max_alignments` parameter may be overruled by
setting `all_Possible_alignments=TRUE`, which returns all possible
alignments. `canonical=TRUE` filters out protospacer sequences
that do not have a canonical PAM sequence.
Finally, the `txObject` argument in `addSpacerAlignmentsused`
allows users to provide a `TxDb` object, or a `TxDb` object
converted in a `GRangesList` using the `TxDb2GRangesList` function, to
annotate genomic alignments with a gene model annotation. This is useful
to understand whether or not off-targets are located in the CDS of
another gene, for instance.
For the sake of time, we will search here for on- and off-targets located
in the beginning of the human chr12 where the gene IQSEC3 is located.
We will the bowtie method, with a maximum of 1 mismatch.
First, we need to build a bowtie index sequence using the fasta file provided
in `crisprDesign`. We use the `RBowtie` package to build the index:
```{r}
library(Rbowtie)
fasta <- system.file(package="crisprDesign", "fasta/chr12.fa")
outdir <- tempdir()
Rbowtie::bowtie_build(fasta,
outdir=outdir,
force=TRUE,
prefix="chr12")
bowtie_index <- file.path(outdir, "chr12")
```
For genome-wide off-target search, users will need to create a bowtie
index on the whole genome. This is explained
in [this tutorial](https://github.com/crisprVerse/Tutorials/tree/master/Building_Genome_Indices).
Finally, we also need to specify a `BSgenome` object storing DNA sequences
of the human reference genome:
```{r, results='hide', warning=FALSE}
library(BSgenome.Hsapiens.UCSC.hg38)
bsgenome <- BSgenome.Hsapiens.UCSC.hg38
```
We are now ready to search for on- and off-targets:
```{r, results='hide', warning=FALSE}
guideSet <- addSpacerAlignments(guideSet,
txObject=grListExample,
aligner_index=bowtie_index,
bsgenome=bsgenome,
n_mismatches=1)
```
Let's look at what was added to the `GuideSet`:
```{r}
guideSet
```
A few columns were added to the `GuideSet` object to summarize the number of
on- and off-targets for each spacer sequence, taking into account genomic
context:
- **n0, n1, n2, n3**: specify number of alignments with 0, 1, 2 and 3
mismatches, respectively.
- **n0_c, n1_c, n2_c, n3_c**: specify number of alignments in a coding region,
with 0, 1, 2 and 3 mismatches, respectively.
- **n0_p, n1_p, n2_p, n3_p**: specify number of alignments in a promoter region
of a coding gene, with 0, 1, 2 and 3 mismatches, respectively.
To look at the individual on- and off-targets and their context, use the
`alignments` function to retrieve a table of all genomic alignments stored in
the `GuideSet` object:
```{r}
alignments(guideSet)
```
The functions `onTargets` and `offTargets` will return on-target alignments
(no mismatch) and off-target alignment (with at least one mismatch),
respectively. See `?addSpacerAlignments` for more details about the
different options.
### Iterative spacer alignments
gRNAs that align to hundreds of different locations are highly unspecific
and undesirable. This can also cause `addSpacerAlignments` to be slow.
To mitigate this, we provide `addSpacerAlignmentsIterative`, an iterative
version of `addSpacerAlignments` that curtails alignment searches
for gRNAs having more hits than the user-defined
threshold (see `?addSpacerAlignmentsIterative`).
### Faster alignment by removing repeat elements
To remove protospacer sequences located in repeats or low-complexity
DNA sequences (regions identified by RepeatMasker), which are usually
not of interest due to their low specificity, we provide the convenience
function `removeRepeats`:
```{r, eval=TRUE}
data(grRepeatsExample, package="crisprDesign")
guideSet <- removeRepeats(guideSet,
gr.repeats=grRepeatsExample)
```
## Off-target scoring
After retrieving a list of putative off-targets and on-targets for
a given spacer sequence, we can use `addOffTargetScores` to
predict the likelihood of the nuclease to cut at the off-targets based
on mismatch tolerance. Currently, only off-target scoring for the SpCas9
nuclease are available (MIT and CFD algorithms):
```{r, eval=TRUE, warning=FALSE, message=FALSE}
guideSet <- addOffTargetScores(guideSet)
guideSet
```
Note that this will only work after calling `addSpacerAlignments`,
as it requires a list of off-targets for each gRNA entry. The returned
`GuideSet` object has now the additional columns `score_mit` and `score_cfd`
representing the gRNA-level aggregated off-target specificity scores. The
off-target table also contains a cutting likelihood score for each gRNA
and off-target pair:
```{r}
head(alignments(guideSet))
```
## On-target scoring
`addOnTargetScores` adds scores from all on-target efficiency
algorithms available in the R package `crisprScore` and
appends them to the `GuideSet`. By default, scores for all available methods
for a given nuclease will be computed. Here, for the sake of time,
let's add only the CRISPRater score:
```{r, eval=TRUE, warning=FALSE, message=FALSE}
guideSet <- addOnTargetScores(guideSet, methods="crisprater")
head(guideSet)
```
See the `crisprScore` vignette for a full description of the different scores.
## Restriction enzymes
Restriction enzymes are usually involved in the gRNA library synthesis process.
Removing gRNAs that contain specific restriction sites is often necessary.
We provide the function `addRestrictionEnzymes` to indicate whether or not
gRNAs contain restriction sites for a user-defined set of enzymes:
```{r, eval=TRUE, warning=FALSE, message=FALSE, results='hide'}
guideSet <- addRestrictionEnzymes(guideSet)
```
When no enzymes are specified, the function adds annotation for the following
default enzymes: EcoRI, KpnI, BsmBI, BsaI, BbsI, PacI, ISceI and MluI. The
function also has two additional arguments, `flanking5` and `flanking3`, to
specify nucleotide sequences flanking the spacer sequence (5' and 3',
respectively) in the lentiviral cassette that will be used for gRNA delivery.
The function will effectively search for restriction sites in the full sequence
`[flanking5][spacer][flanking3]`.
The `enzymeAnnotation` function can be used to retrieve the added annotation:
```{r}
head(enzymeAnnotation(guideSet))
```
## Gene annotation
The function `addGeneAnnotation` adds transcript- and gene-level
contextual information to gRNAs from a `TxDb`-like object:
```{r, eval=TRUE,warning=FALSE, message=FALSE, results='hide'}
guideSet <- addGeneAnnotation(guideSet,
txObject=grListExample)
```
The gene annotation can be retrieved using the function `geneAnnotation`:
```{r}
geneAnnotation(guideSet)
```
It contains a lot of information that contextualizes
the genomic location of the protospacer sequences.
The ID columns (`tx_id`, `gene_id`, `protein_id`, `exon_id`) give Ensembl IDs.
The `exon_rank` gives the order of the exon for the transcript, for example "2"
indicates it is the second exon (from the 5' end) in the mature transcript.
The columns `cut_cds`, `cut_fiveUTRs`, `cut_threeUTRs` and `cut_introns`
indicate whether the guide sequence overlaps with CDS, 5' UTR, 3' UTR,
or an intron, respectively.
`percentCDS` gives the location of the `cut_site` within the transcript as a
percent from the 5' end to the 3' end. `aminoAcidIndex` gives the number of the
specific amino acid in the protein where the cut is predicted to occur.
`downstreamATG` shows how many in-frame ATGs are downstream of the `cut_site`
(and upstream from the defined percent transcript cutoff, `met_cutoff`),
indicating a potential alternative translation initiation site that may
preserve protein function.
For more information about the other columns, type `?addGeneAnnotation`.
## TSS annotation
Similarly, one might want to know which protospacer sequences are located
within promoter regions of known genes:
```{r}
data(tssObjectExample, package="crisprDesign")
guideSet <- addTssAnnotation(guideSet,
tssObject=tssObjectExample)
tssAnnotation(guideSet)
```
For more information, type `?addTssAnnotation`.
## SNP information
Common single-nucleotide polymorphisms (SNPs) can change the on-target and
off-target properties of gRNAs by altering the binding.
The function `addSNPAnnotation` annotates gRNAs with respect to a
reference database of SNPs (stored in a VCF file), specified by the `vcf`
argument.
VCF files for common SNPs (dbSNPs) can be downloaded from NCBI on the [dbSNP website](https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/).
We include in this package an example VCF file for common SNPs located in the
proximity of human gene IQSEC3. This was obtained using the dbSNP151 RefSNP
database obtained by subsetting around IQSEC.
```{r, eval=TRUE,warning=FALSE, message=FALSE}
vcf <- system.file("extdata",
file="common_snps_dbsnp151_example.vcf.gz",
package="crisprDesign")
guideSet <- addSNPAnnotation(guideSet, vcf=vcf)
snps(guideSet)
```
The `rs_site_rel` gives the relative position of the SNP with respect
to the `pam_site`. `allele_ref` and `allele_minor` report the nucleotide of
the reference and minor alleles, respectively. `MAF_1000G` and `MAF_TOPMED`
report the minor allele frequency (MAF) in the 1000Genomes and TOPMED
populations.
## Filtering and ranking gRNAs
Once gRNAs are fully annotated, it is easy to filter out any unwanted gRNAs
since `GuideSet` objects can be subsetted like regular vectors in R.
As an example, suppose that we only want to keep gRNAs that have percent
GC between 20% and 80% and that do not contain a polyT stretch.
This can be achieved using the following lines:
```{r, eval=FALSE}
guideSet <- guideSet[guideSet$percentGC>=20]
guideSet <- guideSet[guideSet$percentGC<=80]
guideSet <- guideSet[!guideSet$polyT]
```
Similarly, it is easy to rank gRNAs based on a set of criteria
using the regular `order` function.
For instance, let's sort gRNAs by the CRISPRater on-target score:
```{r, eval=TRUE}
# Creating an ordering index based on the CRISPRater score:
# Using the negative values to make sure higher scores are ranked first:
o <- order(-guideSet$score_crisprater)
# Ordering the GuideSet:
guideSet <- guideSet[o]
head(guideSet)
```
One can also sort gRNAs using several annotation columns.
For instance, let's sort gRNAs using the CRISPRrater score, but also by
prioritizing first gRNAs that have no 1-mismatch off-targets:
```{r, eval=TRUE}
o <- order(guideSet$n1, -guideSet$score_crisprater)
# Ordering the GuideSet:
guideSet <- guideSet[o]
head(guideSet)
```
The `rankSpacers` function is a convenience function that implements
our recommended rankings for the SpCas9, enAsCas12a and CasRx nucleases.
For a detailed description of our recommended rankings, see the
documentation of `rankSpacers` by typing
`?rankSpacers`.
If an Ensembl transcript ID is provided, the ranking function will also
take into account the position of the gRNA within the target CDS of
the transcript ID in the ranking procedure. Our recommendation is to specify
the Ensembl canonical transcript as the representative
transcript for the gene. In our example, ENST00000538872 is the canonical
transcript for IQSEC3:
```{r, eval=FALSE}
tx_id <- "ENST00000538872"
guideSet <- rankSpacers(guideSet,
tx_id=tx_id)
```
# CRISPRa/CRISPRi design
For CRISPRa and CRISPRi applications, the CRISPR nuclease is engineered to
lose its endonuclease activity, therefore should not introduce double-stranded
breaks (DSBs). We will use the dead SpCas9 (dSpCas9) nuclease as an example
here. Note that users don't have to distinguish between dSpCas9 and SpCas9
when specifying the nuclease in `crisprDesign` and `crisprBase` as they do
not differ in terms of the characteristics stored in the `CrisprNuclease`
object.
*CRISPRi*: Fusing dSpCas9 with a Krüppel-associated box (KRAB) domain has been
shown to be effective at repressing transcription in mammalian cells
[@crispri]. The dSpCas9-KRAB fused protein is a commonly-used construct to
conduct CRISPR inhibition (CRISPRi) experiments. To achieve optimal inhibition,
gRNAs are usually designed targeting the region directly downstream of the gene
transcription starting site (TSS).
*CRISPRa*: dSpCas9 can also be used to activate gene expression
by coupling the dead nuclease with activation factors.
The technology is termed CRISPR activation (CRISPRa), and
several CRISPRa systems have been developed
(see @crispracrisprireview for a review). For optimal activation, gRNAs are
usually designed to target the region
directly upstream of the gene TSS.
`crisprDesign` provides functionalities to be able to take into account
design rules that are specific to CRISPRa and CRISPRi applications. The
`queryTss` function allows to specify genomic coordinates of promoter
regions. The `addTssAnnotation` annotates gRNAs for known TSSs, and includes
a column named `dist_to_tss` that indicates the distance between the TSS
position and the PAM site of the gRNA. For CRISPRi, we recommend targeting
the 25-75bp region downstream of the TSS for optimal inhibition.
For CRISPRa, we recommend targeting the region 75-150bp upstream of the
TSS for optimal activation; see [@sanson2018optimized] for more information.
For more information, please see the following two tutorials:
- [CRISPR activation (CRISPRa) design](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRa)
- [CRISPR interference (CRISPRi) design](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRi)
# CRISPR base editing with BE4max
We illustrate the CRISPR base editing (CRISPRbe) functionalities
of `crisprDesign` by designing and characterizing gRNAs targeting
IQSEC3 using the cytidine base editor BE4max [@koblan2018improving].
We first load the BE4max `BaseEditor` object available in `crisprBase`:
```{r}
data(BE4max, package="crisprBase")
BE4max
```
The editing probabilities of the base editor BE4max are stored in a matrix
where rows correspond to the different nucleotide substitutions, and columns
correspond to the genomic coordinate relative to the PAM site.
The `editingWeights` function from `crisprBase` allows to retrieve
those probabilities. One can see that C to T editing is optimal
around 15 nucleotides upstream of the PAM site for the BE4max base editor:
```{r}
crisprBase::editingWeights(BE4max)["C2T",]
```
We obtain a `GuideSet` object using the first exon of the IQSEC3
gene and retain only the first 2 gRNAs for the sake of time:
```{r}
gr <- queryTxObject(txObject=grListExample,
featureType="cds",
queryColumn="gene_symbol",
queryValue="IQSEC3")
gs <- findSpacers(gr[1],
bsgenome=bsgenome,
crisprNuclease=BE4max)
gs <- gs[1:2]
```
The function `addEditedAlleles` finds, characterizes, and scores predicted
edited alleles for each gRNA, for a chosen transcript. It requires a
transcript-specific annotation that can be obtained using the
function `getTxInfoDataFrame`. Here, we will perform the
analysis using the main isoform of IQSEC3 (transcript id ENST00000538872).
We first get the transcript table for ENST00000538872,
```{r}
txid <- "ENST00000538872"
txTable <- getTxInfoDataFrame(tx_id=txid,
txObject=grListExample,
bsgenome=bsgenome)
head(txTable)
```
and then add the edited alleles annotation to the `GuideSet`:
```{r}
editingWindow <- c(-20,-8)
gs <- addEditedAlleles(gs,
baseEditor=BE4max,
txTable=txTable,
editingWindow=editingWindow)
```
The `editingWindow` argument specifies the window of editing that
we are interested in. When not provided, it uses the default window
provided in the `BaseEditor` object. Note that providing large windows
can exponentially increase computing time as the number of possible
alleles grows exponentially.Let's retrieve the edited alleles for the
first gRNA:
```{r}
alleles <- editedAlleles(gs)[[1]]
```
It is a `DataFrame` object that contains useful metadata information:
```{r}
metadata(alleles)
```
The `wildtypeAllele` reports the unedited nucleotide sequence of the
region specified by the editing window (with respect to the gRNA PAM site).
It is always reported from the 5' to 3' direction on the strand corresponding
to the gRNA strand. The `start` and `end` specify the corresponding
coordinates on the transcript.
Let's look at the edited alleles:
```{r}
head(alleles)
```
The `DataFrame` is ordered so that the top predicted alleles
(based on the `score` column) are shown first. The `score`
represents the likelihood of the edited allele to occur relative
to all possible edited alleles, and is calculated using the editing
weights stored in the `BE4max` object. The `seq` column represents
the edited nucleotide sequences. Similar to the `wildtypeAllele` above,
they are always reported from the 5' to 3' direction on the strand
corresponding to the gRNA strand. The `variant` column indicates the
functional consequence of the editing event (silent, nonsense or
missense mutation). In case an edited allele leads to multiple
editing events, the most detrimental mutation (nonsense over missense,
missense over silent) is reported. The `aa` column reports the result
edited amino acid sequence.
Note that several gRNA-level aggregate scores have also been added
to the `GuideSet` object when calling `addEditedAlleles`:
```{r}
head(gs)
```
The `score_missense`, `score_nonsense` and `score_silent` columns
represent aggregated scores for each of the mutation type. They were
obtained by summing adding up all scores for a given mutation type
across the set of edited alleles for a given gRNA. The `maxVariant`
column indicates the most likely to occur mutation type for a given
gRNA, and is based on the maximum aggregated score, which is stored
in `maxVariantScore`. For instance, for spacer_1, the higher score
is the `score_missense`, and therefore `maxVariant` is set to missense.
For more information, please see the following tutorial:
- [CRISPR base editing (CRISPRbe) design](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRbe)
# CRISPR knockdown with Cas13d
It is also possible to design gRNAs for RNA-targeting nucleases using
`crisprDesign`. In contrast to DNA-targeting nucleases, the target spacer
is composed of mRNA sequences instead of DNA genomic sequences.
We illustrate the functionalities of `crisprDesign` for RNA-targeting
nucleases by designing gRNAs targeting IQSEC3 using the CasRx (RfxCas13d) nuclease [@cas13d].
We first load the CasRx `CrisprNuclease` object from `crisprBase`:
```{r}
data(CasRx, package="crisprBase")
CasRx
```
The PFS sequence (the equivalent of a PAM sequence for RNA-targeting
nucleases) for CasRx is `N`, meaning that there is no specific PFS sequences preferred by CasRx.
We will now design CasRx gRNAs for the transcript ENST00000538872 of IQSEC3.
Let's first extract all mRNA sequences for IQSEC3:
```{r}
txid <- c("ENST00000538872","ENST00000382841")
mrnas <- getMrnaSequences(txid=txid,
bsgenome=bsgenome,
txObject=grListExample)
mrnas
```
We can use the usual function `findSpacers` to design gRNAs, and we
only consider a random subset of 100 gRNAs for the sake of time:
```{r}
gs <- findSpacers(mrnas[["ENST00000538872"]],
crisprNuclease=CasRx)
gs <- gs[1000:1100]
head(gs)
```
Note that all protospacer sequences are located on the original strand
of the mRNA sequence. For RNA-targeting nucleases, the spacer and
protospacer sequences are the reverse complement of each other:
```{r}
head(spacers(gs))
head(protospacers(gs))
```
The `addSpacerAlignments` can be used to perform an off-target search
across all mRNA sequences using the argument `custom_seq`. Here, for
the sake of time, we only perform an off-target search to the 2
isoforms of IQSEC3 specified by the `mRNAs` object:
```{r}
gs <- addSpacerAlignments(gs,
aligner="biostrings",
txObject=grListExample,
n_mismatches=1,
custom_seq=mrnas)
tail(gs)
```
The columns `n0_gene` and `n0_tx` report the number of on-targets at
the gene- and transcript-level, respectively. For instance, `spacer_1095`
maps to the two isoforms of IQSEC3 has `n0_tx` is equal to 2:
```{r}
onTargets(gs["spacer_1095"])
```
Note that one can also use the `bowtie` aligner to perform an off-target
search to a set of mRNA sequences. This requires building a transcriptome
bowtie index first instead of building a genome index.
See the `crisprBowtie` vignette for more detail.
For more information, please see the following tutorial:
- [CRISPR knockdown (CRISPRkd) design with CasRxdesign](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRkd_CasRx)
# Design for optical pooled screening (OPS)
Optical pooled screening (OPS) combines image-based sequencing
(in situ sequencing) of gRNAs and optical phenotyping on the
same physical wells [@ops]. In such experiments, gRNA spacer
sequences are partially sequenced from the 5 prime end. From a
gRNA design perspective, additional gRNA design constraints are
needed to ensure sufficient dissimilarity of the truncated spacer
sequences. The length of the truncated sequences, which corresponds
to the number of sequencing cycles, is fixed and chosen by the experimentalist.
To illustrate the functionalities of `crisprDesign` for designing OPS
libraries, we use the `guideSetExample`.
We will design an OPS library with 8 cycles.
```{r}
n_cycles=8
```
We add the 8nt OPS barcodes to the GuideSet using the `addOpsBarcodes` function:
```{r}
data(guideSetExample, package="crisprDesign")
guideSetExample <- addOpsBarcodes(guideSetExample,
n_cycles=n_cycles)
head(guideSetExample$opsBarcode)
```
The function `getBarcodeDistanceMatrix` calculates the nucleotide distance
between a set of query barcodes and a set of target barcodes. The type of
distance (hamming or levenshtein) can be specified using the `dist_method`
argument. The Hamming distance (default) only considers substitutions when
calculating distances, while the Levenshtein distance allows insertions and
deletions.
When the argument `binnarize` is set to `FALSE`, the return object is a
matrix of pairwise distances between query and target barcodes:
```{r}
barcodes <- guideSetExample$opsBarcode
dist <- getBarcodeDistanceMatrix(barcodes[1:5],
barcodes[6:10],
binnarize=FALSE)
print(dist)
```
When `binnarize` is set to `TRUE` (default), the matrix of distances is
binnarized so that 1 indicates similar barcodes, and 0 indicates
dissimilar barcodes. The `min_dist_edit` argument specifies the minimal
distance between two barcodes to be considered dissimilar:
```{r}
dist <- getBarcodeDistanceMatrix(barcodes[1:5],
barcodes[6:10],
binnarize=TRUE,
min_dist_edit=4)
print(dist)
```
The `designOpsLibrary` allows users to perform a complete end-to-end
library design; see `?designOpsLibrary` for documentation.
For more information, please see the following tutorial:
- [Design for OPS](https://github.com/crisprVerse/Tutorials/tree/master/Design_OPS)
# Design of gRNA pairs with the \code{PairedGuideSet} object
The `findSpacerPairs` function in `crisprDesign` enables the design of
pairs of gRNAs and works similar to `findSpacers`. As an example, we
will design candidate pairs of gRNAs that target a small locus located
on chr12 in the human genome:
```{r}
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg38)
library(crisprBase)
bsgenome <- BSgenome.Hsapiens.UCSC.hg38
```
We first specify the genomic locus:
```{r}
gr <- GRanges(c("chr12"),
IRanges(start=22224014, end=22225007))
```
and find all pairs using the function `findSpacerPairs`:
```{r}