-
Notifications
You must be signed in to change notification settings - Fork 5
/
12-build-RIMA-reference.Rmd
121 lines (88 loc) · 4.3 KB
/
12-build-RIMA-reference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "Build-rima-reference"
author: "altreuter"
date: "8/4/2021"
output: html_document
---
# Customize your own reference
RIMA provides a pre-built set of references using GDC hg38 and v22 GENCODE annotation. This set of references can be downloaded as described in chapter 2.2.
We have also pre-built a set of reference files using v27 GENCODE annotation. This set of pre-built references can be downloaded from http://cistrome.org/~lyang/ref_v27.tar.gz using the same instructions provided in chapter 2.
If you wish to build a different set of references, please follow the instructions which follow.
## Reference fasta
Download the human GDC hg38 fasta file from [GDC website](https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files).
## Gene annotation file (gtf)
Dowload the human gtf annotation file from [GENCODE website](https://www.gencodegenes.org/human/).
## build STAR index
Build the STAR index using the following code. Make sure to change the file names to indicate which gencode version you are using.
```bash
conda activate rna
## STAR Version: STAR_2.6.1d
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ./ref_files/v27_index --genomeFastaFiles GRCh38.d1.vd1.CIDC.fa --sjdbGTFfile gencode.v27.annotation.gtf
...
00:04:54 ..... started STAR run
00:04:54 ... starting to generate Genome files
00:05:57 ... starting to sort Suffix Array. This may take a long time...
00:06:11 ... sorting Suffix Array chunks and saving them to disk...
00:17:43 ... loading chunks from disk, packing SA...
00:19:31 ... finished generating suffix array
00:19:31 ... generating Suffix Array index
00:23:20 ... completed Suffix Array index
00:23:20 ..... processing annotations GTF
00:23:35 ..... inserting junctions into the genome indices
00:26:49 ... writing Genome to disk ...
00:27:06 ... writing Suffix Array to disk ...
00:28:53 ... writing SAindex to disk
00:29:05 ..... finished successfully
```
## RSeQC reference files
We download the human annotation bed file including the whole genome bed file, and house keeping bed file from RSeQC page from [sourcforge website](https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/).
```bash
./ref_files/refseqGenes.bed
./ref_files/housekeeping_refseqGenes.bed
```
## build salmon index
```bash
conda activate rna
## salmon Version: salmon 1.1.0
salmon index -t GRCh38.d1.vd1.CIDC.fa -i salmon_index
...
index ["salmon_index"] did not previously exist . . . creating it
[jLog] [info] building index
[jointLog] [info] [Step 1 of 4] : counting k-mers
[jointLog] [info] Replaced 164,553,847 non-ATCG nucleotides
[jointLog] [info] Clipped poly-A tails from 0 transcripts
[jointLog] [info] Building rank-select dictionary and saving to disk
[jointLog] [info] done
Elapsed time: 0.191866s
[jointLog] [info] Writing sequence data to file . . .
[jointLog] [info] done
Elapsed time: 1.91244s
[jointLog] [info] Building 64-bit suffix array (length of generalized text is 3,088,286,426)
[jointLog] [info] Building suffix array . . .
success
saving to disk . . . done
Elapsed time: 18.3072s
done
Elapsed time: 703.843s
```
### GMT file for gene set analysis
The GMT file is downloaded from [BROAD release page](https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.1/). The current GMT file we used is "c2.cp.kegg.v6.1.symbols.gmt"
## STAR-Fusion genome resource lib
The genome resource lib is downloaded from [BROAD release page](https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/). The current lib we used is GRCh38_v22_CTAT_lib.
You can also prep it for use with STAR-fusion.
More details, read:
* https://github.com/STAR-Fusion/STAR-Fusion/wiki/installing-star-fusion
## Centrifuge index
The human Centrifuge index is downloaded from [Centrifuge website](http://www.ccb.jhu.edu/software/centrifuge/). The current index we used is p_compressed+h+v that includes human genome, prokaryotic genomes, and viral genomes.
You can also build your own custom Centrifuge index.
For more details, read:
* https://github.com/DaehwanKimLab/centrifuge
## TRUST4 reference files
TRUST4 reference files includes
1. a TCR, BCR genomic sequence fasta file; and
2. A reference database sequence containing annotation information.
```
hg38_bcrtcr.fa
human_IMGT+C.fa
```
These reference files can directly be downloaded from [TRUST4 github](https://github.com/liulab-dfci/TRUST4).