-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #42 from fanc-WU/master
Add scripts for SNV2
- Loading branch information
Showing
21 changed files
with
61,188 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
SNV2 track | ||
========== | ||
|
||
Functionality of SNV2 tracks | ||
---------------------------- | ||
|
||
SNV2 tracks are an extension of SNV tracks. It's equipped with the ability to show amino acid level mutations. It also supports everything that SNV tracks support: color code mutations, zoomed-in and zoomed-out views, revealing detailed info upon mouse click, etc. SNV2 format is specified with extreme flexibility. | ||
|
||
Essentially, the SNV2 format is a combination of categorical tracks (color coding) and bed tracks (text display). You can imagine it as “color coded bed tracks”. This enables it to be reused for many other purposes. We provide scripts to generate SNV2 tracks reporting the potential amino acid mutations, but we encourage you to customize it with your own scripts. | ||
|
||
.. image:: _static/snv2_showcase.png | ||
|
||
Defining the format of SNV2 | ||
--------------------------- | ||
|
||
SNV2 tracks can have as many columns as necessary. The first 3 columns encode the genomic positions. The 4th column encodes category. All columns starting from the 4th column will be shown as text upon mouse click. The mapping between categories and colors is specified in Json. | ||
|
||
.. csv-table:: | ||
:header: "column", "detail" | ||
|
||
"1", "chromosome name for epigenome browser, reference name for virus genome browser (such as NC_045512.2 for SARS-CoV-2)" | ||
"2", "start position on the reference genome, 0 based, inclusive" | ||
"3", "stop position on the reference genome, 0 based, not inclusive" | ||
"4", "category. Controls color code, will also show as text in popup window upon clicking" | ||
"5, 6, 7...", "text columns. will show as text in popup window upon clicking" | ||
|
||
The 4th column is mapped to colors through specifying "segmentColors" in the "options" part of the json of datahubs. The detailed tutorial on how to use json is here: https://epigenomegateway.readthedocs.io/en/latest/datahub.html?highlight=json | ||
|
||
However, if you are using the SNV2 track to show AA mutations, you don't need to upload as json, because we have default color code mapping: | ||
|
||
.. code-block:: JSON | ||
"options": { | ||
"segmentColors": { | ||
"un_sequenced": "Linen", | ||
"noncoding_insertion": "LightGrey", | ||
"noncoding_deletion": "LightGrey", | ||
"noncoding_mismatch": "LightGrey", | ||
"silent": "DimGrey", | ||
"frameshift": "FireBrick", | ||
"missense": "CornflowerBlue", | ||
"AA_deletion": "CornflowerBlue", | ||
"AA_insertion": "CornflowerBlue", | ||
"N_mask": "Linen", | ||
"deletion_mask": "Linen" | ||
} | ||
} | ||
For a quick demo: | ||
|
||
.. code-block:: bash | ||
NC_045512.2 10000 10001 duck cyberduck cyberduck quit unexpectedly | ||
zip it and index it using bgzip and tabix (https://epigenomegateway.readthedocs.io/en/latest/tracks.html?highlight=tabix#prepare-track-files). Then put it into a Json like this: | ||
|
||
.. code-block:: JSON | ||
[{ | ||
"name": "duck", | ||
"type": "snv2", | ||
"url": "http://your.url.to.duck.file/duck.snv2.gz", | ||
"options": { | ||
"segmentColors": { | ||
"duck": "red" | ||
} | ||
} | ||
}] | ||
upload the track through Tracks -> Remote Tracks -> Add Remote Data Hub | ||
You will see: | ||
|
||
.. image:: _static/snv2_duck.png | ||
|
||
One of our snv2 tracks for SARS-CoV-2 is coded like this: | ||
|
||
.. code-block:: bash | ||
NC_045512.2 0 16 un_sequenced un_sequenced | ||
NC_045512.2 240 241 noncoding_mismatch mismatch: T NC_045512.2:240-241 | ORF:noncoding | C > T | noncoding_mismatch | ||
NC_045512.2 3036 3037 silent mismatch: T NC_045512.2:3034-3037 | ORF1ab:F924 | TTC > TTT | F > F | silent ; NC_045512.2:3034-3037 | ORF1a:F924 | TTC > TTT | F > F | silent | ||
NC_045512.2 14407 14408 missense mismatch: T NC_045512.2:14406-14409 | ORF1ab:P4715 | CCT > CTT | P > L | missense | ||
NC_045512.2 23402 23403 missense mismatch: G NC_045512.2:23401-23404 | S:D614 | GAT > GGT | D > G | missense | ||
NC_045512.2 29872 29903 un_sequenced un_sequenced | ||
Scripts for generating snv2 tracks | ||
================================== | ||
|
||
All of our premade SNV2 tracks that you can see in :code:`Tracks -> Public Data Hubs` are generated through a set of scripts that can be found at https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/ | ||
|
||
The main function :code:`tsv2snv2.2()`, which we used to generate all snv2 files in the public data hubs, is in :code:`snv2_public_7_2_20.R`, while the helper functions are in :code:`snv2_helper_7_2_20.R`. We used :code:`snv2_orf_7_2_20.R` to generate the tsv file with ORF information required in the main function. | ||
|
||
The arguments of :code:`tsv2snv2.2()`: | ||
|
||
.. csv-table:: | ||
:header: "argument", "detail" | ||
|
||
"tsv.vec", "a vector of tsv files generated by `publicAlignment.py | ||
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv/>`_ in tempt_dir (an argument for publicAlign.py). " | ||
"ref.fa", "name of the fasta file for the reference. The one for SARS-CoV-2 is `here | ||
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/refs/ncov_ref.fa>`_" | ||
"ref.orf.table", "a dataframe or the name of a tsv file containing this dataframe. the ORF information for the reference. `the one for SARS-CoV-2 | ||
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/refs/ncov.aa.df.tsv>`_ is already generated. Refer to it for formatting." | ||
"min.contig.head.tail", "integer. used to mask the beginning and the end of the sequenced region, so that unsequenced regions won't be treated as deletions. default is 15, which means the first 15 continuous non-deletions marks the beginning of the sequenced region. The same applies for the end of the sequenced region" | ||
"out.snv2.vec", "a vector of output snv2 file name" | ||
"hier.df", "a dataframe containing the 'hierarchy' of different types of mutations, can also be the name of a tsv file containing this dataframe. Sometimes one nucleotide can be used by more than one ORF. The mutation at this nucleotide might cause different types of amimo acid mutations for different ORFs. For example, a mutation can be silent for ORF A, but missense for ORF B. In this case, 'missense' will override 'silent' because of the settings in hier.df. Refer to `hier.df.6.18.tsv | ||
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/>`_ for formatting" | ||
"thread.master", "the number of snv2 files to generate in parallel" | ||
"thread.sub", "the number of orfs to process in parallel" | ||
"bedtools.path", "deprecated. Just pass it something" | ||
"return.df", "bool. if the entire snv2 track should be returned as a dataframe." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
type hier | ||
un_sequenced 0 | ||
frameshift 1 | ||
AA_deletion 2 | ||
AA_insertion 3 | ||
missense 4 | ||
silent 5 | ||
100 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,266 @@ | ||
gpos ORF1ab ORF1a S ORF3a E M ORF6 ORF7a ORF7b ORF8 ORF9 ORF10 | ||
1 | ||
261 | ||
262 | ||
263 | ||
264 | ||
265 | ||
266 M1 M1 | ||
267 M1 M1 | ||
268 M1 M1 | ||
269 E2 E2 | ||
270 E2 E2 | ||
271 E2 E2 | ||
13463 L4400 L4400 | ||
13464 L4400 L4400 | ||
13465 L4400 L4400 | ||
13466 N4401 N4401 | ||
13467 N4401 N4401 | ||
13468 R4402 N4401 | ||
13469 R4402 G4402 | ||
13470 R4402 G4402 | ||
13471 V4403 G4402 | ||
13472 V4403 F4403 | ||
13473 V4403 F4403 | ||
21550 N7096 | ||
21551 N7096 | ||
21552 N7096 | ||
21553 *7097 | ||
21554 *7097 | ||
21555 *7097 | ||
21556 | ||
21557 | ||
21558 | ||
21559 | ||
21560 | ||
13478 G4405 V4405 | ||
13479 G4405 V4405 | ||
13480 V4406 V4405 | ||
13481 V4406 *4406 | ||
13482 V4406 *4406 | ||
13483 S4407 *4406 | ||
13484 S4407 | ||
13485 S4407 | ||
13486 A4408 | ||
13487 A4408 | ||
13488 A4408 | ||
21558 | ||
21559 | ||
21560 | ||
21561 | ||
21562 | ||
21563 M1 | ||
21564 M1 | ||
21565 M1 | ||
21566 F2 | ||
21567 F2 | ||
21568 F2 | ||
25379 T1273 | ||
25380 T1273 | ||
25381 T1273 | ||
25382 *1274 | ||
25383 *1274 | ||
25384 *1274 | ||
25385 | ||
25386 | ||
25387 | ||
25388 | ||
25389 | ||
25388 | ||
25389 | ||
25390 | ||
25391 | ||
25392 | ||
25393 M1 | ||
25394 M1 | ||
25395 M1 | ||
25396 D2 | ||
25397 D2 | ||
25398 D2 | ||
26215 L275 | ||
26216 L275 | ||
26217 L275 | ||
26218 *276 | ||
26219 *276 | ||
26220 *276 | ||
26221 | ||
26222 | ||
26223 | ||
26224 | ||
26225 | ||
26240 | ||
26241 | ||
26242 | ||
26243 | ||
26244 | ||
26245 M1 | ||
26246 M1 | ||
26247 M1 | ||
26248 Y2 | ||
26249 Y2 | ||
26250 Y2 | ||
26467 V75 | ||
26468 V75 | ||
26469 V75 | ||
26470 *76 | ||
26471 *76 | ||
26472 *76 | ||
26473 | ||
26474 | ||
26475 | ||
26476 | ||
26477 | ||
26518 | ||
26519 | ||
26520 | ||
26521 | ||
26522 | ||
26523 M1 | ||
26524 M1 | ||
26525 M1 | ||
26526 A2 | ||
26527 A2 | ||
26528 A2 | ||
27186 Q222 | ||
27187 Q222 | ||
27188 Q222 | ||
27189 *223 | ||
27190 *223 | ||
27191 *223 | ||
27192 | ||
27193 | ||
27194 | ||
27195 | ||
27196 | ||
27197 | ||
27198 | ||
27199 | ||
27200 | ||
27201 | ||
27202 M1 | ||
27203 M1 | ||
27204 M1 | ||
27205 F2 | ||
27206 F2 | ||
27207 F2 | ||
27382 D61 | ||
27383 D61 | ||
27384 D61 | ||
27385 *62 | ||
27386 *62 | ||
27387 *62 | ||
27388 | ||
27389 | ||
27390 | ||
27391 | ||
27392 | ||
27389 | ||
27390 | ||
27391 | ||
27392 | ||
27393 | ||
27394 M1 | ||
27395 M1 | ||
27396 M1 | ||
27397 K2 | ||
27398 K2 | ||
27399 K2 | ||
27754 E121 | ||
27755 E121 | ||
27756 E121 M1 | ||
27757 *122 M1 | ||
27758 *122 M1 | ||
27759 *122 I2 | ||
27760 I2 | ||
27761 I2 | ||
27762 E3 | ||
27763 E3 | ||
27764 E3 | ||
27751 T120 | ||
27752 T120 | ||
27753 T120 | ||
27754 E121 | ||
27755 E121 | ||
27756 E121 M1 | ||
27757 *122 M1 | ||
27758 *122 M1 | ||
27759 *122 I2 | ||
27760 I2 | ||
27761 I2 | ||
27882 A43 | ||
27883 A43 | ||
27884 A43 | ||
27885 *44 | ||
27886 *44 | ||
27887 *44 | ||
27888 | ||
27889 | ||
27890 | ||
27891 | ||
27892 | ||
27889 | ||
27890 | ||
27891 | ||
27892 | ||
27893 | ||
27894 M1 | ||
27895 M1 | ||
27896 M1 | ||
27897 K2 | ||
27898 K2 | ||
27899 K2 | ||
28254 I121 | ||
28255 I121 | ||
28256 I121 | ||
28257 *122 | ||
28258 *122 | ||
28259 *122 | ||
28260 | ||
28261 | ||
28262 | ||
28263 | ||
28264 | ||
28269 | ||
28270 | ||
28271 | ||
28272 | ||
28273 | ||
28274 M1 | ||
28275 M1 | ||
28276 M1 | ||
28277 S2 | ||
28278 S2 | ||
28279 S2 | ||
29528 A419 | ||
29529 A419 | ||
29530 A419 | ||
29531 *420 | ||
29532 *420 | ||
29533 *420 | ||
29534 | ||
29535 | ||
29536 | ||
29537 | ||
29538 | ||
29553 | ||
29554 | ||
29555 | ||
29556 | ||
29557 | ||
29558 M1 | ||
29559 M1 | ||
29560 M1 | ||
29561 G2 | ||
29562 G2 | ||
29563 G2 | ||
29669 T38 | ||
29670 T38 | ||
29671 T38 | ||
29672 *39 | ||
29673 *39 | ||
29674 *39 | ||
29675 | ||
29676 | ||
29677 | ||
29678 | ||
29679 |
Oops, something went wrong.