Skip to content

Commit

Permalink
Merge pull request #42 from fanc-WU/master
Browse files Browse the repository at this point in the history
Add scripts for SNV2
  • Loading branch information
debugpoint136 authored Jul 6, 2020
2 parents 2b526a8 + cf3941d commit 7bd2fbd
Show file tree
Hide file tree
Showing 21 changed files with 61,188 additions and 0 deletions.
Binary file added scripts/_static/snv2_duck.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added scripts/_static/snv2_showcase.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
112 changes: 112 additions & 0 deletions scripts/snv2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
SNV2 track
==========

Functionality of SNV2 tracks
----------------------------

SNV2 tracks are an extension of SNV tracks. It's equipped with the ability to show amino acid level mutations. It also supports everything that SNV tracks support: color code mutations, zoomed-in and zoomed-out views, revealing detailed info upon mouse click, etc. SNV2 format is specified with extreme flexibility.

Essentially, the SNV2 format is a combination of categorical tracks (color coding) and bed tracks (text display). You can imagine it as “color coded bed tracks”. This enables it to be reused for many other purposes. We provide scripts to generate SNV2 tracks reporting the potential amino acid mutations, but we encourage you to customize it with your own scripts.

.. image:: _static/snv2_showcase.png

Defining the format of SNV2
---------------------------

SNV2 tracks can have as many columns as necessary. The first 3 columns encode the genomic positions. The 4th column encodes category. All columns starting from the 4th column will be shown as text upon mouse click. The mapping between categories and colors is specified in Json.

.. csv-table::
:header: "column", "detail"

"1", "chromosome name for epigenome browser, reference name for virus genome browser (such as NC_045512.2 for SARS-CoV-2)"
"2", "start position on the reference genome, 0 based, inclusive"
"3", "stop position on the reference genome, 0 based, not inclusive"
"4", "category. Controls color code, will also show as text in popup window upon clicking"
"5, 6, 7...", "text columns. will show as text in popup window upon clicking"

The 4th column is mapped to colors through specifying "segmentColors" in the "options" part of the json of datahubs. The detailed tutorial on how to use json is here: https://epigenomegateway.readthedocs.io/en/latest/datahub.html?highlight=json

However, if you are using the SNV2 track to show AA mutations, you don't need to upload as json, because we have default color code mapping:

.. code-block:: JSON
"options": {
"segmentColors": {
"un_sequenced": "Linen",
"noncoding_insertion": "LightGrey",
"noncoding_deletion": "LightGrey",
"noncoding_mismatch": "LightGrey",
"silent": "DimGrey",
"frameshift": "FireBrick",
"missense": "CornflowerBlue",
"AA_deletion": "CornflowerBlue",
"AA_insertion": "CornflowerBlue",
"N_mask": "Linen",
"deletion_mask": "Linen"
}
}
For a quick demo:

.. code-block:: bash
NC_045512.2 10000 10001 duck cyberduck cyberduck quit unexpectedly
zip it and index it using bgzip and tabix (https://epigenomegateway.readthedocs.io/en/latest/tracks.html?highlight=tabix#prepare-track-files). Then put it into a Json like this:

.. code-block:: JSON
[{
"name": "duck",
"type": "snv2",
"url": "http://your.url.to.duck.file/duck.snv2.gz",
"options": {
"segmentColors": {
"duck": "red"
}
}
}]
upload the track through Tracks -> Remote Tracks -> Add Remote Data Hub
You will see:

.. image:: _static/snv2_duck.png

One of our snv2 tracks for SARS-CoV-2 is coded like this:

.. code-block:: bash
NC_045512.2 0 16 un_sequenced un_sequenced
NC_045512.2 240 241 noncoding_mismatch mismatch: T NC_045512.2:240-241 | ORF:noncoding | C > T | noncoding_mismatch
NC_045512.2 3036 3037 silent mismatch: T NC_045512.2:3034-3037 | ORF1ab:F924 | TTC > TTT | F > F | silent ; NC_045512.2:3034-3037 | ORF1a:F924 | TTC > TTT | F > F | silent
NC_045512.2 14407 14408 missense mismatch: T NC_045512.2:14406-14409 | ORF1ab:P4715 | CCT > CTT | P > L | missense
NC_045512.2 23402 23403 missense mismatch: G NC_045512.2:23401-23404 | S:D614 | GAT > GGT | D > G | missense
NC_045512.2 29872 29903 un_sequenced un_sequenced
Scripts for generating snv2 tracks
==================================

All of our premade SNV2 tracks that you can see in :code:`Tracks -> Public Data Hubs` are generated through a set of scripts that can be found at https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/

The main function :code:`tsv2snv2.2()`, which we used to generate all snv2 files in the public data hubs, is in :code:`snv2_public_7_2_20.R`, while the helper functions are in :code:`snv2_helper_7_2_20.R`. We used :code:`snv2_orf_7_2_20.R` to generate the tsv file with ORF information required in the main function.

The arguments of :code:`tsv2snv2.2()`:

.. csv-table::
:header: "argument", "detail"

"tsv.vec", "a vector of tsv files generated by `publicAlignment.py
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv/>`_ in tempt_dir (an argument for publicAlign.py). "
"ref.fa", "name of the fasta file for the reference. The one for SARS-CoV-2 is `here
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/refs/ncov_ref.fa>`_"
"ref.orf.table", "a dataframe or the name of a tsv file containing this dataframe. the ORF information for the reference. `the one for SARS-CoV-2
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/refs/ncov.aa.df.tsv>`_ is already generated. Refer to it for formatting."
"min.contig.head.tail", "integer. used to mask the beginning and the end of the sequenced region, so that unsequenced regions won't be treated as deletions. default is 15, which means the first 15 continuous non-deletions marks the beginning of the sequenced region. The same applies for the end of the sequenced region"
"out.snv2.vec", "a vector of output snv2 file name"
"hier.df", "a dataframe containing the 'hierarchy' of different types of mutations, can also be the name of a tsv file containing this dataframe. Sometimes one nucleotide can be used by more than one ORF. The mutation at this nucleotide might cause different types of amimo acid mutations for different ORFs. For example, a mutation can be silent for ORF A, but missense for ORF B. In this case, 'missense' will override 'silent' because of the settings in hier.df. Refer to `hier.df.6.18.tsv
<https://github.com/debugpoint136/WashU-Virus-Genome-Browser/tree/master/scripts/snv2/>`_ for formatting"
"thread.master", "the number of snv2 files to generate in parallel"
"thread.sub", "the number of orfs to process in parallel"
"bedtools.path", "deprecated. Just pass it something"
"return.df", "bool. if the entire snv2 track should be returned as a dataframe."
8 changes: 8 additions & 0 deletions scripts/snv2/hier.df.6.18.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
type hier
un_sequenced 0
frameshift 1
AA_deletion 2
AA_insertion 3
missense 4
silent 5
100
266 changes: 266 additions & 0 deletions scripts/snv2/refs/ncov.aa.df.inspect.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
gpos ORF1ab ORF1a S ORF3a E M ORF6 ORF7a ORF7b ORF8 ORF9 ORF10
1
261
262
263
264
265
266 M1 M1
267 M1 M1
268 M1 M1
269 E2 E2
270 E2 E2
271 E2 E2
13463 L4400 L4400
13464 L4400 L4400
13465 L4400 L4400
13466 N4401 N4401
13467 N4401 N4401
13468 R4402 N4401
13469 R4402 G4402
13470 R4402 G4402
13471 V4403 G4402
13472 V4403 F4403
13473 V4403 F4403
21550 N7096
21551 N7096
21552 N7096
21553 *7097
21554 *7097
21555 *7097
21556
21557
21558
21559
21560
13478 G4405 V4405
13479 G4405 V4405
13480 V4406 V4405
13481 V4406 *4406
13482 V4406 *4406
13483 S4407 *4406
13484 S4407
13485 S4407
13486 A4408
13487 A4408
13488 A4408
21558
21559
21560
21561
21562
21563 M1
21564 M1
21565 M1
21566 F2
21567 F2
21568 F2
25379 T1273
25380 T1273
25381 T1273
25382 *1274
25383 *1274
25384 *1274
25385
25386
25387
25388
25389
25388
25389
25390
25391
25392
25393 M1
25394 M1
25395 M1
25396 D2
25397 D2
25398 D2
26215 L275
26216 L275
26217 L275
26218 *276
26219 *276
26220 *276
26221
26222
26223
26224
26225
26240
26241
26242
26243
26244
26245 M1
26246 M1
26247 M1
26248 Y2
26249 Y2
26250 Y2
26467 V75
26468 V75
26469 V75
26470 *76
26471 *76
26472 *76
26473
26474
26475
26476
26477
26518
26519
26520
26521
26522
26523 M1
26524 M1
26525 M1
26526 A2
26527 A2
26528 A2
27186 Q222
27187 Q222
27188 Q222
27189 *223
27190 *223
27191 *223
27192
27193
27194
27195
27196
27197
27198
27199
27200
27201
27202 M1
27203 M1
27204 M1
27205 F2
27206 F2
27207 F2
27382 D61
27383 D61
27384 D61
27385 *62
27386 *62
27387 *62
27388
27389
27390
27391
27392
27389
27390
27391
27392
27393
27394 M1
27395 M1
27396 M1
27397 K2
27398 K2
27399 K2
27754 E121
27755 E121
27756 E121 M1
27757 *122 M1
27758 *122 M1
27759 *122 I2
27760 I2
27761 I2
27762 E3
27763 E3
27764 E3
27751 T120
27752 T120
27753 T120
27754 E121
27755 E121
27756 E121 M1
27757 *122 M1
27758 *122 M1
27759 *122 I2
27760 I2
27761 I2
27882 A43
27883 A43
27884 A43
27885 *44
27886 *44
27887 *44
27888
27889
27890
27891
27892
27889
27890
27891
27892
27893
27894 M1
27895 M1
27896 M1
27897 K2
27898 K2
27899 K2
28254 I121
28255 I121
28256 I121
28257 *122
28258 *122
28259 *122
28260
28261
28262
28263
28264
28269
28270
28271
28272
28273
28274 M1
28275 M1
28276 M1
28277 S2
28278 S2
28279 S2
29528 A419
29529 A419
29530 A419
29531 *420
29532 *420
29533 *420
29534
29535
29536
29537
29538
29553
29554
29555
29556
29557
29558 M1
29559 M1
29560 M1
29561 G2
29562 G2
29563 G2
29669 T38
29670 T38
29671 T38
29672 *39
29673 *39
29674 *39
29675
29676
29677
29678
29679
Loading

0 comments on commit 7bd2fbd

Please sign in to comment.