This document provides instructions on how to generate the data required for Tiberius in de novo mode. The data is generated with ClaMSA, a tool that uses multiple sequence alignments to generate evolutionary information for a given species. See instrutions on how to install ClaMSA and how to train ClaMSA (here) and to create splice site predictions (here). Afterwards, you should have 6 files in UCSCs WIG format for the 6 possible reading frames.
Tiberius de novo training and predicition uses ClaMSA data as numpy array as input. Following describes the steps to generate the data for one species for Tiberius in de novo mode:
- Get sequence names from the genomic sequence of the FASTA file:
grep ">" $genome | sed 's/>//g' > seq_names.txt
- Convert ClaMSA wig files into numpy arrays. Note that the ClaMSA files have to be located in a directory
$inp_dir
and the output will be saved in$out_dir
. You can provide the script with a prefix for the input and output files with--prefix
. For example, if the files are namedDesmodus_rotundus_plus_0.wig
,Desmodus_rotundus_minus_0.wig
, etc. you can use--prefix Desmodus_rotundus_
and it will produce numpy arrays namedDesmodus_rotundus_chr1.npy
,Desmodus_rotundus_chr2.npy
, etc.:python bin/wig2npy.py --inp_dir $inp_dir --out_dir $out_dir --prefix $prefix --seq_names seq_names.txt