Skip to content

BioJulia/BioMarkovChains.jl

Folders and files

NameName
Last commit message
Last commit date
Sep 4, 2023
Oct 23, 2023
Nov 3, 2023
Nov 3, 2023
Jul 18, 2023
Jul 11, 2023
Oct 23, 2023
Jul 18, 2023
Jul 11, 2023
Oct 24, 2023
Aug 21, 2023

Repository files navigation


Representing biological sequences as Markov chains

Documentation Latest Release DOI
CI Workflow License Work in Progress Downloads


BioMarkovChains

A Julia package to represent biological sequences as Markov chains

Installation

BioMarkovChains is a   Julia Language   package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioMarkovChains

Creating Markov chain out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the BioMarkovChain method that will capture the initials and transition probabilities of a DNA sequence (LongSequence) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random LongDNA :

using BioSequences, GeneFinder, BioMarkovChains

sequence = randdnaseq(10^3)
orfdna = getorfdna(sequence, min_len=75)[1]

If we translate it, we get a 69aa sequence:

translate(orfdna)
69aa Amino Acid Sequence:
MSCGETTVSPILSRRTAFIRTLLGYRFRSNLPTKAERSRFGFSLPQFISTPNDRQNGNGGCGCGLENR*

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the BioMarkovChain method and tune it to 2nd-order Markov chain:

BioMarkovChain(orfdna, 2)
BioMarkovChain with DNA Alphabet:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.2123  0.2731  0.278   0.2366
   0.2017  0.3072  0.2687  0.2224
   0.1978  0.2651  0.2893  0.2478
   0.2013  0.3436  0.2431  0.212
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.2027
   0.2973
   0.2703
   0.2297
  - Markov Chain Order -> Int64:
   2

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models or Markov chain implemented:

ECOLICDS
BioMarkovChain with DNA Alphabet:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.31    0.224   0.199   0.268
   0.251   0.215   0.313   0.221
   0.236   0.308   0.249   0.207
   0.178   0.217   0.338   0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.245
   0.243
   0.273
   0.239
  - Markov Chain Order -> Int64:
   1

What is then the probability of the previous random Lambda phage DNA sequence given this model?

dnaseqprobability(orfdna, ECOLICDS)
7.466531836596359e-45

This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs

License

MIT License