AncestralSequenceReconstruction

This package implements the algorithm described in this article

Reconstruction of ancestral protein sequences using autoregressive generative models
Matteo De Leonardis, Andrea Pagnani, Pierre Barrat-Charlaix
bioRxiv 2024

It relies heavily on the ArDCA method for generative models which is described here

Efficient generative modeling of protein sequences using simple autoregressive models
Jeanne Trinquier, Guido Uguzzoni, Andrea Pagnani, Francesco Zamponi, Martin Weigt
Nature Communications 2021

and whose code can be found at https://github.com/pagnani/ArDCA.jl Please cite these works if you use this code.

Installation

For now, there is no command line version of this software. The only way to use it is through a Julia script, notebook or REPL session. I recommend having a look to the example notebook to learn how to use the software. Here are the steps to installation.

Install Julia: https://julialang.org/. If you're never used the language, it can be useful to have a look at https://docs.julialang.org/en/v1/manual/getting-started/
Open an REPL session, and install the package by running

using Pkg
Pkg.add("https://github.com/PierreBarrat/AncestralSequenceReconstruction.jl")

You can now use it from inside the julia session: using AncestralSequenceReconstruction
3. To see the example notebook, you need to install Pluto: Pkg.add("Pluto"). Launch it by running using Pluto; Pluto.run(). Then, open the notebook example/PF00014/reconstruction/reconstruction_tutorial.jl from inside Pluto.
4. To use the package in practice, you might need to install other dependencies, including

ArDCA to infer and manipulate autoregressive protein models
JLD2 to save (or load) ArDCA models to (from) files
Other useful packages are loaded in the example notebook.

Scripts that were used to generate the results in the article can be found at https://github.com/PierreBarrat/AutoRegressiveASR, and provide various examples of how to use it.

Current issues/limitations

Genetic code

It is possible to incorporate the effects of the genetic code into the dynamics of the autoregressive model. This is done by passing with_code=true when constructing the model: AutoRegressiveModel(arnet; with_code=true) (see example notebook).
An important limitation of using this option is that gaps are not correctly handled. In ArDCA as in most energy-based protein generative models, gaps are treated as an extra amino acid. However, it is not trivial how one should deal with them in a way that is consistent with the genetic code. While this is not a fundamental limitation and will be fixed at some point, please know that you should only use this option if your input sequences have no gaps.

Mapping from amino-acids to integers

Both in this package and in ArDCA, amino-acids are mapped to integers for easier use. By default, this package uses the mapping implied by the string "-ACDEFGHIKLMNPQRSTVWY", i.e. '-' => 1, 'A' => 2, etc... However, the ArDCA package uses a different one: ArNet models will by default use "ACDEFGHIKLMNPQRSTVWY-". This leads to the following issue: if (i) an ArNet model is inferred using the default settings of ArDCA and (ii) is used for ancestral reconstruction in this package also using default settings, there will be an inconsistency between the mappings and the results of the reconstruction will likely be nonsense.

There are two ways to work around this.

Infer the autoregressive model using the default mapping of AncestralSequenceReconstruction.jl. This would be done by converting the training alignment used in ArDCA to an integer matrix within Julia, using the mapping "-ACDEFGHIKLMNPQRSTVWY", and then use this matrix to infer the model (see docstring of the ArDCA.ardca function).
This is the method I currently use, and I have some tools that facilitate it, unfortunately not well documented yet.
Do not use the default mapping when building the evolutionary model from the ArNet. In the example notebook, the line ar_model = AutoRegressiveModel(arnet) should then be modified to ar_model = AutoRegressiveModel(arnet; alphabet=:ardca_aa).

In any case, it is important to keep track of the mapping with which the autoregressive model was initially inferred. Currently, there is no "natural" way to do this (apart from e.g. file naming). This is obviously not ideal, and I want to make this easier in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
example		example
src		src
test		test
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AncestralSequenceReconstruction

Installation

Contents

Reconstructing ancestral sequences

Evolutionary models

Simulating sequences

Current issues/limitations

Genetic code

Mapping from amino-acids to integers

About

Releases

Packages

Languages

License

PierreBarrat/AncestralSequenceReconstruction.jl

Folders and files

Latest commit

History

Repository files navigation

AncestralSequenceReconstruction

Installation

Contents

Reconstructing ancestral sequences

Evolutionary models

Simulating sequences

Current issues/limitations

Genetic code

Mapping from amino-acids to integers

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages