Skip to content

Given an HLA sequence, get the allele with highest identity and frequency in population.

License

Notifications You must be signed in to change notification settings

annadiarov/seq2HLAallele

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seq2HLAallele

This package is designed to address the challenge of allele typing for HLA genes in crystallography, even when the available sequences are incomplete. It leverages the BLAST tool to identify the closest matching allele within the HLA database. Subsequently, it calculates the mean frequency of alleles_over_2n across diverse populations, enabling the determination of the most probable allele. Finally, it outputs the matched allele with highest frequency in the standard HLA nomenclature.

The format of the output allele will follow the HLA Naming:

HLA<gene>*<allele_group>:<specific_HLA_protein>

Installation

Environment

Clone this repo and create a python 3.11 environment and install the requirements:

conda create -n seq2hla python=3.11
conda activate seq2hla
# Make sure you are in this repo folder
pip install -r requirements.txt
python setup.py install

Make sure your computer has installed Blast tools. If not, you can install it by:

sudo apt install ncbi-blast+

Download databases

Before using the package, you have to download the:

  • HLA sequence database
  • HLA frequencies database

The database files should be under the databases/ folder with the following structure:

- databases/
   |-> hla_freq/
   |     - afnd.tsv
   |-> hla_seqs/
        - A_prot.fasta
        - all_hla_seq.fasta
        - B_prot.fasta
        - C_prot.fasta
        - DPA1_prot.fasta
        - DPB1_prot.fasta
        - DQA1_prot.fasta
        - DQB1_prot.fasta
        - DRB1_prot.fasta
HLA sequence database

You can download the database latest version and create the Blast DB automatically with:

python seq2hla/download_imgthla_database.py

Alternatively, you can download the HLA database of HLA amino acid sequences from IMGT/HLA. The files should be under the fasta/*_prot.fasta. Unify them in a single file called all_hla_seq.fasta and place it under the databases/hla_seqs/ folder. Then execute the command:

makeblastdb -in databases/hla_seqs/all_hla_seq.fasta -dbtype prot -parse_seqids
HLA frequencies database

You can manually download the file afnd.tsv HLA frequencies database from this GitHub Repo and place it under the databases/hla_freq/ folder.

You can also download the database latest version automatically using the code provided by this GitHub Repo:

python seq2hla/download_imgthla_database.py

Warning: This download might take ~1-2 hours.

As an alternative, you can directly download the file afnd.tsv from the repo using the flag --fast (or -f):

python seq2hla/download_allele_freq_database.py --fast

Usage

Command line interface:

python -m seq2hla.main sequence.fasta

Using python:

from seq2hla import get_most_freq_allele_from_seq

highest_frequency_alleles, mean_frequencies = \
    get_most_freq_allele_from_seq("sequence.fasta")

for allele in highest_frequency_alleles:
    print(f"\t{allele}\tMean Frequency: {mean_frequencies[allele]}")

About

Given an HLA sequence, get the allele with highest identity and frequency in population.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published