-
Notifications
You must be signed in to change notification settings - Fork 0
Abbie-Williams/Protein_Consensus_Plotter.py
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
------------------------------------------------------------------------------------------------- | PROTEIN CONSENSUS PLOTTER | | | | AUTHOR: ABBIE WILLIAMS | | | | NOTE: THIS SCRIPT WAS DEVELOPED FOR ASSESSMENT WITHIN THE DATA SCIENCE AND MACHINE | | LEARNING FOR THE BIOSCIENCES 2020 MODULE, AS PART OF THE SWBIO DTP PROGRAM. | | | ------------------------------------------------------------------------------------------------- I. PROTEIN CONSENSUS PLOTTER II. KEY CONCEPTS AND ASSUMPTIONS III. EXPECTED WORKFLOW IV. DEPENDENCIES AND RUNNING THE SCRIPT V. FILE INPUT VI. FILE OUTPUTS VII. EXAMPLE FILES VIII. ACKNOWLEDGEMENTS / REFERENCES IX. CONTACT INFORMATION I. PROTEIN CONSENSUS PLOTTER ------------------------------------------------------------------------------------------------- The protein consensus plotter is a basic tool developed to calculate and visualise the conservation of amino acid residues across a multiple protein sequence alignment. The theory behind this is that as protein function is conserved amongst a group of related proteins, specific amino acid(s) determining the protein structure will also be conserved, such as domains, motifs, or catalytic sites. The protein consensus plotter can be used to identify well conserved residue(s) for several purposes: preliminary analysis of proteins of an unknown function, analysing the conservation of residues of interest i.e. frequently mutated residues, or comprehensive analysis of the conservation of entire protein sequence i.e. amongst protein families. II. KEY CONCEPTS AND ASSUMPTIONS ------------------------------------------------------------------------------------------------ Prior to running, a basic understanding of the protein consensus plotter is required. The protein consensus plotter will analyse a protein alignment file to calculate consensus. Consensus is defined as 'agreement per residue': the level of conservation of the most represented residue for each position across the alignment sequence. Hence, consensus is effectively a measure of amino acid conservation and is converted to a percentage by: (Frequency of most represented character* / Number of sequences in alignment ) x 100 *calculated for each residue of alignment, across all sequences. In the instance of gap characters, "-', these are treated as a value of 0 and are plotted. Graph plotting uses sliding windows (n=user defined) to create rolling averages which smooth the plot line. When considering particular residues and their consensus scores, their position and score can be inferred from the graph and cross referenced with the output file 'consensus_scores.csv' for more accurate analysis. The consensus plot should be used as a visual aid and the consensus_score.csv file should always be checked for more information. Importantly, a key assumption of the protein consensus plotter is that the first sequence of the alignment file is the protein query sequence. This assumption is based on the intended workflow (see III. EXPECTED WORKFLOW for explanation). If this is not the case, the alignment file must be reformatted and/or realigned to meet this assumption. III. EXPECTED WORKFLOW ------------------------------------------------------------------------------------------------- I. Retrieve the protein sequence of interest (FASTA) II. Run BLASTP for protein query*. If filtering of hits are required, this should be done at this stage. A manual check of whether the first hit matches the protein sequence should also be carried out. If this condition is not met as the query protein is not present in the selected database, the protein sequence (FASTA) can be added as the first entry of the multifasta file downloaded from BLAST. All desired hits to be used in the consensus analysis should be downloaded as the complete sequences in a multifasta file. III. Align the protein sequences downloaded from BLAST using a multiple sequence aligner (MSA). The MSA can be selected based on user preference, and ran locally or online**. Importantly, the alignment file must be CLUSTALW format (.clw) for the script to run. IV. Run the Protein Consensus Plotter script, using the alignment file as an input when prompted. V. Check consensus plot for runs of conserved residues (with a high consensus score). Refer back to output files listing consensus scores to identify the exact consensus score, residue type, and residue position in the sequence and in the alignment. *available here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins **available here: https://www.ebi.ac.uk/Tools/msa/ IV. DEPENDENCIES AND RUNNING THE SCRIPT ------------------------------------------------------------------------------------------------- The Protein Consensus Plotter was developed and tested with Python ver.3.7 using Pycharm ver.2020.2.3 and requires the following modules to run: os math numpy pandas Biopython Matlotib The Protein Consensus Plotter is largely guided by user input. Simply, the script can be run by entering the following in a terminal: "python path_to_script_location/Protein_Consensus_Plotter.py" Input files are not specified in the command, but are then prompted by the user. There are three user input prompts: the absolute file path of the input file, the size of the sliding window to be used, and the threshold value for filtering residues based on consensus score. All user prompts are signaled with obvious, clear messages. V. FILE INPUT ------------------------------------------------------------------------------------------------- A single MSA alignment file in CLUSTALW format (.clw) only. See folder 'example files'. VI. FILE OUTPUTS ------------------------------------------------------------------------------------------------- 'Consensus_sequence.txt': The overall consensus sequence generated from the alignment (contains the most common residues across entire alignment). 'Protein_consensus_plot.png': Graph of sequence consensus (%) across the alignment length. 'Residues_and_scores.csv': Dataframe containing the residue number (within query sequence), Residue (amino acid), Consensus score (%) and Position in alignment (residue number in alignment sequence). 'Filtered_residues.csv': Dataframe containing the residue number (within query sequence), Residue (amino acid), Consensus score (%) and Position in alignment (residue number in alignment sequence) for all residues exceeding the specified threshold. VII. EXAMPLE FILES ------------------------------------------------------------------------------------------------- "alignment_input_file.clw": MSA file used for input. This is based on a cathepsin (protease) protein in Fasciola hepatica* which was processed in the expected workflow (see III. EXPECTED WORKFLOW).Cathespin protein sequence was used in a BLASTP search with default settings. 100 sequences were then aligned with using MUSCLE with default parameters, generating the example input files. The following are all examples of the output file generated from using the input file: "Protein_Consensus_Plot.png","Consensus_sequence.txt","filtered_residues.csv", and "residues_and_scores.csv". *available here: https://www.ncbi.nlm.nih.gov/protein/AAA29137.1?from=1&to=326&report=fasta VIII. ACKNOWLEDGEMENTS / REFERENCES ------------------------------------------------------------------------------------------------- Pycharm: JetBrains, 2017. Pycharm. JetBrains. Available at: https://www.jetbrains.com/pycharm/ Python:Python Software Foundation. Python Language Reference, version 3.7. Available at http://www.python.org os: os - miscellaneous operating system interfaces. Manual available here: https://docs.python.org/3/library/os.html Math: math - Mathematical functions. Manual available here: https://docs.python.org/3/library/math.html numpy: Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke & Travis E. Oliphant. Array programming with NumPy, Nature, 585, 357–362 (2020) pandas: Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423 Matplotlib: J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007. IX. CONTACT INFORMATION ------------------------------------------------------------------------------------------------- For queries or problems, please contact: [email protected]
About
Calculates the consensus (conservation per residue) of a protein sequence alignment.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published