GitHub - RyanCairepo/strain_identify: repo for strain identification project

This program contains code for identifying within-host diversity.

The required python3 packages are: collections, copy, pandas, numpy, scipy.sparse, typing, argparse, statistics

Steps to run the program:

obtain the read files, preferably in fastq format.
obtain the reference sequence in fasta format.

2.1. (optional) Correcting FASTQ reads, see "readme.md" INSIDE folder ErrorCorrection

run the strain identification process with /path/to/find_sub.sh -r reference.fa -1 read_1.fastq -2 read_2.fastq -m tog -p protein_pos.txt (run find_sub.sh -h for more info) if the read files are in fasta format, /path/to/find_sub.sh -r reference.fa -1 read_1.fasta -2 read_2.fasta -m tog -f if the read file is single-ended: find_sub.sh -r reference.fa -0 read_file.fastq

The output consists of the nucleotide sequences of detected strains, named in the format "final_strain_x_reference.fa", x is the numerical label of strains, and "subbed_read_x.fa", a set of reads that belong to this strains and are different from the reference sequence.

3.1. (optional) step of verification. Start by obtaining relevant samples. After that, run verify.py N original_reference.fa -p(paired end reads, for single end use -s) sample1_r1.fastq sample1_r2.fastq sample2_r1.fastq sample2_r2.fastq. N is the numerical labelling of strain from step 3.

(optional) step for inferring synonymous state. It determines the changes in the nucleotide sequences are synonymous or non-synonymous. The command is as followed:

python synonymous_stat.py original_reference_sequence.fa subbed_read_N.sam translation_code.txt protein_pos.txt

transation_code.txt is the translation table from nucleotide bases to amino acid bases. protein_pos.txt is the position of proteins, in the form of protein_name:start..end

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ErrorCorrection		ErrorCorrection
bowtie2-2.4.4-linux-x86_64		bowtie2-2.4.4-linux-x86_64
README.md		README.md
build_matrix.py		build_matrix.py
combine_align.sh		combine_align.sh
find_sub.sh		find_sub.sh
get_ori_half.py		get_ori_half.py
identify_strain.py		identify_strain.py
protein_pos.txt		protein_pos.txt
strain_finder.py		strain_finder.py
synonymous_stat.py		synonymous_stat.py
translation_table.txt		translation_table.txt
verify.py		verify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

RyanCairepo/strain_identify

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages