Skip to content

Commit

Permalink
merge changes for v1.19.0 from dev
Browse files Browse the repository at this point in the history
  • Loading branch information
Alan-Collins committed Aug 16, 2024
2 parents e83965e + ec94d5e commit e3cfc5c
Show file tree
Hide file tree
Showing 6 changed files with 130 additions and 38 deletions.
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ python3 -m pip install .
# Usage

## Quickstart Guide
Here is an example of a basic run using paired-end reads or assemblies as input.
Here is an example of a basic run using paired-end reads or assemblies as input. We recommend using reads whenever available as read-based sequence typing is more reliable.
```
# Paired-end:
el_gato.py --read1 read1.fastq.gz --read2 read2.fastq.gz --out output_folder/
Expand Down Expand Up @@ -136,6 +136,8 @@ Optional arguments:

## Input files

If available, we recommend using raw or trimmed reads instead of assemblies as the extra data contained in reads is valuable for the process used by el_gato to identify sample ST. When run with reads, el_gato is able to use read quality and coverage information to apply quality control rules. When run using assemblies, el_gato is unable to identify errors that have been incorporated into the assembly and may therefore report incorrect results. For example, while many isolates encode two copies of *mompS*, in some cases only one copy of the locus is assembled. If only the secondary *mompS* locus is included in the assembly then el_gato will report that allele.

#### Pair-end reads
When running on a directory of reads, files are associated as pairs using the pattern `R{1,2}.fastq`. i.e., filenames should be identical except for containing either "R1" or "R2" and can be .fastq or .fastq.gz format. Any files for which a pair can not be identified using this pattern will not be processed.

Expand Down Expand Up @@ -238,19 +240,11 @@ Assembly: BLAST hit length and sequence identity thresholds and locus location i

# Approach

At its core, el_gato uses BLAST to identify the closest match to each allele in your input data. For the loci *flaA*, *pilE*, *asd*, *mip*, and *proA*, this process is straight forward. Whereas loci *mompS* and *neuA/neuAh* require more involved processing, with neuA/neuAh being an issue only when processing reads—the specifics of these loci are discussed in the corresponding sections below.
At its core, el_gato uses BLAST to identify the closest match to each allele in your input data. For the loci *flaA*, *pilE*, *asd*, *mip*, and *proA*, this process is straight forward. Whereas loci *mompS* and *neuA/neuAh* require more involved processing, with neuA/neuAh being an issue only when processing reads—the specifics of these loci are discussed in the corresponding sections below.


First for the simple loci (*flaA*, *pilE*, *asd*, *mip*, and *proA*), the following processes are used:

## Assembly

Six of the seven loci (*flaA*, *pilE*, *asd*, *mip*,*proA*, and *neuA/neuAh*) are identified using BLAST. For each, the best BLAST result is returned as the allele. The closest match is returned with an \* if loci have no exact match. [add in updated code changes if necessary] When processing an assembly, only *mompS* requires extra processing.

### *mompS*

[*mompS* is sometimes present in multiple copies in *Legionella pneumophila*, though typically two copies.](https://doi.org/10.1016/j.cmi.2017.01.002) When typing *L. pneumophila* using Sanger sequencing, primers amplify only the correct *mompS* locus. We, therefore, use *in silico* PCR to extract the correct *mompS* locus sequence from the assembly. The primers used for *in silico* PCR are *mompS*-450F (TTGACCATGAGTGGGATTGG) and *mompS*-1116R (TGGATAAATTATCCAGCCGGACTTC) [as described in this protocol](https://doi.org/10.1007/978-1-62703-161-5_6). The *mompS* allele is then identified using BLAST.

## Reads

When processing reads, identification of both *mompS* and *neuA*/*neuAh* requires additional analyses (described below). The other five loci are processed by mapping the provided reads to reference loci from *L. pneumophila* strain Paris and identifying the consensus sequence. Then, all alleles are determined using BLAST against the SBT allele database.
Expand Down Expand Up @@ -297,6 +291,14 @@ If the above process cannot identify the correct sequence, a ? will be returned
## *momps* Read Mapping Schematic
![mompS read mapping schematic](https://github.com/appliedbinf/el_gato/blob/images/images/mompS_allele_assignment.png)

## Assembly

Six of the seven loci (*flaA*, *pilE*, *asd*, *mip*,*proA*, and *neuA/neuAh*) are identified using BLAST. For each, the best BLAST result is returned as the allele. The closest match is returned with an \* if loci have no exact match. [add in updated code changes if necessary] When processing an assembly, only *mompS* requires extra processing.

### *mompS*

[*mompS* is sometimes present in multiple copies in *Legionella pneumophila*, though typically two copies.](https://doi.org/10.1016/j.cmi.2017.01.002) When typing *L. pneumophila* using Sanger sequencing, primers amplify only the correct *mompS* locus. We, therefore, use *in silico* PCR to extract the correct *mompS* locus sequence from the assembly. The primers used for *in silico* PCR are *mompS*-450F (TTGACCATGAGTGGGATTGG) and *mompS*-1116R (TGGATAAATTATCCAGCCGGACTTC) [as described in this protocol](https://doi.org/10.1007/978-1-62703-161-5_6). The *mompS* allele is then identified using BLAST.

# Using NextFlow
## Using NextFlow with Singularity Container

Expand Down
2 changes: 1 addition & 1 deletion conda-recipe/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{% set name = "elgato" %}
{% set version = "1.18.2" %}
{% set version = "1.19.0" %}

package:
name: {{ name|lower }}
Expand Down
2 changes: 2 additions & 0 deletions el_gato/db/ref_gene_regions.fna
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ CAGGCGATCAGCGCTGTGATAAAGAATGAAAATGTTTACGGTTGCCAGTTCCACCCTGAAAAAAGCGGGGAAGTAGGGTT
GGTCAGGCTATTAGCGCAGTAATAAAAGATGAAAATATGTACGGATGCCAATTTCATCCTGAAAAAAGCGGGGAAGTAGGGTTGGAAATTATCAAGCAGTTTTTACAAATTTAGGATGCAGGAATTGAGAGTTTTAGCAACAATACCGGCAAGAGCAGGTTCAAAACGACTCCCGGGAAAGAATATAAAATTACTTGCTAGAAAACCTCTGATTGCCCATGCTATTAATGCTGTTATTCAATCCAATTGCTATAATCTTGTAGTTTCTACTGATAGTCAGGAAAGAGCAGATATTGTAGTTCAACACGGGTGCTCTGTTCCTTGGTTAAAACCTGCAATCTTGTCAGTCGATTCGACCATTGTAGAAGATACTTTGAATAGCTTTTTAAATAATTATAAAAAAATTAATGTGTTTTTTTGCTGTATCTTATTGATACAGAATACTTCCCACTTTAGAAAGCCGGAAATTATAAGAAAGGTCGCTTTAATGTATAAAGAGGTAGGTGAAGAAGTTGTATCTATCAATAAAGTAAGCTTTAAGCCTTCCCGGTGCAGAACTATAGATAATCAAGGCAATTTGTATTACCCCAATGTTTTTAAGATAGCTGATGTTTCTGAGGAAAGTGAATCAATTTATAAATTGAAAGGTGCGATTTATATAGTGGCAACTGAGCAGATAATTTCTAAAAAATCATTTTATAGTGATCTAACAAAAGCCTTCCTTATTGATAGCCCGATTGAATCGATCGACATTGATACGCCAATCGATAGGGCTTTAACAGAAAAATTAATGGAATTAAACCAAGAGGCTCTAGTATGACTTGTTTTATTATTGCTGAAGCAGGAGTAAACCATAATGGCAATCTTCAACTGGCTAAAGAATTAGTTTATGCAGCCAAAGAGTCAGGAGCTGATGCAGTAAAGTTTCAGACTTTCAAAGCGGACACCTTGGTTAATAAAACAGTAGAAAAAGCGGAATACCAAAAAAATAATGCCCCAGAATCCGCTACCCAGTATGAGATGCTCAAGGCACTGGAGCTTTCAGAAGATGAC
>neuA_212
CAGGCGATCAGCGCTGTGATAAAGAATGAAAATGTTTACGGTTGCCAGTTCCACCCTGAAAAAAGCGGGGAAGTAGGGTTGGAAATTATCAAGCAGTTTTTACAAATTTAGGATGCAAGAATTGAGAATTTTAGCAATAATACCGGCAAGAGCAGGTTCAAAACGACTCCCGGGAAAGAATATTAAATTGCTTGCAGGAAAGCCGCTGATTGCTCACACCATCGATGCTGCTCTACAATCCAATTGTTGTGAGCAAGTTGTTGTTTCTACGGATAGTCAGGAAATAGCAGATATCGCAATTCAGCATGGTGCTTCAGTTCCTTGGTTAAGACCAGCAAGTCTGTCAGATGATTCCTCTAATGTAGAGGATGCAGTGATTGATTTGTTAAATAATTATAAAAAGATTAATGTATATTTTGACAGTGTTTTATTATTACAGCCAACCTCCCCTTTTAGAAAACCTGAAACTATCAGAAAGGCTGTTTTAATGCATAAAGATATTGGTTATAGCGTCGTATCTATCAATAAAGTATATTTTAAACCTTCTTGGTACAGAACAGTTGATGATCAAGGTAACTTGCGTTCTCCCAGTATTTTTAAAACGACTGATATCTCTGAAAGCGAACCAATCTATAAATTGAATGGTGCGATTTATATAGCTACTACCAAACAAATAATGTTAAACAAATCATTTTATAGTGATCCAACAAAAGCTCTGCTTATGGATAGTCCAATTGAATCCATCGACATCGATACGCCAATCGATTGGGCTCTAACAGAAAAATTAATGGAATTAAACCAAGAGGCTCTAGTATGACTTGTTTTATTATTGCTGAAGCAGGAGTAAACCATAATGGTGATCTTCAACTGGCTAAAGAATTAGTTTATGCAGCCAAAGAGTCAGGAGCTGATGCAGTAAAGTTTCAGACTTTCAAAGCGGACACCTTGGTTAATAAAACAGTAGAAAAAGCGGAATACCAAAAAAATAATGCCCCGGAATCCGCTACCCAGTATGAAATGCTCAAGGCACTGGAGCTTTCAGAAGATGAC
>neuA_215 DAOWPC010000010.1:60344-61396 GCA_030724125.1
CAGGCTATTAGTGCCGTTATAAAGGATGGCAATATTTATGGGTGCCAGTTCCATCCAGAAAAAAGTGGAGAAGTAGGATTAAAAATTATCCAGCGGTTTTTACAAATTTAGGGCGCAAGAAGTGAGAGTTTTAGCGATAATACCAGCGAGAGCTGGTTCCAAAAGGCTACCAGGAAAAAATATAAGATTGCTTGCTGGCAAACCACTTATCGCTCATAGTATTAATGTGGCTCTACAATCGAATTGCTGCGAACAAATAGTTGTTTCAACTGATAGCCGAGAAATAGCCGATATTGCAATTCAGTATGGAGCTTCTGTACCATGGCTAAGGCCCCAATCCTTAGCTGATGATTCGTCGGATGTCGTATACGCCGTCATTGATTTGTTGAATAATTATAAGAAAATTAACGTGTTTTTTGATAGTGTTCTATTACTACAACCTACTTCGCCTTTTAGAAAACCTGAAACAATTAGAAAGGCTGTTTTAATGCACCGAGATGTTGGTAATAGCGTTGTATCTATTAATAAAGTGAGCTTTAAACCTTCTTGGTACAGGACAGTAGATAATCAAGGTAACTTATGTTCCCCTAATATTTTTAGAAATTCTGATGTTTCTGAAGAAGGTGAACCAATTTATAAACTAAATGGTGCGATTTATATAGCTACTACTGAACAGCTAATGTCTAATAAATCATTTTATAGCAATCCAACAAGGGCTCTGCTTATGGATAATCCAAATGAATCTATAGATATAGATACTCCAATGGATTGGGCTTTAGTAGAAAAATTAATTGAACTGAAAGAAGAGATACTATTATGAATTGTTTTATAATTGCTGAAGCAGGAGTAAACCATAATGGTAATGTCCAACTGGCTAAAGAATTAGTTTACGCAGCCAAAGAGTCTGGCGCTGATGCAGTAAAATTTCAGACTTTCAAAGCCGACACCTTGGTCAATAAAACAGTAGAAAAAGCGGAATACCAAAAAAATAATACCCCGGAATCCTCTACCCAGTATGAGATGCTCAAGGCACTGGAACTTTCAGAAGAGG
32 changes: 31 additions & 1 deletion el_gato/el_gato.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@ class Ref:
'start_pos' : 350,
'end_pos' : 700,
},
"neuA_215": {
'start_pos' : 350,
'end_pos' : 703,
},
}

class SAM_data(object):
Expand Down Expand Up @@ -385,6 +389,7 @@ def set_inputs(
inputs["length"] = args.length
inputs["sequence"] = args.sequence
inputs["json_out"]['operation_mode'] = "Assembly"
inputs["json_out"]["version"] = version
inputs["threads"] = args.threads
inputs["out_prefix"] = args.out
inputs["log"] = os.path.join(args.out, "run.log")
Expand Down Expand Up @@ -1180,7 +1185,7 @@ def process_reads(contig_dict: dict, read_info_dict: dict, ref: Ref, outdir: str
run_command(process_sam_command, tool='samtools', shell=True)

logging.info("Checking coverage of reference loci by mapped reads")
coverage_command = f"samtools coverage -r flaA:351-531 {outdir}/reads_vs_all_ref_filt_sorted.bam | cut -f 1,4,5,6,7,8,9; samtools coverage -r pilE:351-682 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r asd:351-822 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r mip:350-750 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r mompS:367-717 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r proA:350-754 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuAh:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA_207:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA_211:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA_212:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9"
coverage_command = f"samtools coverage -r flaA:351-531 {outdir}/reads_vs_all_ref_filt_sorted.bam | cut -f 1,4,5,6,7,8,9; samtools coverage -r pilE:351-682 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r asd:351-822 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r mip:350-750 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r mompS:367-717 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r proA:350-754 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuAh:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA_207:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA_211:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA_212:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9; samtools coverage -r neuA_215:350-702 {outdir}/reads_vs_all_ref_filt_sorted.bam | tail -1 | cut -f 1,4,5,6,7,8,9"
desc_header = "Assessing coverage of MLST loci by provided sequencing reads."

result = run_command(coverage_command, tool='samtools coverage', shell=True, desc_file=f"{outdir}/intermediate_outputs.txt", desc_header=desc_header)
Expand Down Expand Up @@ -1511,6 +1516,25 @@ def write_alleles_to_file(alleles: list, outdir: str):
fout.write(identified_allele_fasta_string)


def check_reads_are_mapped(inputs: dict, samfile: str):
with open(samfile) as f:
line_count = sum([1 for line in f if line[0] != "@"])
if line_count > 0:
return
# 0 reads mapped. Abort.
logging.error("Critical error. The analysis could not be completed since the sample contains zero reads that could align to all 7 loci and thus likely indicates this sample is not L. pneumophila.")
logging.error("Analysis Aborted")

# set alleles to missing data
outlines = []
if inputs['header']:
header = "Sample\tST\t" + "\t".join(Ref.locus_order)
outlines.append(header)
outlines.append("\t".join(["MD-"] + ["-"] * 7))
print("\n".join(outlines))
sys.exit()


def run_stats(samfile: str, outdir: str):
# Check if insert size is long enough and log read length and insert size
stats_command = f"samtools stats {samfile} | awk '$1==\"IS\" {{is+=$3*$2; is_count+=$3}} $1==\"RL\" {{rl+=$2*$3; rl_count+=$3}} END {{print \"Average insertion size:\", is/is_count, \"Average read length:\", rl/rl_count}}'"
Expand Down Expand Up @@ -1559,6 +1583,8 @@ def map_alleles(inputs: dict, ref: Ref):
mapping_command = f"minimap2 -ax sr -t {threads} {db}/ref_gene_regions.fna {r1} {r2} | samtools view -h -F 0x4 -@ {threads} -o {outdir}/reads_vs_all_ref_filt.sam"
run_command(mapping_command, tool='minimap2 -ax sr', shell=True)

# Check for issues with read mapping
check_reads_are_mapped(inputs, f"{outdir}/reads_vs_all_ref_filt.sam")
run_stats(f"{outdir}/reads_vs_all_ref_filt.sam", outdir)

contig_dict, read_info_dict = read_sam_file(f"{outdir}/reads_vs_all_ref_filt.sam")
Expand Down Expand Up @@ -1835,6 +1861,9 @@ def print_table(inputs: dict, Ref: Ref, alleles: dict) -> str:

out_string = "\n".join(outlines)
for line in out_string.splitlines():
if inputs["header"] and line == out_string.splitlines()[0]:
# skip header line if included in output
continue
line = line.split("\t")
while i < len(loci):
allele_dict[loci[i]] = line[i+1]
Expand Down Expand Up @@ -1918,6 +1947,7 @@ def main():
inputs = set_inputs(args, inputs)
make_output_directory(inputs)
configure_logger(inputs)
logging.info(f"Running el_gato version {version}")
logging.info("Starting preprocessing")
for line in inputs["logging_buffer_message"].rstrip().split("\n"):
logging.info(line)
Expand Down
Loading

0 comments on commit e3cfc5c

Please sign in to comment.