Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgMLST database creation documentation #106

Open
lskatz opened this issue Sep 10, 2021 · 2 comments
Open

cgMLST database creation documentation #106

lskatz opened this issue Sep 10, 2021 · 2 comments

Comments

@lskatz
Copy link

lskatz commented Sep 10, 2021

I am trying to make a guide for myself to create a cgMLST or even a wgMLST scheme. I hope this helps others. I suggest adding this in some way to the documentation, although I might still go with @tseemann's suggestion and just go with Chewie. Anyway, if you're like me and just have to try it out first to convince yourself...

Step 1: download the whole scheme from https://chewbbaca.online into a folder Listeria_monocytogenes.chewbbaca

Step 2: add git to the scheme so that you can check it back out in case you make a mistake.

cd Listeria_monocytogenes.chewbbaca
git init
git add *
git add .*
git commit -m init
git tag v1

Step 2b, undocumented: make sure there are no deflines with * indicating custom alleles not in the universal set of alleles.

Step 3a, not documented here: install mlst

Step 3b: mlst db creation

# load dependencies
module load mlst ncbi-blast+ perl
cd Listeria_monocytogenes.chewbbaca
# Check if any long IDs are present.
# If so, makeblastdb in mlst-build will not run
grep ">" -h *.fasta | perl -lane 'print if(length($_) > 50);' | head
# If long IDs, checkout v1 which I already versioned
# when I downloaded from chewbbaca.online
git checkout v1 -- '*'
pushd ~/bin/mlst/db/pubmlst
mkdir lmonocgmlst
cd lmonocgmlst
cp Listeria_monocytogenes.chewbbaca/*.fasta .
# rename to tfa.  Remove unnecessary prefix to filenames (although they are necessary elsewhere especially for Chewie)
for i in *.fasta; do mv $i $(basename $i .fasta).tfa; done;
for i in *.tfa; do target=$(sed 's/Pasteur_//' <<< $i); mv $i $target; done;
for i in *.tfa; do target=$(sed 's/-//' <<< $i); mv $i $target; done;
# Remove Pasteur_ prefix to avoid locus/allele confusion in mlst
perl -MFile::Copy=mv -MBio::SeqIO -e 'for $f(glob("*.tfa")){print "$f\n"; $in=Bio::SeqIO->new(-file=>$f); $out=Bio::SeqIO->new(-file=>">$f.tmp.fasta"); while($seq=$in->next_seq){$id=$seq->id; $id=~s/Pasteur_//; $seq->id($id); $out->write_seq($seq);} mv("$f.tmp.fasta", $f);}'
# Remove the dash too
perl -MFile::Copy=mv -MBio::SeqIO -e 'for $f(glob("*.tfa")){print "$f\n"; $in=Bio::SeqIO->new(-file=>$f); $out=Bio::SeqIO->new(-file=>">$f.tmp.fasta"); while($seq=$in->next_seq){$id=$seq->id; $id=~s/-//; $seq->id($id); $out->write_seq($seq);} mv("$f.tmp.fasta", $f);}'
# Make the scheme file
(echo -ne "ST\t"; \ls *.tfa | sed 's/.tfa//' | tr '\n' '\t'; echo clonal_complex
# fake genotype to avoid undefined genotype error
echo -ne 1
for i in *.tfa; do echo -ne "\t1"; done;
echo;
) > lmonocgmlst.txt
# Run automated makeblastdb to add the new scheme
mlst-make_blast_db
# ensure it worked
mlst --longlist | grep lmonocgmlst | perl -lane 'print "@F[0..7] ..."'
popd
cd my/Lmono/assemblies/
mlst --scheme lmonocgmlst *.fasta --threads 4 > lmonocgmlst.tsv
@lskatz
Copy link
Author

lskatz commented Sep 10, 2021

To cap this off, my results were indeed not promising for my test set. I would expect about 1748 locus results because these are the core loci. However, the results only approximated 1k per genome. I have indeed shown myself to go with Chewie instead of mlst for cgMLST analysis.

\ls *_spades.fasta | xargs -P 12 -n 1 bash -c 'mlst --scheme lmonocgmlst --threads 2 --novel $0.novel.fasta.tmp $0 > $0.tsv.tmp; touch $0.novel.fasta.tmp;'
for tsv in *.tsv.tmp; do name=$(basename $tsv .fasta.tsv.tmp); touch $name.fasta.novel.fasta.tmp; novel=$(( $(wc -l < $name.fasta.novel.fasta.tmp) / 2)); echo -ne "$name\t$novel\t";  head -n1 $tsv | perl -lane 'for(@F){ if(/(\(\d+\))/){ $num++} } print "\t$num";'; done

...to produce a table of novel and exact allele matches per assembly in my benchmarking dataset.

assembly                                numNovel        numExact
Listeria_shovill_SRR10323923_spades     4               987
Listeria_shovill_SRR10483479_spades     994             139
Listeria_shovill_SRR10505985_spades     1               396
Listeria_shovill_SRR10696096_spades     101             265
Listeria_shovill_SRR13296922_spades     328             75
Listeria_shovill_SRR14044344_spades     5               355
Listeria_shovill_SRR14404488_spades     4               439
Listeria_shovill_SRR14669035_spades     4               397
Listeria_shovill_SRR15356214_spades     0               271
Listeria_shovill_SRR9973979_spades      4               902

@lskatz
Copy link
Author

lskatz commented Sep 10, 2021

Not shown: Chewie results gave me on average 1740 loci per assembly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant