ampQC or Amplicon QC is utility which checks for elevated per-amplicon coverage in a plate negative bam file. It then scans the rest of the bam files from the plate or sequencing run to see if they have enough coverage to clear contamination (at a desired level) in the corresponding sample amplicons that were high in the negative.
Threshold values are user definable, a default of mean depth of 100, and uplift multiplier of 10 are set. This means should an amplicon be high in the negative at a depth of 200 reads, all samples on the plate would have to have over 2,000 reads of mean depth in the corresponding amplicon to clear the QC check. Such a scenario could mean that potential contamination would be present at a 10% allelic frequency. The assumption here being a negative control has similar levels of contamination to other samples on the same plate of sequencing prep ("Here be dragons" this may not always be the case). The end user is encouraged to play around with and adapt these setting for their own usage case and desired levels. The aim being that some samples from a run with sufficient depth can be salvaged, rather than a whole plate of sequencing discarded, depending on levels of contamination and amplicon performance in samples. ampQC does not currently consider specific allelic frequencies and is based on per amplicon mean death alone.
This utility was designed and tested with detecting high negative control amplicons in SARS-CoV-2 sequencing with the artic protocol, it has been tested against illumina runs from the output of ncov2019-artic-nf. It should have utility to other amplicon based sequencing endeavours where a library prep negative control is used. The one proviso being ampQC assumes the 4th name
column of the amplicon bed file is unique for all amplicons.
Clone this repo:
git clone https://github.com/MattBashton/ampQC
Change to download directory:
cd ampQC
Install the supplied conda environment:
conda env create -f environment.yml
Activate the conda environment:
conda activate ampQC
Run the install script:
bash setup.sh
If you are unable to use conda then dependencies are pretty straightforward, GNU grep, awk and parallel, as well as samtools and mosdepth are required. The conda env will take care of installing all of these which is important for macOS which ships with inferior BSD versions of awk and grep which lack functionality needed for this utility.
Arguments -b -d -n and one of (-l -a -e) are required
Takes and input bed file of amplicons, a plate negative control bam file, and
checks sample bam have enough depth to clear contamination level in negative.
Usage ampqc [ options ] --bam-dir <bam-dir> --bed-file <amplicons.bed> --neg-file <neg.bam> --plate-list <file with sample names on plate>
Required options:
-b --bed-file bed file defining amplicons to use
-d --bam-dir directory of bam files
-n --neg-file file name of negative
-l --plate-list text file of samples/file names (.bam extension optional) for the plate to investigate (not needed if -a is used)
Other options:
-h --help print this message and exit
-v --version print version and exit
-p --prefix prefix to use for analysis [defaults to ampQC]
-s --bam-suffix suffix appened to sample ID [defaults to .sorted.bam]
-a --all don't use the file above and process all bam files in a given directory
-e --exclude prefixes to exclude in GNU grep Perl regex format, e.g. 'NEG|POS|BLANK' where | is or.
-t --amplicon-threshold mean depth threshold for detecting elevated amplicons in negative, [default 100]
-u --amplicon-uplift multiplier value required for letting potential contaminated amplicons pass [default 10]
-o --output-list list of sample IDs which failed amplicon QC, [defaults to prefix.failed.txt]
-c --numb-threads number of CPU threads for parallel tasks [default 4]
ampQC.sh -b ~/artic-ncov2019/primer_schemes/nCoV-2019/V4/nCoV-2019.insert.bed \
-d ncovIllumina_sequenceAnalysis_readMapping/ \
-n ncovIllumina_sequenceAnalysis_readMapping/NEG-9.sorted.bam \
-l 9_list.txt -t 100 -u 10 -c 10
The above would set off analysis using the -b
specified bed file for amplicons, with the negative control bam file specified in -n
the directory in which other sample bad can be found is specified in -d
and a text file of sample IDs (which must uniquely either in part or fully match the bam filenames) specified in -l
, (assuming you only want to check part of a sequencing run, other options are available see above). The other arguments -t 100
and -u 10
, set a high amplicon threshold of 100 and uplift multiplier of 10 [both default values if left unspecified], finally -c 10 parallelises some stages on 10 threads [default is 4, if not specified].
A list of sample file names (excluding suffix specified in -s
, [default is .sorted.bam
]) is written to ampQC.failed.txt where ampQC is the definable prefix with -p
. Other output includes a summary of runtime parameters and a list of pass and failing samples with emoji ✔ and ❌ in terminal, along with final pass and fail counts, which enables easy threshold exploration in the terminal. (Note these symbols may not display correctly if your system is not set to UTF-8.)
Implement ASCII character digital VU meter for amplicon depth.
parallel --will-cite
, will pretty up terminal output, of course please cite GNU parallel!