-
Notifications
You must be signed in to change notification settings - Fork 10
Home
fqgrep
is a grep (and agrep) like tool optimized for matching sequence patterns in FASTA and FASTQ files. With it, one could:
- perform rudimentary exploratory data analysis on the read data
- identify & collect subset of reads within a larger grouping having common sequence characteristics
- design & construct custom adaptive trimmers, and/or demultiplexers
- enable other custom pre-processing of the read data before its presentation to a more intensive analysis pipeline
The basic features of fqgrep
are:
- search for a sequence pattern using a POSIX regular expression
- allow for specified mismatches in sequence search pattern
- fine tune both the mismatch weights and the total number of allowable respective insertions, deletions, and substitutions in the sequence search pattern
- highlight/color the matching substring within the overall read
- streaming ability (can pipe the output of one
fqgrep
command into anotherfqgrep
instance, or other command), for identifying particular combinations of sequence patterns within a FASTQ/FASTA input file(s) - generation of parsable detail match statistics reports on each read for custom post-processing
When simply searching for exact sequence patterns, or in other
words, when zero mismatches on a pattern are specified (the default
case), the Boyer-Moore approach (as used similarly by grep
) is
taken.
fqgrep
does not use an alignment strategy for approximate sequence
searching. For approximate pattern matching, the TRE regular expression library is being
used.
The TRE library uses a variant of Bitap algorithm for approximate string matches. Below is an excerpt from the README from the TRE software distribution describing the notions behind the bitap algorithm:
Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance ) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
These above mentioned insertion, deletion, and substitution cost
parameters can be adjusted by the -m
, -s
, -i
, -d
, -S
, -I
, -D
options
to fqgrep
. The exact costs for each match can be observed via the
detailed stats report (the -r
option to fqgrep
).
The TRE regular expression library is the default regular expression
engine used in R; due to this, the results of fqgrep
maybe
similar to the results produced by software from the Bioconductor
Project.
The -r
option to fqgrep
produces a detailed match report of each record analyzed in the input FASTQ/FASTA file. Each field of the output format is delimited with tabs by default. The delimiter string can be altered by the -b
option.
Below are the columns displayed in the statistics report:
0 read name
1 read comment
2 total mismatches
3 # insertions
4 # deletions
5 # substitutions
6 start position
7 end position
8 match string
9 sequence
10 quality
By default, the output of fqgrep only displays FASTQ/FASTA records that have matched a pattern. Using the -v
option, will invert the output, in other words, show records that DO NOT match a given pattern.
The -a
option will show all the records in a FASTQ/FASTA file (both that MATCH and DO NOT MATCH a given input pattern). This is useful in conjunction with the -r
option to perform custom post-processing on read data. See the 'A simple adaptive trimmer' section below for an example.
When a sequence match is not present for a given record (via the -v
or -a
options). The "match string" column just contains the string '*'.
- Only POSIX styled regular expression can be used with fqgrep.
- When searching for an exact pattern (0 mismatches) via a regular expression, please use the '-e' option. This forces the use of the TRE regular expression engine.
If a dash ('-') is used as the input file, it will read the FASTQ/FASTA
input from standard input (stdin
). This is useful when piping the output
of one fqgrep
command (or any other program that produces FASTQ/FASTA formatted output) into another fqgrep
command. Supplying anything other than a FASTQ/FASTA formatted input in this manner may give unexpected behavior to fqgrep
. See the examples below for demonstrative usage.
Search for the sequence pattern 'GATTACA' anywhere in the input reads file, input.fq
, a FASTQ file.
Note that an input can be a FASTA file as well. GZipped input FASTA (input.fa.gz) and FASTQ (input.fq.gz) files are acceptable as well.
fqgrep -p 'GATTACA' input.fq
fqgrep -f -p 'GATTACA' input.fq
fqgrep -c -f -p 'GATTACA' input.fq
Allow up one mismatch (insertion, deletion, or substitutions) in the pattern.
fqgrep -c -f -p 'GATTACA' -m 1 input.fq
At the beginning of each read, allow for a variable 0 to 3 bases (A, C, G, T, or N) followed by the pattern 'GATTACA'. Plus, within the 'GATTACA' string, allow for at most a couple mismatches.
fqgrep -c -f -p '^[AGCTN]{0,3}GATTACA' -m 2 input.fq
View a detailed statistics report of just the matches in the above command:
fqgrep -c -r -p '^[AGCTN]{0,3}GATTACA' -m 2 input.fq
Usage of a single dash, '-', as a file name allows for reading a FASTQ or FASTA file from STDIN.
cat input.fq | fqgrep -p 'GATTACA' -
Please note that when reading from standard input, fqgrep
is expecting an input of a FASTQ or FASTA format. Not a detailed statistics report format. See the 'A simple adaptive trimmer' section below for an example observation.
Show all the reads (in FASTQ format) that DO NOT exactly start with 'GATTACA':
fqgrep -v -e -p '^GATTACA' input.fq
Show all the reads (in FASTQ format) that DO NOT start with 'GATTACA' AND end with a polyA sequence (8 or more A's) with a couple of mismatches:
fqgrep -v -e -p '^GATTACA' input.fq | fqgrep -m 2 -p 'A{8,}$' -
The following command produces a cheap 5' and 3' end trimmer (assuming all the reads in the FASTQ file are oriented in a 5' to 3' direction).
fqgrep -a -m 1 -r -p '^[AGCTN]{0,3}GATTACA' input.fq | trimmer --5-prime | fqgrep -a -e -r -p 'AAAAAA$' - | trimmer --3-prime
The first command:
fqgrep -a -m 1 -r -p '^[AGCTN]{0,3}GATTACA' input.fq
says to identify all the reads in the input file that may have 0-3 random bases (A, G, C, T or N) followed by the sequence GATTACA
allowing for a single mismatch (-m
option) and output all the FASTQ records (irregardless if there there is a match or not, -a
option) in the tab-delimited statistics report format (-r
option) . This output is taken as input into the next piped command:
trimmer --5-prime
trimmer
is an example perl script that produces FASTQ formatted output from its supplied fqgrep
statistics report input. For each input record, trimmer
will trim out the identified match string from the 5' end, if there was a match. If there was no match, it does nothing and simply re-prints out the original fastq record. trimmer
's FASTQ output serves as the input to the next piped fqgrep
command:
fqgrep -a -e -r -p 'AAAAAA$' -
The above statement says to find all the reads supplied from standard input that have the exact (0 mismatches) polyA sequence AAAAAA
on its 3' end. As a regular expression is being supplied for the -p
search pattern, the -e
option of fqgrep is used to ensure the usage of the regular expression engine in the pattern search. The output of the above is inputted into the following subsequent trimming command:
trimmer --3-prime
The above statement performs a similar function as the prior trimmer
command, but trims from the 3' end instead.
The above trimmer
script and another example script fqgrep-trim.pl
are supplied in the fqgrep package. They are both simplistic trimmers. The fqgrep-trim.pl
script performs similiarly to the trimmer
script, but provides a differing output format and user interface. See the documentation inside the fqgrep-trim.pl
script for more details.
One could imagine creating more complex trimmers based upon the quality score of the read, the experimental design generating the reads, the sequencing technology being used and other notions. This is left for the user to further customize for the particular scientific problem at hand.