Short introduction to using NCBI blast tools from the command line
Sometimes, you may have to use blast on your own computer to query thousands of sequences against a custom database of hundreds of thousands of sequences. To do that, you will need to install Blast on your computer, format the database, and then blast the sequences.
Here is a short tutorial on how to do this.
Get the compiled executables from this URL:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Decompress the archive. For example:
tar xvfz ncbi-blast-2.9.0+-x64-linux.tar.gz
Add the bin
folder from the extracted archive to your path. For example, add
the following line to your ~/.bashrc
file:
export PATH="/PATH/TO/ncbi-blast-2.9.0+/bin":$PATH
And change the /PATH/TO
part to the path where you have put the extracted
archive.
In order to test blast, you need a test fasta file. Use the following files that come with the tutorial:
sequences.fasta
reference.fasta
The different blast tools require a formatted database to search against. In
order to create the database, we use the makeblastdb
tool:
makeblastdb -in reference.fasta -title reference -dbtype nucl -out databases/reference
This will create a list of files in the databases
folder. These are all part
of the blast database.
We can now blast our sequences against the database. In this case, both our
query sequences and database sequences are DNA sequences, so we use the
blastn
tool:
blastn -db databases/reference -query sequences.fasta -evalue 1e-3 -word_size 11 -outfmt 0 > sequences.reference
You can use different output formats with the outmft
option:
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
If you need to run your blasts faster (and who doesn't?), you can maximise
CPU usage with gnu parallel
. You will find it at this
link.
Download the archive, extract it (with tar xvfB parallel-latest.tar.bz2
) and
install it with the following commands:
./configure
make
sudo make install
We can now use parallel
to speed up blast:
time cat sequences.fasta | parallel -k --block 1k --recstart '>' --pipe 'blastn -db databases/reference -query - -evalue 1e-3 -word_size 11 -outfmt 0' > sequences.reference
If you need help to know the options and parameters you can pass blastn
and
the other blast+ utilities, use the --help
option and pipe the output into
less
, for example:
blastn --help | less
NCBI blast tools cover more cases than DNA against DNA searches. For example, you can search a protein database with either DNA or protein sequences. Here is an exhaustive list of the programs that come with the blast+ distribution:
blastdb_aliastool
blastdbcheck
blastdbcmd
blast_formatter
blastn
blastp
blastx
convert2blastmask
deltablast
dustmasker
legacy_blast.pl
makeblastdb
makembindex
makeprofiledb
psiblast
rpsblast
rpstblastn
segmasker
tblastn
tblastx
update_blastdb.pl
windowmasker
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
NCBI blast tutorial by Eric Normandeau is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://github.com/enormandeau/ncbi_blast_tutorial.