Skip to content

Using rextraccnt to extract entries from nt like databases

Jose Manuel Martí edited this page Jan 16, 2023 · 1 revision

Overview

Rextraccnt extracts entries from metagenomic databases with the FASTA format of the NCBI nt database, which lists the accession number of each entry. The name of the script is intentionally close to rextract because both scripts have several analogies. Rextraccnt is useful when you have a very large database (e.g., a decontaminated version of NCBI BLAST nt) and need to get a subset attending to taxonomic criteria, such as entries belonging to organisms under a given clade, or the contrary, by excluding some branch of the taxonomic tree. Below, you have details about the command layout, and you can find some examples at the end of the page, but first we will see the expected input format and how to obtain an accession to taxid mapping file.

Input format: NCBI nt fasta file

The format of the input is expected to be the used in the NCBI BLAST nt fasta files, which has the accession number as the sequence id. For example:

>X51700.1 Bos taurus mRNA for bone Gla protein
GTCCACGCAGCCGCTGACAGACACACCATGAGAACCCCCATGCTGCTCGC...

Accession to taxonomic id mapping

To run rextraccnt command, you will need a file mapping the accessions in the input file and the taxonomic identifiers (NCBI Taxonomy), one per line, in no particular order. For example:

X51700.1 9913
...

You pass the name of this file to rextraccnt via the argument --mapfile. For example, for the NCBI nt database, you can get this file using NCBI blastcmd suite:

blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fa.taxidmapping

Command layout

The layout of the Rextraccnt (rextraccnt) command (ver. 1.12.0) is:

usage: rextraccnt [-h] [-d] [-l NUMBER] [-e NUMBER] [-n PATH] [-i TAXID]
                  [-x TAXID] -m FILE [-f FILE] [-c] [-V]

General arguments

  -h, --help            show this help message and exit
  -d, --debug           increase output verbosity and perform additional checks
  -l NUMBER, --limit NUMBER
                        limit of nt DB entries to extract; default: no limit
  -e NUMBER, --entrymax NUMBER
                        maximum number of nt DB entries to search for the taxa; default: no maximum
  -n PATH, --nodespath PATH
                        path for the nodes information files (nodes.dmp and names.dmp from NCBI)
  -m FILE, --mapfile FILE
                        Mapping (accession to taxid) file
  -c, --compress        Output FASTA file will be gzipped
  -V, --version         show program's version number and exit

Selection of reads based on the taxonomy

  -i TAXID, --include TAXID
                        NCBI taxid code to include a taxon and all underneath
                        (multiple -i is available to include several taxid);
                        by default all the taxa is considered for inclusion
  -x TAXID, --exclude TAXID
                        NCBI taxid code to exclude a taxon and all underneath
                       (multiple -x is available to exclude several taxid)

Input

  -f FILE, --ntfastafile FILE
                        NCBI nt formatted FASTA file

Example

For example, if you:

  • want to extract all the fungal (taxid: 4751) entries of a decontaminated nt database nt_decon.fa,
  • have cloned the repo in ~/recentrifuge,
  • have taxonomy files downloaded and expanded to /my/tax/dir —or just use retaxdump!,
  • have generated a mapping file with name 'nt.fa.taxidmapping',
  • want to get some extra information about the taxonomy, then you may run:
~/recentrifuge/rextraccnt -d -n /my/tax/dir -i 4751 -m nt.fa.taxidmapping -f nt_decon.fa

Since the current size of NCBI nt DB is circa 1 TB, the process may take more than one hour to complete, and then you will get the file nt_decon_rxnt_incl4751.fa as a result.