Filtered NT dataset is generated by excluding sequences from the whole nt file provided by NCBI, based on whether they have unwanted taxonomy names or any child taxonomy name of these unwanted ones. These unwanted taxonomy names are listed in the black list generated by two steps:
- Getting all taxonomy names which contain the strings listed below (Step 3);
- Getting all possible child taxonomy names of each of the taxonomy names from (1). For example, "other sequences" (taxId: 28384) is excluded with all its child taxonomy names including "artificial sequence", "vector", "synthetic", and so on.
We have chosen to apply the Creative Commons Attribution 3.0 Unsupported License to this version of the software.
Version | Downloadable Files | File Size | Release Notes | NCBI Download Date |
---|---|---|---|---|
Version 7.0 | Filtered NT v7.0 | 278 G | Release Notes v7.0 | 2023-05-16 |
Version 6.0 | Filtered NT v6.0 | 168 G | Release Notes v6 | July 2018 |
Version 5.0 | Filtered_NT v5.0 | 131 G | Release Notes v5.0 | May 2017 |
Version 4.0 | Filtered NT v4.0 | 110 G | Release Notes v4.0 | July 2016 |
Clone the repo and add data directories:
git clone https://github.com/GW-HIVE/filtered_nt.git
cd filtered_nt
mkdir raw_data
mkdir output_data
mkdir logfiles
Create and activate virtual environment:
python -m venv env
. env/bin/activate
python -m pip install requirements.txt
This is a very large file. It will take a long time.
downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
commands:
cd raw_data
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
downloaded from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/
accession2taxid version: 2023-06-19
commands:
mkdir accession2taxid
cd accession2taxid
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz'
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz'
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz'
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz'
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.EXTRA.gz'
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz'
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz'
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz'
Hat tip to https://github.com/acorg/ncbi-taxonomy-database
taxdump version: 2023-06-20
commands:
mkdir new_taxdump
cd new_taxdump
curl -O -L 'ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz'
tar xfz new_taxdump.tar.gz
There is a Makefile
in the repo root. This is for constructing the
taxonomy-accession DBs that we will use later on. Each of the DBs will take a
significant amount of time to build so be patient.
a) To create the taxonomy.db
file run:
make nucleotide
b) To create the dead_taxonomy.db
file run:
make dead
c) To create the protein_taxonomy.db
file run:
make proteiin
There are two scripts for generating the black list. The first will get all taxonomy names with the strings above. The second will get all child taxonomy names of those terms above. Unwanted taxonomy names (scientific names) from names.dmp include:
['unclassified','unidentified','uncultured', 'unspecified','unknown',
'phage','vector', 'environmental sample','artificial sequence',
'other sequence']
-
script 1:
parent_taxid_blacklist.py
default output:
./output_data/blacklist-taxId.1.csv
-
script 2:
child_taxid_blacklist.py
default output:
./output_data/blacklist_children.csv
After generating blacklist_children.csv
, use command line "sort -u" to delete duplicated records, and store the results in a duplicate file:
sort -u blacklist_children.csv > blacklist_children_unique.csv
QC step: Compare the newly generated file with the original version.
wc -l blacklist_children_unique.csv
1452016 blacklist_children_unique.csv
wc -l blacklist_children.csv
1457194 blacklist_children.csv
We need to check if all accessions in the nt
file have a taxId associated
with it from our DBs. If you find any you will need to trouble shoot those.
-
script:
ac2taxid_check.py
default output:
./logfiles/accession2taxid_log.txt
The output file accession2taxid_log.txt
should be empty. If not you will
have to trouble shoot.
protocol:
- script:
filter-nt.py