-
Notifications
You must be signed in to change notification settings - Fork 2
/
taxdump_readme.txt
149 lines (117 loc) · 4.84 KB
/
taxdump_readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
This directory contains the following NCBI Taxonomy database dump files:
taxdmp.zip
taxdump.tar.Z
taxdump.tar.gz
All these files containes exactly the same information and are arranged so
for the convenience of unpacking them on various operating environments.
In addition there are files:
taxdmp.zip.md5
taxdump.tar.Z.md5
taxdump.tar.gz.md5
which contain MD5 sums for the corresponding archive files. These files
might be used to check correctness of the download of corresponding
archive file.
taxdmp.zip
----------
Is intended for zip-capable utilities such as pkunzip, unzip, and WinZip.
These utilities are widely available in almost all operating environments.
To unpack it command-line pkunzip and unzip:
pkunzip taxdmp.zip
or
unzip taxdmp.zip
Note: pkunzip and/or unzip executables must be in the executable search path
and taxdmp.zip must be in the current directory. Files will be unzipped into
current directory. For desired dump files placement and more please refer to
the manual and/or option descriptions of pkunzip and unzip utilities.
taxdump.tar.Z
-------------
This file is to be unpacked by uncompress utility and subsequent tar
archiver. These utilities are usually used in UNIX-like environment.
Unpacking instructions follows:
uncompress -c taxdump.tar.Z | tar xf -
taxdump.tar.gz
--------------
This file is to be unpacked by GNU unzip utility and subsequent tar
archiver. These utilities are usually used in UNIX-like environment.
Unpacking instructions follows:
gunzip -c taxdump.tar.gz | tar xf -
The content of the archive
--------------------------
It may look like this:
citations.dmp
delnodes.dmp
division.dmp
gencode.dmp
merged.dmp
names.dmp
nodes.dmp
readme.txt
The readme.txt file gives a brief description of *.dmp files. These files
contain taxonomic information and are briefly described below. Each of the
files store one record in the single line that are delimited by "\t|\n"
(tab, vertical bar, and newline) characters. Each record consists of one
or more fields delimited by "\t|\t" (tab, vertical bar, and tab) characters.
The brief description of field position and meaning for each file follows.
nodes.dmp
---------
This file represents taxonomy nodes. The description for each node includes
the following fields:
tax_id -- node id in GenBank taxonomy database
parent tax_id -- parent node id in GenBank taxonomy database
rank -- rank of this node (superkingdom, kingdom, ...)
embl code -- locus-name prefix; not unique
division id -- see division.dmp file
inherited div flag (1 or 0) -- 1 if node inherits division from parent
genetic code id -- see gencode.dmp file
inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent
mitochondrial genetic code id -- see gencode.dmp file
inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent
GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage
hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet
comments -- free-text comments and citations
names.dmp
---------
Taxonomy names file has these fields:
tax_id -- the id of node associated with this name
name_txt -- name itself
unique name -- the unique variant of this name if name not unique
name class -- (synonym, common name, ...)
division.dmp
------------
Divisions file has these fields:
division id -- taxonomy database division id
division cde -- GenBank division code (three characters)
division name -- e.g. BCT, PLN, VRT, MAM, PRI...
comments
gencode.dmp
-----------
Genetic codes file:
genetic code id -- GenBank genetic code id
abbreviation -- genetic code name abbreviation
name -- genetic code name
cde -- translation table for this genetic code
starts -- start codons for this genetic code
delnodes.dmp
------------
Deleted nodes (nodes that existed but were deleted) file field:
tax_id -- deleted node id
merged.dmp
----------
Merged nodes file fields:
old_tax_id -- id of nodes which has been merged
new_tax_id -- id of nodes which is result of merging
citations.dmp
-------------
Citations file fields:
cit_id -- the unique id of citation
cit_key -- citation key
medline_id -- unique id in MedLine database (0 if not in MedLine)
pubmed_id -- unique id in PubMed database (0 if not in PubMed)
url -- URL associated with citation
text -- any text (usually article name and authors)
-- The following characters are escaped in this text by a backslash:
-- newline (appear as "\n"),
-- tab character ("\t"),
-- double quotes ('\"'),
-- backslash character ("\\").
taxid_list -- list of node ids separated by a single space