-
Notifications
You must be signed in to change notification settings - Fork 116
For training a NER tagger, you need to supply the word embeddings, their vocabulary and the training corpus annotated in tab-separated format, one token per line. The last field should be the NE tag in IOB notation. Sentences should be separated by an empty line. Word embeddings are accepted in three formats:
- SENNA, two separate files: lowercased vocabulary and embeddings
- polyglot (word2vectors), two separate files: vocabulary and embeddings
- word2vec, single file, containing initial line with counts and size, and then one word per line followed by its weights
Optionally you can supply a gazetteer, containing a list of entities in each of the categories. The format of the gazetteer is just an entity per line, with the class name at the beginning, separated by a tab, like this:
LOC abita springs
LOC acqui terme
PER abraham lincoln
PER abraham
You can optionally specify to use word suffixes as features. You can invoke training like this:
bin/dl-ner.py ner.dnn -t train+dev \
--vocab vocab.txt --vectors vectors.txt \
--caps --suffix --suffixes --gazetteer eng.list \
-e 40 -l 0.01 -w 5 -n 300 -v
You can invoke the same script for tagging a file:
dl-ner.py ner.dnn < input
where ner.dnn
is a model produced by training and input
is a file containing one token per line with an empty line to separate sentences.
The full invocation options are:
usage: dl-ner.py [-h] [-c FILE] [-w WINDOW] [-s EMBEDDINGS_SIZE]
[-e ITERATIONS] [-l LEARNING_RATE] [-n HIDDEN]
[--threads THREADS] [-t TRAIN] [-o OUTPUT] [--caps [CAPS]]
[--suffix [SUFFIX]] [--suffixes SUFFIXES] [--prefix [PREFIX]]
[--prefixes PREFIXES] [--gazetteer GAZETTEER] [--gsize GSIZE]
[--vocab VOCAB] [--vectors VECTORS] [--min-occurr MINOCCURR]
[--load LOAD] [--variant VARIANT] [-v]
model
positional arguments:
model Model file to train/use.
optional arguments:
-h, --help show this help message and exit
-c FILE, --config FILE
Specify config file
-w WINDOW, --window WINDOW
Size of the word window (default 5)
-s EMBEDDINGS_SIZE, --embeddings-size EMBEDDINGS_SIZE
Number of features per word (default 50)
-e ITERATIONS, --epochs ITERATIONS
Number of training epochs (default 100)
-l LEARNING_RATE, --learning_rate LEARNING_RATE
Learning rate for network weights (default 0.001)
-n HIDDEN, --hidden HIDDEN
Number of hidden neurons (default 200)
--threads THREADS Number of threads (default 1)
-t TRAIN, --train TRAIN
File with annotated data for training.
-o OUTPUT, --output OUTPUT
File where to save embeddings
--caps [CAPS] Include capitalization features. Optionally, supply
the number of features (default 5)
--suffix [SUFFIX] Include suffix features. Optionally, supply the number
of features (default 5)
--suffixes SUFFIXES Load suffixes from this file
--prefix [PREFIX] Include prefix features. Optionally, supply the number
of features (default 0)
--prefixes PREFIXES Load prefixes from this file
--gazetteer GAZETTEER
Load gazetteer from this file
--gsize GSIZE Size of gazetteer features (default 5)
--vocab VOCAB Vocabulary file, either read or created
--vectors VECTORS Embeddings file, either read or created
--min-occurr MINOCCURR
Minimum occurrences for inclusion in vocabulary
--load LOAD Load previously saved model
--variant VARIANT Either "senna" (default), "polyglot", "word2vec" or
"gensym".
-v, --verbose Verbose mode