Skip to content
Giuseppe Attardi edited this page May 12, 2015 · 3 revisions

Training

For training a NER tagger, you need to supply the word embeddings, their vocabulary and the training corpus annotated in tab-separated format, one token per line. The last field should be the NE tag in IOB notation. Sentences should be separated by an empty line. Word embeddings are accepted in three formats:

  1. SENNA, two separate files: lowercased vocabulary and embeddings
  2. polyglot (word2vectors), two separate files: vocabulary and embeddings
  3. word2vec, single file, containing initial line with counts and size, and then one word per line followed by its weights

Optionally you can supply a gazetteer, containing a list of entities in each of the categories. The format of the gazetteer is just an entity per line, with the class name at the beginning, separated by a tab, like this:

LOC     abita springs
LOC     acqui terme
PER     abraham lincoln
PER     abraham

You can optionally specify to use word suffixes as features. You can invoke training like this:

bin/dl-ner.py ner.dnn -t train+dev \
  --vocab vocab.txt --vectors vectors.txt \
  --caps --suffix --suffixes --gazetteer eng.list \
  -e 40 -l 0.01 -w 5 -n 300 -v

Tagging

You can invoke the same script for tagging a file:

dl-ner.py ner.dnn < input

where ner.dnn is a model produced by training and input is a file containing one token per line with an empty line to separate sentences.

Usage

The full invocation options are:

usage: dl-ner.py [-h] [-c FILE] [-w WINDOW] [-s EMBEDDINGS_SIZE]
             [-e ITERATIONS] [-l LEARNING_RATE] [-n HIDDEN]
             [--threads THREADS] [-t TRAIN] [-o OUTPUT] [--caps [CAPS]]
             [--suffix [SUFFIX]] [--suffixes SUFFIXES] [--prefix [PREFIX]]
             [--prefixes PREFIXES] [--gazetteer GAZETTEER] [--gsize GSIZE]
             [--vocab VOCAB] [--vectors VECTORS] [--min-occurr MINOCCURR]
             [--load LOAD] [--variant VARIANT] [-v]
             model

positional arguments:
  model                 Model file to train/use.

optional arguments:
  -h, --help            show this help message and exit
  -c FILE, --config FILE
                        Specify config file
  -w WINDOW, --window WINDOW
                        Size of the word window (default 5)
  -s EMBEDDINGS_SIZE, --embeddings-size EMBEDDINGS_SIZE
                        Number of features per word (default 50)
  -e ITERATIONS, --epochs ITERATIONS
                        Number of training epochs (default 100)
  -l LEARNING_RATE, --learning_rate LEARNING_RATE
                        Learning rate for network weights (default 0.001)
  -n HIDDEN, --hidden HIDDEN
                        Number of hidden neurons (default 200)
  --threads THREADS     Number of threads (default 1)
  -t TRAIN, --train TRAIN
                        File with annotated data for training.
  -o OUTPUT, --output OUTPUT
                        File where to save embeddings
  --caps [CAPS]         Include capitalization features. Optionally, supply
                    the number of features (default 5)
  --suffix [SUFFIX]     Include suffix features. Optionally, supply the number
                        of features (default 5)
  --suffixes SUFFIXES   Load suffixes from this file
  --prefix [PREFIX]     Include prefix features. Optionally, supply the number
                        of features (default 0)
  --prefixes PREFIXES   Load prefixes from this file
  --gazetteer GAZETTEER
                        Load gazetteer from this file
  --gsize GSIZE         Size of gazetteer features (default 5)
  --vocab VOCAB         Vocabulary file, either read or created
  --vectors VECTORS     Embeddings file, either read or created
  --min-occurr MINOCCURR
                        Minimum occurrences for inclusion in vocabulary
  --load LOAD           Load previously saved model
  --variant VARIANT     Either "senna" (default), "polyglot", "word2vec" or
                        "gensym".
  -v, --verbose         Verbose mode
Clone this wiki locally