Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.
- python 3.6
- keras 2.2.4 (tensorflow backend)
- keras-contrib 2.0.8 for CRF inference.
- gensim for training word2vec.
- bilm-tf for ELMo.
- Static Word Embedding: word2vec, GloVe
- Contextualized Word Representation: ELMo (
_elmo
), refer to Sec.
- BiLSTM
- DGCNN
- sequence labeling (
sequence_labeling.py
)- CRF
- softmax
- predict start/end index of entities (
_pointer
)
According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:
- Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax):
sequence_labeling.py
- (Static Word Embedding, ELMo) × BiLSTM × pointer:
bilstm_pointer.py
andbilstm_pointer_elmo.py
Other models can be implemented by adding/modifying few codes.
- Prepare data:
- download official competition data to
data
folder - get sequence tagging train/dev/test data:
bin/trans_data.py
- prepare
vocab
,tag
vocab
: word vocabulary, one word per line, withword word_count
formattag
:BIOES
ner tag list, one tag per line (O
in first line)
- follow the step 2 or 3 below
- 2 is for models using static word embedding
- 3 is for model using ELMo
- download official competition data to
- Run model with static word embedding, here take
word2vec
as an example:- train word2vec:
bin/train_w2v.py
- modify
config.py
- run
python sequence_labeling.py [bilstm/dgcnn] [softmax/crf]
orpython bilstm_pointer.py
(remember to modifyconfig.model_name
before a new run, or the old model will be overridden)
- train word2vec:
- Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of
train/dev/test
to file first, then load them for train/dev/test, not run ELMo on the fly):- follow the instruction described here to get contextualized sentence representation for
train_full/dev/test
data from pre-trained ELMo weights - modify
config.py
- run
python bilstm_pointer_elmo.py
- follow the instruction described here to get contextualized sentence representation for
- Just follow the official instruction described here.
- Some notes:
- to train a token-level language model, modify
bin/train_elmo.py
:
fromvocab = load_vocab(args.vocab_file, 50)
tovocab = load_vocab(args.vocab_file, None)
- modify
n_train_tokens
- remove
char_cnn
inoptions
- modify
lstm.dim
/lstm.projection_dim
as you wish. n_gpus=2
,n_train_tokens=94114921
,lstm['dim']=2048
,projection_dim=256
,n_epochs=10
. It took about 17 hours long on 2 GTX 1080 Ti.
- to train a token-level language model, modify
- After finishing the last step of the instruction, you can refer to the script dump_token_level_bilm_embeddings.py to dump the dynamic sentence representations of your own dataset.
- Blog:《基于CNN的阅读理解式问答模型:DGCNN 》
- Blog:《基于DGCNN和概率图的轻量级信息抽取模型 》
- Named entity recognition tutorial: Named entity recognition series
- Some codes
- Sequence Evaluation tools: seqeval
- Neural Sequence Labeling Toolkit: NCRF++
- Contextualized Word Representation: ELMo