This are implementations of various LSTM-based language models using Tensorflow. Codes are based on tensorflow tutorial on building a PTB LSTM model. Some extensions are made to handle input from subword units level, i.e. characters, character ngrams, morpheme segments (i.e. from BPE/Morfessor).
- Tensorflow (tested on v0.10.0)
- Python 3
Use the script train.py to train a model. Below is an example to train a character bi-LSTM model for English.
python3 train.py --train_file=data/multi/en/train.txt \
--dev_file=data/multi/en/dev.txt \
--save_dir=model \
--unit=char \
--composition=bi-lstm \
--rnn_size=200 \
--batch_size=32 \
--num_steps=20 \
--learning_rate=1.0 \
--decay_rate=0.5 \
--keep_prob=0.5 \
--lowercase \
--SOS=true
Options for units are: char, char-ngram, morpheme (BPE/Morfessor), oracle, and word.
Options for compositions are: none (word only), bi-lstm, and addition.
The morpheme representation uses BPE-like representation. Each word is replaced by its word segments, for example imperfect
is written as im@@perfect
, where @@
denotes the segment boundary. You can use the segmentation tool provided in here to preprocess your dataset.
In the oracle setting, you need to replace each word in the data with its morphological analysis. For example, in Czech the word Dodavatel
is replaced by the following (note that the actual word form is not used for experiment):
word:Dodavatel+lemma:dodavatel+pos:NOUN+Animacy:Anim+Case:Nom+Gender:Masc+Negative:Pos+Number:Sing
Please look at train.py for more hyperparameter options.
To test a model, run test.py.
python3 test.py --test_file=data/multi/en/test.txt \
--save_dir=model
Character-based bi-LSTM model:
"Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation".
http://www.cs.cmu.edu/~lingwang/papers/emnlp2015.pdf
Word segments (BPE) model:
"Neural Machine Translation of Rare Words with Subword Units"
http://www.aclweb.org/anthology/P16-1162
Character ngrams:
The model first segments word into its character ngrams, e.g. cat = ('c', 'a', 't', '^c', 'ca', 'at', 't$). The embedding of the word is computed by summing up all the ngrams embeddings of the word.