Skip to content

LSTM Language Model with Subword Units Input Representations

Notifications You must be signed in to change notification settings

Nokeli/subword-lstm-lm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LSTM Language Model with Subword Units Input Representations

This are implementations of various LSTM-based language models using Tensorflow. Codes are based on tensorflow tutorial on building a PTB LSTM model. Some extensions are made to handle input from subword units level, i.e. characters, character ngrams, morpheme segments (i.e. from BPE/Morfessor).

Dependencies

  1. Tensorflow (tested on v0.10.0)
  2. Python 3

Training

Use the script train.py to train a model. Below is an example to train a character bi-LSTM model for English.

python3 train.py --train_file=data/multi/en/train.txt \
					--dev_file=data/multi/en/dev.txt \
					--save_dir=model \
					--unit=char \
					--composition=bi-lstm \
					--rnn_size=200 \
					--batch_size=32 \
					--num_steps=20 \
					--learning_rate=1.0 \
					--decay_rate=0.5 \
					--keep_prob=0.5 \
					--lowercase \
					--SOS=true

Options for units are: char, char-ngram, morpheme (BPE/Morfessor), oracle, and word.
Options for compositions are: none (word only), bi-lstm, and addition.

The morpheme representation uses BPE-like representation. Each word is replaced by its word segments, for example imperfect is written as im@@perfect, where @@ denotes the segment boundary. You can use the segmentation tool provided in here to preprocess your dataset.

In the oracle setting, you need to replace each word in the data with its morphological analysis. For example, in Czech the word Dodavatel is replaced by the following (note that the actual word form is not used for experiment):

word:Dodavatel+lemma:dodavatel+pos:NOUN+Animacy:Anim+Case:Nom+Gender:Masc+Negative:Pos+Number:Sing

Please look at train.py for more hyperparameter options.

Testing

To test a model, run test.py.

python3 test.py --test_file=data/multi/en/test.txt \
					--save_dir=model

Notes

Character-based bi-LSTM model:
"Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation".
http://www.cs.cmu.edu/~lingwang/papers/emnlp2015.pdf

Word segments (BPE) model:
"Neural Machine Translation of Rare Words with Subword Units"
http://www.aclweb.org/anthology/P16-1162

Character ngrams:
The model first segments word into its character ngrams, e.g. cat = ('c', 'a', 't', '^c', 'ca', 'at', 't$). The embedding of the word is computed by summing up all the ngrams embeddings of the word.

About

LSTM Language Model with Subword Units Input Representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%