Language modeling is the task of predicting the next word or character in a document.
* indicates models using dynamic evaluation; where, at test time, models may adapt to seen tokens in order to improve performance on following tokens. (Mikolov et al., (2010), Kraus et al., (2017))
A common evaluation dataset for language modeling ist the Penn Treebank,
as pre-processed by Mikolov et al., (2011).
The dataset consists of 929k training words, 73k validation words, and
82k test words. As part of the pre-processing, words were lower-cased, numbers
were replaced with N, newlines were replaced with <eos>
,
and all other punctuation was removed. The vocabulary is
the most frequent 10k words with the rest of the tokens replaced by an <unk>
token.
Models are evaluated based on perplexity, which is the average
per-word log-probability (lower is better).
Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 47.38 | 46.54 | 22M | FRAGE: Frequency-Agnostic Word Representation | Official |
AWD-LSTM-DOC x5 (Takase et al., 2018) | 48.63 | 47.17 | 185M | Direct Output Connection for a High-Rank Language Model | Official |
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 48.33 | 47.69 | 22M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 51.6 | 51.1 | 24M | Dynamic Evaluation of Neural Sequence Models | Official |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.9 | 52.8 | 24M | Regularizing and Optimizing LSTM Language Models | Official |
AWD-LSTM-DOC (Takase et al., 2018) | 54.12 | 52.38 | 23M | Direct Output Connection for a High-Rank Language Model | Official |
Trellis Network (Bai et al., 2019) | - | 54.19 | 34M | Trellis Networks for Sequence Modeling | Official |
AWD-LSTM-MoS + finetune (Yang et al., 2018) | 56.54 | 54.44 | 22M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
Transformer-XL (Dai et al., 2018) under review | 56.72 | 54.52 | 24M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
AWD-LSTM-MoS (Yang et al., 2018) | 58.08 | 55.97 | 22M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) | 58.9 | 56.8 | 24M | Fraternal dropout | Official |
AWD-LSTM (Merity et al., 2017) | 60.0 | 57.3 | 24M | Regularizing and Optimizing LSTM Language Models | Official |
WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.
Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 40.85 | 39.14 | 35M | FRAGE: Frequency-Agnostic Word Representation | Official |
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 42.41 | 40.68 | 35M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 46.4 | 44.3 | 33M | Dynamic Evaluation of Neural Sequence Models | Official |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.8 | 52.0 | 33M | Regularizing and Optimizing LSTM Language Models | Official |
AWD-LSTM-DOC x5 (Takase et al., 2018) | 54.19 | 53.09 | 185M | Direct Output Connection for a High-Rank Language Model | Official |
AWD-LSTM-DOC (Takase et al., 2018) | 60.29 | 58.03 | 37M | Direct Output Connection for a High-Rank Language Model | Official |
AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | 35M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) | 66.8 | 64.1 | 34M | Fraternal dropout | Official |
AWD-LSTM (Merity et al., 2017) | 68.6 | 65.8 | 33M | Regularizing and Optimizing LSTM Language Models | Official |
WikiText-103 The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.
Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
Transformer-XL Large (Dai et al., 2018) under review | 17.7 | 18.3 | 257M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Transformer with tied adaptive embeddings (Baevski and Auli, 2018) | 19.8 | 20.5 | 247M | Adaptive Input Representations for Neural Language Modeling | Link |
Transformer-XL Standard (Dai et al., 2018) under review | 23.1 | 24.0 | 151M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
LSTM + Hebbian + Cache + MbPA (Rae et al., 2018) | 29.0 | 29.2 | Fast Parametric Learning with Activation Memorization | ||
Trellis Network (Bai et al., 2019) | - | 30.35 | 180M | Trellis Networks for Sequence Modeling | Official |
LSTM + Hebbian (Rae et al., 2018) | 34.1 | 34.3 | Fast Parametric Learning with Activation Memorization | ||
LSTM (Rae et al., 2018) | 36.0 | 36.4 | Fast Parametric Learning with Activation Memorization | ||
Gated CNN (Dauphin et al., 2016) | - | 37.2 | Language modeling with gated convolutional networks | ||
Neural cache model (size = 2,000) (Grave et al., 2017) | - | 40.8 | Improving Neural Language Models with a Continuous Cache | Link | |
Temporal CNN (Bai et al., 2018) | - | 45.2 | Convolutional sequence modeling revisited | ||
LSTM (Grave et al., 2017) | - | 48.7 | Improving Neural Language Models with a Continuous Cache | Link |
The One-Billion Word benchmark is a large dataset derived from a news-commentary site. The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words. Importantly, sentences in this model are shuffled and hence context is limited.
Model | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|
Transformer-XL Large (Dai et al., 2018) under review | 21.8 | 0.8B | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Transformer-XL Base (Dai et al., 2018) under review | 23.5 | 0.46B | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Transformer with shared adaptive embeddings - Very large (Baevski and Auli, 2018) | 23.7 | 0.8B | Adaptive Input Representations for Neural Language Modeling | Link |
10 LSTM+CNN inputs + SNM10-SKIP (Jozefowicz et al., 2016) ensemble | 23.7 | 43B? | Exploring the Limits of Language Modeling | Official |
Transformer with shared adaptive embeddings (Baevski and Auli, 2018) | 24.1 | 0.46B | Adaptive Input Representations for Neural Language Modeling | Link |
Big LSTM+CNN inputs (Jozefowicz et al., 2016) | 30.0 | 1.04B | Exploring the Limits of Language Modeling | |
Gated CNN-14Bottleneck (Dauphin et al., 2017) | 31.9 | ? | Language Modeling with Gated Convolutional Networks | |
BIGLSTM baseline (Kuchaiev and Ginsburg, 2018) | 35.1 | 0.151B | Factorization tricks for LSTM networks | Official |
BIG F-LSTM F512 (Kuchaiev and Ginsburg, 2018) | 36.3 | 0.052B | Factorization tricks for LSTM networks | Official |
BIG G-LSTM G-8 (Kuchaiev and Ginsburg, 2018) | 39.4 | 0.035B | Factorization tricks for LSTM networks | Official |
The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.
Model | Bit per Character (BPC) | Number of params | Paper / Source | Code |
---|---|---|---|---|
24-layer Transformer-XL (Dai et al., 2018) under review | 0.99 | 277M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
18-layer Transformer-XL (Dai et al., 2018) under review | 1.03 | 88M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
12-layer Transformer-XL (Dai et al., 2018) under review | 1.06 | 41M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.06 | 235M | Character-Level Language Modeling with Deeper Self-Attention | |
mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | Dynamic Evaluation of Neural Sequence Models | Official |
12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.11 | 44M | Character-Level Language Modeling with Deeper Self-Attention | |
3-layer AWD-LSTM (Merity et al., 2018) | 1.232 | 47M | An Analysis of Neural Language Modeling at Multiple Scales | Official |
Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | Multiplicative LSTM for sequence modelling | Official |
Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | Fast-Slow Recurrent Neural Networks | Official |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks | Official |
FS-LSTM-4 (Mujika et al., 2017) | 1.277 | 27M | Fast-Slow Recurrent Neural Networks | Official |
The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.
Model | Bit per Character (BPC) | Number of params | Paper / Source | Code |
---|---|---|---|---|
Transformer-XL Large (Dai et al., 2018) under review | 1.08 | 277M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.13 | 235M | Character-Level Language Modeling with Deeper Self-Attention | |
12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.18 | 44M | Character-Level Language Modeling with Deeper Self-Attention | |
mLSTM + dynamic eval (Krause et al., 2017)* | 1.19 | 45M | Dynamic Evaluation of Neural Sequence Models | Official |
Large mLSTM +emb +WN +VD (Krause et al., 2016) | 1.27 | 45M | Multiplicative LSTM for sequence modelling | Official |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks | Official |
LayerNorm HM-LSTM (Chung et al., 2017) | 1.29 | 35M | Hierarchical Multiscale Recurrent Neural Networks | |
BN LSTM (Cooijmans et al., 2016) | 1.36 | 16M | Recurrent Batch Normalization | Official |
Unregularised mLSTM (Krause et al., 2016) | 1.40 | 45M | Multiplicative LSTM for sequence modelling | Official |
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.
Model | Bit per Character (BPC) | Number of params | Paper / Source | Code |
---|---|---|---|---|
Trellis Network (Bai et al., 2019) | 1.159 | 13.4M | Trellis Networks for Sequence Modeling | Official |
3-layer AWD-LSTM (Merity et al., 2018) | 1.175 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales | Official |
6-layer QRNN (Merity et al., 2018) | 1.187 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales | Official |
FS-LSTM-4 (Mujika et al., 2017) | 1.190 | 27M | Fast-Slow Recurrent Neural Networks | Official |
FS-LSTM-2 (Mujika et al., 2017) | 1.193 | 27M | Fast-Slow Recurrent Neural Networks | Official |
NASCell (Zoph & Le, 2016) | 1.214 | 16.3M | Neural Architecture Search with Reinforcement Learning | |
2-layer Norm HyperLSTM (Ha et al., 2016) | 1.219 | 14.4M | HyperNetworks |