Perplexity

Perplexity is the standard way to intrinsically evaluate whether a predication system is working. Perplexity computes how surprised the system is of seeing what it is actually sees in the light of what it expected to see given what it knows.

What the prediction system expects is caused by the training data.

More formally

At any new token in the data, the language model outputs a probability for every possible type as a continuation given the previously observed history and the transition probability matrix it learned on some other data.

The higher the probability a model assigns to new valid sentences, the better the language model, the lower the perplexity.

Perplexity based cased by the interaction of the model and the test set. But really it is caused by the test set because the model is a probability distribution matrix over the test set. This means that you can say that perplexity is the inverse probability of the test set under a language model, normalized by the number of tokens (the more tokens there are, the lower the final probability of a sequence). In maths this is:

$$pp(W) = 2^{-l}$$ where $$l = \frac{1}{|w|}\sum\limits^{|W|}{i=1}\log p(w_i|w{i-i-n~:~i-1})$$ with $w_{i} \in W$ where W is a sequence of tokens. We use log, so we can use sum ($\Sigma$) for calculating the probability. This avoids underflowing, as the probabilities can get really small.

So basically you have to normalize the perplexity to compare language models. Which means that you can only compare the complexity of models that use the same test set.

Since we’re taking the inverse probability, a lower perplexity indicates a better model.

Best practices for evolution

Estimate the language model (states and transition matrix) on some corpus.
Fine tune the model on different corpus (test data)
Test the model to check how it fits on new data (validation data).

NEVER test on the same data you trained on and NEVER validate on the test data! Learning is not remembering. If it were, it wouldn't be useful.

Again, you can only compare perplexity of different models if you use the same test set. Because otherwise the probability distribution is not the same. So the result states have to be the same.

Perplexity is an average, so it depends on the number of tokens in the vocabulary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perplexity.md

Perplexity.md

Perplexity

More formally

Best practices for evolution

Files

Perplexity.md

Latest commit

History

Perplexity.md

File metadata and controls

Perplexity

More formally

Best practices for evolution