Perplexity is the standard way to intrinsically evaluate whether a predication system is working. Perplexity computes how surprised the system is of seeing what it is actually sees in the light of what it expected to see given what it knows.
What the prediction system expects is caused by the training data.
At any new token in the data, the language model outputs a probability for every possible type as a continuation given the previously observed history and the transition probability matrix it learned on some other data.
The higher the probability a model assigns to new valid sentences, the better the language model, the lower the perplexity.
Perplexity based cased by the interaction of the model and the test set. But really it is caused by the test set because the model is a probability distribution matrix over the test set. This means that you can say that perplexity is the inverse probability of the test set under a language model, normalized by the number of tokens (the more tokens there are, the lower the final probability of a sequence). In maths this is:
So basically you have to normalize the perplexity to compare language models. Which means that you can only compare the complexity of models that use the same test set.
Since we’re taking the inverse probability, a lower perplexity indicates a better model.
- Estimate the language model (states and transition matrix) on some corpus.
- Fine tune the model on different corpus (test data)
- Test the model to check how it fits on new data (validation data).
NEVER test on the same data you trained on and NEVER validate on the test data! Learning is not remembering. If it were, it wouldn't be useful.
Again, you can only compare perplexity of different models if you use the same test set. Because otherwise the probability distribution is not the same. So the result states have to be the same.
Perplexity is an average, so it depends on the number of tokens in the vocabulary.