Lemmatizer FAQ #11685
Locked
polm
started this conversation in
Help: Best practices
Lemmatizer FAQ
#11685
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
spaCy has a number of different lemmatizer implementations, and which one is the best for a given application can depend on many different requirements. This document will help you identify the source of lemmatizer errors, figure out what lemmatizer you're using, how it works, and how to change it or modify the output.
Why is the Lemmatizer Output Wrong?
There are a few typical reasons the lemmatizer may give incorrect output. Here's an overview of the most common ones.
The part of speech or morphological features are wrong. Many lemmatizer modes are dependent on part of speech information (
token.pos_
), and if the POS is wrong then the lemma will also be wrong. A common case is when a common noun is recognized as a proper noun and not lemmatized. This is especially common in documents where upper and lowercase are not used as in formal writing. If your lemmas are wrong, this is one of the first things to check.The word isn't in the table for a lookup lemmatizer. While the tables used in lookup lemmatizers are extensive, they don't have everything, and they also may not capture newer words or slang. See the "Modifying the Output" section below for details on how to fix this.
The word is in the lemma tables, but it's wrong. While it's rare, the lemma tables have many entries, and inevitably there are a few errors. In some cases this is due to including rare variant forms that get prioritized over more common forms. If you find a mistake like this, feel free to let us know in an issue, or you can fix it immediately following the instructions in "Modifying the Output" below.
The word doesn't fit with the existing rules. For rule-based lemmatizers, it possible to have a word that is processed incorrectly by rules. This can also be fixed by modifying the tables, though it works a bit differently from the modifying the lookup table. See "Modifying the Output" for a fix.
The Different Lemmatizers
spaCy has two lemmatizer components: the
Lemmatizer
is a rule-based lemmatizer with several modes, while theEditTreeLemmatizer
is a trainable component that uses machine learning to predict lemmas.The default Lemmatizer has two built-in modes:
lookup
uses lookup tables to find lemmas. By default it uses just one table, but for several languages, the default lemmatizer has an additional modepos_lookup
that looks up words in different tables based on part of speech.rule
uses manually written transformation rules to create lemmas, but may fall back to lookup tables.It is also possible to add custom modes; for more details, read further.
Check Which Lemmatizer You're Using
If you're using a pretrained pipeline, check the model page for mention of the "trainable lemmatizer", which corresponds to the EditTreeLemmatizer. If it's not mentioned, it contains the rule-based lemmatizer.
Given a pipeline with a
lemmatizer
component, you can check details like this.Modifying the Output
Sometimes you want to alter the output for a few words that the lemmatizer gets wrong. For modifying individual words, you can modify the tables. If you're using a lookup lemmatizer, modifying the output is simple, and can be done like this:
Note that lemmatizer has a caching mechanism for lemmas, so if you need to modify the tables, you should reload your pipeline, then modify the tables before processing any text. To reload the pipeline, you can simply save it locally with
nlp.to_disk(path)
and then usespacy.load(path)
to load it back.For a rule-based lemmatizer, you can always update the exceptions table to explicitly specify a correct form.
Depending on the exact kind of fix, it may be more appropriate to modify the index or rule tables (see further on in this post for details), but the exceptions table will always work.
If you want to modify lemmas for a broad class of words, it's easier to use the
AttributeRuler
to match words based on conditions and manually specify a lemma. An example of something you might want to do is to make all punctuation have the lemma.
(a period). Here's how you'd set that up:Custom Lemmatizer Modes
It's possible to add your own custom processing mode to the rule-based lemmatizer. Suppose, for example, that you want the same behavior as the lookup lemmatizer, but you want to always use the lower case form of words. Here's a simple subclass of the Lemmatizer to do that.
Lemmatizer Tables
This is information about the internal implementation of the lemmatizer.
There are four different tables the lemmatizer can use. By default
lemma_lookups
is used inlookup
mode, and the other three tables are used inrule
mode. The three tables used forrule
mode are subdivided by POS, and their individual functions are:lemma_exc
: Exceptions, these take precedence over other forms.lemma_rule
: This is a list of simple suffix rewrite rules. If the first part of the rule is found as a suffix, it is replaced by the second part of the rule.lemma_index
: This is a list of real words - when a word rewritten with a rule is found in this list, it's known to be a good form, and used as a lemma (unless an exception takes precedence).Depending on the specific lemmatizer implementation, the way these tables are used can differ, or there can even be different tables altogether. To understand the exact way the tables are used, the best reference is the code of the lemmatizer you're using.
If your issue isn't resolved by this FAQ, feel free to open a new Discussion and mention you read it - we're always looking to improve our docs!
Beta Was this translation helpful? Give feedback.
All reactions