Lemmatizer FAQ #11685

polm · 2022-10-21T04:56:07Z

polm
Oct 21, 2022

spaCy has a number of different lemmatizer implementations, and which one is the best for a given application can depend on many different requirements. This document will help you identify the source of lemmatizer errors, figure out what lemmatizer you're using, how it works, and how to change it or modify the output.

Why is the Lemmatizer Output Wrong?

There are a few typical reasons the lemmatizer may give incorrect output. Here's an overview of the most common ones.

The part of speech or morphological features are wrong. Many lemmatizer modes are dependent on part of speech information (token.pos_), and if the POS is wrong then the lemma will also be wrong. A common case is when a common noun is recognized as a proper noun and not lemmatized. This is especially common in documents where upper and lowercase are not used as in formal writing. If your lemmas are wrong, this is one of the first things to check.

The word isn't in the table for a lookup lemmatizer. While the tables used in lookup lemmatizers are extensive, they don't have everything, and they also may not capture newer words or slang. See the "Modifying the Output" section below for details on how to fix this.

The word is in the lemma tables, but it's wrong. While it's rare, the lemma tables have many entries, and inevitably there are a few errors. In some cases this is due to including rare variant forms that get prioritized over more common forms. If you find a mistake like this, feel free to let us know in an issue, or you can fix it immediately following the instructions in "Modifying the Output" below.

The word doesn't fit with the existing rules. For rule-based lemmatizers, it possible to have a word that is processed incorrectly by rules. This can also be fixed by modifying the tables, though it works a bit differently from the modifying the lookup table. See "Modifying the Output" for a fix.

The Different Lemmatizers

spaCy has two lemmatizer components: the Lemmatizer is a rule-based lemmatizer with several modes, while the EditTreeLemmatizer is a trainable component that uses machine learning to predict lemmas.

The default Lemmatizer has two built-in modes:

lookup uses lookup tables to find lemmas. By default it uses just one table, but for several languages, the default lemmatizer has an additional mode pos_lookup that looks up words in different tables based on part of speech.
rule uses manually written transformation rules to create lemmas, but may fall back to lookup tables.

It is also possible to add custom modes; for more details, read further.

Check Which Lemmatizer You're Using

If you're using a pretrained pipeline, check the model page for mention of the "trainable lemmatizer", which corresponds to the EditTreeLemmatizer. If it's not mentioned, it contains the rule-based lemmatizer.

Given a pipeline with a lemmatizer component, you can check details like this.

lemmatizer = nlp.get_pipe("lemmatizer")
print(type(lemmatizer)) # EditTreeLemmatizer or language-specific
# if not edit tree...
print(lemmatizer.mode) # usually "lookup" or "rule"

Modifying the Output

Sometimes you want to alter the output for a few words that the lemmatizer gets wrong. For modifying individual words, you can modify the tables. If you're using a lookup lemmatizer, modifying the output is simple, and can be done like this:

import spacy
nlp = spacy.load("my_pipeline")
lemmatizer = nlp.get_pipe("lemmatizer")
# This check is only meaningful for lookup lemmatizers
assert lemmatizer.mode == "lookup"
lookups = lemmatizer.lookups.get_table("lemma_lookup", {})
lookups.get(problem_word) # returns None if not present
# add to lookup tables
lookups[problem_word] = correct_lemma

Note that lemmatizer has a caching mechanism for lemmas, so if you need to modify the tables, you should reload your pipeline, then modify the tables before processing any text. To reload the pipeline, you can simply save it locally with nlp.to_disk(path) and then use spacy.load(path) to load it back.

For a rule-based lemmatizer, you can always update the exceptions table to explicitly specify a correct form.

import spacy

# check the lemma
nlp = spacy.load("en_core_web_sm")
print(nlp("hangrier")[0].lemma_) # => "hangri"

# reload the pipeline to clear the cache
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
# note the POS here - the lookups use lowercase POS
lemmatizer.lookups.get_table("lemma_exc")["adj"]["hangrier"] = ["hangry"]
print(nlp("hangrier")[0].lemma_) # => "hangry"

Depending on the exact kind of fix, it may be more appropriate to modify the index or rule tables (see further on in this post for details), but the exceptions table will always work.

If you want to modify lemmas for a broad class of words, it's easier to use the AttributeRuler to match words based on conditions and manually specify a lemma. An example of something you might want to do is to make all punctuation have the lemma . (a period). Here's how you'd set that up:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("wait, what!?!")
print("old lemma:", doc[1].lemma_) # => , (comma)

# We'll add a new ruler with a custom name at the end of the pipeline
ruler = nlp.add_pipe("attribute_ruler", name="lemma_fix")
# now we just describe the POS to match
patterns = [[{"POS": "PUNCT"}]]
attrs = {"LEMMA":"."}
ruler.add(patterns=patterns, attrs=attrs, index=0)

doc = nlp("wait, what!?!")
print("new lemma:", doc[1].lemma_) # => . (period)

Custom Lemmatizer Modes

It's possible to add your own custom processing mode to the rule-based lemmatizer. Suppose, for example, that you want the same behavior as the lookup lemmatizer, but you want to always use the lower case form of words. Here's a simple subclass of the Lemmatizer to do that.

from typing import List

from spacy.lang.en import Lemmatizer
from spacy.tokens import Token
from spacy import Language
from thinc.api import Model

class LowercaseLemmatizer(Lemmatizer):
    def get_lookups_config(self, mode):
        # this is just copied from the lookups version, and ensures the tables
        # are loaded.
        return (["lemma_lookup"], [])

    def lowercase_lemmatize(self, token: Token) -> List[str]:
        # The lemmatizer will find this method based on the mode name ("lowercase")

        lookup_tables = self.lookups.get_table("lemma_lookup", {})
        # call the existing lookup method
        result = lookup_tables.get(token.lower_, token.lower_)
        if isinstance(result, str):
            result = [result]
        return result


# This factory will let us add this to our pipeline
@Language.factory("lowercase_lemmatizer")
def make_lowercase_lemmatizer(nlp, name="lowercase_lemmatizer"):
    return LowercaseLemmatizer(nlp.vocab, name, mode="lowercase")


import spacy

text = "cats love CAts"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("default lemmatizer:", doc[2].lemma_, sep="\t")  # => CAts

# replace the existing lemmatizer
nlp.add_pipe("lowercase_lemmatizer", before="lemmatizer").initialize()
nlp.remove_pipe("lemmatizer")

doc2 = nlp(text)
print("lowercase lemmatizer:", doc2[2].lemma_, sep="\t")  # => cat

Lemmatizer Tables

This is information about the internal implementation of the lemmatizer.

There are four different tables the lemmatizer can use. By default lemma_lookups is used in lookup mode, and the other three tables are used in rule mode. The three tables used for rule mode are subdivided by POS, and their individual functions are:

lemma_exc: Exceptions, these take precedence over other forms.
lemma_rule: This is a list of simple suffix rewrite rules. If the first part of the rule is found as a suffix, it is replaced by the second part of the rule.
lemma_index: This is a list of real words - when a word rewritten with a rule is found in this list, it's known to be a good form, and used as a lemma (unless an exception takes precedence).

Depending on the specific lemmatizer implementation, the way these tables are used can differ, or there can even be different tables altogether. To understand the exact way the tables are used, the best reference is the code of the lemmatizer you're using.

If your issue isn't resolved by this FAQ, feel free to open a new Discussion and mention you read it - we're always looking to improve our docs!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatizer FAQ #11685

{{title}}

Replies: 0 comments

Select a reply

Lemmatizer FAQ #11685

polm Oct 21, 2022

Why is the Lemmatizer Output Wrong?

The Different Lemmatizers

Check Which Lemmatizer You're Using

Modifying the Output

Custom Lemmatizer Modes

Lemmatizer Tables

Replies: 0 comments

polm
Oct 21, 2022