The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.
@inproceedings{decao2021autoregressive,
author = {Nicola {De Cao} and
Gautier Izacard and
Sebastian Riedel and
Fabio Petroni},
title = {Autoregressive Entity Retrieval},
booktitle = {9th International Conference on Learning Representations, {ICLR} 2021,
Virtual Event, Austria, May 3-7, 2021},
publisher = {OpenReview.net},
year = {2021},
url = {https://openreview.net/forum?id=5k8F6UU39V},
}
The mGENRE system as presented in Multilingual Autoregressive Entity Linking
@article{de-cao-etal-2022-multilingual,
title = "Multilingual Autoregressive Entity Linking",
author = "De Cao, Nicola and
Wu, Ledell and
Popat, Kashyap and
Artetxe, Mikel and
Goyal, Naman and
Plekhanov, Mikhail and
Zettlemoyer, Luke and
Cancedda, Nicola and
Riedel, Sebastian and
Petroni, Fabio",
journal = "Transactions of the Association for Computational Linguistics",
volume = "10",
year = "2022",
address = "Cambridge, MA",
publisher = "MIT Press",
url = "https://aclanthology.org/2022.tacl-1.16",
doi = "10.1162/tacl_a_00460",
pages = "274--290",
}
Please consider citing our works if you use code from this repository.
In a nutshell, (m)GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture or mBART (for multilingual). (m)GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:
For end-to-end entity linking GENRE re-generates the input text annotated with a markup:
GENRE achieves state-of-the-art results on multiple datasets.
mGENRE performs multilingual entity linking in 100+ languages treating language as latent variables and marginalizing over them:
- python>=3.7
- pytorch>=1.6
- fairseq>=0.10 (optional for training GENRE) NOTE: fairseq is going though changing without backward compatibility. Install
fairseq
from source and use this commit for reproducibilty. See here for the current PR that should fixfairseq/master
. - transformers>=4.2 (optional for inference of GENRE)
For a full review of (m)GENRE API see:
- examples for GENRE on how to use GENRE for both pytorch fairseq and huggingface transformers;
- examples for mGENRE on how to use mGENRE.
After importing and loading the model and a prefix tree (trie), you would generate predictions (in this example for Entity Disambiguation) with a simple call like:
import pickle
from genre.fairseq_model import GENRE
from genre.trie import Trie
# load the prefix tree (trie)
with open("../data/kilt_titles_trie_dict.pkl", "rb") as f:
trie = Trie.load_from_dict(pickle.load(f))
# load the model
model = GENRE.from_pretrained("models/fairseq_entity_disambiguation_aidayago").eval()
# generate Wikipedia titles
model.sample(
sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."],
prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
[[{'text': 'Germany', 'score': tensor(-0.1856)},
{'text': 'Germans', 'score': tensor(-0.5461)},
{'text': 'German Empire', 'score': tensor(-2.1858)}]
Making predictions with mGENRE is very similar, but we additionally need to map (title, language_ID)
to Wikidata IDs and (optionally) marginalize over predictions of the same entity:
import pickle
from genre.fairseq_model import mGENRE
from genre.trie import MarisaTrie, Trie
with open("../data/lang_title2wikidataID-normalized_with_redirect.pkl", "rb") as f:
lang_title2wikidataID = pickle.load(f)
# memory efficient prefix tree (trie) implemented with `marisa_trie`
with open("../data/titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
trie = pickle.load(f)
# generate Wikipedia titles and language IDs
model = mGENRE.from_pretrained("../models/fairseq_multilingual_entity_disambiguation").eval()
model.sample(
sentences=["[START] Einstein [END] era un fisico tedesco."],
# Italian for "[START] Einstein [END] was a German physicist."
prefix_allowed_tokens_fn=lambda batch_id, sent: [
e for e in trie.get(sent.tolist()) if e < len(model.task.target_dictionary)
],
text_to_id=lambda x: max(lang_title2wikidataID[
tuple(reversed(x.split(" >> ")))
], key=lambda y: int(y[1:])),
marginalize=True,
)
[[{'id': 'Q937',
'texts': ['Albert Einstein >> it',
'Alberto Einstein >> it',
'Einstein >> it'],
'scores': tensor([-0.0808, -1.4619, -1.5765]),
'score': tensor(-0.0884)},
{'id': 'Q60197',
'texts': ['Alfred Einstein >> it'],
'scores': tensor([-1.4337]),
'score': tensor(-3.2058)},
{'id': 'Q15990626',
'texts': ['Albert Einstein (disambiguation) >> en'],
'scores': tensor([-1.0998]),
'score': tensor(-3.6478)}]]
For GENRE use this script to download all models and this to download all datasets. See here the list of all individual models for each task and for both pytorch fairseq and huggingface transformers. See the example on how to download additional optional files like the prefix tree (trie) for KILT Wikipedia.
For mGENRE we only have a model available here. See the example on how to download additional optional files like the prefix tree (trie) for Wikipedia in all languages and the mapping between titles and Wikidata IDs.
Pre-trained mBART model on 125 languages available here.
If the module cannot be found, preface the python command with PYTHONPATH=.
GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.