This is the repository for the paper MOSAICo: a Multilingual Open-text Semantically Annotated Interlinked Corpus, presented at NAACL 2024 by Simone Conia, Edoardo Barba, Abelardo Carlos Martinez Lorenzo, Pere-Lluís Huguet Cabot, Riccardo Orlando, Luigi Procopio and Roberto Navigli.
Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks.
- Full paper: NAACL 2024!
If you use any part of this work, please consider citing the paper as follows:
@inproceedings{conia-etal-2024-mosaico,
title = "{MOSAICo}: a Multilingual Open-text Semantically Annotated Interlinked Corpus",
author = "Conia, Simone and Barba, Edoardo and Martinez Lorenzo, Abelardo Carlos and Huguet Cabot, Pere-Llu{\'\i}s and Orlando, Riccardo and Procopio, Luigi and Navigli, Roberto",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.442",
pages = "7983--7997",
}
MOSAICo provides high-quality silver annotations for 4 semantic tasks:
- Word Sense Disambiguation: we use ESCHER, a state-of-the-art WSD system adapted for multilingual settings.
- Semantic Role Labeling: we use Multi-SRL, a state-of-the-art multilingual system for dependency- and span-based SRL.
- Semantic Parsing: we use SPRING, a state-of-the-art semantic parser adapted for multilingual settings.
- Relation Extraction: we use mREBEL, a state-of-the-art system for multilingual RE.
MOSAICo data are released as mongoexported JSON files that can be loaded into a local instance of MongoDB.
First, we need to start a local MongoDB instance (we suggest using Docker):
docker run \
-e MONGO_INITDB_ROOT_USERNAME=admin \
-e MONGO_INITDB_ROOT_PASSWORD=password \
-p 27017:27017 \
--name local-mosaico-db \
--detach \
mongo:6.0.11
Then, we need to mongoimport the data corresponding to three collections.
Collection | Sample | Mosaico Core |
---|---|---|
interlanguage-links | - | link |
pages | link | link |
annotations | link | link |
The Sample column refers to an English-only sample of 835 annotated documents.
Once downloaded, you can import the data into the local MongoDB instance.
# import interlanguage links
docker exec -i local-mosaico-db \
mongoimport \
--authenticationDatabase admin -u admin -p password \
--db mosaico --collection interlanguage-links < <path-to-interlanguage-links.collection.json>
# import pages
docker exec -i local-mosaico-db \
mongoimport \
--authenticationDatabase admin -u admin -p password \
--db mosaico --collection pages < <path-to-pages.collection.json>
# import annotations
docker exec -i local-mosaico-db \
mongoimport \
--authenticationDatabase admin -u admin -p password \
--db mosaico --collection annotations < <path-to-annotations.collection.json>
pip install git+https://github.com/SapienzaNLP/mosaico
The library heavily uses async programming. If you cannot integrate that within your code (e.g., inside a torch.Dataset), I suggest using a separate script to download the data locally. Moreover, we built this project on top of beanie, an ODM for MongoDB. Before proceeding, we strongly recommend to check out its tutorial, as WikiPage is a beanie.Document.
import asyncio
from mosaico.schema import init, WikiPage
async def main():
await init(
mongo_uri="mongodb://admin:[email protected]:27017/",
db="mosaico",
)
page = await WikiPage.find_one(WikiPage.title == "Barack Obama")
print(f"# document id: {page.document_id}")
print(
f"# wikidata id: {page.wikidata_id if page.wikidata_id is not None else '<not available>'}"
)
print(f"# language: {page.language.value}")
print(f"# text: {page.text[: 100]} [...]")
print("# available annotations:")
async for annotation in page.list_annotations():
print(f" * {annotation.name}")
print("# available translated pages:")
async for translated_page in page.list_translations():
print(f" * {translated_page.language.value} => {translated_page.document_id}")
if __name__ == "__main__":
asyncio.run(main())
For more information, check out the examples/ folder. If interested in the fields available for each annotation, check out the pydantic models defined in src/mosaico/schema/annotations/.
This code includes a script to run a streamlit demo that allows for easy data visualization.
PYTHONPATH=$(pwd) pdm run demo
This repository uses PDM as its dependency manager.
# install pdm package manager
curl -sSL https://pdm-project.org/install-pdm.py | python3 -
# and add binary folder to PATH
pdm install
We use an alignment algorithm to link the Cirrus text (which does not contain metadata such as sections and links) to the standard Wikipedia source text (which does).
In this process, we compute a cleaned more-easily-alignable version of the source text by applying wikiextractor. For best results, we recommend correcting (i.e., patching) the installed version of wikiextractor by updating the following lines in wikiextractor.extract:clean:
for tag in discardElements:
text = dropNested(text, r'<\s*%s\b[^>/]*>' % tag, r'<\s*/\s*%s>' % tag)
to:
for tag in discardElements:
text = dropNested(text, r'<\s*%s\b[^>]*[^/]*>' % tag, r'<\s*/\s*%s>' % tag)
The reason behind this change is that the original regex fails on some edge cases:
Inspired by the first person ever to be cured of HIV, <a href="The%20Berlin%20Patient">The Berlin Patient</a>, StemCyte began collaborations with <a href="Cord%20blood%20bank">Cord blood bank</a>s worldwide to systematically screen <a href="Umbilical%20cord%20blood">Umbilical cord blood</a> samples for the CCR5 mutation beginning in 2011.<ref name="CCR5Δ32/Δ32 HIV-resistant cord blood"></ref>
This is the cleaned text returned by the original unpatched function: the trailing hasn't been deleted because / is excluded by the regex (while it should only be excluded if second to last char).
More details on the linking process can be found in src/scripts/annotations/source_text_linking/link.py.
If you use any part of this work, please consider citing the paper as follows:
@inproceedings{conia-etal-2024-mosaico,
title = "{MOSAIC}o: a Multilingual Open-text Semantically Annotated Interlinked Corpus",
author = "Conia, Simone and
Barba, Edoardo and
Martinez Lorenzo, Abelardo Carlos and
Huguet Cabot, Pere-Llu{\'\i}s and
Orlando, Riccardo and
Procopio, Luigi and
Navigli, Roberto",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.442",
pages = "7983--7997",
abstract = "Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks in NLU.",
}
The data is licensed under Creative Commons Attribution-ShareAlike-NonCommercial 4.0.