Skip to content

This repository hosts the dataset developed as part of the work accepted in LaTeCH-CLfL 2024: "Enriching the Metadata of Community-Generated Digital Content through Entity Linking: An Evaluative Comparison of State-of-the-Art Models"

Notifications You must be signed in to change notification settings

OurHeritageOurStories/cgdc_annotations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

Overview

This repo contains named entity (NE) and entity linking (EL) annotations of the textual metadata of items drawn from community-generated digital content (CGDC): digital-born historical/cultural archive collections that have been developed by communities. The dataset consists of annotation files (in .ann format) produced by the Brat annotation tool. The annotation files correspond to 100 CGDC items, 50 of which were collected from the Morrab Library and the other 50 from The People's Collection Wales (PCW).

Usage

Researchers and practitioners can utilise this dataset in testing and evaluating Entity Linking models on CGDC metadata. It provides a set of annotations that can serve as a benchmark for assessing EL model performance. As we are still in the process of seeking permission to directly share and distribute the raw textual metadata that was anntotated, for now we can provide only the annotations. Interested users of the dataset will thus have to reconstruct the text themselves, following the instructions below.

Dataset Description

  • Annotations: Each annotation file corresponds to a CGDC item and has a filename that starts with the collection name abbreviation ("morrab" or "pcw") followed by an underscore and the item ID within the collection. For example, the annotation file morrab_10286.ann contains the annotations for the item with ID 10286 in the Morrab Library.
  • Text reconstruction: The original text for each item can be reconstructed by creating a plain text file whose filename should bear a similar prefix as the corresponding annotation file, but with ".txt" as file extension. For instance, to create the corresponding plain text file for morrab_10286.ann, one has to create a file called morrab_10286.txt. The content of the text file should be a concatenation of the title and description of each item, with a newline in between them. The titles and descriptions of items can be retrieved by accessing the following URLs, where item_ID should be replaced with the item ID of interest:
    • for Morrab Library items: https://photoarchive.morrablibrary.org.uk/items/show/<item_ID>
    • for PCW items: https://www.peoplescollection.wales/items/<item_ID>

Citation

If you use this dataset in your research or work, please cite the following paper: Youcef Benkhedda, Adrians Skapars, Viktor Schlegel, Goran Nenadic, and Riza Batista-Navarro. 2024. Enriching the Metadata of Community-Generated Digital Content through Entity Linking: An Evaluative Comparison of State-of-the-Art Models. In Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), pages 213–220, St. Julians, Malta. Association for Computational Linguistics.

About

This repository hosts the dataset developed as part of the work accepted in LaTeCH-CLfL 2024: "Enriching the Metadata of Community-Generated Digital Content through Entity Linking: An Evaluative Comparison of State-of-the-Art Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published