T-Res is an end-to-end pipeline for toponym detection, linking, and resolution on digitised historical newspapers. Given an input text, T-Res identifies the places that are mentioned in it, links them to their corresponding Wikidata IDs, and provides their geographic coordinates. T-Res has been developed to assist researchers explore large collections of digitised historical newspapers, and has been designed to tackle common problems often found when dealing with this type of data.
The pipeline has three main components:
- The Recogniser performs named entity recognition.
- The Ranker performs candidate selection and ranking.
- The Linker performs entity linking and resolution.
The three components are used in combination in the Pipeline class.
We also provide the code to deploy T-Res as an API, and show how to use it. Each of these elements are described in the documentation.
The repository contains the code for the experiments described in our CHR 2023 paper.
The T-Res documentation can be found at https://living-with-machines.github.io/T-Res/.
T-Res relies on several resources in the following directory structure:
T-Res/
├── t-res/
│ ├── geoparser/
│ └── utils/
├── app/
├── evaluation/
├── examples/
├── experiments/
│ └── outputs/
│ └── data/
│ └── lwm/
│ ├── linking_df_split.tsv [*?]
│ ├── ner_fine_dev.json [*+?]
│ └── ner_fine_train.json [*+?]
├── resources/
│ ├── deezymatch/
│ │ └── data/
│ │ └── w2v_ocr_pairs.txt [?]
│ ├── models/
│ ├── news_datasets/
│ ├── rel_db/
│ │ └── embeddings_database.db [*+?]
│ └── wikidata/
│ ├── entity2class.txt [*]
│ ├── mentions_to_wikidata_normalized.json [*]
│ ├── mentions_to_wikidata.json [*]
│ ├── wikidta_gazetteer.csv [*]
│ └── wikidata_to_mentions_normalized.json [*]
└── tests/
These resources are described in detail in the documentation. A question mark (?
) is used to indicate resources which are only required for some approaches (for example, the rel_db/embeddings_database.db
file is only required by the REL-based disambiguation approaches). Note that an asterisk (*
) next to the resource means that the path can be changed when instantiating the T-Res objects, and a plus sign (+
) if the name of the file can be changed in the instantiation.
By default, T-Res expects to be run from the experiments/
folder, or a directory in the same level (for example, the examples/
folder).
This is an example on how to use the default T-Res pipeline:
from geoparser import pipeline
geoparser = pipeline.Pipeline(resources_path="./resources")
output = geoparser.run_text("She was on a visit at Chippenham.")
This returns:
[{'mention': 'Chippenham',
'ner_score': 1.0,
'pos': 22,
'sent_idx': 0,
'end_pos': 32,
'tag': 'LOC',
'sentence': 'She was on a visit at Chippenham.',
'prediction': 'Q775299',
'ed_score': 0.651,
'string_match_score': {'Chippenham': (1.0,
['Q775299',
'Q3138621',
'Q2178517',
'Q1913461',
'Q7592323',
'Q5101644',
'Q67149348'])},
'prior_cand_score': {},
'cross_cand_score': {'Q775299': 0.651,
'Q3138621': 0.274,
'Q2178517': 0.035,
'Q1913461': 0.033,
'Q5101644': 0.003,
'Q7592323': 0.002,
'Q67149348': 0.002},
'latlon': [51.4585, -2.1158],
'wkdt_class': 'Q3957'}]
Note that T-Res allows the user to use their own knowledge base, and to choose among different approaches for performing each of the steps in the pipeline. Please refer to the documentation to learn how.
This work was supported by Living with Machines (AHRC grant AH/S01179X/1) and The Alan Turing Institute (EPSRC grant EP/N510129/1).
Living with Machines, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and Cambridge, King's College London, East Anglia, Exeter, and Queen Mary University of London.
This work has been inspired by many previous projects, but particularly the Radboud Entity Linker (REL).
We adapt some code from:
- Huggingface tutorials: Apache License 2.0
- DeezyMatch tutorials: MIT License
- Radboud Entity Linker: MIT License
- Wikimapper: Apache License 2.0
Classes, methods and functions that have been taken or adapted from above are credited in the docstrings.
In our experiments, we have used resources built from Wikidata and Wikipedia for linking. In order to assess T-Res performance, we have used the topRes19th and the HIPE-2020 datasets, and the HIPE-scorer for evaluation.
Coll-Ardanuy, Mariona, Federico Nanni, Kaspar Beelen, & Luke Hare. 'The Past is a Foreign Place: Improving Toponym Linking for Historical Newspapers.' Proceedings of the Computational Humanities Research Conference 2023, ed. by Artjoms Šeļa, Fotis Jannidis, and Iza Romanowska. Paris, France, December 6-8, 2023. https://ceur-ws.org/Vol-3558/ ISSN, 1613, 0073.