diff --git a/LICENSE b/LICENSE index 995f5e2e..3d4113b8 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,9 @@ MIT License -Copyright (c) 2021 Living with Machines +Copyright (c) 2023 The Alan Turing Institute, British Library Board, Queen Mary +University of London, King's College London, University of East Anglia, The +University of Exeter and the Chancellor, Masters and Scholars of the University +of Cambridge Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index 2d835b3c..ef5ffe0a 100644 --- a/README.md +++ b/README.md @@ -1,47 +1,27 @@ - - -## Table of contents -
-A cartoon of a funny T-Rex reading a map with a lense -
- - - +
+

T-Res: A Toponym Resolution Pipeline for Digitised Historical Newspapers

+
## Overview - - -T-Res is an end-to-end pipeline for toponym resolution for digitised historical newspapers. Given an input text (a sentence or a text), T-Res identifies the places that are mentioned in it, links them to their corresponding Wikidata IDs, and provides their geographic coordinates. T-Res has been designed to tackle common problems of working with digitised historical newspapers. +T-Res is an end-to-end pipeline for toponym resolution for digitised historical newspapers. Given an input text, T-Res identifies the places that are mentioned in it, links them to their corresponding Wikidata IDs, and provides their geographic coordinates. T-Res has been designed to tackle common problems of working with digitised historical newspapers. The pipeline has three main components: + * **The Recogniser** performs named entity recognition. * **The Ranker** performs candidate selection and ranking. * **The Linker** performs entity linking and resolution. The three components are used in combination in the **Pipeline** class. -We also provide the code to deploy T-Res as an API, and show how to use it. We will describe each of these elements below. +We also provide the code to deploy T-Res as an API, and show how to use it. Each of these elements are described in this documentation. ## Directory structure ``` toponym-resolution/ ├── app/ + ├── docs/ ├── evaluation/ ├── examples/ ├── experiments/ @@ -57,496 +37,30 @@ toponym-resolution/ └── utils/ ``` -## The T-Res API - -**[TODO]** +## Documentation -[[^Go back to the Table of contents]](#table-of-contents) +The T-Res documentation can be found at **[TODO]**. -## The complete tour +## Acknowledgements -The T-Res has three main classes: the Recogniser class (which performs named entity recognition---NER), the Ranker class (which performs candidate selection and ranking for the named entities identified by the Recogniser), and the Linker class (which selectes the most likely candidate from those provided by the Ranker). An additional class, the Pipeline, wraps these three components into one, therefore making end-to-end T-Res easier to use. +This work was supported by Living with Machines (AHRC grant AH/S01179X/1) and The Alan Turing Institute (EPSRC grant EP/N510129/1). -### The Recogniser +Living with Machines, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and Cambridge, King's College London, East Anglia, Exeter, and Queen Mary University of London. -The Recogniser allows (1) loading an existing model (either directly downloading a model from the HuggingFace hub or loading a locally stored NER model) and (2) training a new model and loading it if it is already trained. +## Credits -The following notebooks show examples using the Recogniser: -* `./examples/load_use_ner_model.ipynb` -* `./examples/train_use_ner_model.ipynb` +This work has been inspired by many previous projects, but particularly the [Radboud Entity Linker (REL)](https://github.com/informagi/REL). -#### 1. Instantiate the Recogniser +We adapt some code from: +* Huggingface tutorials: [Apache License 2.0](https://github.com/huggingface/notebooks/blob/main/LICENSE) +* DeezyMatch tutorials: [MIT License](https://github.com/Living-with-machines/DeezyMatch/blob/master/LICENSE) +* Radboud Entity Linker: [MIT License](https://github.com/informagi/REL/blob/main/LICENSE) +* Wikimapper: [Apache License 2.0](https://github.com/jcklie/wikimapper/blob/master/LICENSE) -To load an already trained model (both from HuggingFace or a local model), you can just instantiate the recogniser as follows: -```python= -import recogniser - -myner = recogniser.Recogniser( - model="path-to-model", - load_from_hub=True, -) -``` - -For example, to load the [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) NER model from the HuggingFace hub: -```python= -import recogniser - -myner = recogniser.Recogniser( - model="dslim/bert-base-NER", - load_from_hub=True, -) -``` - -To load a NER model that is stored locally (for example, let's suppose we have a NER model in this relative location `../resources/models/blb_lwm-ner-fine`), you can also load it in the same way (notice that `load_from_hub` should still be True, probably a better name would be `load_from_path`): - -```python= -import recogniser - -myner = recogniser.Recogniser( - model="resources/models/blb_lwm-ner-fine", - load_from_hub=True, -) -``` - -Alternatively, you can use the Recogniser to train a new model (and load it, once it's trained). To instantiate the Recogniser for training a new model and loading it once it's trained, you can do it as in the example (see the description of each parameter below): -```python= -import recogniser - -myner = recogniser.Recogniser( - model="blb_lwm-ner-fine", - train_dataset="experiments/outputs/data/lwm/ner_fine_train.json", - test_dataset="experiments/outputs/data/lwm/ner_fine_dev.json", - pipe=None, - base_model="khosseini/bert_1760_1900", - model_path="resources/models/", - training_args={ - "learning_rate": 5e-5, - "batch_size": 16, - "num_train_epochs": 4, - "weight_decay": 0.01, - }, - overwrite_training=False, - do_test=False, - load_from_hub=False, -) -``` -Description of the arguments: -* **`load_from_hub`** set to False indicates we're not using an off-the-shelf model. It will prepare the Recogniser to train a new model, unless the model already exists or if **`overwrite_training`** is set to True. If `overwrite_training` is set to False and `load_from_hub` is set to False, the Recogniser will be prepared to first try to load the model and---if it does not exist---will train it. If `overwrite_training` is set to True and `load_from_hub` is set to False, the Recogniser will be ready to directly try to train a model. -* **`base_model`** is the path to the model that will be used as base to train our NER model. This can be the path to a HuggingFace model (we are using [khosseini/bert_1760_1900](https://huggingface.co/khosseini/bert_1760_1900), a BERT model trained on 19th Century texts) or the path to a model stored locally. -* **`train_dataset`** and **`test_dataset`** contain the path to the train and test data sets necessary for training the NER model. The paths point to a json file (one for training, one for testing), in which each line is a dictionary corresponding to a sentence. Each sentence-dictionary has three key-value pairs: `id` is an ID of the sentence (a string), `tokens` is the list of tokens into which the sentence has been split, and `ner_tags` is the list of annotations per token (in BIO format). The length of `tokens` and `ner_tags` should always be the same. This is an example of two lines from either the training or test json files: - ```json - {"id":"3896239_29","ner_tags":["O","B-STREET","I-STREET","O","O","O","B-BUILDING","I-BUILDING","O","O","O","O","O","O","O","O","O","O"],"tokens":[",","Old","Millgate",",","to","the","Collegiate","Church",",","where","they","arrived","a","little","after","ten","oclock","."]} - {"id":"8262498_11","ner_tags":["O","O","O","O","O","O","O","O","O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O"],"tokens":["On","the","\u2018","JSth","November","the","ship","Santo","Christo",",","from","Monteveido","to","Cadiz",",","with","hides","and","copper","."]} - ``` -* **`model_path`** is the path where the Recogniser should store the model, and **`model`** is the name of the model. The **`pipe`** argument can be left empty: that's where we will store the NER pipeline, once the model is trained and loaded. -* The training arguments can be modified in **`training_args`**: you can change the learning rate, batch size, number of training epochs, and weight decay. -* Finally, **`do_test`** allows you to train a mock model and then load it (the suffix `_test` will be added to the model name). As mentioned above, **`overwrite_training`** forces retraining a model, even if a model with the same name and characteristics already exists. +Classes, methods and functions that have been taken or adapted from above are credited in the docstrings. -This instantiation prepares a new model (`resources/models/blb_lwm-ner-fine.model`) to be trained, unless the model already exists (`overwrite_training` is False), in which case it will just load it. - -#### 2. Train the NER model - -After having instantiated the Recogniser, to train the model, run: -```python= -myner.train() -``` - -Note that if `load_to_hub` is set to True or the model already exists (and `overwrite_training` is set to False), the training will be skipped, even if you call the `train()` method. - -#### 3. Create a NER pipeline - -In order to create a NER pipeline, run: -```python= -myner.pipe = myner.create_pipeline() -``` - -This loads the NER model into a [Transformers pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines), to use it for inference. - -#### 4. Use the NER pipeline - -In order to run the NER pipeline on a sentence, use the `ner_predict()` method of the Recogniser as follows: -```python= -sentence = "I ought to be at Dewsbury Moor." -predictions = myner.ner_predict(sentence) -print(predictions) -``` - -This returns all words in the sentence, with their detected entity type, confidence score, and start and end characters in the sentence, as follows: -``` -[ - {'entity': 'O', 'score': 0.9997773766517639, 'word': 'I', 'start': 0, 'end': 1}, - {'entity': 'O', 'score': 0.9997766613960266, 'word': 'ought', 'start': 2, 'end': 7}, - {'entity': 'O', 'score': 0.9997838139533997, 'word': 'to', 'start': 8, 'end': 10}, - {'entity': 'O', 'score': 0.9997853636741638, 'word': 'be', 'start': 11, 'end': 13}, - {'entity': 'O', 'score': 0.9997740387916565, 'word': 'at', 'start': 14, 'end': 16}, - {'entity': 'B-LOC', 'score': 0.9603037536144257, 'word': 'Dewsbury', 'start': 17, 'end': 25}, - {'entity': 'I-LOC', 'score': 0.9753544330596924, 'word': 'Moor', 'start': 26, 'end': 30}, - {'entity': 'O', 'score': 0.9997835755348206, 'word': '.', 'start': 30, 'end': 31} -] -``` - -To return the named entities in a user-friendlier format, run: -```python= -from utils import ner - -# Process predictions: -procpreds = [ - [x["word"], x["entity"], "O", x["start"], x["end"], x["score"]] - for x in predictions -] -# Aggretate mentions: -mentions = ner.aggregate_mentions(procpreds, "pred") -``` - -This returns only the named entities, aggregating multiple tokens together: -``` -[{'mention': 'Dewsbury Moor', - 'start_offset': 5, - 'end_offset': 6, - 'start_char': 17, - 'end_char': 30, - 'ner_score': 0.968, - 'ner_label': 'LOC', - 'entity_link': 'O'}] -``` -[[^Go back to the Table of contents]](#table-of-contents) - -### The Ranker - -The Ranker takes the named entities detected by the Recogniser as input. Given a knowledge base, it ranks the entities according to their string similarity to the named entity, and selects a subset of candidates that will be passed on to the next component, the Linker, to do the disambiguation and select the most likely entity. - -In order to use the Ranker and the Linker, we need a knowledge base, a gazetteer. T-Res uses a gazetteer which combines data from Wikipedia and Wikidata. The steps to create it are described in the [wiki2gaz](https://github.com/Living-with-machines/wiki2gaz) GitHub repository. - -The following files are needed to run the Ranker: -* `wikidata_to_mentions_normalized.json`: dictionary of Wikidata entities (by their QID) mapped to the mentions used in Wikipedia to refer to them (obtained through Wikipedia anchor texts) and the normalised score. For example, the value of entity [Q23183](https://www.wikidata.org/wiki/Q23183) is the following: - ``` - {'Wiltshire, England': 0.005478851632697786, - 'Wilton': 0.00021915406530791147, - 'Wiltshire': 0.9767696690773614, - 'College': 0.00021915406530791147, - 'Wiltshire Council': 0.0015340784571553803, - 'West Wiltshire': 0.00021915406530791147, - 'North Wiltshire': 0.00021915406530791147, - 'Wilts': 0.0015340784571553803, - 'County of Wilts': 0.0026298487836949377, - 'County of Wiltshire': 0.010081087004163929, - 'Wilts.': 0.00021915406530791147, - 'Wiltshire county': 0.00021915406530791147, - 'Wiltshire, United Kingdom': 0.00021915406530791147, - 'Wiltshire plains': 0.00021915406530791147, - 'Wiltshire England': 0.00021915406530791147} - ``` -* `mentions_to_wikidata_normalized.json`: the reverse dictionary to the one above, it maps a mention to all the Wikidata entities that are referred to by this mention in Wikipedia. For example, the value of `"Wiltshire"` is: - ``` - {'Q23183': 0.9767696690773614, 'Q55448990': 1.0, 'Q8023421': 0.03125} - ``` - These scores don't add up to one, as they are normalised per entity, therefore indicating how often an entity is referred to by this mention. For example, `Q55448990` is always referred to as `Wiltshire`. - -We provide four different strategies for selecting candidates: -* **`perfectmatch`** retrieves candidates from the knowledge base if one of their alternate names is identical to the detected named entity. For example, given the mention "Wiltshire", the following Wikidata entities will be retrieved: [Q23183](https://www.wikidata.org/wiki/Q23183), [Q55448990](https://www.wikidata.org/wiki/Q55448990), and [Q8023421](https://www.wikidata.org/wiki/Q8023421), because all these entities are referred to as "Wiltshire" in Wikipedia anchor texts. -* **`partialmatch`** retrieves candidates from the knowledge base if there is a (partial) match between the query and the candidate names, based on string overlap. Therefore, the mention "Ashton-under" returns candidates for "Ashton-under-Lyne". -* **`levenshtein`** retrieves candidates from the knowledge base if there is a fuzzy match between the query and the candidate names, based on levenshtein distance. Therefore, if the mention "Wiltshrre" would still return the candidates for "Wiltshire". This method is often quite accurate when it comes to OCR variations, but it is very slow. -* **`deezymatch`** retrieves candidates from the knowledge base if there is a fuzzy match between the query and the candidate names, based on [DeezyMatch](https://github.com/Living-with-machines/DeezyMatch) embeddings. Significantly more complex than the other methods to set up from scratch, but the fastest approach. - -#### 1. Instantiate the Ranker - -To use the Ranker for exact matching (`perfectmatch`) or fuzzy string matching based either on overlap or Levenshtein distance (`partialmatch` and `levenshtein` respectively), instantiate it as follows, changing the **`method`** argument accordingly: - -```python= -from geoparser import ranking - -myranker = ranking.Ranker( - method="perfectmatch", # or "partialmatch" or "levenshtein" - resources_path="resources/wikidata/", - mentions_to_wikidata=dict(), - wikidata_to_mentions=dict(), -) -``` -Note that **`resources_path`** should contain the path to the directory where the resources are stored, namely `wikidata_to_mentions_normalized.json` and `mentions_to_wikidata.json`. The **`mentions_to_wikidata`** and **`wikidata_to_mentions`** dictionaries should be left empty, as they will be populated when the Ranker loads the resources. - -DeezyMatch instantiation is trickier, as it requires training a model that, ideally, should capture the types of string variations that can be found in your data (such as OCR errrors). Using the Ranker, you can: -1. Train a DeezyMatch model from scratch, including generating a string pairs dataset. -2. Train a DeezyMatch model, given an existing string pairs dataset. -3. Use an existing DeezyMatch model. - -See below each of them in detail. - -##### 1. Use an existing DeezyMatch model - -To use an existing DeezyMatch model, you wil need to have the following `resources` file structure (where `wkdtalts` is the name given to the set of all Wikidata alternate names and `w2v_ocr` is the name given to the DeezyMatch model). -``` -toponym-resolution/ - ├── ... - ├── resources/ - │ ├── deezymatch/ - │ │ ├── combined/ - │ │ │ └── wkdtalts_w2v_ocr/ - │ │ │ ├── bwd.pt - │ │ │ ├── bwd_id.pt - │ │ │ ├── bwd_items.npy - │ │ │ ├── fwd.pt - │ │ │ ├── fwd_id.pt - │ │ │ ├── fwd_items.npy - │ │ │ └── input_dfm.yaml - │ │ └── models/ - │ │ └── w2v_ocr/ - │ │ ├── input_dfm.yaml - │ │ ├── w2v_ocr.model - │ │ ├── w2v_ocr.model_state_dict - │ │ └── w2v_ocr.vocab - │ ├── models/ - │ ├── news_datasets/ - │ ├── wikidata/ - │ │ ├── mentions_to_wikidata.json - │ │ └── wikidata_to_mentions.json - │ └── wikipedia/ - └── ... -``` - -The Ranker can then be instantiated as follows: -```python= -from pathlib import Path -from geoparser import ranking - -myranker = ranking.Ranker( - # Generic Ranker parameters: - method="deezymatch", - resources_path="resources/wikidata/", - mentions_to_wikidata=dict(), - wikidata_to_mentions=dict(), - # Parameters to create the string pair dataset: - strvar_parameters={ - "overwrite_dataset": False, - }, - # Parameters to train, load and use a DeezyMatch model: - deezy_parameters={ - # Paths and filenames of DeezyMatch models and data: - "dm_path": str(Path("resources/deezymatch/").resolve()), - "dm_cands": "wkdtalts", - "dm_model": "w2v_ocr", - "dm_output": "deezymatch_on_the_fly", - # Ranking measures: - "ranking_metric": "faiss", - "selection_threshold": 25, - "num_candidates": 3, - "search_size": 3, - "verbose": False, - # DeezyMatch training: - "overwrite_training": True, - "do_test": True, - }, -) -``` - -Description of the arguments (to learn more, refer to the [DeezyMatch readme](https://github.com/Living-with-machines/DeezyMatch/blob/master/README.md)): -* **`strvar_parameters`** contains the parameters needed to generate the DeezyMatch training set. In this scenario, the DeezyMatch model is already trained and there is therefore no need to generate the training set. -* **`deezy_parameters`** contains the set of parameters to train or load a DeezyMatch model: - * **`dm_path`**: The path to the folder where the DeezyMatch model and data will be stored. - * **`dm_cands`**: The name given to the set of alternate names from which DeezyMatch will try to find a match for a given mention. - * **`dm_model`**: Name of the DeezyMatch model to train or load. - * **`ranking_metric`** Metric used to - -You can download these resources from: -* `resources/deezymatch/combined/wkdtalts_w2v_ocr/`: **[TODO]** -* `resources/deezymatch/models/w2v_ocr/`: **[TODO]** -* `wikidata/mentions_to_wikidata.json`: **[TODO]** -* `wikidata/wikidata_to_mentions.json`: **[TODO]** - -##### 1. Train a DeezyMatch model from scratch, including generating a string pairs dataset - - - -##### 2. Train a DeezyMatch model, given an existing string pairs dataset - - - - -```python= -myranker = ranking.Ranker( - method="perfectmatch", - resources_path="../resources/wikidata/", - mentions_to_wikidata=dict(), - wikidata_to_mentions=dict(), - # Parameters to create the string pair dataset: - strvar_parameters={ - "overwrite_dataset": False, - }, - deezy_parameters={ - "dm_path": str(Path("../resources/deezymatch/").resolve()), - "dm_cands": "wkdtalts", - "dm_model": "w2v_ocr", - "dm_output": "deezymatch_on_the_fly", - # Ranking measures: - "ranking_metric": "faiss", - "selection_threshold": 25, - "num_candidates": 3, - "search_size": 3, - "verbose": False, - # DeezyMatch training: - "overwrite_training": False, - "do_test": False, - }, -) -``` - -#### 2. Load the resources - -The following line loads the resources (i.e. the `mentions-to-wikidata` and `wikidata_to_mentions` dictionaries) required to perform candidate selection and ranking, regardless of the Ranker method. - -```python= -myranker.mentions_to_wikidata = myranker.load_resources() -``` - -#### 3. Train a DeezyMatch model - -The following line will train a DeezyMatch model, given the arguments specified when instantiating the Ranker. - -```python= -myranker.train() -``` - -Note that if the model already exists and overwrite_training is set to False, the training will be skipped, even if you call the train() method. The training will also be skipped if the Ranker is not instantiated for DeezyMatch. - -#### 4. Retrieve candidates for a given mention - -```python= -toponym = "Manchefter" -print(myranker.find_candidates([{"mention": toponym}])[0][toponym]) -``` +Finally, this work has used the [HIPE-scorer](https://github.com/hipe-eval/HIPE-scorer/blob/master/LICENSE) for assessing the performance of T-Res. -[[^Go back to the Table of contents]](#table-of-contents) - -### The Linker - -[[^Go back to the Table of contents]](#table-of-contents) - -### The Pipeline - -[[^Go back to the Table of contents]](#table-of-contents) - -## Installation - -If you want to work directly on the codebase, we suggest to install T-Res following these instructions (which have been tested Linux (ubuntu 20.04)). - -### First, update the system - -First, you need to make sure the system is up to date and all essential libraries are installed. - -``` -sudo apt update -sudo apt install build-essential curl libbz2-dev libffi-dev liblzma-dev libncursesw5-dev libreadline-dev libsqlite3-dev libssl-dev libxml2-dev libxmlsec1-dev llvm make tk-dev wget xz-utils zlib1g-dev -``` - -### Install pyenv - -Then you need to install pyenv, which we use to manage virtual environments: - -``` -curl https://pyenv.run | bash -``` -And also to make sure paths are properly exported: - -``` -echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc -echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc -echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init --path)"\nfi' >> ~/.bashrc -``` -Then you can restart your bash session, to make sure all changes are updated: - -``` -source ~/.bashrc -``` -And then you run the following commands to update `pyenv` and create the needed environemnt. - -``` -pyenv update - -pyenv install 3.9.7 -pyenv global 3.9.7 -``` - -### Install poetry - -To manage dipendencies across libraries, we use Poetry. To install it, do the following: - -``` -curl -sSL https://install.python-poetry.org | python3 - -echo 'export PATH=$PATH:$HOME/.poetry/bin' >> ~/.bashrc -``` - -### Project Installation - -You can now clone the repo and `cd` into it: - -``` -git clone https://github.com/Living-with-machines/toponym-resolution.git -cd toponym-resolution -``` - -Explicitly tell poetry to use the python version defined above: - -``` -poetry env use python -``` - -Install all dependencies using `poetry`: - -``` -poetry update -poetry install -``` - -Create a kernel: -``` -poetry run ipython kernel install --user --name= -``` - -### How to use poetry - -To activate the environment: - -``` -poetry shell -``` - -Now you can run a script as usual, for instance : - -``` -python experiments/toponym_resolution.py -``` - -To add a package: - -``` -poetry add [package name] -``` - -To run the Python tests: - -``` -poetry run pytest -``` - -If you want to use Jupyter notebook, run it as usual, and then select the created kernel in "Kernel" > "Change kernel". - -``` -jupyter notebook -``` - -### Pre-commit hoooks - -In order to guarantee style consistency across our codebase we use a few basic pre-commit hooks. - - -To use them, first run: - -``` -poetry run pre-commit install --install-hooks -``` - -To run the hooks on all files, you can do: - -``` -poetry run pre-commit run --all-files -``` +## Cite +**[TODO]** diff --git a/docs/source/conf.py b/docs/source/conf.py index 08f59861..075d94ce 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -13,15 +13,13 @@ import os import sys -sys.path.insert(0, os.path.abspath(".")) -sys.path.insert(0, os.path.abspath("../..")) -sys.path.insert(0, os.path.abspath("../geoparser")) +sys.path.insert(0, os.path.abspath("../../")) # -- Project information ----------------------------------------------------- project = "T-Res" -copyright = "2023 Living with Machines" -author = "Federico Nanni" +copyright = "2023 The Alan Turing Institute, British Library Board, Queen Mary University of London, King's College London, University of East Anglia, The University of Exeter and the Chancellor, Masters and Scholars of the University of Cambridge" +author = "Living with Machines" # The full version, including alpha/beta/rc tags release = "0.1.0" diff --git a/docs/source/experiments/index.rst b/docs/source/experiments/index.rst new file mode 100644 index 00000000..b08d24f4 --- /dev/null +++ b/docs/source/experiments/index.rst @@ -0,0 +1,51 @@ +Experiments and evaluation +========================== + +Follow these steps to reproduce the experiments in our paper. + +1. Obtain the external resources +-------------------------------- + +Follow the instructions in the ":doc:`resources`" page in the documentation +to obtain the resources required for running the experiments. + +2. Preparing the data +------------------------- + +To create the datasets that we use in the experiments presented in the paper, +run the following command from the ``./experiments/`` folder: + +.. code-block:: bash + + $ python ./prepare_data.py + +This script takes care of downloading the LwM and HIPE datasets and format them +as needed in the experiments. + +3. Running the experiments +-------------------------- + +To run the experiments, run the following script from the ``./experiments/`` +folder: + +.. code-block:: bash + + $ python ./toponym_resolution.py + +This script does runs for all different scenarios reported in the experiments in +the paper. + +4. Evaluate +----------- + +To evaluate the different approaches and obtain a table with results such as the +one provided in the paper, go to the ``./evaluation/`` directory. There, you +should clone the `HIPE scorer `_. We +are using the code version at commit ``50dff4e``, and have added the line +``return eval_stats`` at the end of the ``get_results()`` function. From +``./evaluation/``, run the following script to obtain the results in latex +format: + +.. code-block:: bash + + $ python display_results.py diff --git a/docs/source/getting-started/complete-tour.rst b/docs/source/getting-started/complete-tour.rst index cef09fc1..f27dee79 100644 --- a/docs/source/getting-started/complete-tour.rst +++ b/docs/source/getting-started/complete-tour.rst @@ -1,33 +1,435 @@ -.. _top: +.. _top-tour: ================= The complete tour ================= -The T-Res has three main classes: the Recogniser class (which performs named -entity recognition---NER), the Ranker class (which performs candidate -selection and ranking for the named entities identified by the Recogniser), -and the Linker class (which selectes the most likely candidate from those -provided by the Ranker). An additional class, the Pipeline, wraps these three -components into one, therefore making end-to-end T-Res easier to use. +The T-Res has three main classes: the **Recogniser** class (which performs +toponym recognition, which is a named entity recognition task), the **Ranker** +class (which performs candidate selection and ranking for the named entities +identified by the Recogniser), and the **Linker** class (which selectes the +most likely candidate from those provided by the Ranker). + +An additional class, the **Pipeline**, wraps these three components into one, +therefore making it easier for the user to perform end-to-end entity linking. + +In the following sections, we provide a complete tour: including an in-depth +description of each of the four classes. We recommend that you start with the +Pipeline, which wraps the three other classes, and refer to the description of +each of the other classes to learn more about them. We also recommend that +you first try to run T-Res using the default pipeline, and then change it +accordingly to your needs. + +.. warning:: + + Note that, before being able to run the pipeline, you will need to make sure + you have all the required resources. Refer to the ":doc:`resources`" page + in the documentation. + +The Pipeline +------------ + +The Pipeline wraps the Recogniser, the Ranker and the Linker into one object, +to make it easier to use T-Res for end-to-end entity linking. + +1. Instantiate the Pipeline +########################### + +By default, the Pipeline instantiates: + +* a Recogniser (from a HuggingFace model), +* a Ranker (using the `perfectmatch` approach), and +* a Linker (using the `mostpopular` approach). + +To instantiate the default T-Res pipeline, do: + +.. code-block:: python + + from geoparser import pipeline + + geoparser = pipeline.Pipeline() + +You can also instantiate a pipeline using a customised Recogniser, Ranker and +Linker. To see the different options, refer to the sections on instantiating +each of them: :ref:`Recogniser `, :ref:`Ranker ` +and :ref:`Linker `. + +In order to instantiate a pipeline using a customised Recogniser, Ranker and +Linker, just instantiate them beforehand, and then pass them as arguments To +the Pipeline, as follows: + +.. code-block:: python + + from geoparser import pipeline, recogniser, ranking, linking + + myner = recogniser.Recogniser(...) + myranker = ranking.Ranker(...) + mylinker = linking.Linker(...) + + geoparser = pipeline.Pipeline(myner=myner, myranker=myranker, mylinker=mylinker) + +.. warning:: + + Note that the default Pipeline expects to be run from the ``experiments/`` + or the ``examples`` folder (or any other folder in the same level). The + Pipeline will look for the resources at ``../resources/``. Make sure all + the required resources are in the right locations. + +.. note:: + + If a model needs to be trained, the Pipeline itself will take care of it. + Therefore, you should expect that the first time the Pipeline is used (or + if you change certain input parameters) T-Res will take long to be ready + to be used for prediction, as it will train the models if the approaches + require so. + +2. Use the Pipeline +################### + +Once instantiated (and once all the models have been trained or loaded, if needed), +the Pipeline can be used to perform end-to-end toponym recognition and linking +(given an input text) or to perform each of the three steps individually: (1) +toponym recognition given an input text, (2) candidate selection given a toponym +or list of toponyms, and (3) toponym disambiguation given the output from the +first two steps. + +End-to-end pipeline +^^^^^^^^^^^^^^^^^^^ + +The Pipeline can be used to perform end-to-end toponym recognition and linking +given an input text, using the ``run_sentence()`` method (which applies the +T-Res pipeline to the input text) or the ``run_text()`` method (which takes +care of splitting a text into sentences, before running ``run_sentence()`` +on each sentence). + +See this with examples: + +.. code-block:: python + + output = geoparser.run_text("Inspector Liddle said: I am an inspector of police, living in the city of Durham.") + +.. code-block:: python + + output = geoparser.run_sentence("Inspector Liddle said: I am an inspector of police, living in the city of Durham.") + +In both cases, the following parameters are optional **[TODO: link to docstrings]**: + +* ``place``: The place of publication associated with the text document as a + human-legible string (e.g. ``"London"``). This defaults to ``""``. +* ``place_wqid``: The Wikidata ID of the place of publication provided in + ``place`` (e.g. ``"Q84"``). This defaults to ``""``. + +For example: + +.. code-block:: python + + output = geoparser.run_text("Inspector Liddle said: I am an inspector of police, living in the city of Durham.", + place="Alston, Cumbria, England", + place_wqid="Q2560190" + ) + +The output of this example is the following: + +.. code-block:: json + + [{"mention": "Durham", + "ner_score": 0.999, + "pos": 74, + "sent_idx": 0, + "end_pos": 80, + "tag": "LOC", + "sentence": "Inspector Liddle said: I am an inspector of police, living in the city of Durham.", + "prediction": "Q179815", + "ed_score": 0.039, + "cross_cand_score": { + "Q179815": 0.396, + "Q23082": 0.327, + "Q49229": 0.141, + "Q5316459": 0.049, + "Q458393": 0.045, + "Q17003433": 0.042, + "Q1075483": 0.0 + }, + "string_match_score": {"Durham": [1.0, ["Q1137286", "Q5316477", "Q752266", "..."]]}, + "prior_cand_score": { + "Q179815": 0.881, + "Q49229": 0.522, + "Q5316459": 0.457, + "Q17003433": 0.455, + "Q23082": 0.313, + "Q458393": 0.295, + "Q1075483": 0.293 + }, + "latlon": [54.783333, -1.566667], + "wkdt_class": "Q515"}] + +Step-by-step pipeline +^^^^^^^^^^^^^^^^^^^^^ + +See how to perform toponym recognition with the Pipeline, with an example: + +.. code-block:: python + + output = geoparser.run_text_recognition( + "Inspector Liddle said: I am an inspector of police, living in the city of Durham.", + place="Alston, Cumbria, England", + place_wqid="Q2560190" + ) + +This is the output for this example: + +.. code-block:: json + + [{"mention": "Durham", + "context": ["", ""], + "candidates": [], + "gold": ["NONE"], + "ner_score": 0.999, + "pos": 74, + "sent_idx": 0, + "end_pos": 80, + "ngram": "Durham", + "conf_md": 0.999, + "tag": "LOC", + "sentence": "Inspector Liddle said: I am an inspector of police, living in the city of Durham.", + "place": "Alston, Cumbria, England", + "place_wqid": "Q2560190" + }] + +See how to perform candidate selection given the output from the previous +step, with an example: + +.. code-block:: python + + ner_output = [ + { + 'mention': 'Durham', + 'context': ['', ''], + 'candidates': [], + 'gold': ['NONE'], + 'ner_score': 0.999, + 'pos': 74, + 'sent_idx': 0, + 'end_pos': 80, + 'ngram': 'Durham', + 'conf_md': 0.999, + 'tag': 'LOC', + 'sentence': 'Inspector Liddle said: I am an inspector of police, living in the city of Durham.', + 'place': 'Alston, Cumbria, England', + 'place_wqid': 'Q2560190' + } + ] + + cands = geoparser.run_candidate_selection(ner_output) + +This is the output for this example: + +.. code-block:: json + + {"Durham": + {"Durham": + { + "Score": 1.0, + "Candidates": + { + "Q1137286": 0.022222222222222223, + "Q5316477": 0.3157894736842105, + "Q752266": 0.013513513513513514, + "Q23082": 0.06484443152079093, + } + } + } + } + +Finally, see how to perform toponym disambiguation given the output from +the two previous steps, with an example: + +.. code-block:: python + + ner_output = [ + { + 'mention': 'Durham', + 'context': ['', ''], + 'candidates': [], + 'gold': ['NONE'], + 'ner_score': 0.999, + 'pos': 74, + 'sent_idx': 0, + 'end_pos': 80, + 'ngram': 'Durham', + 'conf_md': 0.999, + 'tag': 'LOC', + 'sentence': 'Inspector Liddle said: I am an inspector of police, living in the city of Durham.', + 'place': 'Alston, Cumbria, England', + 'place_wqid': 'Q2560190' + } + ] + + cands = {'Durham': {'Durham': {'Score': 1.0, + 'Candidates': { + 'Q1137286': 0.022222222222222223, + 'Q5316477': 0.3157894736842105, + 'Q752266': 0.013513513513513514, + 'Q23082': 0.06484443152079093}}}} + + disamb_output = geoparser.run_disambiguation(ner_output, cands) + +This will return the exact same output as running the pipeline end-to-end. + +Description of the output +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The output of running the pipeline (both using the end-to-end method or +in a step-wise manner, regardless of the methods used for each of the +three components), will have the following format: + +.. code-block:: json + + [{"mention": "Durham", + "ner_score": 0.999, + "pos": 74, + "sent_idx": 0, + "end_pos": 80, + "tag": "LOC", + "sentence": "Inspector Liddle said: I am an inspector of police, living in the city of Durham.", + "prediction": "Q179815", + "ed_score": 0.039, + "cross_cand_score": { + "Q179815": 0.396, + "Q23082": 0.327, + "Q49229": 0.141, + "Q5316459": 0.049, + "Q458393": 0.045, + "Q17003433": 0.042, + "Q1075483": 0.0 + }, + "string_match_score": {"Durham": [1.0, ["Q1137286", "Q5316477", "Q752266", "..."]]}, + "prior_cand_score": { + "Q179815": 0.881, + "Q49229": 0.522, + "Q5316459": 0.457, + "Q17003433": 0.455, + "Q23082": 0.313, + "Q458393": 0.295, + "Q1075483": 0.293 + }, + "latlon": [54.783333, -1.566667], + "wkdt_class": "Q515"}] + +Description of the fields: + +* ``mention``: The mention text. +* ``ner_score``: The NER confidence score of the mention. +* ``pos``: The starting position of the mention in the sentence. +* ``sent_idx``: The index of the sentence. +* ``end_pos``: The ending position of the mention in the sentence. +* ``tag``: The NER label of the mention. +* ``sentence``: The input sentence. +* ``prediction``: The predicted entity linking result (a Wikidata QID or NIL). +* ``ed_score``: The entity disambiguation score. +* ``string_match_score``: A dictionary of candidate entities and their string + matching confidence scores. +* ``prior_cand_score``: A dictionary of candidate entities and their prior + confidence scores. +* ``cross_cand_score``: A dictionary of candidate entities and their + cross-candidate confidence scores. +* ``latlon``: The latitude and longitude coordinates of the predicted entity. +* ``wkdt_class``: The Wikidata class of the predicted entity. + +Pipeline recommendations +^^^^^^^^^^^^^^^^^^^^^^^^ + +* To get started with T-Res, we recommend to start using the default pipeline, + as its significantly less complex than the better performing approaches. +* The default pipeline may not be a bad option if you are planning to perform + toponym recognition on modern global clean data. However, take into account + that it uses context-agnostic approaches, which often perform quantitavively + quite well just because of the higher probability of the most common sense + to appear in texts. +* Running T-Res with DeezyMatch for candidate selection and ``reldisamb`` for + entity disambiguation takes considerably longer than using the default + pipeline. If you want to run T-Res on a few sentences, you can use the + end-to-end ``run_text()`` or ``run_sentence()`` methods. If, however, you + have a large number of texts on which to run T-Res, then we recommend that + you use the step-wise approach. If done efficiently, this can save a lot + of time. Using this approach, you should: + + #. Perform toponym recognition on all the texts, + #. Obtain the set of all unique toponyms identified in the full dataset, + and perform candidate selection on the unique set of toponyms, + #. Perform toponym disambiguation on a per-text basis, passing as argument + the dictionary of candidates returned in the previous step. + + See an example, assuming the dataset is in a ``CSV`` format, with one text + per row: + + .. code-block:: python + + # Load the data: + df = pd.read_pickle("1880-1900-LwM-HMD-subsample.csv") + location = "London" + wikidata_id = "Q84" + + # Instantiate the recogniser, ranker and linker: + myner = recogniser.Recogniser(...) + myranker = ranking.Ranker(...) + mylinker = linking.Linker(...) + + # Instantiate the pipeline: + geoparser = pipeline.Pipeline(myner=myner, myranker=myranker, mylinker=mylinker) + + # Find mentions for each text in the dataframe: + nlp_df["identified_toponyms"] = nlp_df.progress_apply( + lambda x: geoparser.run_text_recognition( + x["text"], + place_wqid=wikidata_id, + place=location, + ), + axis=1, + ) + + # Obtain the set of unique mentions in the whole dataset and find their candidates: + all_toponyms = [item for l in nlp_df["identified_toponyms"] for item in l] + all_cands = geoparser.run_candidate_selection(all_toponyms) + + # Disambiguate the mentions for each text in the dataframe, taking as an input the + # recognised mentions and the mention-to-candidate dictionaries: + nlp_df["identified_toponyms"] = nlp_df.progress_apply( + lambda x: geoparser.run_disambiguation( + x["identified_toponyms"], + all_cands, + place_wqid=wikidata_id, + place=location, + ), + axis=1, + ) + +`back to top <#top-tour>`_ + +.. _The Recogniser: The Recogniser -------------- -The Recogniser allows (1) loading an existing model (either directly -downloading a model from the HuggingFace hub or loading a locally stored NER -model) and (2) training a new model and loading it if it is already trained. +The Recogniser performs toponym recognition (i.e. geographic named entity +recognition). Users can either: + +#. Load an existing model (either directly downloading a model from the HuggingFace hub or loading a locally stored NER model), or +#. Fine-tune a new model on top of a base model and loading it, or directly load it if it is already pre-trained. -The following notebooks show examples using the Recogniser: +The following notebooks provide examples of both training or loading a +NER model using the Recogniser, and using it for detecting entities: + +:: -* ``./examples/load_use_ner_model.ipynb`` -* ``./examples/train_use_ner_model.ipynb`` + ./examples/train_use_ner_model.ipynb + ./examples/load_use_ner_model.ipynb 1. Instantiate the Recogniser ############################# -To load an already trained model (both from HuggingFace or a local model), you -can just instantiate the recogniser as follows: +To load an already trained model (both from HuggingFace or a locally stored +pre-trained model), you can just instantiate the recogniser as follows: .. code-block:: python @@ -38,23 +440,24 @@ can just instantiate the recogniser as follows: load_from_hub=True, ) -For example, to load the `dslim/bert-base-NER `_ -NER model from the HuggingFace hub: +For example, in order to load the `Livingwithmachines/toponym-19thC-en +`_ NER model +from the HuggingFace hub, initialise the Recogniser as follows: .. code-block:: python import recogniser myner = recogniser.Recogniser( - model="dslim/bert-base-NER", + model="Livingwithmachines/toponym-19thC-en", load_from_hub=True, ) -To load a NER model that is stored locally (for example, let's suppose we have -a NER model in this relative location -``../resources/models/blb_lwm-ner-fine``), you can also load it in the same -way (notice that ``load_from_hub`` should still be ``True``, probably a better -name would be ``load_from_path``): +You can also load a model that is stored locally in the same way. For example, +let's suppose the user has a NER model stored in the relative location +``../resources/models/blb_lwm-ner-fine``. The user could load it as follows +(notice that ``load_from_hub`` should still be True, a better name for this +would probably be ``load_from_path``): .. code-block:: python @@ -66,9 +469,10 @@ name would be ``load_from_path``): ) Alternatively, you can use the Recogniser to train a new model (and load it, -once it's trained). To instantiate the Recogniser for training a new model and -loading it once it's trained, you can do it as in the example (see the -description of each parameter below): +once it's trained). The model will be trained using HuggingFace's +``transformers`` library. To instantiate the Recogniser for training a new +model and loading it once it's trained, you can do it as in the example +(see the description of each parameter below): .. code-block:: python @@ -78,160 +482,51 @@ description of each parameter below): model="blb_lwm-ner-fine", train_dataset="experiments/outputs/data/lwm/ner_fine_train.json", test_dataset="experiments/outputs/data/lwm/ner_fine_dev.json", - pipe=None, - base_model="khosseini/bert_1760_1900", + base_model="Livingwithmachines/bert_1760_1900", model_path="resources/models/", training_args={ - "learning_rate": 5e-5, - "batch_size": 16, - "num_train_epochs": 4, - "weight_decay": 0.01, + "batch_size": 8, + "num_train_epochs": 10, + "learning_rate": 0.00005, + "weight_decay": 0.0, }, overwrite_training=False, do_test=False, load_from_hub=False, ) -Description of the arguments: - -* ``load_from_hub`` set to ``False`` indicates we're not using an off-the-shelf - model. It will prepare the Recogniser to train a new model, unless the model - already exists or if ``overwrite_training`` is set to ``True``. If - ``overwrite_training`` is set to ``False`` and ``load_from_hub`` is set to - ``False``, the Recogniser will be prepared to first try to load the model - and—if it does not exist—will train it. If ``overwrite_training`` is set to - ``True`` and ``load_from_hub`` is set to ``False``, the Recogniser will be - ready to directly try to train a model. -* ``base_model`` is the path to the model that will be used as base to - train our NER model. This can be the path to a HuggingFace model (we are - using `khosseini/bert_1760_1900 `_, - a BERT model trained on 19th Century texts) or the path to a model stored - locally. -* ``train_dataset`` and ``test_dataset`` contain the path to the train - and test data sets necessary for training the NER model. The paths point to a - json file (one for training, one for testing), in which each line is a - dictionary corresponding to a sentence. Each sentence-dictionary has three - key-value pairs: ``id`` is an ID of the sentence (a string), ``tokens`` is - the list of tokens into which the sentence has been split, and ``ner_tags`` - is the list of annotations per token (in BIO format). The length of - ``tokens`` and ``ner_tags`` should always be the same. This is an example of - two lines from either the training or test json files: - - .. code-block:: json - - { - "id":"3896239_29", - "ner_tags": [ - "O", - "B-STREET", - "I-STREET", - "O", - "O", - "O", - "B-BUILDING", - "I-BUILDING", - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "O" - ], - "tokens": [ - ",", - "Old", - "Millgate", - ",", - "to", - "the", - "Collegiate", - "Church", - ",", - "where", - "they", - "arrived", - "a", - "little", - "after", - "ten", - "oclock", - "." - ] - } - - { - "id":"8262498_11", - "ner_tags": [ - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "O", - "B-LOC", - "O", - "B-LOC", - "O", - "O", - "O", - "O", - "O", - "O" - ], - "tokens": [ - "On", - "the", - "\u2018", - "JSth", - "November", - "the", - "ship", - "Santo", - "Christo", - ",", - "from", - "Monteveido", - "to", - "Cadiz", - ",", - "with", - "hides", - "and", - "copper", - "." - ] - } - -* ``model_path`` is the path where the Recogniser should store the model, - and ``model`` is the name of the model. The ``pipe`` argument can be - left empty: that's where we will store the NER pipeline, once the model is - trained and loaded. -* The training arguments can be modified in ``training_args``: you can - change the learning rate, batch size, number of training epochs, and weight - decay. -* Finally, ``do_test`` allows you to train a mock model and then load it - (the suffix `_test` will be added to the model name). As mentioned above, - ``overwrite_training`` forces retraining a model, even if a model with - the same name and characteristics already exists. - -This instantiation prepares a new model -(``resources/models/blb_lwm-ner-fine.model``) to be trained, unless the model -already exists (``overwrite_training`` is ``False``), in which case it will -just load it. - -1. Train the NER model +Description of the parameters: + +* ``load_from_hub``: it indicates whether to load a pre-trained NER model. If it is + set to ``False``, the Recogniser will be prepared to train a new model, unless + the model already exists. +* ``overwrite_training``: it indicates whether a model should be re-trained, even if + there already is a model with the same name in the pre-specified output folder. + If ``load_from_hub`` is set to ``False`` and ``overwrite_training`` is also set + to ``False``, then the Recogniser will be prepared to first try to load the model + and---if it does not exist---to train it. If ``overwrite_training`` is set to + ``True``, it will prepare the Recogniser to train a model, even if a model with + the same name already exists. +* ``base_model``: the path to the model that will be used as base to train our NER + model. This can be the path to a HuggingFace model (for example, we are using + `Livingwithmachines/bert_1760_1900 `_, + a BERT model trained on nineteenth-century texts) or the path to a pre-trained + model from a local folder. +* ``train_dataset`` and ``test_dataset``: the path to the train and test data sets + necessary for training the NER model. You can find more information about the + format of this data in the ":doc:`resources`" page in the documentation. +* ``model_path``: the path folder where the Recogniser will store the model (and + try to load it from). +* ``model``: the name of the NER model. +* ``training_args``: the training arguments: the user can change the learning rate, + batch size, number of training epochs, and weight decay. +* ``do_test``: it allows the user to train a mock model and then load it (note that + the suffix ``_test`` will be added to the model name). + +2. Train the NER model ###################### -After having instantiated the Recogniser, to train the model, run: +Once the Recogniser has been initialised, you can train the model by running: .. code-block:: python @@ -241,220 +536,63 @@ Note that if ``load_to_hub`` is set to ``True`` or the model already exists (and ``overwrite_training`` is set to ``False``), the training will be skipped, even if you call the ``train()`` method. -3. Create a NER pipeline -######################## +.. note:: -In order to create a NER pipeline, run: + Note that this step is already taken care of if you use the T-Res ``Pipeline``. -.. code-block:: python - - myner.pipe = myner.create_pipeline() - -This loads the NER model into a -`Transformers pipeline `_, -to use it for inference. - -1. Use the NER pipeline -####################### - -In order to run the NER pipeline on a sentence, use the ``ner_predict()`` -method of the Recogniser as follows: - -.. code-block:: python - - sentence = "I ought to be at Dewsbury Moor." - predictions = myner.ner_predict(sentence) - print(predictions) - -This returns all words in the sentence, with their detected entity type, -confidence score, and start and end characters in the sentence, as follows: - -.. code-block:: json - - [ - { - "entity": "O", - "score": 0.9997773766517639, - "word": "I", - "start": 0, - "end": 1 - }, - { - "entity": "O", "score": 0.9997766613960266, - "word": "ought", - "start": 2, - "end": 7 - }, - { - "entity": "O", - "score": 0.9997838139533997, - "word": "to", - "start": 8, - "end": 10 - }, - { - "entity": "O", - "score": 0.9997853636741638, - "word": "be", - "start": 11, - "end": 13 - }, - { - "entity": "O", - "score": 0.9997740387916565, - "word": "at", - "start": 14, - "end": 16 - }, - { - "entity": "B-LOC", - "score": 0.9603037536144257, - "word": "Dewsbury", - "start": 17, - "end": 25 - }, - { - "entity": "I-LOC", - "score": 0.9753544330596924, - "word": "Moor", - "start": 26, - "end": 30 - }, - { - "entity": "O", - "score": 0.9997835755348206, - "word": ".", - "start": 30, - "end": 3 - 1} - ] - - -To return the named entities in a user-friendlier format, run: - -.. code-block:: python +`back to top <#top-tour>`_ - from utils import ner +.. _The Ranker: - # Process predictions: - processed_predictions = [ - [ - x["word"], x["entity"], "O", x["start"], x["end"], x["score"] - ] - for x in predictions - ] - - # Aggretate mentions: - mentions = ner.aggregate_mentions(processed_predictions, "pred") - -This returns only the named entities, aggregating multiple tokens together: - -.. code-block:: json +The Ranker +---------- - [ - { - "mention": "Dewsbury Moor", - "start_offset": 5, - "end_offset": 6, - "start_char": 17, - "end_char": 30, - "ner_score": 0.968, - "ner_label": "LOC", - "entity_link": "O" - } - ] +The Ranker takes the named entities detected by the Recogniser as input. +Given a knowledge base, it ranks the entities names according to their string +similarity to the target named entity, and selects a subset of candidates that +will be passed on to the next component, the Linker, to do the disambiguation +and select the most likely entity. -`back to top <#top>`_ +In order to use the Ranker and the Linker, we need a knowledge base, a gazetteer. +T-Res uses a gazetteer which combines data from Wikipedia and Wikidata. See how +to obtain the Wikidata-based resources in the ":doc:`resources`" page in the +documentation. -The Ranker ----------- +T-Res provides four different strategies for selecting candidates: -The Ranker takes the named entities detected by the Recogniser as input. Given -a knowledge base, it ranks the entities according to their string similarity to -the named entity, and selects a subset of candidates that will be passed on to -the next component, the Linker, to do the disambiguation and select the most -likely entity. - -In order to use the Ranker and the Linker, we need a knowledge base, a -gazetteer. T-Res uses a gazetteer which combines data from Wikipedia and -Wikidata. The steps to create it are described in the -`wiki2gaz `_ GitHub -repository. - -The following files are needed to run the Ranker: - -* ``wikidata_to_mentions_normalized.json``: dictionary of Wikidata entities - (by their QID) mapped to the mentions used in Wikipedia to refer to them - (obtained through Wikipedia anchor texts) and the normalised score. For - example, the value of entity `Q23183 `_ - is the following: - - .. code-block:: json - - { - "Wiltshire, England": 0.005478851632697786, - "Wilton": 0.00021915406530791147, - "Wiltshire": 0.9767696690773614, - "College": 0.00021915406530791147, - "Wiltshire Council": 0.0015340784571553803, - "West Wiltshire": 0.00021915406530791147, - "North Wiltshire": 0.00021915406530791147, - "Wilts": 0.0015340784571553803, - "County of Wilts": 0.0026298487836949377, - "County of Wiltshire": 0.010081087004163929, - "Wilts.": 0.00021915406530791147, - "Wiltshire county": 0.00021915406530791147, - "Wiltshire, United Kingdom": 0.00021915406530791147, - "Wiltshire plains": 0.00021915406530791147, - "Wiltshire England": 0.00021915406530791147 - } - -* ``mentions_to_wikidata_normalized.json``: the reverse dictionary to the one -* above, it maps a mention to all the Wikidata entities that are referred to -* by this mention in Wikipedia. For example, the value of `"Wiltshire"` is: - - .. code-block:: json - - { - "Q23183": 0.9767696690773614, - "Q55448990": 1.0, - "Q8023421": 0.03125 - } - - These scores don't add up to one, as they are normalised per entity, - therefore indicating how often an entity is referred to by this mention. For - example, ``Q55448990`` is always referred to as ``Wiltshire``. - -We provide four different strategies for selecting candidates: - -* ``perfectmatch`` retrieves candidates from the knowledge base if one of - their alternate names is identical to the detected named entity. For example, - given the mention "Wiltshire", the following Wikidata entities will be - retrieved: `Q23183 `_, +* ``perfectmatch`` retrieves candidates from the knowledge base if one of their + alternate names is identical to the detected named entity. For example, given + the mention "Wiltshire", the following Wikidata entities will be retrieved: + `Q23183 `_, `Q55448990 `_, and `Q8023421 `_, because all these entities are referred to as "Wiltshire" in Wikipedia anchor texts. -* ``partialmatch`` retrieves candidates from the knowledge base if there is - a (partial) match between the query and the candidate names, based on string +* ``partialmatch`` retrieves candidates from the knowledge base if there is a + (partial) match between the query and the candidate names, based on string overlap. Therefore, the mention "Ashton-under" returns candidates for "Ashton-under-Lyne". -* ``levenshtein`` retrieves candidates from the knowledge base if there is - a fuzzy match between the query and the candidate names, based on levenshtein - distance. Therefore, if the mention "Wiltshrre" would still return the - candidates for "Wiltshire". This method is often quite accurate when it comes - to OCR variations, but it is very slow. +* ``levenshtein`` retrieves candidates from the knowledge base if there is a + fuzzy match between the query and the candidate names, based on levenshtein + distance. Therefore, mention "Wiltshrre" would still return the candidates + for "Wiltshire". This method is often quite accurate when it comes to OCR + variations, but it is very slow. * ``deezymatch`` retrieves candidates from the knowledge base if there is a - fuzzy match between the query and the candidate names, based on - `DeezyMatch `_ embeddings. - Significantly more complex than the other methods to set up from scratch, but - the fastest approach. + fuzzy match between the query and the candidate names, based on similarity + between `DeezyMatch `_ + embeddings. It is significantly more complex than the other methods to set + up from scratch, and you will need to train a DeezyMatch model (which takes + about two hours), but once it is set up, it is the fastest approach (except + for ``perfectmatch``). 1. Instantiate the Ranker ######################### +1.1. Perfectmatch, partialmatch, and levenshtein +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + To use the Ranker for exact matching (``perfectmatch``) or fuzzy string -matching based either on overlap or Levenshtein distance (``partialmatch`` and -``levenshtein`` respectively), instantiate it as follows, changing the +matching based either on overlap or Levenshtein distance (``partialmatch`` +and ``levenshtein`` respectively), instantiate it as follows, changing the ``method`` argument accordingly: .. code-block:: python @@ -464,61 +602,57 @@ matching based either on overlap or Levenshtein distance (``partialmatch`` and myranker = ranking.Ranker( method="perfectmatch", # or "partialmatch" or "levenshtein" resources_path="resources/wikidata/", - mentions_to_wikidata=dict(), - wikidata_to_mentions=dict(), ) Note that ``resources_path`` should contain the path to the directory -where the resources are stored, namely ``wikidata_to_mentions_normalized.json`` -and ``mentions_to_wikidata.json``. The ``mentions_to_wikidata`` and -``wikidata_to_mentions`` dictionaries should be left empty, as they will be -populated when the Ranker loads the resources. +where the Wikidata- and Wikipedia-based resources are stored, as described +in the ":doc:`resources`" page in the documentation. + +1.2. DeezyMatch +^^^^^^^^^^^^^^^ DeezyMatch instantiation is trickier, as it requires training a model that, ideally, should capture the types of string variations that can be found in your data (such as OCR errrors). Using the Ranker, you can: -#. Train a DeezyMatch model from scratch, including generating a string pairs - dataset. -#. Train a DeezyMatch model, given an existing string pairs dataset. -#. Use an existing DeezyMatch model. +* **Option 1:** Train a DeezyMatch model from scratch, including generating + a string pairs dataset. +* **Option 2:** Train a DeezyMatch model, given an existing string pairs dataset. + +Once a DeezyMatch has been trained, you can load it and use it. The following +notebooks provide examples of each case: + +:: + + ./examples/train_use_deezy_model_1.ipynb # Option 1 + ./examples/train_use_deezy_model_2.ipynb # Option 2 + ./examples/train_use_deezy_model_3.ipynb # Load an existing DeezyMatch model. -See below each of them in detail. +See below each option in detail. -1. Use an existing DeezyMatch model -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Option 1. Train a DeezyMatch model from scratch, given an existing string pairs dataset +""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -To use an existing DeezyMatch model, you wil need to have the following -``resources`` file structure (where ``wkdtalts`` is the name given to the set -of all Wikidata alternate names and ``w2v_ocr`` is the name given to the -DeezyMatch model). +To train a DeezyMatch model from scratch, using an existing string pairs dataset, +you will need to have the following `resources` file structure (as described in +the ":doc:`resources`" page in the documentation): :: - toponym-resolution/ + T-RES/ ├── ... ├── resources/ │ ├── deezymatch/ - │ │ ├── combined/ - │ │ │ └── wkdtalts_w2v_ocr/ - │ │ │ ├── bwd.pt - │ │ │ ├── bwd_id.pt - │ │ │ ├── bwd_items.npy - │ │ │ ├── fwd.pt - │ │ │ ├── fwd_id.pt - │ │ │ ├── fwd_items.npy - │ │ │ └── input_dfm.yaml - │ │ └── models/ - │ │ └── w2v_ocr/ - │ │ ├── input_dfm.yaml - │ │ ├── w2v_ocr.model - │ │ ├── w2v_ocr.model_state_dict - │ │ └── w2v_ocr.vocab + │ │ ├── data/ + │ │ │ └── w2v_ocr_pairs.txt + │ │ └── inputs/ + │ │ ├── characters_v001.vocab + │ │ └── input_dfm.yaml │ ├── models/ │ ├── news_datasets/ │ ├── wikidata/ - │ │ ├── mentions_to_wikidata.json - │ │ └── wikidata_to_mentions.json + │ │ ├── mentions_to_wikidata_normalized.json + │ │ └── wikidata_to_mentions_normalized.json │ └── wikipedia/ └── ... @@ -533,12 +667,8 @@ The Ranker can then be instantiated as follows: # Generic Ranker parameters: method="deezymatch", resources_path="resources/wikidata/", - mentions_to_wikidata=dict(), - wikidata_to_mentions=dict(), # Parameters to create the string pair dataset: - strvar_parameters={ - "overwrite_dataset": False, - }, + strvar_parameters=dict(), # Parameters to train, load and use a DeezyMatch model: deezy_parameters={ # Paths and filenames of DeezyMatch models and data: @@ -548,69 +678,105 @@ The Ranker can then be instantiated as follows: "dm_output": "deezymatch_on_the_fly", # Ranking measures: "ranking_metric": "faiss", - "selection_threshold": 25, - "num_candidates": 3, - "search_size": 3, + "selection_threshold": 50, + "num_candidates": 1, "verbose": False, # DeezyMatch training: - "overwrite_training": True, - "do_test": True, + "overwrite_training": False, + "do_test": False, }, ) -Description of the arguments (to learn more, refer to the -`DeezyMatch readme `_: +Description of the parameters (to learn more, refer to the `DeezyMatch readme +`_): * ``strvar_parameters`` contains the parameters needed to generate the - DeezyMatch training set. In this scenario, the DeezyMatch model is already - trained and there is therefore no need to generate the training set. -* ``deezy_parameters`` contains the set of parameters to train or load a + DeezyMatch training set. It can be left empty, since the training set + already exists. +* ``deezy_parameters``: contains the set of parameters to train or load a DeezyMatch model: - * ``dm_path``: The path to the folder where the DeezyMatch model and - data will be stored. - * ``dm_cands``: The name given to the set of alternate names from which - DeezyMatch will try to find a match for a given mention. - * ``dm_model``: Name of the DeezyMatch model to train or load. - * ``ranking_metric`` Metric used to TODO - -You can download these resources from: - -* ``resources/deezymatch/combined/wkdtalts_w2v_ocr/``: **[TODO]** -* ``resources/deezymatch/models/w2v_ocr/``: **[TODO]** -* ``wikidata/mentions_to_wikidata.json``: **[TODO]** -* ``wikidata/wikidata_to_mentions.json``: **[TODO]** + * ``dm_path``: The path to the folder where the DeezyMatch model and data will + be stored. + * ``dm_cands``: The name given to the set of alternate names from which DeezyMatch + will try to find a match for a given mention. + * ``dm_model``: Name of the DeezyMatch model to train (or load if the + model already exists). + * ``dm_output``: Name of the DeezyMatch output file (not really needed). + * ``ranking_metric``: DeezyMatch parameter: the metric used to rank the string + variations based on their vectors. + * ``selection_threshold``: DeezyMatch parameter: selection threshold based on + the ranking metric. + * ``num_candidates``: DeezyMatch parameter: maximum number of string variations + that will be retrieved. + * ``verbose``: DeezyMatch parameter: verbose output or not. + * ``overwrite_training``: Whether to overwrite the training of a DeezyMatch + model provided it already exists. + * ``do_test``: Whether to train a model in test mode. + +Option 2. Train a DeezyMatch model from scratch, including generating a string pairs dataset +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +To train a DeezyMatch model from scratch, including generating a string pairs +dataset, you will need to have the following ``resources`` file structure (as +described in the ":doc:`resources`" page in the documentation): -1. Train a DeezyMatch model from scratch, including generating a string pairs dataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -TODO +:: -2. Train a DeezyMatch model, given an existing string pairs dataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + T-RES/ + ├── ... + ├── resources/ + │ ├── deezymatch/ + │ ├── models/ + │ │ └── w2v/ + │ │ ├── w2v_1800s_news + │ │ │ ├── w2v.model + │ │ │ ├── w2v.model.syn1neg.npy + │ │ │ └── w2v.model.wv.vectors.npy + │ │ ├── ... + │ │ └── w2v_1860s_news + │ │ ├── w2v.model + │ │ ├── w2v.model.syn1neg.npy + │ │ └── w2v.model.wv.vectors.npy + │ ├── news_datasets/ + │ ├── wikidata/ + │ │ ├── mentions_to_wikidata_normalized.json + │ │ └── wikidata_to_mentions_normalized.json + │ └── wikipedia/ + └── ... -TODO +The Ranker can then be instantiated as follows: .. code-block:: python + from pathlib import Path + from geoparser import ranking + myranker = ranking.Ranker( - method="perfectmatch", - resources_path="../resources/wikidata/", - mentions_to_wikidata=dict(), - wikidata_to_mentions=dict(), + # Generic Ranker parameters: + method="deezymatch", + resources_path="resources/wikidata/", # Parameters to create the string pair dataset: strvar_parameters={ + "ocr_threshold": 60, + "top_threshold": 85, + "min_len": 5, + "max_len": 15, + "w2v_ocr_path": str(Path("../resources/models/w2v/").resolve()), + "w2v_ocr_model": "w2v_*_news", "overwrite_dataset": False, }, + # Parameters to train, load and use a DeezyMatch model: deezy_parameters={ - "dm_path": str(Path("../resources/deezymatch/").resolve()), + # Paths and filenames of DeezyMatch models and data: + "dm_path": str(Path("resources/deezymatch/").resolve()), "dm_cands": "wkdtalts", "dm_model": "w2v_ocr", "dm_output": "deezymatch_on_the_fly", # Ranking measures: "ranking_metric": "faiss", - "selection_threshold": 25, - "num_candidates": 3, - "search_size": 3, + "selection_threshold": 50, + "num_candidates": 1, "verbose": False, # DeezyMatch training: "overwrite_training": False, @@ -618,17 +784,60 @@ TODO }, ) +Description of the parameters (to learn more, refer to the `DeezyMatch readme +`_): + +* ``strvar_parameters`` contains the parameters needed to generate the + DeezyMatch training set: + + * ``ocr_threshold``: Maximum `FuzzyWuzzy `_ + ratio for two strings to be considered negative variations of each other. + * ``top_threshold``: Minimum `FuzzyWuzzy `_ + ratio for two strings to be considered positive variations of each other. + * ``min_len``: Minimum length for a word to be included in the dataset. + * ``max_len``: Maximum length for a word to be included in the dataset. + * ``w2v_ocr_path``: The path to the word2vec embeddings folders. + * ``w2v_ocr_model``: The folder name of the word2vec embeddings (it can be a + regular expression). + * ``overwrite_dataset``: Whether to overwrite the dataset if it already exists. + +* ``deezy_parameters``: contains the set of parameters to train or load a + DeezyMatch model: + + * ``dm_path``: The path to the folder where the DeezyMatch model and data will + be stored. + * ``dm_cands``: The name given to the set of alternate names from which DeezyMatch + will try to find a match for a given mention. + * ``dm_model``: Name of the DeezyMatch model to train or load. + * ``dm_output``: Name of the DeezyMatch output file (not really needed). + * ``ranking_metric``: DeezyMatch parameter: the metric used to rank the string + variations based on their vectors. + * ``selection_threshold``: DeezyMatch parameter: selection threshold based on + the ranking metric. + * ``num_candidates``: DeezyMatch parameter: maximum number of string variations + that will be retrieved. + * ``verbose``: DeezyMatch parameter: verbose output or not. + * ``overwrite_training``: Whether to overwrite the training of a DeezyMatch + model provided it already exists. + * ``do_test``: Whether to train a model in test mode. + 2. Load the resources ##################### -The following line loads the resources (i.e. the ``mentions-to-wikidata`` and -``wikidata_to_mentions`` dictionaries) required to perform candidate selection -and ranking, regardless of the Ranker method. +The following line of code loads the resources (i.e. the +``mentions-to-wikidata_normalized.json`` and +``wikidata_to_mentions_normalized.json`` files into dictionaries). They are +required in order to perform candidate selection and ranking, regardless of +the Ranker method. .. code-block:: python myranker.mentions_to_wikidata = myranker.load_resources() +.. note:: + + Note that this step is already taken care of if you use the ``Pipeline``. + 3. Train a DeezyMatch model ########################### @@ -639,31 +848,239 @@ when instantiating the Ranker. myranker.train() -Note that if the model already exists and overwrite_training is set to -``False``, the training will be skipped, even if you call the train() method. -The training will also be skipped if the Ranker is not instantiated for -DeezyMatch. +Note that if the model already exists and ``overwrite_training`` is set to +``False``, the training will be skipped, even if you call the ``train()`` +method. The training will also be skipped if the Ranker is instantiated for +a different method than DeezyMatch. + +The resulting model will be stored in the specified path. In this case, the +resulting DeezyMatch model that the Ranker will use is called ``w2v_ocr``: + +:: + + T-RES/ + ├── ... + ├── resources/ + │ ├── deezymatch/ + │ │ └── models/ + │ │ └── w2v_ocr/ + │ │ ├── input_dfm.yaml + │ │ ├── w2v_ocr.model + │ │ ├── w2v_ocr.model_state_dict + │ │ └── w2v_ocr.vocab + │ ├── models/ + │ ├── news_datasets/ + │ ├── wikidata/ + │ │ ├── mentions_to_wikidata_normalized.json + │ │ └── wikidata_to_mentions_normalized.json + │ └── wikipedia/ + └── ... + +.. note:: + + Note that this step is already taken care of if you use the ``Pipeline``. 4. Retrieve candidates for a given mention ########################################## +In order to use the Ranker to retrieve candidates for a given mention, follow +the example. The ``find_candidates`` Ranker method requires that the input is +a list of dictionaries, where the key is always ``"mention"`` and the value +is the toponym in question. + .. code-block:: python toponym = "Manchefter" print(myranker.find_candidates([{"mention": toponym}])[0][toponym]) -`back to top <#top>`_ +`back to top <#top-tour>`_ + +.. _The Linker: The Linker ---------- -TODO +The Linker takes as input the set of candidates selected by the Ranker and +disambiguates them, selecting the best matching entity depending on the +approach selected for disambiguation. -`back to top <#top>`_ +We provide two different strategies for disambiguation: -The Pipeline ------------- +* ``mostpopular``: Unsupervised method, which, given a set of candidates + for a given mention, returns as a prediction the candidate that is most + popular in terms of inlink structure in Wikipedia. +* ``reldisamb``: Given a set of candidates, this approach uses the + `REL re-implementation `_ of the + `ment-norm algorithm `_ proposed + by Le and Titov (2018) and partially based on Ganea and Hofmann (2017), + and adapts it. To know more: + + :: + + Van Hulst, Johannes M., Faegheh Hasibi, Koen Dercksen, Krisztian Balog, and + Arjen P. de Vries. "Rel: An entity linker standing on the shoulders of giants." + In Proceedings of the 43rd International ACM SIGIR Conference on Research and + Development in Information Retrieval, pp. 2197-2200. 2020. + + Le, Phong, and Ivan Titov. "Improving Entity Linking by Modeling Latent Relations + between Mentions." In Proceedings of the 56th Annual Meeting of the Association + for Computational Linguistics (Volume 1: Long Papers), pp. 1595-1604. 2018. + + Ganea, Octavian-Eugen, and Thomas Hofmann. "Deep Joint Entity Disambiguation + with Local Neural Attention." In Proceedings of the 2017 Conference on + Empirical Methods in Natural Language Processing, pp. 2619-2629. 2017. + +1. Instantiate the Linker +######################### + +1.1. ``mostpopular`` +^^^^^^^^^^^^^^^^^^^^ + +To use the Linker with the ``mostpopular`` approach, instantiate it as follows: + +.. code-block:: python + + from geoparser import linking + + mylinker = linking.Linker( + method="mostpopular", + resources_path="resources/", + ) + +Description of the parameters: + +* ``method``: name of the method, in this case ``mostpopular``. +* ``resources_path``: path to the resources directory. + +Note that ``resources_path`` should contain the path to the directory where +the resources are stored. + +When using the ``mostpopular`` linking approach, the resources folder should at +least contain the following resources: + +:: + + T-Res/ + └── resources/ + └── wikidata/ + ├── entity2class.txt + ├── mentions_to_wikidata.json + └── wikidata_gazetteer.csv + +1.2. ``reldisamb`` +^^^^^^^^^^^^^^^^^^ + +To use the Linker with the ``reldisamb`` approach, instantiate it as follows: + +.. code-block:: python + + from geoparser import linking + + with sqlite3.connect("resources/rel_db/embeddings_database.db") as conn: + cursor = conn.cursor() + mylinker = linking.Linker( + method="reldisamb", + resources_path="resources/", + rel_params={ + "model_path": "resources/models/disambiguation/", + "data_path": "experiments/outputs/data/lwm/", + "training_split": "originalsplit", + "db_embeddings": cursor, + "with_publication": True, + "without_microtoponyms": True, + "do_test": False, + "default_publname": "London", + "default_publwqid": "Q84", + }, + overwrite_training=False, + ) + +Description of the parameters: + +* ``method``: name of the method, in this case ``reldisamb``. +* ``resources_path``: path to the resources directory. +* ``overwrite_training``: whether to overwrite the training of the entity + disambiguation model provided a model with the same path and name already + exists. +* ``rel_params``: set of parameters specific to the ``reldisamb`` method: + + * ``model_path``: Path to the entity disambiguation model. + * ``data_path``: Path to the dataset file ``linking_df_split.tsv`` used for + training a model (see information about the dataset in the ":doc:`resources`" + page in the documentation). + * ``training_split``: Column from the ``linking_df_split.tsv`` file that indicates + which documents are used for training, development, and testing (see more + information about this in the ":doc:`resources`" page in the documentation). + * ``db_embeddings``: cursor for the embeddings database (see more + information about this in the ":doc:`resources`" page in the documentation). + * ``with_publication``: whether place of publication should be used as a feature + when disambiguating (by adding it as an already disambiguated entity). + * ``without_microtoponyms``: whether to filter out microtoponyms or not (i.e. + filter out all entities that are not ``LOC``). + * ``do_test``: Whether to train an entity disambiguation model in test mode. + * ``default_publname``: The default value for the place of publication of + the texts. For example, "London". This will be the default publication place + name, but you will be able to override it when using the Linker to do predictions. + This will be ignored if ``with_publication`` is ``False``. + * ``default_publwqid``: The wikidata ID of the place of publication. For example, + ``Q84`` for London. As in ``default_publname``, you will be able to override + it at inference time, and it will be ignored if ``with_publication`` is ``False``. + +In this way, an entity disambiguation model will be trained unless a model trained +using the same characteristics already exists (i.e. same candidate ranker method, +same ``training_split`` column name, and same values for ``with_publication`` and +``without_microtoponyms``). + +When using the ``reldisamb`` linking approach, the resources folder should at +least contain the following resources: + +:: + + T-Res/ + └── resources/ + ├── wikidata/ + | ├── entity2class.txt + | ├── mentions_to_wikidata.json + | └── wikidata_gazetteer.csv + └── rel_db/ + └── embeddings_database.db + + +2. Load the resources +##################### + +The following line of code loads the resources required by the Linker, regardless +of the Linker method. + +.. code-block:: python + + mylinker.linking_resources = mylinker.load_resources() + +.. note:: + + Note that this step is already taken care of if you use the ``Pipeline``. + +3. Train an entity disambiguation model +####################################### + +The following line will train an entity disambiguation model, given the arguments +specified when instantiating the Linker. + +.. code-block:: python + + mylinker.rel_params["ed_model"] = mylinker.train_load_model(self.myranker) + +Note that if the model already exists and ``overwrite_training`` is set to ``False``, +the training will be skipped, even if you call the ``train()`` method. The training +will also be skipped if the Linker is instantiated for ``mostpopular``. + +The resulting model will be stored in the location specified when instantiating the +Linker (i.e. ``resources/models/disambiguation/`` in the example) in a new folder +whose name combines information about the ranking and linking arguments used in +training the method. + +.. note:: -TODO + Note that this step is already taken care of if you use the ``Pipeline``. -`back to top <#top>`_ +`back to top <#top-tour>`_ diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst index d5041cb9..f0a59016 100644 --- a/docs/source/getting-started/index.rst +++ b/docs/source/getting-started/index.rst @@ -2,11 +2,10 @@ Getting started =============== -TODO: Text here. - .. toctree:: :maxdepth: 2 :caption: Table of contents: installation - complete-tour \ No newline at end of file + resources + complete-tour diff --git a/docs/source/getting-started/resources.rst b/docs/source/getting-started/resources.rst new file mode 100644 index 00000000..bec0f83c --- /dev/null +++ b/docs/source/getting-started/resources.rst @@ -0,0 +1,559 @@ +.. _top-resources: + +================================= +Resources and directory structure +================================= + +T-Res requires several resources to work. Some resources can be downloaded +and loaded directly from the web. Others will need to be generated, following +the instructions provided in this section. + +Toponym recognition and disambiguation training data +---------------------------------------------------- + +We provide the dataset we used to train T-Res for the tasks of toponym recognition +(i.e. a named entity recognition task) and toponym disambiguation (i.e. an entity +linking task focused on geographical entities). The dataset is based on the +`TopRes19th dataset `_. + +.. note:: + + You can download the data (in the format required by T-Res) from the `British + Library research repository `_. + +By default, T-Res assumes the files are stored in the following location: + +:: + + T-Res/ + └── experiments/ + └── outputs/ + └── data/ + └── lwm/ + ├── ner_fine_dev.json + ├── ner_fine_test.json + └── linking_df_split.tsv + +Continue reading the sections below to learn more about the datasets, and for a +description of the format expected by T-Res. + +1. Toponym recognition dataset +############################## + +T-Res allows directly loading a pre-trained BERT-based NER model, either locally +or from the HuggingFace models hub. If this is your option, you can skip this +section. Otherwise, if you want to train your own NER model using either our +dataset or a different dataset, you should continue reading. + +T-Res requires that the data for training a NER model is provided as two json files +(one for training, one for testing) in the JSON Lines format, where each line +corresponds to a sentence. Each sentence is a dictionary with three key-value +pairs: ``id`` (an identifier of the sentence, a string), ``tokens`` (the list of +tokens into which the sentence has been split), and ``ner_tags`` (the list of +annotations per token, in the BIO format). The length of ``tokens`` and ``ner_tags`` +is therefore always the same. See below an example of three lines from one of +the JSON files, corresponding to three annotated sentences: + +.. code-block:: json + + {"id":"3896239_29","ner_tags":["O","B-STREET","I-STREET","O","O","O","B-BUILDING","I-BUILDING","O","O","O","O","O","O","O","O","O","O"],"tokens":[",","Old","Millgate",",","to","the","Collegiate","Church",",","where","they","arrived","a","little","after","ten","oclock","."]} + {"id":"8262498_11","ner_tags":["O","O","O","O","O","O","O","O","O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O"],"tokens":["On","the","'","JSth","November","the","ship","Santo","Christo",",","from","Monteveido","to","Cadiz",",","with","hides","and","copper","."]} + {"id":"10715509_7","ner_tags":["O","O","O","B-LOC","O","O","O","O","O","O","O","O","O","O","O","O"],"tokens":["A","COACH","to","SOUTHAMPTON",",","every","morning","at","a","quarter","before","6",",","Sundays","excepted","."]} + +Note that the list of NER labels will be automatically detected from the training +data. + +2. Toponym disambiguation dataset +################################# + +Train and test data examples are required for training a new entity +disambiguation (ED) model. They should be provided in a single tsv file, named +``linking_df_split.tsv``, one document per row, with the following required +columns: + +* ``article_id``: article identifier, which consists of the number in the + document file in the original dataset (for example, the ``article_id`` of + ``1218_Poole1860.tsv`` is ``1218``). +* ``sentences``: list of dictionaries, each dictionary corresponding to a + sentence in the article, with two fields: ``sentence_pos`` (the position + of the sentence in the article) and ``sentence_text`` (the text of the + sentence). For example: + + .. code-block:: json + + [ + { + "sentence_pos": 1, + "sentence_text": "DUKINFIELD. " + }, + { + "sentence_pos": 2, + "sentence_text": "Knutsford Sessions." + }, + { + "sentence_pos": 3, + "sentence_text": "—The servant girl, Eliza Ann Byrom, who stole a quantity of clothes from the house where she lodged, in Dukiafield, was sentenced to two months’ imprisonment. " + } + ] + +* ``annotations``: list of dictionaries containing the annotated place names. + Each dictionary corresponds to a named entity mentioned in the text, with the + following fields (at least): ``mention_pos`` (order of the mention in the article), + ``mention`` (the actual mention), ``entity_type`` (the type of named entity), + ``wkdt_qid`` (the Wikidata ID of the resolved entity), ``mention_start`` + (the character start position of the mention in the sentence), ``mention_end`` + (the character end position of the mention in the sentence), ``sent_pos`` + (the sentence index in which the mention is found). + + For example: + + .. code-block:: json + + [ + { + "mention_pos": 0, + "mention": "DUKINFIELD", + "entity_type": "LOC", + "wkdt_qid": "Q1976179", + "mention_start": 0, + "mention_end": 10, + "sent_pos": 1 + }, + { + "mention_pos": 1, + "mention": "Knutsford", + "entity_type": "LOC", + "wkdt_qid": "Q1470791", + "mention_start": 0, + "mention_end": 9, + "sent_pos": 2 + }, + { + "mention_pos": 2, + "mention": "Dukiafield", + "entity_type": "LOC", + "wkdt_qid": "Q1976179", + "mention_start": 104, + "mention_end": 114, + "sent_pos": 3 + } + ] + +* ``place``: A string containing the place of publication of the newspaper to + which the article belongs. For example, "Manchester" or "Ashton-under-Lyne". + +* ``place_wqid``: A string with the Wikidata ID of the place of publication. + For example, if ``place`` is London UK, then ``place_wqid`` should be ``Q84``. + +Finally, the TSV contains a set of columns which can be used to indicate how +to split the dataset into training (``train``), development (``dev``), testing +(``test``), or documents to leave out (``left_out``). The Linker requires that +the user specifies which column should be used for training the ED model. +The code assumes the following columns: + +* ``originalsplit``: The articles maintain the ``test`` set of the original + dataset. Train is split into ``train`` (0.66) and ``dev`` (0.33). + +* ``apply``: The articles are divided into ``train`` and ``dev``, with no articles + left for testing. This split can be used to train the final entity disambiguation + model, after the experiments. + +* ``withouttest``: This split can be used for development. The articles in the + test set of the original dataset are left out. The training set is split into + ``train``, ``dev`` and ``test``. + +`back to top <#top-resources>`_ + +Wikipedia- and Wikidata-based resources +--------------------------------------- + +T-Res requires a series of Wikipedia- and Wikidata-based resources: + +* ``mentions_to_wikidata.json`` +* ``mentions_to_wikidata_normalized.json`` +* ``wikidata_to_mentions_normalized.json`` +* ``wikidata_gazetteer.csv`` +* ``entity2class.txt`` + +.. note:: + + These files can be generated using the + `wiki2gaz `_ GitHub + repository (**[coming soon]**). For more information on how they are built, + refer to the ``wiki2gaz`` documentation. + +T-Res assumes these files in the following default location: + +:: + + T-Res/ + └── resources/ + └── wikidata/ + ├── entity2class.txt + ├── mentions_to_wikidata_normalized.json + ├── mentions_to_wikidata.json + ├── wikidata_gazetteer.csv + └── wikidata_to_mentions_normalized.json + +The sections below describe the contents of the files, as well as their +format, in case you prefer to provide your own resources (which should be +in the same format). + +``mentions_to_wikidata.json`` +############################# + +A JSON file consisting of a python dictionary in which the key is a mention +of a place in Wikipedia (by means of an anchor text) and the value is an inner +dictionary, where the inner keys are the QIDs of all Wikidata entities that +can be referred to by the mention in question, and the inner values are the +absolute counts (i.e. the number of times such mention is used in Wikipedia +to refer to this particular entity). + +You can load the dictionary, and access it, as follows: + +:: + + >>> import json + >>> with open('mentions_to_wikidata.json', 'r') as f: + ... mentions_to_wikidata = json.load(f) + ... + >>> mentions_to_wikidata["Wiltshire"] + + +In the example, the value assigned to the key "Wiltshire" is: + +.. code-block:: json + + { + "Q23183": 4457, + "Q55448990": 5, + "Q8023421": 1 + } + +In the example, we see that the mention "Wiltshire" is assigned a mapping +between key ``Q23183`` and value 4457. This means that, on Wikipedia, +"Wiltshire" appears 4457 times to refer to entity `Q23183 +`_ (through the mapping between +Wikidata entity ``Q23183`` and its `corresponding Wikipedia page +`_). + +``mentions_to_wikidata_normalized.json`` +######################################## + +A JSON file containing the normalised version of the ``mentions_to_wikidata.json`` +dictionary. For example, the value of the mention "Wiltshire" is now: + +.. code-block:: json + + { + "Q23183": 0.9767696690773614, + "Q55448990": 1.0, + "Q8023421": 0.03125 + } + +Note that these scores do not add up to one, as they are normalised by entity, +not by mention. They are a measure of how likely an entity is to be referred to +by a mention. In the example, we see that entity ``Q55448990`` is always referred +to as ``Wiltshire``. + +``wikidata_to_mentions_normalized.json`` +######################################## + +A JSON file consisting of a python dictionary in which the key is a Wikidata QID +and the value is an inner dictionary, in which the inner keys are the mentions +used in Wikipedia to refer to such Wikidata entity, and the values are their +relative frequencies. + +You can load the dictionary, and access it, as follows: + +:: + + >>> import json + >>> with open('wikidata_to_mentions_normalized.json', 'r') as f: + ... wikidata_to_mentions_normalized = json.load(f) + ... + >>> wikidata_to_mentions_normalized["Q23183"] + +In this example, the value of entity `Q23183 `_ is: + +.. code-block:: json + + { + "Wiltshire, England": 0.005478851632697786, + "Wilton": 0.00021915406530791147, + "Wiltshire": 0.9767696690773614, + "College": 0.00021915406530791147, + "Wiltshire Council": 0.0015340784571553803, + "West Wiltshire": 0.00021915406530791147, + "North Wiltshire": 0.00021915406530791147, + "Wilts": 0.0015340784571553803, + "County of Wilts": 0.0026298487836949377, + "County of Wiltshire": 0.010081087004163929, + "Wilts.": 0.00021915406530791147, + "Wiltshire county": 0.00021915406530791147, + "Wiltshire, United Kingdom": 0.00021915406530791147, + "Wiltshire plains": 0.00021915406530791147, + "Wiltshire England": 0.00021915406530791147 + } + +In this example, we can see that entity ``Q23183`` is referred to as "Wiltshire, +England" in Wikipedia 0.5% of the times and as "Wiltshire" 97.7% of the times. +These values add up to one. + +``wikidata_gazetteer.csv`` +########################## + +A csv file consisting of (at least) the following four columns: + +* a Wikidata ID (QID) of a location, +* its English label, +* its latitude, and +* its longitude. + +You can load the csv, and show the first five rows, as follows: + +:: + + >>> import pandas as pd + >>> df = pd.read_csv("wikidata_gazetteer.csv") + >>> df. head() + wikidata_id english_label latitude longitude + 0 Q5059107 Centennial 40.01140 -87.24330 + 1 Q5059144 Centennial Grounds 39.99270 -75.19380 + 2 Q5059153 Centennial High School 40.06170 -83.05780 + 3 Q5059162 Centennial High School 38.30440 -104.63800 + 4 Q5059178 Centennial Memorial Samsung Hall 37.58949 127.03434 + +Each row corresponds to a Wikidata geographic entity (i.e. a Wikidata entity +with coordinates). + +``entity2class.txt`` +#################### + +A python dictionary in which each entity in Wikidata is mapped to its most +common Wikidata class. + +You can load the dictionary, and access it, as follows: + +:: + + >>> with open('entity2class.txt', 'r') as f: + ... entity2class = json.load(f) + ... + >>> entity2class["Q23183"] + 'Q180673' + >>> entity2class["Q84"] + 'Q515' + +For example, Wiltshire (`Q23183 `_) is +mapped to `Q180673 `_, i.e. "cerimonial +county of England", whereas London (`Q84 `_) +is mapped to `Q515 `_, i.e. "city". + +`back to top <#top-resources>`_ + +Entity and word embeddings +-------------------------- + +In order to perform toponym linking and resolution using the REL-based approaches, +T-Res requires a database of word2vec and wiki2vec embeddings. Note that you will +not need this if you use the ``mostpopular`` disambiguation approach. + +By default, T-Res expects a database file called ``embeddings_database.db`` with, +at least, one table (``entity_embeddings``) with at least the following columns: + +* ``word``: Either a lower-cased token (i.e. a word on Wikipedia) or a Wikidata QID + preceded by ``ENTITY/``. The database should also contain the following two wildcard + tokens: ``#ENTITY/UNK#`` and ``#WORD/UNK#``. +* ``emb``: The corresponding word or entity embedding. + +Generate the embeddings database +################################ + +In our experiments, we derived the embeddings database from REL's shared resources. + +.. note:: + + We are working towards improving this step in the pipeline. Meanwhile, to generate + the ``embeddings_database.db``, please follow these steps: + + #. Make sure you have ``wikidata_gazetteer.csv`` in ``./resources/wikidata/`` (see + `above <#wikipedia-and-wikidata-based-resources>`_). + #. Generate a Wikipedia-to-Wikidata index, following `this instructions + `_, save it as: ``./resources/wikipedia/index_enwiki-latest.db``. + #. Run `this script `_ + to create the embeddings database. + +You can load the file, and access a token embedding, as follows: + +:: + + >>> import array + >>> from array import array + >>> with sqlite3.connect("embeddings_database.db") as conn: + ... cursor = conn.cursor() + ... result = cursor.execute("SELECT emb FROM entity_embeddings WHERE word='lerwick'").fetchone() + ... result = result if result is None else array("f", result[0]).tolist() + ... + >>> result + [-0.3257000148296356, -0.00989999994635582, -0.13420000672340393, ...] + +You can load the file, and access an entity embedding, as follows: + +:: + + >>> import array + >>> from array import array + >>> with sqlite3.connect("embeddings_database.db") as conn: + ... cursor = conn.cursor() + ... result = cursor.execute("SELECT emb FROM entity_embeddings WHERE word='ENTITY/Q84'").fetchone() + ... result = result if result is None else array("f", result[0]).tolist() + ... + >>> result + [-0.014700000174343586, 0.007899999618530273, -0.1808999925851822, ...] + +T-Res expects the ``embeddings_database.db`` file to be stored as follows: + +:: + + T-Res/ + └── resources/ + └── rel_db/ + └── embeddings_database.db + +`back to top <#top-resources>`_ + +DeezyMatch training set +--------------------------------------- + +In order to train a DeezyMatch model, a training set consisting of positive and +negative string pairs is required. We provide a dataset of positive and negative +OCR variations, which can be used to train a DeezyMatch model, which can then be +used to perform fuzzy string matching to find candidates for entity linking. + +.. note:: + + The DeezyMatch training set can be downloaded from the `British Library research + repository `_. + +T-Res assumes by default the DeezyMatch training set to be named ``w2v_ocr_pairs.txt`` +and to be in the following location: + +:: + + T-Res/ + └── resources/ + └── deezymatch/ + └── data/ + └── w2v_ocr_pairs.txt + +Optionally, T-Res also provides the option to generate a DeezyMatch training set +from word2vec embeddings trained on digitised texts. Continue reading the sections +below for more information about both types of resources. + +1. DeezyMatch training set +########################## + +T-Res can directly load the string pairs dataset required to train a new DeezyMatch +model. By default, the code assumes the dataset to be called ``w2v_ocr_pairs.txt``. +The dataset consists of three columns: ``word1``, ``word2``, and a boolean describing +whether ``word2`` is an OCR variation of ``word1``. For example: + + .. code-block:: + + could might FALSE + could wished FALSE + could hardly FALSE + could didnot FALSE + could never FALSE + could reusing FALSE + could could TRUE + could coeld TRUE + could could TRUE + could conld TRUE + could could TRUE + could couid TRUE + +This dataset has been automatically generated from word2vec embeddings trained on +digitised historical news texts (i.e. with OCR noise), and has been expanded with +toponym alternate names extracted from Wikipedia. + +The dataset we provide consists of 1,085,514 string pairs. + +2. Word2Vec embeddings trained on noisy data +############################################ + +The 19thC word2vec embeddings **are not needed** if you already have the DeezyMatch +training set ``w2v_ocr_pairs.txt`` (described in the `section above +<#deezymatch-training-set>`_). + +To create a new DeezyMatch training set using T-Res, you need to provide Word2Vec +models that have been trained on digitised historical news texts. In our experiments, +we used the embeddings trained on a 4.2-billion-word corpus of 19th-century British +newspapers using Word2Vec (you can download them from `Zenodo +`_), but you can also do this with your +own word2vec embeddings. The embeddings are divided into periods of ten years each. +By default, T-Res assumes that the word2vec models are stored in +``./resources/models/w2v/``, in directories named ``w2v_xxxxs_news/``, where +``xxxx`` corresponds to the decade (e.g. 1800 or 1810) of the models. + +See the expected directory structure below: + +:: + + T-Res/ + └── resources/ + └── models/ + └── w2v/ + ├── w2v_1800_news/ + │ ├── w2v.model + │ ├── w2v.model.syn1neg.npy + │ └── w2v.model.wv.vectors.npy + ├── w2v_1810_news/ + │ ├── w2v.model + │ ├── w2v.model.syn1neg.npy + │ └── w2v.model.wv.vectors.npy + └── .../ + +Summary of resources and directory structure +-------------------------------------------- + +In the code and our tutorials, we assume the following directory structure +for the mentioned resources that are required in order to run the pipeline. + +:: + + T-Res/ + ├── app/ + ├── evaluation/ + ├── examples/ + ├── experiments/ + │ └── outputs/ + │ └── data/ + │ └── lwm/ + │ ├── linking_df_split.tsv [*] + │ ├── ner_fine_dev.json [*+] + │ └── ner_fine_train.json [*+] + ├── geoparser/ + ├── resources/ + │ ├── deezymatch/ + │ │ └── data/ + │ │ └── w2v_ocr_pairs.txt + │ ├── models/ + │ ├── news_datasets/ + │ ├── rel_db/ + │ │ └── embeddings_database.db [*+] + │ └── wikidata/ + │ ├── entity2class.txt [*] + │ ├── mentions_to_wikidata_normalized.json [*] + │ ├── mentions_to_wikidata.json [*] + │ ├── wikidta_gazetteer.csv [*] + │ └── wikidata_to_mentions_normalized.json [*] + ├── tests/ + └── utils/ + +Note that an asterisk (``*``) next to the resource means that the path can +be changed when instantiating the T-Res objects, and a plus sign (``+``) if +the name of the file can be changed in the instantiation. + +`back to top <#top-resources>`_ diff --git a/docs/source/index.rst b/docs/source/index.rst index f7a840f1..63c6b1de 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -7,10 +7,10 @@ T-Res: A Toponym Resolution Pipeline for Digitised Historical Newspapers :alt: License T-Res is an end-to-end pipeline for toponym resolution for digitised historical -newspapers. Given an input text (a sentence or a text), T-Res identifies the -places that are mentioned in it, links them to their corresponding Wikidata -IDs, and provides their geographic coordinates. T-Res has been designed to -tackle common problems of working with digitised historical newspapers. +newspapers. Given an input text, T-Res identifies the places that are mentioned +in it, links them to their corresponding Wikidata IDs, and provides their +geographic coordinates. T-Res has been designed to tackle common problems of +working with digitised historical newspapers. The pipeline has three main components: @@ -24,13 +24,13 @@ We also provide the code to deploy T-Res as an API, and show how to use it. Each of these elements are described in this documentation. .. toctree:: - :maxdepth: 2 + :maxdepth: 3 :caption: Table of contents: getting-started/index reference/index t-res-api/index - other-files/index + experiments/index Indices and tables ================== diff --git a/docs/source/other-files/evaluation/display_results.rst b/docs/source/other-files/evaluation/display_results.rst deleted file mode 100644 index 15c1bf23..00000000 --- a/docs/source/other-files/evaluation/display_results.rst +++ /dev/null @@ -1,4 +0,0 @@ -``evaluation.display_results`` -============================== - -TODO \ No newline at end of file diff --git a/docs/source/other-files/evaluation/index.rst b/docs/source/other-files/evaluation/index.rst deleted file mode 100644 index a034f860..00000000 --- a/docs/source/other-files/evaluation/index.rst +++ /dev/null @@ -1,30 +0,0 @@ -Evaluation -========== - -First, clone the [CLEF-HIPE-2020-scorer](https://github.com/impresso/CLEF-HIPE-2020-scorer) to this folder and checkout [this commit](https://github.com/impresso/CLEF-HIPE-2020-scorer/tree/ac5c876eba58065195024cff550c2b5056986f7b) to have the exact same evaluation setting as in our experiments. - -.. code-block:: bash - - $ git clone https://github.com/impresso/CLEF-HIPE-2020-scorer.git - $ cd CLEF-HIPE-2020-scorer - $ git checkout ac5c876eba58065195024cff550c2b5056986f7b - -Then, to run the script: - -To assess the performance on toponym recognition: - -.. code-block:: bash - - $ python CLEF-HIPE-2020-scorer/clef_evaluation.py --ref ../experiments/outputs/results/lwm-true_bundle2_en_1.tsv --pred ../experiments/outputs/results/lwm-pred_bundle2_en_1.tsv --task nerc_coarse --outdir results/ - -To assess the performance on toponym resolution: - -.. code-block:: bash - - $ python CLEF-HIPE-2020-scorer/clef_evaluation.py --ref ../experiments/outputs/results/lwm-true_bundle2_en_1.tsv --pred ../experiments/outputs/results/lwm-pred_bundle2_en_1.tsv --task nel --outdir results/ - -.. toctree:: - :maxdepth: 2 - :caption: Table of contents: - - display_results \ No newline at end of file diff --git a/docs/source/other-files/experiments/experiment.rst b/docs/source/other-files/experiments/experiment.rst deleted file mode 100644 index a07125b0..00000000 --- a/docs/source/other-files/experiments/experiment.rst +++ /dev/null @@ -1,7 +0,0 @@ -``experiments.experiment`` -========================== - -.. autoclass:: experiments.experiment.Experiment - :members: - :undoc-members: - :show-inheritance: \ No newline at end of file diff --git a/docs/source/other-files/experiments/index.rst b/docs/source/other-files/experiments/index.rst deleted file mode 100644 index b813be55..00000000 --- a/docs/source/other-files/experiments/index.rst +++ /dev/null @@ -1,59 +0,0 @@ -Reproducing the Experiments: ``experiments`` module -=================================================== - -Follow these steps to reproduce the experiments in our paper. - -1. Obtain the external resources [DRAFT] ----------------------------------------- - -You will need the following resources, which are created using the code in the [wiki2gaz](https://github.com/Living-with-machines/wiki2gaz) or can be downloaded from [TODO: add link]: - -.. - - ../resources/wikidata/wikidata_gazetteer.csv - ../resources/wikidata/entity2class.txt - ../resources/wikidata/mentions_to_wikidata.json - ../resources/wikidata/mentions_to_wikidata_normalized.json - ../resources/wikidata/wikidata_to_mentions_normalized.json - ../resources/wikipedia/wikidata2wikipedia/index_enwiki-latest.db - -You will also need the [word2vec embeddings](TODO: add link) trained from 19th Century data. These embeddings have been created by Nilo Pedrazzini. For more information, check https://github.com/Living-with-machines/DiachronicEmb-BigHistData. - -2. Preparing the data -------------------------- - -To create the datasets that we use in the experiments presented in the paper, run the following command: - -.. code-block:: bash - - $ python prepare_data.py - -This script takes care of downloading the LwM and HIPE datasets and format them as needed in the experiments. - -3. Running the experiments --------------------------- - -To run the experiments, run the following script: - -.. code-block:: bash - - $ python toponym_resolution.py - -This script does runs for all different scenarios reported in the experiments in the paper. - -4. Evaluate ------------ - -To evaluate the different approaches and obtain a table with results such as the one provided in the paper, go to the `../evaluation/` directory. There, you should clone the [HIPE scorer](https://github.com/hipe-eval/HIPE-scorer). We are using the code version at commit 50dff4e, and have added the line `return eval_stats` at the end of the `get_results()` function. From `../evaluation/`, run the following script to obtain the results in latex format: - -.. code-block:: bash - - $ python display_results.py - -.. toctree:: - :maxdepth: 2 - :caption: Table of contents: - - experiment - prepare_data - toponym_resolution \ No newline at end of file diff --git a/docs/source/other-files/experiments/prepare_data.rst b/docs/source/other-files/experiments/prepare_data.rst deleted file mode 100644 index 5aecfc77..00000000 --- a/docs/source/other-files/experiments/prepare_data.rst +++ /dev/null @@ -1,4 +0,0 @@ -``experiments.prepare_data`` -============================ - -Script... TODO = description. \ No newline at end of file diff --git a/docs/source/other-files/experiments/toponym_resolution.rst b/docs/source/other-files/experiments/toponym_resolution.rst deleted file mode 100644 index dc837c70..00000000 --- a/docs/source/other-files/experiments/toponym_resolution.rst +++ /dev/null @@ -1,4 +0,0 @@ -``experiments.toponym_resolution`` -================================== - -Script... TODO = description. \ No newline at end of file diff --git a/docs/source/other-files/index.rst b/docs/source/other-files/index.rst deleted file mode 100644 index 138f9b68..00000000 --- a/docs/source/other-files/index.rst +++ /dev/null @@ -1,10 +0,0 @@ -Other files in repository -========================= - -.. toctree:: - :maxdepth: 2 - :caption: Table of contents: - - evaluation/index - experiments/index - resources/index \ No newline at end of file diff --git a/docs/source/reference/utils/process_data.rst b/docs/source/reference/utils/process_data.rst index c61e4ab2..1f5f2066 100644 --- a/docs/source/reference/utils/process_data.rst +++ b/docs/source/reference/utils/process_data.rst @@ -17,12 +17,4 @@ .. autofunction:: utils.process_data.prepare_storing_links -.. autofunction:: utils.process_data.load_processed_data - -.. autofunction:: utils.process_data.store_processed_data - -.. autofunction:: utils.process_data.create_mentions_df - .. autofunction:: utils.process_data.store_for_scorer - -.. autofunction:: utils.process_data.store_results \ No newline at end of file diff --git a/docs/source/reference/utils/rel/index.rst b/docs/source/reference/utils/rel/index.rst index 35fc92cf..a7391a53 100644 --- a/docs/source/reference/utils/rel/index.rst +++ b/docs/source/reference/utils/rel/index.rst @@ -1,6 +1,24 @@ ``utils.REL`` module ==================== +The scripts included in this module are taken and have been adapted +from the `REL: Radboud Entity Linker `_ +Github repository: Copyright (c) 2020 Johannes Michael van Hulst. +See the `permission notice `_. + +:: + + Reference: + + @inproceedings{vanHulst:2020:REL, + author = {van Hulst, Johannes M. and Hasibi, Faegheh and Dercksen, Koen and Balog, Krisztian and de Vries, Arjen P.}, + title = {REL: An Entity Linker Standing on the Shoulders of Giants}, + booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval}, + series = {SIGIR '20}, + year = {2020}, + publisher = {ACM} + } + .. toctree:: :maxdepth: 2 :caption: Table of contents: @@ -8,4 +26,4 @@ entity_disambiguation mulrel_ranker utils - vocabulary \ No newline at end of file + vocabulary diff --git a/experiments/prepare_data.py b/experiments/prepare_data.py index d3fb6184..c84cbe83 100644 --- a/experiments/prepare_data.py +++ b/experiments/prepare_data.py @@ -14,19 +14,6 @@ from utils import get_data, preprocess_data RANDOM_SEED = 42 -"""Constant representing the random seed used for generating pseudo-random -numbers. - -The `RANDOM_SEED` is a value that initializes the random number generator -algorithm, ensuring that the sequence of random numbers generated remains the -same across different runs of the program. This is useful for achieving -reproducibility in experiments or when consistent random behavior is -desired. - -.. - If this docstring is changed, also make sure to edit linking.py, - rel_utils.py, entity_disambiguation.py. -""" random.seed(RANDOM_SEED) resources = "../resources/" # path to resources diff --git a/geoparser/linking.py b/geoparser/linking.py index 13d8af16..93ab00e4 100644 --- a/geoparser/linking.py +++ b/geoparser/linking.py @@ -12,19 +12,6 @@ tqdm.pandas() RANDOM_SEED = 42 -"""Constant representing the random seed used for generating pseudo-random -numbers. - -The `RANDOM_SEED` is a value that initializes the random number generator -algorithm, ensuring that the sequence of random numbers generated remains the -same across different runs of the program. This is useful for achieving -reproducibility in experiments or when consistent random behavior is -desired. - -.. - If this docstring is changed, also make sure to edit prepare_data.py, - rel_utils.py, entity_disambiguation.py. -""" np.random.seed(RANDOM_SEED) # Add "../" to path to import utils @@ -55,13 +42,16 @@ class Linker: approach. For the default settings, see Notes below. Example: - >>> linker = Linker( - method="mostpopular", - resources_path="/path/to/linking/resources/", - linking_resources={}, - overwrite_training=True, - rel_params={"with_publication": True, "do_test": True} - ) + + .. code-block:: python + + linker = Linker( + method="mostpopular", + resources_path="/path/to/linking/resources/", + linking_resources={}, + overwrite_training=True, + rel_params={"with_publication": True, "do_test": True} + ) Note: @@ -70,45 +60,45 @@ class Linker: a connection to the entity embeddings database is established and a cursor is created: - .. code-block:: python - - with sqlite3.connect("../resources/rel_db/embeddings_database.db") as conn: - cursor = conn.cursor() - mylinker = linking.Linker( - method="reldisamb", - resources_path="../resources/", - linking_resources=dict(), - rel_params={ - "model_path": "../resources/models/disambiguation/", - "data_path": "../experiments/outputs/data/lwm/", - "training_split": "", - "db_embeddings": cursor, - "with_publication": wpubl, - "without_microtoponyms": wmtops, - "do_test": False, - "default_publname": "", - "default_publwqid": "", - }, - overwrite_training=False, - ) - - * See below the default settings for ``rel_params``. Note that + .. code-block:: python + + with sqlite3.connect("../resources/rel_db/embeddings_database.db") as conn: + cursor = conn.cursor() + mylinker = linking.Linker( + method="reldisamb", + resources_path="../resources/", + linking_resources=dict(), + rel_params={ + "model_path": "../resources/models/disambiguation/", + "data_path": "../experiments/outputs/data/lwm/", + "training_split": "", + "db_embeddings": cursor, + "with_publication": wpubl, + "without_microtoponyms": wmtops, + "do_test": False, + "default_publname": "", + "default_publwqid": "", + }, + overwrite_training=False, + ) + + See below the default settings for ``rel_params``. Note that `db_embeddings` defaults to None, but it should be assigned a cursor to the entity embeddings database, as described above: - .. code-block:: python - - rel_params: Optional[dict] = { - "model_path": "../resources/models/disambiguation/", - "data_path": "../experiments/outputs/data/lwm/", - "training_split": "originalsplit", - "db_embeddings": None, - "with_publication": True, - "without_microtoponyms": True, - "do_test": False, - "default_publname": "United Kingdom", - "default_publwqid": "Q145", - } + .. code-block:: python + + rel_params: Optional[dict] = { + "model_path": "../resources/models/disambiguation/", + "data_path": "../experiments/outputs/data/lwm/", + "training_split": "originalsplit", + "db_embeddings": None, + "with_publication": True, + "without_microtoponyms": True, + "do_test": False, + "default_publname": "United Kingdom", + "default_publwqid": "Q145", + } """ @@ -148,7 +138,7 @@ def __str__(self) -> str: """ s = ">>> Entity Linking:\n" s += f" * Method: {self.method}\n" - s += " * Overwrite training: {self.overwrite_training}\n" + s += f" * Overwrite training: {self.overwrite_training}\n" return s def load_resources(self) -> dict: diff --git a/utils/REL/entity_disambiguation.py b/utils/REL/entity_disambiguation.py index 785597d8..26a147de 100644 --- a/utils/REL/entity_disambiguation.py +++ b/utils/REL/entity_disambiguation.py @@ -21,19 +21,6 @@ from utils.REL.vocabulary import Vocabulary RANDOM_SEED = 42 -"""Constant representing the random seed used for generating pseudo-random -numbers. - -The `RANDOM_SEED` is a value that initializes the random number generator -algorithm, ensuring that the sequence of random numbers generated remains the -same across different runs of the program. This is useful for achieving -reproducibility in experiments or when consistent random behavior is -desired. - -.. - If this docstring is changed, also make sure to edit prepare_data.py, - linking.py, rel_utils.py. -""" random.seed(RANDOM_SEED) @@ -49,9 +36,14 @@ class EntityDisambiguation: :py:class:`~utils.REL.mulrel_ranker.MulRelRanker` model, for entity disambiguation. - Credit: + .. note:: + + **Credit:** + This class and its methods are adapted from the `REL: Radboud Entity - Linker `_ Github repository. + Linker `_ Github repository: + Copyright (c) 2020 Johannes Michael van Hulst. See the `permission + notice `_. :: @@ -65,6 +57,7 @@ class EntityDisambiguation: year = {2020}, publisher = {ACM} } + """ def __init__(self, db_embs, user_config, reset_embeddings=False): diff --git a/utils/REL/mulrel_ranker.py b/utils/REL/mulrel_ranker.py index b09d0011..675e11bb 100644 --- a/utils/REL/mulrel_ranker.py +++ b/utils/REL/mulrel_ranker.py @@ -9,11 +9,15 @@ class PreRank(torch.nn.Module): PreRank class is used for preranking entities for a given mention by multiplying entity vectors with word vectors. - Credit: + .. note:: + + **Credit:** + This class and its methods are taken (minimally adapted when necessary) from the `REL: Radboud Entity - Linker `_ Github - repository. + Linker `_ Github repository: + Copyright (c) 2020 Johannes Michael van Hulst. See the `permission + notice `_. :: @@ -27,6 +31,7 @@ class PreRank(torch.nn.Module): year = {2020}, publisher = {ACM} } + """ def __init__(self, config, embeddings=None): @@ -64,15 +69,20 @@ class MulRelRanker(torch.nn.Module): """ The MulRelRanker class implements a neural network model for entity disambiguation. - Credit: + .. note:: + + **Credit:** + This class and its methods are taken (minimally adapted when necessary) from the `REL: Radboud Entity - Linker `_ Github - repository, which is based on the ``mulrel-nel`` - approach developed by Le and Titov (2018), whose original - code is available in the `mulrel-nel: Multi-relational Named - Entity Linking `_ Github - repository, and on Ganea and Hofmann (2017). + Linker `_ Github repository: + Copyright (c) 2020 Johannes Michael van Hulst. See the `permission + notice `_. + This is based on the ``mulrel-nel`` approach developed by Le and + Titov (2018), whose original code is available in the + `mulrel-nel: Multi-relational Named Entity Linking + `_ Github repository, and + on Ganea and Hofmann (2017). :: @@ -102,6 +112,7 @@ class MulRelRanker(torch.nn.Module): pages={1595--1604}, year={2018} } + """ def __init__(self, config, device): @@ -243,10 +254,11 @@ def forward( ): """ Responsible for the forward pass of the entity disambiguation model - and produces a ranking of candidates for a given set of mentions. - - ctx_layer refers to function f. See Figure 3 in Le and Titov (2018). - - ent_scores refers to function q. - - score_combine refers to function g. + and produces a ranking of candidates for a given set of mentions: + + * ctx_layer refers to function f. See Figure 3 in Le and Titov (2018). + * ent_scores refers to function q. + * score_combine refers to function g. Returns: Ranking of entities per mention. diff --git a/utils/REL/utils.py b/utils/REL/utils.py index 139b0bc9..39a4797d 100644 --- a/utils/REL/utils.py +++ b/utils/REL/utils.py @@ -20,9 +20,14 @@ def flatten_list_of_lists( >>> print(flatten_list_of_lists(list_of_lists)) ([1, 2, 3, 4, 5, 6], array([0, 3, 5])) - Credit: - This function is taken from the `REL: Radboud Entity Linker - `_ Github repository. + .. note:: + + **Credit:** + + This function is taken from the `REL: Radboud Entity + Linker `_ Github repository: + Copyright (c) 2020 Johannes Michael van Hulst. See the `permission + notice `_. :: @@ -64,9 +69,14 @@ def make_equal_len( >>> print(make_equal_len(lists)) ([[1, 2, 3, 0], [4, 5, 0, 0], [6, 7, 8, 9]], [[1.0, 1.0, 1.0, 0.0], [1.0, 1.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0]]) - Credit: - This function is taken from the `REL: Radboud Entity Linker - `_ Github repository. + .. note:: + + **Credit:** + + This function is taken from the `REL: Radboud Entity + Linker `_ Github repository: + Copyright (c) 2020 Johannes Michael van Hulst. See the `permission + notice `_. :: @@ -108,9 +118,14 @@ def is_important_word(s: str) -> bool: >>> print(is_important_word("apple")) True - Credit: - This function is adapted from the `REL: Radboud Entity Linker - `_ Github repository. + .. note:: + + **Credit:** + + This function is adapted from the `REL: Radboud Entity + Linker `_ Github repository: + Copyright (c) 2020 Johannes Michael van Hulst. See the `permission + notice `_. :: diff --git a/utils/REL/vocabulary.py b/utils/REL/vocabulary.py index 35d11163..008eba2d 100644 --- a/utils/REL/vocabulary.py +++ b/utils/REL/vocabulary.py @@ -18,10 +18,15 @@ class Vocabulary: """ A class representing a vocabulary object used for storing references to embeddings. - Credit: - The code for this class and its methods is taken from the `REL: Radboud - Entity Linker `_ Github repository. See - `https://github.com/informagi/REL/blob/main/src/REL/vocabulary.py`_ for more + .. note:: + + **Credit:** + + The code for this class and its methods is taken from the `REL: Radboud Entity + Linker `_ Github repository: Copyright (c) + 2020 Johannes Michael van Hulst. See the `permission notice + `_. See `the original script + `_ for more information. ::