Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
mcollardanuy committed Jul 31, 2023
1 parent 830dfe8 commit d5b8c9b
Show file tree
Hide file tree
Showing 4 changed files with 188 additions and 54 deletions.
110 changes: 88 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Overview

T-Res is an end-to-end pipeline for toponym resolution for digitised historical newspapers. Given an input text, T-Res identifies the places that are mentioned in it, links them to their corresponding Wikidata IDs, and provides their geographic coordinates. T-Res has been designed to tackle common problems of working with digitised historical newspapers.
T-Res is an end-to-end pipeline for toponym detection, linking, and resolution on digitised historical newspapers. Given an input text, T-Res identifies the places that are mentioned in it, links them to their corresponding Wikidata IDs, and provides their geographic coordinates. T-Res has been developed to assist researchers explore large collections of digitised historical newspapers, and has been designed to tackle common problems often found when dealing with this type of data.

The pipeline has three main components:

Expand All @@ -14,32 +14,98 @@ The pipeline has three main components:

The three components are used in combination in the **Pipeline** class.

We also provide the code to deploy T-Res as an API, and show how to use it. Each of these elements are described in this documentation.
We also provide the code to deploy T-Res as an API, and show how to use it. Each of these elements are described in the documentation.

## Directory structure
The repository contains the code for the experiments described in our paper.

## Documentation

The T-Res documentation can be found at **[TODO]**.

## Resources and directory structure

T-Res relies on several resources in the following directory structure:

```
toponym-resolution/
├── app/
├── docs/
├── evaluation/
├── examples/
├── experiments/
│ └── outputs/
├── geoparser/
├── resources/
│ ├── deezymatch/
│ ├── models/
│ ├── news_datasets/
│ ├── wikidata/
│ └── wikipedia/
├── tests/
└── utils/
T-Res/
├── app/
├── evaluation/
├── examples/
├── experiments/
│ └── outputs/
│ └── data/
│ └── lwm/
│ ├── linking_df_split.tsv [*?]
│ ├── ner_fine_dev.json [*+?]
│ └── ner_fine_train.json [*+?]
├── geoparser/
├── resources/
│ ├── deezymatch/
│ │ └── data/
│ │ └── w2v_ocr_pairs.txt [*+?]
│ ├── models/
│ ├── news_datasets/
│ ├── rel_db/
│ │ └── embeddings_database.db [*+?]
│ └── wikidata/
│ ├── entity2class.txt [*]
│ ├── mentions_to_wikidata_normalized.json [*]
│ ├── mentions_to_wikidata.json [*]
│ ├── wikidta_gazetteer.csv [*]
│ └── wikidata_to_mentions_normalized.json [*]
├── tests/
└── utils/
```

## Documentation
These resources are described in detail in the documentation. A question mark (`?`) is used to indicate resources which are only required for some approaches (for example, the `rel_db/embeddings_database.db` file is only required by the REL-based disambiguation approaches). Note that an asterisk (`*`) next to the resource means that the path can be changed when instantiating the T-Res objects, and a plus sign (`+`) if the name of the file can be changed in the instantiation.

The T-Res documentation can be found at **[TODO]**.
By default, T-Res expects to be run from the `experiments/` folder, or a directory in the same level (for example, the `examples/` folder).

## Example

This is an example on how to use the default T-Res pipeline:

```python
from geoparser import pipeline

geoparser = pipeline.Pipeline()

output = geoparser.run_text("She was on a visit at Chippenham.")
```

This returns:

```json
[{'mention': 'Chippenham',
'ner_score': 1.0,
'pos': 22,
'sent_idx': 0,
'end_pos': 32,
'tag': 'LOC',
'sentence': 'She was on a visit at Chippenham.',
'prediction': 'Q775299',
'ed_score': 0.651,
'string_match_score': {'Chippenham': (1.0,
['Q775299',
'Q3138621',
'Q2178517',
'Q1913461',
'Q7592323',
'Q5101644',
'Q67149348'])},
'prior_cand_score': {},
'cross_cand_score': {'Q775299': 0.651,
'Q3138621': 0.274,
'Q2178517': 0.035,
'Q1913461': 0.033,
'Q5101644': 0.003,
'Q7592323': 0.002,
'Q67149348': 0.002},
'latlon': [51.4585, -2.1158],
'wkdt_class': 'Q3957'}]
```

Note that T-Res allows the user to use their own knowledge base, and to choose among different approaches for performing each of the steps in the pipeline. Please refer to the documentation to learn how.

## Acknowledgements

Expand All @@ -59,7 +125,7 @@ We adapt some code from:

Classes, methods and functions that have been taken or adapted from above are credited in the docstrings.

Finally, this work has used the [HIPE-scorer](https://github.com/hipe-eval/HIPE-scorer/blob/master/LICENSE) for assessing the performance of T-Res.
In our experiments, we have used resources built from Wikidata and Wikipedia for linking. In order to assess T-Res performance, we have used the [topRes19th](https://doi.org/10.23636/r7d4-kw08) and the [HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020/datasets.html) datasets, and the [HIPE-scorer](https://github.com/hipe-eval/HIPE-scorer/blob/master/LICENSE) for evaluation.

## Cite

Expand Down
12 changes: 7 additions & 5 deletions docs/source/getting-started/complete-tour.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The complete tour
The T-Res has three main classes: the **Recogniser** class (which performs
toponym recognition, which is a named entity recognition task), the **Ranker**
class (which performs candidate selection and ranking for the named entities
identified by the Recogniser), and the **Linker** class (which selectes the
identified by the Recogniser), and the **Linker** class (which selects the
most likely candidate from those provided by the Ranker).

An additional class, the **Pipeline**, wraps these three components into one,
Expand Down Expand Up @@ -55,7 +55,7 @@ each of them: :ref:`Recogniser <The Recogniser>`, :ref:`Ranker <The Ranker>`
and :ref:`Linker <The Linker>`.

In order to instantiate a pipeline using a customised Recogniser, Ranker and
Linker, just instantiate them beforehand, and then pass them as arguments To
Linker, just instantiate them beforehand, and then pass them as arguments to
the Pipeline, as follows:

.. code-block:: python
Expand Down Expand Up @@ -412,10 +412,12 @@ The Recogniser
--------------

The Recogniser performs toponym recognition (i.e. geographic named entity
recognition). Users can either:
recognition), using HuggingFace's ``transformers`` library. Users can either:

#. Load an existing model (either directly downloading a model from the HuggingFace hub or loading a locally stored NER model), or
#. Fine-tune a new model on top of a base model and loading it, or directly load it if it is already pre-trained.
#. Load an existing model (either directly downloading a model from the
HuggingFace hub or loading a locally stored NER model), or
#. Fine-tune a new model on top of a base model and loading it, or directly
load it if it is already pre-trained.

The following notebooks provide examples of both training or loading a
NER model using the Recogniser, and using it for detecting entities:
Expand Down
86 changes: 63 additions & 23 deletions docs/source/getting-started/resources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,16 @@ Resources and directory structure

T-Res requires several resources to work. Some resources can be downloaded
and loaded directly from the web. Others will need to be generated, following
the instructions provided in this section.
the instructions provided in this section. In this page, we describe the format
of the files that are required by T-Res, therefore also giving the user the
option to use their own resources (adapted to T-Res).

Toponym recognition and disambiguation training data
----------------------------------------------------

We provide the dataset we used to train T-Res for the tasks of toponym recognition
(i.e. a named entity recognition task) and toponym disambiguation (i.e. an entity
linking task focused on geographical entities). The dataset is based on the
linking task focused on geographical entities) in English. The dataset is based on the
`TopRes19th dataset <https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.56>`_.

.. note::
Expand All @@ -40,6 +42,12 @@ description of the format expected by T-Res.
1. Toponym recognition dataset
##############################

.. note::

You don't need a toponym recognition dataset if you load a NER model directly
from the HuggingFace hub, or from a local folder. In that case, you can skip
this section.

T-Res allows directly loading a pre-trained BERT-based NER model, either locally
or from the HuggingFace models hub. If this is your option, you can skip this
section. Otherwise, if you want to train your own NER model using either our
Expand All @@ -66,6 +74,12 @@ data.
2. Toponym disambiguation dataset
#################################

.. note::

You won't need a toponym disambiguation dataset if you use the unsupervised
approach for linking (i.e ``mostpopular``). You will need a toponym disambiguation
dataset if you want to use one of the REL-based approaches.

Train and test data examples are required for training a new entity
disambiguation (ED) model. They should be provided in a single tsv file, named
``linking_df_split.tsv``, one document per row, with the following required
Expand Down Expand Up @@ -97,8 +111,8 @@ columns:
]
* ``annotations``: list of dictionaries containing the annotated place names.
Each dictionary corresponds to a named entity mentioned in the text, with the
following fields (at least): ``mention_pos`` (order of the mention in the article),
Each dictionary corresponds to a named entity mentioned in the text, with (at
least) the following fields: ``mention_pos`` (order of the mention in the article),
``mention`` (the actual mention), ``entity_type`` (the type of named entity),
``wkdt_qid`` (the Wikidata ID of the resolved entity), ``mention_start``
(the character start position of the mention in the sentence), ``mention_end``
Expand Down Expand Up @@ -196,8 +210,8 @@ T-Res assumes these files in the following default location:
└── wikidata_to_mentions_normalized.json

The sections below describe the contents of the files, as well as their
format, in case you prefer to provide your own resources (which should be
in the same format).
format, in case you prefer to provide your own resources (which should
have the same format).

``mentions_to_wikidata.json``
#############################
Expand Down Expand Up @@ -327,6 +341,15 @@ You can load the csv, and show the first five rows, as follows:
Each row corresponds to a Wikidata geographic entity (i.e. a Wikidata entity
with coordinates).

.. note::

Note that the latitude and longitude are not used by the disambiguation
method: they are only provided as a post-processing step when rendering
the output of the linking. Therefore, the columns can have dummy values
(of type ``float``) if the user is not interested in obtaining the
coordinates: the linking to Wikidata will not be affected. Column
``english_label`` can likewise be left empty.

``entity2class.txt``
####################

Expand All @@ -350,14 +373,27 @@ mapped to `Q180673 <https://www.wikidata.org/wiki/Q180673>`_, i.e. "cerimonial
county of England", whereas London (`Q84 <https://www.wikidata.org/wiki/Q84>`_)
is mapped to `Q515 <https://www.wikidata.org/wiki/Q515>`_, i.e. "city".

.. note::

Note that the entity2class mapping is not used by the disambiguation
method: the Wikidata class is only provided as a post-processing step
when rendering the output of the linking. T-Res will complain if the
file is not there, but values can be left empty if the user is not
interested in obtaining the wikidata class of the predicted entity.
The linking to Wikidata will not be affected.

`back to top <#top-resources>`_

Entity and word embeddings
--------------------------

.. note::

Note that you will not need this if you use the ``mostpopular`` disambiguation
approach.

In order to perform toponym linking and resolution using the REL-based approaches,
T-Res requires a database of word2vec and wiki2vec embeddings. Note that you will
not need this if you use the ``mostpopular`` disambiguation approach.
T-Res requires a database of word2vec and wiki2vec embeddings.

By default, T-Res expects a database file called ``embeddings_database.db`` with,
at least, one table (``entity_embeddings``) with at least the following columns:
Expand All @@ -367,9 +403,6 @@ at least, one table (``entity_embeddings``) with at least the following columns:
tokens: ``#ENTITY/UNK#`` and ``#WORD/UNK#``.
* ``emb``: The corresponding word or entity embedding.

Generate the embeddings database
################################

In our experiments, we derived the embeddings database from REL's shared resources.

.. note::
Expand All @@ -382,7 +415,7 @@ In our experiments, we derived the embeddings database from REL's shared resourc
#. Generate a Wikipedia-to-Wikidata index, following `this instructions
<https://github.com/jcklie/wikimapper#create-your-own-index>`_, save it as: ``./resources/wikipedia/index_enwiki-latest.db``.
#. Run `this script <https://github.com/Living-with-machines/wiki2gaz/blob/main/download_and_merge_embeddings_databases.py>`_
to create the embeddings database.
to create the embeddings database (**[coming soon]**).

You can load the file, and access a token embedding, as follows:

Expand Down Expand Up @@ -435,6 +468,8 @@ used to perform fuzzy string matching to find candidates for entity linking.

The DeezyMatch training set can be downloaded from the `British Library research
repository <https://bl.iro.bl.uk/concern/datasets/12208b77-74d6-44b5-88f9-df04db881d63>`_.
This dataset is only necessary if you want to use the DeezyMatch approach to perform
candidate selection. This is not needed if you use ``perfectmatch``.

T-Res assumes by default the DeezyMatch training set to be named ``w2v_ocr_pairs.txt``
and to be in the following location:
Expand Down Expand Up @@ -483,9 +518,11 @@ The dataset we provide consists of 1,085,514 string pairs.
2. Word2Vec embeddings trained on noisy data
############################################

The 19thC word2vec embeddings **are not needed** if you already have the DeezyMatch
training set ``w2v_ocr_pairs.txt`` (described in the `section above
<#deezymatch-training-set>`_).
.. note::

The 19thC word2vec embeddings **are not needed** if you already have the
DeezyMatch training set ``w2v_ocr_pairs.txt`` (described in the `section above
<#deezymatch-training-set>`_).

To create a new DeezyMatch training set using T-Res, you need to provide Word2Vec
models that have been trained on digitised historical news texts. In our experiments,
Expand Down Expand Up @@ -531,18 +568,18 @@ for the mentioned resources that are required in order to run the pipeline.
│ └── outputs/
│ └── data/
│ └── lwm/
│ ├── linking_df_split.tsv [*]
│ ├── ner_fine_dev.json [*+]
│ └── ner_fine_train.json [*+]
│ ├── linking_df_split.tsv [*?]
│ ├── ner_fine_dev.json [*+?]
│ └── ner_fine_train.json [*+?]
├── geoparser/
├── resources/
│ ├── deezymatch/
│ │ └── data/
│ │ └── w2v_ocr_pairs.txt
│ │ └── w2v_ocr_pairs.txt [?]
│ ├── models/
│ ├── news_datasets/
│ ├── rel_db/
│ │ └── embeddings_database.db [*+]
│ │ └── embeddings_database.db [*+?]
│ └── wikidata/
│ ├── entity2class.txt [*]
│ ├── mentions_to_wikidata_normalized.json [*]
Expand All @@ -552,8 +589,11 @@ for the mentioned resources that are required in order to run the pipeline.
├── tests/
└── utils/

Note that an asterisk (``*``) next to the resource means that the path can
be changed when instantiating the T-Res objects, and a plus sign (``+``) if
the name of the file can be changed in the instantiation.
A question mark (``?``) is used to indicate resources which are only required
for some approaches (for example, the ``rel_db/embeddings_database.db`` file
is only required by the REL-based disambiguation approaches). Note that an
asterisk (``*``) next to the resource means that the path can be changed when
instantiating the T-Res objects, and a plus sign (``+``) if the name of the
file can be changed in the instantiation.

`back to top <#top-resources>`_
Loading

0 comments on commit d5b8c9b

Please sign in to comment.