Update docs

Living-with-machines · Jul 31, 2023 · d5b8c9b · d5b8c9b
1 parent 830dfe8
commit d5b8c9b
Show file tree

Hide file tree

Showing 4 changed files with 188 additions and 54 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 ## Overview
 
-T-Res is an end-to-end pipeline for toponym resolution for digitised historical newspapers. Given an input text, T-Res identifies the places that are mentioned in it, links them to their corresponding Wikidata IDs, and provides their geographic coordinates. T-Res has been designed to tackle common problems of working with digitised historical newspapers.
+T-Res is an end-to-end pipeline for toponym detection, linking, and resolution on digitised historical newspapers. Given an input text, T-Res identifies the places that are mentioned in it, links them to their corresponding Wikidata IDs, and provides their geographic coordinates. T-Res has been developed to assist researchers explore large collections of digitised historical newspapers, and has been designed to tackle common problems often found when dealing with this type of data.
 
 The pipeline has three main components:
 
@@ -14,32 +14,98 @@ The pipeline has three main components:
 
 The three components are used in combination in the **Pipeline** class.
 
-We also provide the code to deploy T-Res as an API, and show how to use it. Each of these elements are described in this documentation.
+We also provide the code to deploy T-Res as an API, and show how to use it. Each of these elements are described in the documentation.
 
-## Directory structure
+The repository contains the code for the experiments described in our paper.
+
+## Documentation
+
+The T-Res documentation can be found at **[TODO]**.
+
+## Resources and directory structure
+
+T-Res relies on several resources in the following directory structure:
 
 ```
-toponym-resolution/
-   ├── app/
-   ├── docs/
-   ├── evaluation/
-   ├── examples/
-   ├── experiments/
-   │   └── outputs/
-   ├── geoparser/
-   ├── resources/
-   │   ├── deezymatch/
-   │   ├── models/
-   │   ├── news_datasets/
-   │   ├── wikidata/
-   │   └── wikipedia/
-   ├── tests/
-   └── utils/
+T-Res/
+├── app/
+├── evaluation/
+├── examples/
+├── experiments/
+│   └── outputs/
+│       └── data/
+│           └── lwm/
+│               ├── linking_df_split.tsv [*?]
+│               ├── ner_fine_dev.json [*+?]
+│               └── ner_fine_train.json [*+?]
+├── geoparser/
+├── resources/
+│   ├── deezymatch/
+│   │   └── data/
+│   │       └── w2v_ocr_pairs.txt [*+?]
+│   ├── models/
+│   ├── news_datasets/
+│   ├── rel_db/
+│   │   └── embeddings_database.db [*+?]
+│   └── wikidata/
+│       ├── entity2class.txt [*]
+│       ├── mentions_to_wikidata_normalized.json [*]
+│       ├── mentions_to_wikidata.json [*]
+│       ├── wikidta_gazetteer.csv [*]
+│       └── wikidata_to_mentions_normalized.json [*]
+├── tests/
+└── utils/
 ```
 
-## Documentation
+These resources are described in detail in the documentation. A question mark (`?`) is used to indicate resources which are only required for some approaches (for example, the `rel_db/embeddings_database.db` file is only required by the REL-based disambiguation approaches). Note that an asterisk (`*`) next to the resource means that the path can be changed when instantiating the T-Res objects, and a plus sign (`+`) if the name of the file can be changed in the instantiation.
 
-The T-Res documentation can be found at **[TODO]**.
+By default, T-Res expects to be run from the `experiments/` folder, or a directory in the same level (for example, the `examples/` folder).
+
+## Example
+
+This is an example on how to use the default T-Res pipeline:
+
+```python
+from geoparser import pipeline
+
+geoparser = pipeline.Pipeline()
+
+output = geoparser.run_text("She was on a visit at Chippenham.")
+```
+
+This returns:
+
+```json
+[{'mention': 'Chippenham',
+  'ner_score': 1.0,
+  'pos': 22,
+  'sent_idx': 0,
+  'end_pos': 32,
+  'tag': 'LOC',
+  'sentence': 'She was on a visit at Chippenham.',
+  'prediction': 'Q775299',
+  'ed_score': 0.651,
+  'string_match_score': {'Chippenham': (1.0,
+    ['Q775299',
+     'Q3138621',
+     'Q2178517',
+     'Q1913461',
+     'Q7592323',
+     'Q5101644',
+     'Q67149348'])},
+  'prior_cand_score': {},
+  'cross_cand_score': {'Q775299': 0.651,
+   'Q3138621': 0.274,
+   'Q2178517': 0.035,
+   'Q1913461': 0.033,
+   'Q5101644': 0.003,
+   'Q7592323': 0.002,
+   'Q67149348': 0.002},
+  'latlon': [51.4585, -2.1158],
+  'wkdt_class': 'Q3957'}]
+```
+
+Note that T-Res allows the user to use their own knowledge base, and to choose among different approaches for performing each of the steps in the pipeline. Please refer to the documentation to learn how.
 
 ## Acknowledgements
 
@@ -59,7 +125,7 @@ We adapt some code from:
 
 Classes, methods and functions that have been taken or adapted from above are credited in the docstrings.
 
-Finally, this work has used the [HIPE-scorer](https://github.com/hipe-eval/HIPE-scorer/blob/master/LICENSE) for assessing the performance of T-Res.
+In our experiments, we have used resources built from Wikidata and Wikipedia for linking. In order to assess T-Res performance, we have used the [topRes19th](https://doi.org/10.23636/r7d4-kw08) and the [HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020/datasets.html) datasets, and the [HIPE-scorer](https://github.com/hipe-eval/HIPE-scorer/blob/master/LICENSE) for evaluation.
 
 ## Cite
 

diff --git a/docs/source/getting-started/complete-tour.rst b/docs/source/getting-started/complete-tour.rst
@@ -7,7 +7,7 @@ The complete tour
 The T-Res has three main classes: the **Recogniser** class (which performs
 toponym recognition, which is a named entity recognition task), the **Ranker**
 class (which performs candidate selection and ranking for the named entities
-identified by the Recogniser), and the **Linker** class (which selectes the
+identified by the Recogniser), and the **Linker** class (which selects the
 most likely candidate from those provided by the Ranker).
 
 An additional class, the **Pipeline**, wraps these three components into one,
@@ -55,7 +55,7 @@ each of them: :ref:`Recogniser <The Recogniser>`, :ref:`Ranker <The Ranker>`
 and :ref:`Linker <The Linker>`.
 
 In order to instantiate a pipeline using a customised Recogniser, Ranker and
-Linker, just instantiate them beforehand, and then pass them as arguments To
+Linker, just instantiate them beforehand, and then pass them as arguments to
 the Pipeline, as follows:
 
 .. code-block:: python
@@ -412,10 +412,12 @@ The Recogniser
 --------------
 
 The Recogniser performs toponym recognition (i.e. geographic named entity
-recognition). Users can either:
+recognition), using HuggingFace's ``transformers`` library. Users can either:
 
-#. Load an existing model (either directly downloading a model from the HuggingFace hub or loading a locally stored NER model), or
-#. Fine-tune a new model on top of a base model and loading it, or directly load it if it is already pre-trained.
+#. Load an existing model (either directly downloading a model from the
+   HuggingFace hub or loading a locally stored NER model), or
+#. Fine-tune a new model on top of a base model and loading it, or directly
+   load it if it is already pre-trained.
 
 The following notebooks provide examples of both training or loading a
 NER model using the Recogniser, and using it for detecting entities:

diff --git a/docs/source/getting-started/resources.rst b/docs/source/getting-started/resources.rst
@@ -6,14 +6,16 @@ Resources and directory structure
 
 T-Res requires several resources to work. Some resources can be downloaded
 and loaded directly from the web. Others will need to be generated, following
-the instructions provided in this section.
+the instructions provided in this section. In this page, we describe the format
+of the files that are required by T-Res, therefore also giving the user the
+option to use their own resources (adapted to T-Res).
 
 Toponym recognition and disambiguation training data
 ----------------------------------------------------
 
 We provide the dataset we used to train T-Res for the tasks of toponym recognition
 (i.e. a named entity recognition task) and toponym disambiguation (i.e. an entity
-linking task focused on geographical entities). The dataset is based on the
+linking task focused on geographical entities) in English. The dataset is based on the
 `TopRes19th dataset <https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.56>`_.
 
 .. note::
@@ -40,6 +42,12 @@ description of the format expected by T-Res.
 1. Toponym recognition dataset
 ##############################
 
+.. note::
+
+    You don't need a toponym recognition dataset if you load a NER model directly
+    from the HuggingFace hub, or from a local folder. In that case, you can skip
+    this section.
+
 T-Res allows directly loading a pre-trained BERT-based NER model, either locally
 or from the HuggingFace models hub. If this is your option, you can skip this
 section. Otherwise, if you want to train your own NER model using either our
@@ -66,6 +74,12 @@ data.
 2. Toponym disambiguation dataset
 #################################
 
+.. note::
+
+    You won't need a toponym disambiguation dataset if you use the unsupervised
+    approach for linking (i.e ``mostpopular``). You will need a toponym disambiguation
+    dataset if you want to use one of the REL-based approaches.
+
 Train and test data examples are required for training a new entity
 disambiguation (ED) model. They should be provided in a single tsv file, named
 ``linking_df_split.tsv``, one document per row, with the following required
@@ -97,8 +111,8 @@ columns:
     ]
 
 * ``annotations``: list of dictionaries containing the annotated place names.
-  Each dictionary corresponds to a named entity mentioned in the text, with the
-  following fields (at least): ``mention_pos`` (order of the mention in the article),
+  Each dictionary corresponds to a named entity mentioned in the text, with (at
+  least) the following fields: ``mention_pos`` (order of the mention in the article),
   ``mention`` (the actual mention), ``entity_type`` (the type of named entity),
   ``wkdt_qid`` (the Wikidata ID of the resolved entity), ``mention_start``
   (the character start position of the mention in the sentence), ``mention_end``
@@ -196,8 +210,8 @@ T-Res assumes these files in the following default location:
             └── wikidata_to_mentions_normalized.json
 
 The sections below describe the contents of the files, as well as their
-format, in case you prefer to provide your own resources (which should be
-in the same format).
+format, in case you prefer to provide your own resources (which should
+have the same format).
 
 ``mentions_to_wikidata.json``
 #############################
@@ -327,6 +341,15 @@ You can load the csv, and show the first five rows, as follows:
 Each row corresponds to a Wikidata geographic entity (i.e. a Wikidata entity
 with coordinates).
 
+.. note::
+
+    Note that the latitude and longitude are not used by the disambiguation
+    method: they are only provided as a post-processing step when rendering
+    the output of the linking. Therefore, the columns can have dummy values
+    (of type ``float``) if the user is not interested in obtaining the
+    coordinates: the linking to Wikidata will not be affected. Column
+    ``english_label`` can likewise be left empty.
+
 ``entity2class.txt``
 ####################
 
@@ -350,14 +373,27 @@ mapped to `Q180673 <https://www.wikidata.org/wiki/Q180673>`_, i.e. "cerimonial
 county  of England", whereas London (`Q84 <https://www.wikidata.org/wiki/Q84>`_)
 is mapped to `Q515 <https://www.wikidata.org/wiki/Q515>`_, i.e. "city".
 
+.. note::
+
+    Note that the entity2class mapping is not used by the disambiguation
+    method: the Wikidata class is only provided as a post-processing step
+    when rendering the output of the linking. T-Res will complain if the
+    file is not there, but values can be left empty if the user is not
+    interested in obtaining the wikidata class of the predicted entity.
+    The linking to Wikidata will not be affected.
+
 `back to top <#top-resources>`_
 
 Entity and word embeddings
 --------------------------
 
+.. note::
+
+    Note that you will not need this if you use the ``mostpopular`` disambiguation
+    approach.
+
 In order to perform toponym linking and resolution using the REL-based approaches,
-T-Res requires a database of word2vec and wiki2vec embeddings. Note that you will
-not need this if you use the ``mostpopular`` disambiguation approach.
+T-Res requires a database of word2vec and wiki2vec embeddings.
 
 By default, T-Res expects a database file called ``embeddings_database.db`` with,
 at least, one table (``entity_embeddings``) with at least the following columns:
@@ -367,9 +403,6 @@ at least, one table (``entity_embeddings``) with at least the following columns:
   tokens: ``#ENTITY/UNK#`` and ``#WORD/UNK#``.
 * ``emb``: The corresponding word or entity embedding.
 
-Generate the embeddings database
-################################
-
 In our experiments, we derived the embeddings database from REL's shared resources.
 
 .. note::
@@ -382,7 +415,7 @@ In our experiments, we derived the embeddings database from REL's shared resourc
     #. Generate a Wikipedia-to-Wikidata index, following `this instructions
     <https://github.com/jcklie/wikimapper#create-your-own-index>`_, save it as: ``./resources/wikipedia/index_enwiki-latest.db``.
     #. Run `this script <https://github.com/Living-with-machines/wiki2gaz/blob/main/download_and_merge_embeddings_databases.py>`_
-    to create the embeddings database.
+    to create the embeddings database (**[coming soon]**).
 
 You can load the file, and access a token embedding, as follows:
 
@@ -435,6 +468,8 @@ used to perform fuzzy string matching to find candidates for entity linking.
 
     The DeezyMatch training set can be downloaded from the `British Library research
     repository <https://bl.iro.bl.uk/concern/datasets/12208b77-74d6-44b5-88f9-df04db881d63>`_.
+    This dataset is only necessary if you want to use the DeezyMatch approach to perform
+    candidate selection. This is not needed if you use ``perfectmatch``.
 
 T-Res assumes by default the DeezyMatch training set to be named ``w2v_ocr_pairs.txt``
 and to be in the following location:
@@ -483,9 +518,11 @@ The dataset we provide consists of 1,085,514 string pairs.
 2. Word2Vec embeddings trained on noisy data
 ############################################
 
-The 19thC word2vec embeddings **are not needed** if you already have the DeezyMatch
-training set ``w2v_ocr_pairs.txt`` (described in the `section above
-<#deezymatch-training-set>`_).
+.. note::
+
+    The 19thC word2vec embeddings **are not needed** if you already have the
+    DeezyMatch training set ``w2v_ocr_pairs.txt`` (described in the `section above
+    <#deezymatch-training-set>`_).
 
 To create a new DeezyMatch training set using T-Res, you need to provide Word2Vec
 models that have been trained on digitised historical news texts. In our experiments,
@@ -531,18 +568,18 @@ for the mentioned resources that are required in order to run the pipeline.
     │   └── outputs/
     │       └── data/
     │           └── lwm/
-    │               ├── linking_df_split.tsv [*]
-    │               ├── ner_fine_dev.json [*+]
-    │               └── ner_fine_train.json [*+]
+    │               ├── linking_df_split.tsv [*?]
+    │               ├── ner_fine_dev.json [*+?]
+    │               └── ner_fine_train.json [*+?]
     ├── geoparser/
     ├── resources/
     │   ├── deezymatch/
     │   │   └── data/
-    │   │       └── w2v_ocr_pairs.txt
+    │   │       └── w2v_ocr_pairs.txt [?]
     │   ├── models/
     │   ├── news_datasets/
     │   ├── rel_db/
-    │   │   └── embeddings_database.db [*+]
+    │   │   └── embeddings_database.db [*+?]
     │   └── wikidata/
     │       ├── entity2class.txt [*]
     │       ├── mentions_to_wikidata_normalized.json [*]
@@ -552,8 +589,11 @@ for the mentioned resources that are required in order to run the pipeline.
     ├── tests/
     └── utils/
 
-Note that an asterisk (``*``) next to the resource means that the path can
-be changed when instantiating the T-Res objects, and a plus sign (``+``) if
-the name of the file can be changed in the instantiation.
+A question mark (``?``) is used to indicate resources which are only required
+for some approaches (for example, the ``rel_db/embeddings_database.db`` file
+is only required by the REL-based disambiguation approaches). Note that an
+asterisk (``*``) next to the resource means that the path can be changed when
+instantiating the T-Res objects, and a plus sign (``+``) if the name of the
+file can be changed in the instantiation.
 
 `back to top <#top-resources>`_