Update example notebooks

Living-with-machines · Jul 31, 2023 · 31503a9 · 31503a9
1 parent 830dfe8
commit 31503a9
Show file tree

Hide file tree

Showing 12 changed files with 100 additions and 141 deletions.
diff --git a/examples/load_use_ner_model.ipynb b/examples/load_use_ner_model.ipynb
@@ -7,7 +7,7 @@
    "source": [
     "# Loading and using a NER model\n",
     "\n",
-    "This notebook shows how to load an existing named entity recognition (NER) model from the HuggingFace hub.\n",
+    "This notebook shows how to load an existing named entity recognition (NER) model from the HuggingFace hub, using T-Res.\n",
     "\n",
     "We start by importing some libraries, and the `recogniser` script from the `geoparser` folder:"
    ]

diff --git a/examples/run_pipeline_basic.ipynb b/examples/run_pipeline_basic.ipynb
@@ -28,7 +28,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Once the `pipeline` script has been imported (in the previous cell), we create a new object of the `Pipeline` class. Since we don't pass any parameters, it will take all the default values: it will detect toponyms using the fine-grained tagset, it will find candidates using the perfect match approach, and will disambiguate them using the most popular approach. You can see the default `Pipeline` values [here](https://github.com/Living-with-machines/toponym-resolution/blob/main/geoparser/pipeline.py)."
+    "Once the `pipeline` script has been imported (in the previous cell), we create a new object of the `Pipeline` class. Since we don't pass any parameters, it will take all the default values: it will detect toponyms using `Livingwithmachines/toponym-19thC-en` NER model, it will find candidates using the perfect match approach, and will disambiguate them using the most popular approach. You can see the default `Pipeline` values [here](https://living-with-machines.github.io/T-Res/reference/geoparser/pipeline.html)."
    ]
   },
   {
@@ -40,6 +40,13 @@
     "geoparser = pipeline.Pipeline()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Using the pipeline: end-to-end"
+   ]
+  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -54,10 +61,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
-    "    \n",
-    "for r in resolved:\n",
-    "    print(r)"
+    "resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
+    "print(resolved)"
    ]
   },
   {
@@ -67,8 +72,67 @@
    "outputs": [],
    "source": [
     "resolved = geoparser.run_sentence(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
-    "for r in resolved:\n",
-    "    print(r)"
+    "print(resolved)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Using the pipeline: step-wise"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Instead of using the end-to-end pipeline, the pipeline can be used step-wise.\n",
+    "\n",
+    "Therefore, it can be used to just perform toponym recognition (i.e. NER):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mentions = geoparser.run_text_recognition(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
+    "print(mentions)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The pipeline can then be used to just perform candidate selection given the output of NER:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "candidates = geoparser.run_candidate_selection(mentions)\n",
+    "print(candidates)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And finally, the pipeline can be used to perform entity disambiguation, given the output from the previous two steps:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "disamb_output = geoparser.run_disambiguation(mentions, candidates)\n",
+    "print(disamb_output)"
    ]
   }
  ],

diff --git a/examples/run_pipeline_deezy_mostpopular.ipynb b/examples/run_pipeline_deezy_mostpopular.ipynb
@@ -32,8 +32,6 @@
     "myranker = ranking.Ranker(\n",
     "    method=\"deezymatch\",\n",
     "    resources_path=\"../resources/wikidata/\",\n",
-    "    mentions_to_wikidata=dict(),\n",
-    "    wikidata_to_mentions=dict(),\n",
     "    strvar_parameters={\n",
     "        # Parameters to create the string pair dataset:\n",
     "        \"ocr_threshold\": 60,\n",
@@ -52,9 +50,8 @@
     "        \"dm_output\": \"deezymatch_on_the_fly\",\n",
     "        # Ranking measures:\n",
     "        \"ranking_metric\": \"faiss\",\n",
-    "        \"selection_threshold\": 25,\n",
-    "        \"num_candidates\": 3,\n",
-    "        \"search_size\": 3,\n",
+    "        \"selection_threshold\": 50,\n",
+    "        \"num_candidates\": 1,\n",
     "        \"verbose\": False,\n",
     "        # DeezyMatch training:\n",
     "        \"overwrite_training\": False,\n",
@@ -72,9 +69,6 @@
     "mylinker = linking.Linker(\n",
     "    method=\"mostpopular\",\n",
     "    resources_path=\"../resources/\",\n",
-    "    linking_resources=dict(),\n",
-    "    rel_params=dict(),\n",
-    "    overwrite_training=False,\n",
     ")"
    ]
   },
@@ -87,18 +81,6 @@
     "geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
-    "    \n",
-    "for r in resolved:\n",
-    "    print(r)"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/examples/run_pipeline_deezy_reldisamb+wmtops.ipynb b/examples/run_pipeline_deezy_reldisamb+wmtops.ipynb
@@ -35,18 +35,7 @@
     "myranker = ranking.Ranker(\n",
     "    method=\"deezymatch\",\n",
     "    resources_path=\"../resources/wikidata/\",\n",
-    "    mentions_to_wikidata=dict(),\n",
-    "    wikidata_to_mentions=dict(),\n",
-    "    strvar_parameters={\n",
-    "        # Parameters to create the string pair dataset:\n",
-    "        \"ocr_threshold\": 60,\n",
-    "        \"top_threshold\": 85,\n",
-    "        \"min_len\": 5,\n",
-    "        \"max_len\": 15,\n",
-    "        \"w2v_ocr_path\": str(Path(\"../resources/models/w2v/\").resolve()),\n",
-    "        \"w2v_ocr_model\": \"w2v_*_news\",\n",
-    "        \"overwrite_dataset\": False,\n",
-    "    },\n",
+    "    strvar_parameters=dict(),\n",
     "    deezy_parameters={\n",
     "        # Paths and filenames of DeezyMatch models and data:\n",
     "        \"dm_path\": str(Path(\"../resources/deezymatch/\").resolve()),\n",
@@ -55,9 +44,8 @@
     "        \"dm_output\": \"deezymatch_on_the_fly\",\n",
     "        # Ranking measures:\n",
     "        \"ranking_metric\": \"faiss\",\n",
-    "        \"selection_threshold\": 25,\n",
-    "        \"num_candidates\": 3,\n",
-    "        \"search_size\": 3,\n",
+    "        \"selection_threshold\": 50,\n",
+    "        \"num_candidates\": 1,\n",
     "        \"verbose\": False,\n",
     "        # DeezyMatch training:\n",
     "        \"overwrite_training\": False,\n",
@@ -77,12 +65,10 @@
     "    mylinker = linking.Linker(\n",
     "        method=\"reldisamb\",\n",
     "        resources_path=\"../resources/\",\n",
-    "        linking_resources=dict(),\n",
     "        rel_params={\n",
     "            \"model_path\": \"../resources/models/disambiguation/\",\n",
     "            \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
     "            \"training_split\": \"originalsplit\",\n",
-    "            \"context_length\": 100,\n",
     "            \"db_embeddings\": cursor,\n",
     "            \"with_publication\": False,\n",
     "            \"without_microtoponyms\": True,\n",

diff --git a/examples/run_pipeline_deezy_reldisamb+wpubl+wmtops.ipynb b/examples/run_pipeline_deezy_reldisamb+wpubl+wmtops.ipynb
@@ -35,8 +35,6 @@
     "myranker = ranking.Ranker(\n",
     "    method=\"deezymatch\",\n",
     "    resources_path=\"../resources/wikidata/\",\n",
-    "    mentions_to_wikidata=dict(),\n",
-    "    wikidata_to_mentions=dict(),\n",
     "    strvar_parameters={\n",
     "        # Parameters to create the string pair dataset:\n",
     "        \"ocr_threshold\": 60,\n",
@@ -55,9 +53,8 @@
     "        \"dm_output\": \"deezymatch_on_the_fly\",\n",
     "        # Ranking measures:\n",
     "        \"ranking_metric\": \"faiss\",\n",
-    "        \"selection_threshold\": 25,\n",
-    "        \"num_candidates\": 3,\n",
-    "        \"search_size\": 3,\n",
+    "        \"selection_threshold\": 50,\n",
+    "        \"num_candidates\": 1,\n",
     "        \"verbose\": False,\n",
     "        # DeezyMatch training:\n",
     "        \"overwrite_training\": False,\n",
@@ -77,12 +74,10 @@
     "    mylinker = linking.Linker(\n",
     "        method=\"reldisamb\",\n",
     "        resources_path=\"../resources/\",\n",
-    "        linking_resources=dict(),\n",
     "        rel_params={\n",
     "            \"model_path\": \"../resources/models/disambiguation/\",\n",
     "            \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
     "            \"training_split\": \"originalsplit\",\n",
-    "            \"context_length\": 100,\n",
     "            \"db_embeddings\": cursor,\n",
     "            \"with_publication\": True,\n",
     "            \"without_microtoponyms\": True,\n",
@@ -103,22 +98,6 @@
     "geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "resolved = geoparser.run_text(\n",
-    "    \"A remarkable case of rattening has just occurred in the building trade next to the Market-street of Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\", \n",
-    "    place=\"Manchester\", \n",
-    "    place_wqid=\"Q18125\"\n",
-    ")\n",
-    "    \n",
-    "for r in resolved:\n",
-    "    print(r)"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/examples/run_pipeline_deezy_reldisamb+wpubl.ipynb b/examples/run_pipeline_deezy_reldisamb+wpubl.ipynb
@@ -35,18 +35,7 @@
     "myranker = ranking.Ranker(\n",
     "    method=\"deezymatch\",\n",
     "    resources_path=\"../resources/wikidata/\",\n",
-    "    mentions_to_wikidata=dict(),\n",
-    "    wikidata_to_mentions=dict(),\n",
-    "    strvar_parameters={\n",
-    "        # Parameters to create the string pair dataset:\n",
-    "        \"ocr_threshold\": 60,\n",
-    "        \"top_threshold\": 85,\n",
-    "        \"min_len\": 5,\n",
-    "        \"max_len\": 15,\n",
-    "        \"w2v_ocr_path\": str(Path(\"../resources/models/w2v/\").resolve()),\n",
-    "        \"w2v_ocr_model\": \"w2v_*_news\",\n",
-    "        \"overwrite_dataset\": False,\n",
-    "    },\n",
+    "    strvar_parameters=dict(),\n",
     "    deezy_parameters={\n",
     "        # Paths and filenames of DeezyMatch models and data:\n",
     "        \"dm_path\": str(Path(\"../resources/deezymatch/\").resolve()),\n",
@@ -55,9 +44,8 @@
     "        \"dm_output\": \"deezymatch_on_the_fly\",\n",
     "        # Ranking measures:\n",
     "        \"ranking_metric\": \"faiss\",\n",
-    "        \"selection_threshold\": 25,\n",
-    "        \"num_candidates\": 3,\n",
-    "        \"search_size\": 3,\n",
+    "        \"selection_threshold\": 50,\n",
+    "        \"num_candidates\": 1,\n",
     "        \"verbose\": False,\n",
     "        # DeezyMatch training:\n",
     "        \"overwrite_training\": False,\n",
@@ -77,12 +65,10 @@
     "    mylinker = linking.Linker(\n",
     "        method=\"reldisamb\",\n",
     "        resources_path=\"../resources/\",\n",
-    "        linking_resources=dict(),\n",
     "        rel_params={\n",
     "            \"model_path\": \"../resources/models/disambiguation/\",\n",
     "            \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
     "            \"training_split\": \"originalsplit\",\n",
-    "            \"context_length\": 100,\n",
     "            \"db_embeddings\": cursor,\n",
     "            \"with_publication\": True,\n",
     "            \"without_microtoponyms\": False,\n",
@@ -133,13 +119,6 @@
     "for r in resolved:\n",
     "    print(r)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

diff --git a/examples/run_pipeline_modular.ipynb b/examples/run_pipeline_modular.ipynb
@@ -64,7 +64,6 @@
     "    mylinker = linking.Linker(\n",
     "        method=\"reldisamb\",\n",
     "        resources_path=\"../resources/\",\n",
-    "        linking_resources=dict(),\n",
     "        rel_params={\n",
     "            \"model_path\": \"../resources/models/disambiguation/\",\n",
     "            \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
@@ -127,6 +126,15 @@
    "source": [
     "output_disamb = geoparser.run_disambiguation(output, cands)"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output_disamb"
+   ]
   }
  ],
  "metadata": {

diff --git a/examples/run_pipeline_perfect_mostpopular.ipynb b/examples/run_pipeline_perfect_mostpopular.ipynb
@@ -30,8 +30,6 @@
     "myranker = ranking.Ranker(\n",
     "    method=\"perfectmatch\",\n",
     "    resources_path=\"../resources/wikidata/\",\n",
-    "    mentions_to_wikidata=dict(),\n",
-    "    wikidata_to_mentions=dict(),\n",
     ")\n"
    ]
   },
@@ -44,8 +42,6 @@
     "mylinker = linking.Linker(\n",
     "    method=\"mostpopular\",\n",
     "    resources_path=\"../resources/\",\n",
-    "    linking_resources=dict(),\n",
-    "    overwrite_training=False,\n",
     ")"
    ]
   },
@@ -58,18 +54,6 @@
     "geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
-    "    \n",
-    "for r in resolved:\n",
-    "    print(r)"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,