Skip to content

Commit

Permalink
Update example notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
mcollardanuy committed Jul 31, 2023
1 parent 830dfe8 commit 31503a9
Show file tree
Hide file tree
Showing 12 changed files with 100 additions and 141 deletions.
2 changes: 1 addition & 1 deletion examples/load_use_ner_model.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"source": [
"# Loading and using a NER model\n",
"\n",
"This notebook shows how to load an existing named entity recognition (NER) model from the HuggingFace hub.\n",
"This notebook shows how to load an existing named entity recognition (NER) model from the HuggingFace hub, using T-Res.\n",
"\n",
"We start by importing some libraries, and the `recogniser` script from the `geoparser` folder:"
]
Expand Down
78 changes: 71 additions & 7 deletions examples/run_pipeline_basic.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the `pipeline` script has been imported (in the previous cell), we create a new object of the `Pipeline` class. Since we don't pass any parameters, it will take all the default values: it will detect toponyms using the fine-grained tagset, it will find candidates using the perfect match approach, and will disambiguate them using the most popular approach. You can see the default `Pipeline` values [here](https://github.com/Living-with-machines/toponym-resolution/blob/main/geoparser/pipeline.py)."
"Once the `pipeline` script has been imported (in the previous cell), we create a new object of the `Pipeline` class. Since we don't pass any parameters, it will take all the default values: it will detect toponyms using `Livingwithmachines/toponym-19thC-en` NER model, it will find candidates using the perfect match approach, and will disambiguate them using the most popular approach. You can see the default `Pipeline` values [here](https://living-with-machines.github.io/T-Res/reference/geoparser/pipeline.html)."
]
},
{
Expand All @@ -40,6 +40,13 @@
"geoparser = pipeline.Pipeline()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the pipeline: end-to-end"
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand All @@ -54,10 +61,8 @@
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
" \n",
"for r in resolved:\n",
" print(r)"
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
"print(resolved)"
]
},
{
Expand All @@ -67,8 +72,67 @@
"outputs": [],
"source": [
"resolved = geoparser.run_sentence(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
"for r in resolved:\n",
" print(r)"
"print(resolved)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the pipeline: step-wise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of using the end-to-end pipeline, the pipeline can be used step-wise.\n",
"\n",
"Therefore, it can be used to just perform toponym recognition (i.e. NER):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mentions = geoparser.run_text_recognition(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
"print(mentions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The pipeline can then be used to just perform candidate selection given the output of NER:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"candidates = geoparser.run_candidate_selection(mentions)\n",
"print(candidates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally, the pipeline can be used to perform entity disambiguation, given the output from the previous two steps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"disamb_output = geoparser.run_disambiguation(mentions, candidates)\n",
"print(disamb_output)"
]
}
],
Expand Down
22 changes: 2 additions & 20 deletions examples/run_pipeline_deezy_mostpopular.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
Expand All @@ -52,9 +50,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -72,9 +69,6 @@
"mylinker = linking.Linker(\n",
" method=\"mostpopular\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params=dict(),\n",
" overwrite_training=False,\n",
")"
]
},
Expand All @@ -87,18 +81,6 @@
"geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
" \n",
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
20 changes: 3 additions & 17 deletions examples/run_pipeline_deezy_reldisamb+wmtops.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,7 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
" \"top_threshold\": 85,\n",
" \"min_len\": 5,\n",
" \"max_len\": 15,\n",
" \"w2v_ocr_path\": str(Path(\"../resources/models/w2v/\").resolve()),\n",
" \"w2v_ocr_model\": \"w2v_*_news\",\n",
" \"overwrite_dataset\": False,\n",
" },\n",
" strvar_parameters=dict(),\n",
" deezy_parameters={\n",
" # Paths and filenames of DeezyMatch models and data:\n",
" \"dm_path\": str(Path(\"../resources/deezymatch/\").resolve()),\n",
Expand All @@ -55,9 +44,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -77,12 +65,10 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
" \"training_split\": \"originalsplit\",\n",
" \"context_length\": 100,\n",
" \"db_embeddings\": cursor,\n",
" \"with_publication\": False,\n",
" \"without_microtoponyms\": True,\n",
Expand Down
25 changes: 2 additions & 23 deletions examples/run_pipeline_deezy_reldisamb+wpubl+wmtops.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,6 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
Expand All @@ -55,9 +53,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -77,12 +74,10 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
" \"training_split\": \"originalsplit\",\n",
" \"context_length\": 100,\n",
" \"db_embeddings\": cursor,\n",
" \"with_publication\": True,\n",
" \"without_microtoponyms\": True,\n",
Expand All @@ -103,22 +98,6 @@
"geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\n",
" \"A remarkable case of rattening has just occurred in the building trade next to the Market-street of Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\", \n",
" place=\"Manchester\", \n",
" place_wqid=\"Q18125\"\n",
")\n",
" \n",
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
27 changes: 3 additions & 24 deletions examples/run_pipeline_deezy_reldisamb+wpubl.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,7 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
" \"top_threshold\": 85,\n",
" \"min_len\": 5,\n",
" \"max_len\": 15,\n",
" \"w2v_ocr_path\": str(Path(\"../resources/models/w2v/\").resolve()),\n",
" \"w2v_ocr_model\": \"w2v_*_news\",\n",
" \"overwrite_dataset\": False,\n",
" },\n",
" strvar_parameters=dict(),\n",
" deezy_parameters={\n",
" # Paths and filenames of DeezyMatch models and data:\n",
" \"dm_path\": str(Path(\"../resources/deezymatch/\").resolve()),\n",
Expand All @@ -55,9 +44,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -77,12 +65,10 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
" \"training_split\": \"originalsplit\",\n",
" \"context_length\": 100,\n",
" \"db_embeddings\": cursor,\n",
" \"with_publication\": True,\n",
" \"without_microtoponyms\": False,\n",
Expand Down Expand Up @@ -133,13 +119,6 @@
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
10 changes: 9 additions & 1 deletion examples/run_pipeline_modular.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,6 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
Expand Down Expand Up @@ -127,6 +126,15 @@
"source": [
"output_disamb = geoparser.run_disambiguation(output, cands)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output_disamb"
]
}
],
"metadata": {
Expand Down
16 changes: 0 additions & 16 deletions examples/run_pipeline_perfect_mostpopular.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,6 @@
"myranker = ranking.Ranker(\n",
" method=\"perfectmatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
")\n"
]
},
Expand All @@ -44,8 +42,6 @@
"mylinker = linking.Linker(\n",
" method=\"mostpopular\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" overwrite_training=False,\n",
")"
]
},
Expand All @@ -58,18 +54,6 @@
"geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
" \n",
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
Loading

0 comments on commit 31503a9

Please sign in to comment.