Fix issue 1414 #1416

RobinL · 2023-07-08T21:18:38Z

Issue

If you run code like:

settings = {   
    "link_type": "link_only",
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
    ],
}

rec = {
    "first_name": "robin",
    "surname": "linacre",
}

linker.find_matches_to_new_records(
    [rec], match_weight_threshold=-1000000
).as_pandas_dataframe()

you get an error like:

Error was: Binder Error: Values list "r" does not have a column named "source_dataset"

or

Error was: Binder Error: Values list "r" does not have a column named "unique_id"

This is frustrating from the users point of view because:

It's not obvious they should need to set a source_dataset column
If they realised it's needed, it's not obvious what the value should be
Similar with the ID, it's not obvious it's needed or what it should be

See #1414 for bug report.

Testing script:

import pandas as pd
import logging
from splink.duckdb.comparison_library import jaro_winkler_at_thresholds
from splink.duckdb.linker import DuckDBLinker

records_l = [
    {"uid": 1, "first_name": "robin", "surname": "linacre"},
    {"uid": 2, "first_name": "james", "surname": "booth"},
]

records_r = [
    {"uid": 3, "first_name": "robin", "surname": "linacre"},
    {"uid": 4, "first_name": "rob", "surname": "linaker"},
    {"uid": 5, "first_name": "john", "surname": "smith"},
]

df_l = pd.DataFrame(records_l)
df_r = pd.DataFrame(records_r)
df = pd.concat([df_l, df_r])

settings = {
    "probability_two_random_records_match": 0.01,
    "unique_id_column_name": "uid",
    "source_dataset_column_name": "src_dtaset",
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": ["1=1"],
    "comparisons": [
        jaro_winkler_at_thresholds("first_name", [0.9, 0.8]),
        jaro_winkler_at_thresholds("surname", [0.9, 0.8]),
    ],
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker([df_l, df_r], settings)

linker.predict().as_pandas_dataframe()


logging.getLogger("splink").setLevel(1)
rec = {
    "first_name": "robin",
    "surname": "linacre",
}

linker.find_matches_to_new_records(
    [rec], match_weight_threshold=-1000000
).as_pandas_dataframe()


settings = {
    "probability_two_random_records_match": 0.01,
    "unique_id_column_name": "uid",
    "source_dataset_column_name": "src_dtaset",
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": ["1=1"],
    "comparisons": [
        jaro_winkler_at_thresholds("first_name", [0.9, 0.8]),
        jaro_winkler_at_thresholds("surname", [0.9, 0.8]),
    ],
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(df, settings)


rec = {
    "first_name": "robin",
    "surname": "linacre",
}

linker.find_matches_to_new_records(
    [rec], match_weight_threshold=-1000000
).as_pandas_dataframe()

github-actions · 2023-07-08T21:23:04Z

Test: test_2_rounds_1k_duckdb

Percentage change: -0.5%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1839	2023-07-12	19:29:59	1.87465	1.86449	(detached head)	`5eed11d`	Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz	2.2947 GHz	`5eed11d`

Test: test_2_rounds_1k_sqlite

Percentage change: 3.7%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1841	2023-07-12	19:29:59	4.45027	4.41808	(detached head)	`5eed11d`	Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz	2.2947 GHz	`5eed11d`

Click here for vega lite time series charts

- Added logic to add the source dataset column to new records if required and not already present. - Added logic to add the unique ID column to new records if not already present.

RobinL · 2023-07-10T11:34:13Z

splink/linker.py

-        # If our df_concat_with_tf table already exists, use backwards inference to
-        # find all underlying term frequency tables.
+        # If our df_concat_with_tf table already exists, derive the term frequency
+        # tables from df_concat_with_tf rather than computing them


Not related to the PR, but i wanted to clarify this comment, it took me a little while to understand 'backwards inference'

RossKen

This works as expected so is good to go in from my perspective.

This is also related to an issue that @aliceoleary0 has found (not yet written up in github) so there will likely be further developments on this function in the near future.

aliceoleary0 · 2023-07-18T13:11:11Z

@RossKen yep, I will write up the issue you mentioned this week and post a link here for clarity.

fix issue 1414

86c21d2

RobinL linked an issue Jul 8, 2023 that may be closed by this pull request

Values list "r" does not have a column named "source_dataset" - New in 3.9.3 #1414

Closed

2 tasks

RobinL marked this pull request as draft July 8, 2023 21:18

fix circular

b578617

clarify ids

8124a45

RobinL mentioned this pull request Jul 8, 2023

Values list "r" does not have a column named "source_dataset" - New in 3.9.3 #1414

Closed

2 tasks

Add src dataset and unique ID cols to new records

4e8b0a7

- Added logic to add the source dataset column to new records if required and not already present. - Added logic to add the unique ID column to new records if not already present.

RobinL commented Jul 10, 2023

View reviewed changes

RobinL marked this pull request as ready for review July 10, 2023 11:56

RobinL marked this pull request as draft July 10, 2023 12:23

deal with quoting

8029a59

RobinL marked this pull request as ready for review July 12, 2023 20:07

RossKen approved these changes Jul 18, 2023

View reviewed changes

RobinL merged commit afd97eb into master Jul 19, 2023
10 checks passed

RobinL deleted the 1414-values-list-r-does-not-have-a-column-named-source_dataset-new-in-393 branch July 19, 2023 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 1414 #1416

Fix issue 1414 #1416

RobinL commented Jul 8, 2023 •

edited

Loading

github-actions bot commented Jul 8, 2023 •

edited

Loading

RobinL Jul 10, 2023

RossKen left a comment

aliceoleary0 commented Jul 18, 2023

Fix issue 1414 #1416

Fix issue 1414 #1416

Conversation

RobinL commented Jul 8, 2023 • edited Loading

Issue

github-actions bot commented Jul 8, 2023 • edited Loading

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

RobinL Jul 10, 2023

Choose a reason for hiding this comment

RossKen left a comment

Choose a reason for hiding this comment

aliceoleary0 commented Jul 18, 2023

RobinL commented Jul 8, 2023 •

edited

Loading

github-actions bot commented Jul 8, 2023 •

edited

Loading