Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 1414 #1416

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Jul 8, 2023

Issue

If you run code like:

settings = {   
    "link_type": "link_only",
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
    ],
}

rec = {
    "first_name": "robin",
    "surname": "linacre",
}

linker.find_matches_to_new_records(
    [rec], match_weight_threshold=-1000000
).as_pandas_dataframe()

you get an error like:

Error was: Binder Error: Values list "r" does not have a column named "source_dataset"

or

Error was: Binder Error: Values list "r" does not have a column named "unique_id"

This is frustrating from the users point of view because:

  • It's not obvious they should need to set a source_dataset column
  • If they realised it's needed, it's not obvious what the value should be
  • Similar with the ID, it's not obvious it's needed or what it should be

See #1414 for bug report.


Testing script:
import pandas as pd
import logging
from splink.duckdb.comparison_library import jaro_winkler_at_thresholds
from splink.duckdb.linker import DuckDBLinker

records_l = [
    {"uid": 1, "first_name": "robin", "surname": "linacre"},
    {"uid": 2, "first_name": "james", "surname": "booth"},
]

records_r = [
    {"uid": 3, "first_name": "robin", "surname": "linacre"},
    {"uid": 4, "first_name": "rob", "surname": "linaker"},
    {"uid": 5, "first_name": "john", "surname": "smith"},
]

df_l = pd.DataFrame(records_l)
df_r = pd.DataFrame(records_r)
df = pd.concat([df_l, df_r])

settings = {
    "probability_two_random_records_match": 0.01,
    "unique_id_column_name": "uid",
    "source_dataset_column_name": "src_dtaset",
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": ["1=1"],
    "comparisons": [
        jaro_winkler_at_thresholds("first_name", [0.9, 0.8]),
        jaro_winkler_at_thresholds("surname", [0.9, 0.8]),
    ],
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker([df_l, df_r], settings)

linker.predict().as_pandas_dataframe()


logging.getLogger("splink").setLevel(1)
rec = {
    "first_name": "robin",
    "surname": "linacre",
}

linker.find_matches_to_new_records(
    [rec], match_weight_threshold=-1000000
).as_pandas_dataframe()


settings = {
    "probability_two_random_records_match": 0.01,
    "unique_id_column_name": "uid",
    "source_dataset_column_name": "src_dtaset",
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": ["1=1"],
    "comparisons": [
        jaro_winkler_at_thresholds("first_name", [0.9, 0.8]),
        jaro_winkler_at_thresholds("surname", [0.9, 0.8]),
    ],
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(df, settings)


rec = {
    "first_name": "robin",
    "surname": "linacre",
}

linker.find_matches_to_new_records(
    [rec], match_weight_threshold=-1000000
).as_pandas_dataframe()

@RobinL RobinL linked an issue Jul 8, 2023 that may be closed by this pull request
2 tasks
@RobinL RobinL marked this pull request as draft July 8, 2023 21:18
@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -0.5%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1839 2023-07-12 19:29:59 1.87465 1.86449 (detached head) 5eed11d Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz 5eed11d

Test: test_2_rounds_1k_sqlite

Percentage change: 3.7%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1841 2023-07-12 19:29:59 4.45027 4.41808 (detached head) 5eed11d Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz 5eed11d

Click here for vega lite time series charts

- Added logic to add the source dataset column to new records if required and not already present.
- Added logic to add the unique ID column to new records if not already present.
# If our df_concat_with_tf table already exists, use backwards inference to
# find all underlying term frequency tables.
# If our df_concat_with_tf table already exists, derive the term frequency
# tables from df_concat_with_tf rather than computing them
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR, but i wanted to clarify this comment, it took me a little while to understand 'backwards inference'

@RobinL RobinL marked this pull request as ready for review July 10, 2023 11:56
@RobinL RobinL marked this pull request as draft July 10, 2023 12:23
@RobinL RobinL marked this pull request as ready for review July 12, 2023 20:07
Copy link
Contributor

@RossKen RossKen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as expected so is good to go in from my perspective.

This is also related to an issue that @aliceoleary0 has found (not yet written up in github) so there will likely be further developments on this function in the near future.

@aliceoleary0
Copy link
Contributor

@RossKen yep, I will write up the issue you mentioned this week and post a link here for clarity.

@RobinL RobinL merged commit afd97eb into master Jul 19, 2023
10 checks passed
@RobinL RobinL deleted the 1414-values-list-r-does-not-have-a-column-named-source_dataset-new-in-393 branch July 19, 2023 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Values list "r" does not have a column named "source_dataset" - New in 3.9.3
3 participants