Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking Topic Guides #1389

Merged
merged 27 commits into from
Jul 17, 2023
Merged

Blocking Topic Guides #1389

merged 27 commits into from
Jul 17, 2023

Conversation

RossKen
Copy link
Contributor

@RossKen RossKen commented Jul 3, 2023

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Closes #1188

Give a brief description for the solution you have provided

Blocking is a common stumbling block/area of misunderstanding for users. Plan to flesh out & reorganise the existing guides to (hopefully) make them the concepts easier to understand.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks at tutorial in splink_demos (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -17.2%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1860 2023-07-17 14:27:32 1.57644 1.55133 (detached head) d6624ce Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz d6624ce

Test: test_2_rounds_1k_sqlite

Percentage change: -11.5%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1862 2023-07-17 14:27:32 3.76976 3.76923 (detached head) d6624ce Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.2947 GHz d6624ce

Click here for vega lite time series charts

@RossKen RossKen marked this pull request as ready for review July 4, 2023 15:11
@RossKen
Copy link
Contributor Author

RossKen commented Jul 4, 2023

@RobinL I have added you on as a reviewer as, amongst other things, I have chopped up some of your pre-existing topic guides and redistributed the content, but no worries if you don't have the time to review 😊

@RossKen
Copy link
Contributor Author

RossKen commented Jul 5, 2023

Feedback kindly provided by @sama-ds in slack:

Samuel Atkin
12:00
Something that's felt missing to me is any kind of scale guidelines for blocking rules. It may be an impossible question to answer, but there's a lot of "dont include too many comparrisons or it won't run!" "don't be too tight or it won't be a good model!". The question most beginners are going to have is "Well what is sensible?". I get that there's no "right" answer here, but there's definitely upper/lower bounds of what is feasible, and I think it would be good to show some sort of effects of this. eg. Say we took a small dataset of like 10,000 rows or something. What happens if we make blocking rules that includes all comparisons speedwise, now what happens if we make blocking rules that include only 100 comparisons on linking accuracy, now let's pick some points between there and show how to get to get to a happy medium.

12:01
Kind of illustrating the iterative process of making tight blocking rules, looking at model outputs, then going "okay we've missed that kind of true positive- let's add something in to catch that and repeat"
12:02
Though I haven't scanned yet so you may have got to this :)
12:06
Okay you have definitely touched on this a bit and listed some recommended bounds- still feel that something to illustrate this process would be beneficial. That said, it may be "bigger" than blocking rules. All the example notebooks show things that are good and correct, but not the iterative process that linking feels it is in reality. Maybe something like an example workflow of building a linkage model may be beneficial (eg. right now I've profiled my data I see X,Y and Z, we go back and change that in the same, eg. Now I realise the blocking rules are too strict for Z case, which are true positives, so let's go back and change those, eg. now I realise this is creating transitive links between groups, this is causing false positives so let's change that, repeat)- I'll mull it over a little as I go through my first proper run!

Ross Kennedy
:spiral_calendar_pad: 12:14
All good points! Thanks!
12:17
I will go away and have a think about how to build some intuition for what is sensible in terms of scale. One thing that makes it tricky is the different setups and backends people will have - but it is definitely worth having something
12:18
I like the idea of a worked example though - showing all of the iterations would be super helpful
12:19
(and is essentially what we do when physically showing people how to use Splink)
12:20
Which could be an example notebook, or maybe even a video could be helpful with commentary over the top to show the iterations

Samuel Atkin
13:25
We could likely pair the two- to have a video we'd have all the code ready to copy and paste into cells anyhow, so could made a worked through workbook from it- I really like that idea
13:25
It's a difficult task- but conveying the thought process seems the most important part, as this is very much an art not a science

Ross Kennedy
:spiral_calendar_pad: 13:35
It feels like that is the sort of thing that would be helpful in choosing comparisons/comparison levels too
13:40
I don’t think it should be in the scope of this PR, but I will add an issue for doing some sort of walk through videos with accompanying example notebook

@RobinL
Copy link
Member

RobinL commented Jul 5, 2023

@RobinL I have added you on as a reviewer as, amongst other things, I have chopped up some of your pre-existing topic guides and redistributed the content, but no worries if you don't have the time to review 😊

No probs, will try to look in next couple of days

Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some suggestions

docs/topic_guides/blocking/blocking_rules.md Outdated Show resolved Hide resolved
docs/topic_guides/blocking/blocking_rules.md Outdated Show resolved Hide resolved
docs/topic_guides/blocking/blocking_rules.md Outdated Show resolved Hide resolved
docs/topic_guides/blocking/blocking_rules.md Show resolved Hide resolved
docs/topic_guides/blocking/performance.md Outdated Show resolved Hide resolved
docs/topic_guides/blocking/performance.md Outdated Show resolved Hide resolved
docs/topic_guides/blocking/performance.md Outdated Show resolved Hide resolved
docs/topic_guides/blocking/performance.md Outdated Show resolved Hide resolved
docs/topic_guides/blocking/predictions.md Show resolved Hide resolved
@RossKen
Copy link
Contributor Author

RossKen commented Jul 17, 2023

Given the implemented feedback and integration of @ThomasHepworth's Blocking Rule Library I am going to merge this PR in ahead of our release on Wednesday so we can flag the additional BR documentation alongside the library. The docs will still need iterated and improved in future - but it is better to have something for users to refer to than nothing.

@RossKen RossKen merged commit d7c2e26 into master Jul 17, 2023
6 of 10 checks passed
@RossKen RossKen deleted the blocking_docs branch July 17, 2023 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Add documentation on hashing for Blocking Rules
2 participants