Skip to content

Issue with performance on Spark after migrating to Splink 4 #2439

Answered by Priebe1
Priebe1 asked this question in Q&A
Discussion options

You must be logged in to vote

Hi Robin,
thanks for the response! :)
The hint to __splink__blocked_id_pairs helped me to pin down the issue. Our fake data generator erroneously did not create unique values in our ID column customercode, which led to duplicated entries in the table :/
In Splink v4 this results into a massive join between __splink__df_concat_with_tf and __splink__blocked_id_pairs which is joined by
INNER JOIN __splink__blocked_id_pairs AS b ON l.customercode = b.join_key_l INNER JOIN __splink__df_concat_with_tf AS r ON r.customercode = b.join_key_r

In Splink v3 on the other hand I assume this issue did not come up because the join is designed differently and therefore creates a different Spark execution…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@Priebe1
Comment options

Answer selected by RobinL
@RobinL
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants