Issue with performance on Spark after migrating to Splink 4 #2439

Priebe1 · 2024-10-01T13:29:31Z

Priebe1
Oct 1, 2024

Hello,

we are using Splink for deduplication on our data with Databricks/Spark. Currently we are running on v3.9.15 which is working fine for us but we wanted to migrate to v4 to stay up to date. Unfortunately, we observed a very poor performance when running our model on the new version.

In the current we are trying to dedupe 150m records, which takes about 1hr on v3.9.15 on our Databricks cluster setup. On v4.0.2 it seems that the Spark jobs get stuck at some point, so that even after around 16 hours the predict() method was still running.
To ensure that we are as close to our configuration in v3, we used the output of save_model_to_json() as setting for the v4 Linker.

When comparing the logged SQL queries, it seemed to us that in v4 the __splink__df_blocked CTE was "replaced" by two separate CTEs, blocked_with_cols and __splink__blocked_id_pairs. We also tried to set materialise_blocked_pairs to False to be closer to the settings in v3, but still we were facing the performance issue.
We also played around with other Spark related settings (repartitioning, lineage breaking, salting) but it did not seem to solve our problems and we did not need to adapt them in our v3 implementation.

Is there anyone else who is facing such issues or does anyone know about some other changes from v3 to v4 beside the slight difference in the predict() SQL that might have such an impact, especially when running in a Databricks/Spark environment?

I attached the debug logs for reference:
splink_debug_log_3_9_15.txt
splink_debug_log_4_0_1.txt

Any help would be appreciated!

Answered by Priebe1

Oct 2, 2024

Hi Robin,
thanks for the response! :)
The hint to __splink__blocked_id_pairs helped me to pin down the issue. Our fake data generator erroneously did not create unique values in our ID column customercode, which led to duplicated entries in the table :/
In Splink v4 this results into a massive join between __splink__df_concat_with_tf and __splink__blocked_id_pairs which is joined by
INNER JOIN __splink__blocked_id_pairs AS b ON l.customercode = b.join_key_l INNER JOIN __splink__df_concat_with_tf AS r ON r.customercode = b.join_key_r

In Splink v3 on the other hand I assume this issue did not come up because the join is designed differently and therefore creates a different Spark execution…

View full answer

RobinL · 2024-10-01T22:22:02Z

RobinL
Oct 1, 2024
Maintainer

Yes you've understanding of how the pipeline has changed between 3 and 4 is corre.

Does the __splink__blocked_id_pairs table get created,? There should be a message to the log in the form of 'blocking completed in x seconds '. To work out what's going wrong it'd be useful to figure out whether it's the blocking phase or the comparisons stage that's taking the time. If blocking does complete, are you able to see the files on disk. How many are there and how big is each file?

Our spark jobs run faster in Splink 4 but it's very hard to test the full range of possibilities. We save (break lineage )to parquet not delta table. I guess that's a possibility, but on the other hand if you're doing that consistently between 3 and 4 it's probably not the root cause

2 replies

Priebe1 Oct 2, 2024
Author

Hi Robin,
thanks for the response! :)
The hint to __splink__blocked_id_pairs helped me to pin down the issue. Our fake data generator erroneously did not create unique values in our ID column customercode, which led to duplicated entries in the table :/
In Splink v4 this results into a massive join between __splink__df_concat_with_tf and __splink__blocked_id_pairs which is joined by
INNER JOIN __splink__blocked_id_pairs AS b ON l.customercode = b.join_key_l INNER JOIN __splink__df_concat_with_tf AS r ON r.customercode = b.join_key_r

In Splink v3 on the other hand I assume this issue did not come up because the join is designed differently and therefore creates a different Spark execution plan:
INNER JOIN __splink__df_concat_with_tf AS r ON ( l.ZIP = r.ZIP AND l.Street = r.Street AND SUBSTRING(l.HouseNo, 1, 1) = SUBSTRING(r.HouseNo, 1, 1) ) WHERE l.customercode < r.customercode

While this allowed Splink v3 to handle the non-unique ID column somehow, the result was definitely misleading anyway.
After fixing our data generator Splink v4 runs without any problems and produces the expected result so that we are ready to migrate to the new version.

Answer selected by RobinL

RobinL Oct 2, 2024
Maintainer

Great result, thanks for the update! I'm curious - now the problem is fixed, do you see faster runtimes on Splink4 or similar to Splink3.

(My sense is that on Spark, there isn't a huge difference, the gains are more in duckdb land)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with performance on Spark after migrating to Splink 4 #2439

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Issue with performance on Spark after migrating to Splink 4 #2439

Priebe1 Oct 1, 2024

Replies: 1 comment · 2 replies

RobinL Oct 1, 2024 Maintainer

Priebe1 Oct 2, 2024 Author

RobinL Oct 2, 2024 Maintainer

Priebe1
Oct 1, 2024

Replies: 1 comment 2 replies

RobinL
Oct 1, 2024
Maintainer

Priebe1 Oct 2, 2024
Author

RobinL Oct 2, 2024
Maintainer