Question regarding comparison across Spark sessions. #105

jacksongoode · 2022-01-31T20:57:31Z

jacksongoode
Jan 31, 2022

Hi @MrPowers, we're currently looking for a reliable method to compare df's in Spark across a large set of data. However, part of this comparison involves testing between commits (ie. checking if a PR produces the same data as main). We've been running into issues where the df's appear to change across sessions (we are currently testing this by generating a column of hashes for every row and then producing a hash from that column by summing).

Wondering 1) if you may know why this could be happening across Spark sessions as the methods and data are static (or why this method might be inherently flawed?) 2) if your library would be able to compare dataframes across sessions (with some identifier?).

Thanks in advance!
Edit: This question would have been better posed in chispa.

MrPowers · 2022-02-09T08:57:27Z

MrPowers
Feb 9, 2022
Maintainer

@jacksongoode - I'd probably try to use the same SparkSession. Try creating the DataFrame with one version of the code, then create another DataFrame with the new version of the code, and compare the two. There are some leaky components of the SparkSession, so that might not work. I'd probably need to dig into your specific example & code to give you a better answer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding comparison across Spark sessions. #105

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Question regarding comparison across Spark sessions. #105

jacksongoode Jan 31, 2022

Replies: 1 comment

MrPowers Feb 9, 2022 Maintainer

jacksongoode
Jan 31, 2022

MrPowers
Feb 9, 2022
Maintainer