Question regarding comparison across Spark sessions. #105
Unanswered
jacksongoode
asked this question in
Q&A
Replies: 1 comment
-
@jacksongoode - I'd probably try to use the same SparkSession. Try creating the DataFrame with one version of the code, then create another DataFrame with the new version of the code, and compare the two. There are some leaky components of the SparkSession, so that might not work. I'd probably need to dig into your specific example & code to give you a better answer. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi @MrPowers, we're currently looking for a reliable method to compare df's in Spark across a large set of data. However, part of this comparison involves testing between commits (ie. checking if a PR produces the same data as main). We've been running into issues where the df's appear to change across sessions (we are currently testing this by generating a column of hashes for every row and then producing a hash from that column by summing).
Wondering 1) if you may know why this could be happening across Spark sessions as the methods and data are static (or why this method might be inherently flawed?) 2) if your library would be able to compare dataframes across sessions (with some identifier?).
Thanks in advance!
Edit: This question would have been better posed in chispa.
Beta Was this translation helpful? Give feedback.
All reactions