-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for PythonUDAF expression #937
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
I don't believe the PR covers the support for PythonUDAF correctly.
There is a bug in handling PythonUDF. So, it is not a good model to follow. Instead, the PR should fix both of them.
If you have eventlogs that has the PythonUDF/PythonUDAF then this is a good start to look more closely at how the SQLPlanParser handles/detects it.
I had intuition that "PythonUDF/PythonUDAF
" won't actually appear in the eventlogs, do they?
@@ -357,7 +357,7 @@ object RDDCheckHelper { | |||
|
|||
object ExecHelper { | |||
private val UDFRegExLookup = Set( | |||
".*UDF.*".r | |||
".*UDF.*".r, ".*UDAF.*".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct to add the *UDAF*
to account for other UDAFs such as Hive and Scala.
However, the problem that the Q tool uses UDFRegExLookup
to decide whether something is uDF or not.
That being said, any node description containing *UDAF*
is going to be flagged as not-supported because it is a UDF.
Note that the same problem exists for PythonUDF
which is supposed to be supported.
Thanks @amahussein!
|
It is not explicitl mentioned but you can see that BatchEvalPython is marked as unsupported because it thought that PythobnUDF is a UDF expr.
That's the part I was mentioning here that the tools was considering PythonUDF as non-supported.
|
Signed-off-by: cindyyuanjiang <[email protected]>
Thanks @amahussein!
Now we have python UDF as supported, should we modify |
Signed-off-by: cindyyuanjiang <[email protected]>
@@ -313,7 +315,7 @@ object ExecInfo { | |||
duration, | |||
nodeId, | |||
opType, | |||
isSupported, | |||
finalSupported, | |||
children, | |||
stages, | |||
shouldRemove, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make changes to the usage of variable containsUDF
when passing to createExecNoNode()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Now containsUDF
denotes that the exec contains Scala UDFs after we changed the matching regex, so for python UDFs, this value would remain false
as intended.
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cindyyuanjiang for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
- I cannot see how the changes will treat pythonUDAF as supported. All the changes are related to pythonUDF
- I don't see how the Q tool shows the batchEvalPython as supported. That means that the changes done in ExecHelper did change the behavior of the flow. Note that pythonUDF/PythonUDAF are expressions and we should not expect to see the Q tool changes the decision about an exec (
batchEvalPython
) without main changes to the Exec parser. If you confirm that batchEvalPython is always linked to pythonUDF/PythonUDAF, then we should simply add code to catch that exec and mark it as supported without need to parse the content.
CC: @nartal1 any thoughts?
@@ -331,7 +331,7 @@ abstract class AppBase( | |||
} | |||
} | |||
|
|||
private val UDFRegex = ".*UDF.*" | |||
private val UDFRegex = ".*(?<!python)UDF.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't match pythonUDAF
".*(?<!python)UDF.*".r | ||
) | ||
|
||
private val pythonUDFRegExLookUp = Set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here as my previous comment. There are three regular expressions that look overlapping.
) | ||
|
||
private val pythonUDFRegExLookUp = Set( | ||
".*pythonUDF.*".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not take pythonUDAF
into considerations.
val projectExec = execInfo.filter(_.exec.contains("Project")) | ||
assertSizeAndSupported(2, projectExec) | ||
val batchEvalPythonExec = execInfo.filter(_.exec.contains("BatchEvalPython")) | ||
assertSizeAndSupported(1, batchEvalPythonExec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is something I missing here. I cannot find anywhere in the code that we handle BatchEvalPython
exec.
How come that the result shows it as supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchEvalPython should not be supported. I will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we confirmed that for a fact?
I mean did we verify that it is most likely that BatchEvalPython
is associated with PythonUDF and PythonUDAF? if that's the case then all we need to do is considering that the entire exec is supported with being conservative on the speedup factor.
As far as I can see in the plugin code, the RAPIDS has very limited capability accelerating those expressions anyway. So, the speedup should be near 1.
|
||
// if the expression is RDD because of the node name, then we do not want to add the | ||
// unsupportedExpressions because it becomes bogus. | ||
val finalUnsupportedExpr = if (rddCheckRes.nodeDescRDD) { | ||
val finalUnsupportedExpr = if (rddCheckRes.nodeDescRDD || containsPythonUDF) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct.
The code comment just above that change is explaining that we avoid parsing the node content when it is rddCheckRes.nodeDescRDD
. But the new code tries to check if it is pythonUDF anyway which could lead to bogus results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I will address this in the next iteration.
I am confused how batchEvalPython is marked as supported and is passing the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be more conservative with the speedup value we assign for PythonUDF and PythonUDAF.
I would suggest a speedup-factor closer to 1.0 actually.
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Thanks @amahussein @nartal1 for the feedback! pythonUDAF pythonUDF Note: Followup: had an offline discussion. I will add a parser for |
Yes, It seems that all |
Signed-off-by: cindyyuanjiang <[email protected]>
Fixes #936
Changes
Added support for
PythonUDAF
expression.Testing
Tested using
spark_rapids
commands with event log generated locally usingpyspark
shell. Confirmed behavior is same asPythonUDF
support in tools.