forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to…
… improve performance of Analyze requests ### What changes were proposed in this pull request? While building the DataFrame step by step, each time a new DataFrame is generated with an empty schema, which is lazily computed on access. However, if a user's code frequently accesses the schema of these new DataFrames using methods such as `df.columns`, it will result in a large number of Analyze requests to the server. Each time, the entire plan needs to be reanalyzed, leading to poor performance, especially when constructing highly complex plans. Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the overhead of repeated analysis during this process. This is achieved by saving significant computation if the resolved logical plan of a subtree of can be cached. A minimal example of the problem: ``` import pyspark.sql.functions as F df = spark.range(10) for i in range(200): if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze request in every iteration df = df.withColumn(str(i), F.col("id") + i) df.show() ``` With this patch, the performance of the above code improved from ~110s to ~5s. ### Why are the changes needed? The performance improvement is huge in the above cases. ### Does this PR introduce _any_ user-facing change? Yes, a static conf `spark.connect.session.planCache.maxSize` and a dynamic conf `spark.connect.session.planCache.enabled` are added. * `spark.connect.session.planCache.maxSize`: Sets the maximum number of cached resolved logical plans in Spark Connect Session. If set to a value less or equal than zero will disable the plan cache * `spark.connect.session.planCache.enabled`: When true, the cache of resolved logical plans is enabled if `spark.connect.session.planCache.maxSize` is greater than zero. When false, the cache is disabled even if `spark.connect.session.planCache.maxSize` is greater than zero. The caching is best-effort and not guaranteed. ### How was this patch tested? Some new tests are added in SparkConnectSessionHolderSuite.scala. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46012 from xi-db/SPARK-47818-plan-cache. Lead-authored-by: Xi Lyu <[email protected]> Co-authored-by: Xi Lyu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information
1 parent
6762d1f
commit a1fc6d5
Showing
5 changed files
with
345 additions
and
104 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.