[SPARK-51027][SQL] Prevent HiveClient.runSqlHive invocation in non-…

…testing environment ### What changes were proposed in this pull request? This PR aims to prevent `HiveClient.runSqlHive` invocation in non-testing environment. ### Why are the changes needed? `HiveClient.runSqlHive` is a kind of testing utility which Apache Spark uses during unit testing in order to compare with Hive result. We had better guarantee that this is not used in non-testing environments. ``` $ git grep runSqlHive | grep test | wc -l 84 $ git grep runSqlHive | grep -v test sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala: def runSqlHive(sql: String): Seq[String] sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: override def runSqlHive(sql: String): Seq[String] = { ``` ### Does this PR introduce _any_ user-facing change? No. This is a method of private class, `HiveClient`. https://github.com/apache/spark/blob/3badfd2c41d33fe5f1cbd1d7345c749b7af81574/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala#L42 ### How was this patch tested? In testing environment, it always succeed. So, we need to check manually. ``` $ build/sbt package -Phive -Phive-thriftserver $ bin/spark-shell WARNING: Using incubator modules: jdk.incubator.vector Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 4.1.0-SNAPSHOT /_/ Using Scala version 2.13.15 (OpenJDK 64-Bit Server VM, Java 17.0.14) Type in expressions to have them evaluated. Type :help for more information. 25/01/28 22:32:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1738132378341). Spark session available as 'spark'. scala> :paste -raw // Entering paste mode (ctrl-D to finish) package org.apache.spark object myObject { val spark = org.apache.spark.sql.SparkSession.getActiveSession.get val client = spark.sharedState.externalCatalog.unwrapped.asInstanceOf[org.apache.spark.sql.hive.HiveExternalCatalog].client client.runSqlHive("SHOW TABLES") } // Exiting paste mode... now compiling with scalac. scala> org.apache.spark.myObject 25/01/28 22:33:23 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 25/01/28 22:33:23 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1 java.lang.AssertionError: assertion failed: spark.testing is not set to true at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:865) ... 43 elided ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49722 from dongjoon-hyun/SPARK-51027. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
MaxGekk · Jan 29, 2025 · 42898c6 · 42898c6
1 parent 3badfd2
commit 42898c6
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 1 deletion.
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala
@@ -55,7 +55,7 @@ private[hive] trait HiveClient {
 
   /**
    * Runs a HiveQL command using Hive, returning the results as a list of strings.  Each row will
-   * result in one string.
+   * result in one string. This should be used only in testing environment.
    */
   def runSqlHive(sql: String): Seq[String]
 

diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
@@ -49,6 +49,7 @@ import org.apache.spark.{SparkConf, SparkException, SparkThrowable}
 import org.apache.spark.deploy.SparkHadoopUtil.SOURCE_SPARK
 import org.apache.spark.internal.{Logging, LogKeys, MDC}
 import org.apache.spark.internal.LogKeys._
+import org.apache.spark.internal.config.Tests.IS_TESTING
 import org.apache.spark.metrics.source.HiveCatalogMetrics
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.{DatabaseAlreadyExistsException, NoSuchDatabaseException, NoSuchPartitionException, NoSuchPartitionsException, NoSuchTableException, PartitionsAlreadyExistException}
@@ -858,8 +859,10 @@ private[hive] class HiveClientImpl(
 
   /**
    * Runs the specified SQL query using Hive.
+   * This should be used only in testing environment.
    */
   override def runSqlHive(sql: String): Seq[String] = {
+    assert(Utils.isTesting, s"${IS_TESTING.key} is not set to true")
     val maxResults = 100000
     val results = runHive(sql, maxResults)
     // It is very confusing when you only get back some of the results...