Skip to content

Commit

Permalink
[SPARK-51027][SQL] Prevent HiveClient.runSqlHive invocation in non-…
Browse files Browse the repository at this point in the history
…testing environment

### What changes were proposed in this pull request?

This PR aims to prevent `HiveClient.runSqlHive` invocation in non-testing environment.

### Why are the changes needed?

`HiveClient.runSqlHive` is a kind of testing utility which Apache Spark uses during unit testing in order to compare with Hive result. We had better guarantee that this is not used in non-testing environments.

```
$ git grep runSqlHive | grep test | wc -l
      84

$ git grep runSqlHive | grep -v test
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala:  def runSqlHive(sql: String): Seq[String]
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:  override def runSqlHive(sql: String): Seq[String] = {
```

### Does this PR introduce _any_ user-facing change?

No. This is a method of private class, `HiveClient`.

https://github.com/apache/spark/blob/3badfd2c41d33fe5f1cbd1d7345c749b7af81574/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala#L42

### How was this patch tested?

In testing environment, it always succeed. So, we need to check manually.

```
$ build/sbt package -Phive -Phive-thriftserver

$ bin/spark-shell
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.1.0-SNAPSHOT
      /_/

Using Scala version 2.13.15 (OpenJDK 64-Bit Server VM, Java 17.0.14)
Type in expressions to have them evaluated.
Type :help for more information.
25/01/28 22:32:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1738132378341).
Spark session available as 'spark'.

scala> :paste -raw
// Entering paste mode (ctrl-D to finish)

package org.apache.spark

object myObject {
    val spark = org.apache.spark.sql.SparkSession.getActiveSession.get
    val client = spark.sharedState.externalCatalog.unwrapped.asInstanceOf[org.apache.spark.sql.hive.HiveExternalCatalog].client
    client.runSqlHive("SHOW TABLES")
}

// Exiting paste mode... now compiling with scalac.

scala> org.apache.spark.myObject
25/01/28 22:33:23 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
25/01/28 22:33:23 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1
java.lang.AssertionError: assertion failed: spark.testing is not set to true
  at scala.Predef$.assert(Predef.scala:279)
  at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:865)
  ... 43 elided
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49722 from dongjoon-hyun/SPARK-51027.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
dongjoon-hyun committed Jan 29, 2025
1 parent 3badfd2 commit 42898c6
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ private[hive] trait HiveClient {

/**
* Runs a HiveQL command using Hive, returning the results as a list of strings. Each row will
* result in one string.
* result in one string. This should be used only in testing environment.
*/
def runSqlHive(sql: String): Seq[String]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ import org.apache.spark.{SparkConf, SparkException, SparkThrowable}
import org.apache.spark.deploy.SparkHadoopUtil.SOURCE_SPARK
import org.apache.spark.internal.{Logging, LogKeys, MDC}
import org.apache.spark.internal.LogKeys._
import org.apache.spark.internal.config.Tests.IS_TESTING
import org.apache.spark.metrics.source.HiveCatalogMetrics
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.analysis.{DatabaseAlreadyExistsException, NoSuchDatabaseException, NoSuchPartitionException, NoSuchPartitionsException, NoSuchTableException, PartitionsAlreadyExistException}
Expand Down Expand Up @@ -858,8 +859,10 @@ private[hive] class HiveClientImpl(

/**
* Runs the specified SQL query using Hive.
* This should be used only in testing environment.
*/
override def runSqlHive(sql: String): Seq[String] = {
assert(Utils.isTesting, s"${IS_TESTING.key} is not set to true")
val maxResults = 100000
val results = runHive(sql, maxResults)
// It is very confusing when you only get back some of the results...
Expand Down

0 comments on commit 42898c6

Please sign in to comment.