SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function …

…should be avoided
SonarSource · Jan 30, 2025 · 6f752eb · 6f752eb
1 parent 14c80b8
commit 6f752eb
Show file tree

Hide file tree

Showing 3 changed files with 94 additions and 0 deletions.
diff --git a/rules/S7193/metadata.json b/rules/S7193/metadata.json
@@ -0,0 +1,2 @@
+{
+}
diff --git a/rules/S7193/python/metadata.json b/rules/S7193/python/metadata.json
@@ -0,0 +1,25 @@
+{
+  "title": "FIXME",
+  "type": "CODE_SMELL",
+  "status": "ready",
+  "remediation": {
+    "func": "Constant\/Issue",
+    "constantCost": "5min"
+  },
+  "tags": [
+    "data-science",
+    "pyspark"
+  ],
+  "defaultSeverity": "Major",
+  "ruleSpecification": "RSPEC-7193",
+  "sqKey": "S7193",
+  "scope": "All",
+  "defaultQualityProfiles": ["Sonar way"],
+  "quickfix": "unknown",
+  "code": {
+    "impacts": {
+      "RELIABILITY": "MEDIUM"
+    },
+    "attribute": "EFFICIENT"
+  }
+}
diff --git a/rules/S7193/python/rule.adoc b/rules/S7193/python/rule.adoc
@@ -0,0 +1,67 @@
+This rule raises an issue when `toPandas` is used in PySpark in a way that can lead to memory or performance bottlenecks.
+
+== Why is this an issue?
+
+The use of `toPandas` in PySpark can lead to performance and memory management issues.
+
+PySpark is designed to handle large-scale data processing in a distributed manner, leveraging the power of cluster computing to efficiently manage and process big data. Calling `toPandas` collects all data from a Spark `DataFrame` into a Pandas `DataFrame` on a single machine. This can lead to memory issues and performance bottlenecks, especially with large datasets, which is contrary to the distributed nature of Spark.
+
+For this reason, it is generally advisable to avoid using `toPandas` unless you are certain that the dataset is small enough to be handled comfortably by a single machine. Instead, consider using Spark's built-in functions and capabilities to perform data processing tasks in a distributed manner.
+
+If conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.
+
+=== Exceptions
+
+This rule will not raise issues in the following context:
+
+* When visualization is performed with other libraries such as `matplotlib`, `seaborn`, etc.
+* When the `DataFrame` is of a limited size (e.g. following a call to `limit` or an aggregation through `groupBy`)
+* In tests
+
+== How to fix it
+
+To fix this issue, consider using PySpark built-in capabilities without relying on Pandas whenever possible.
+
+=== Code examples
+
+==== Noncompliant code example
+
+[source,python,diff-id=1,diff-type=noncompliant]
+----
+# Converting a PySpark DataFrame to a Pandas DataFrame
+df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')], ["id", "name"])
+pandas_df = df.toPandas()  # Noncompliant: May cause memory issues with large datasets
+print(pandas_df)
+----
+
+==== Compliant solution
+
+[source,python,diff-id=1,diff-type=compliant]
+----
+from pyspark.sql.functions import col
+
+# Using PySpark DataFrame operations
+df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')], ["id", "name"])
+
+# Perform operations using PySpark's distributed capabilities
+filtered_df = df.filter(col("id") > 1)
+filtered_df.show()
+----
+
+== Resources
+=== Documentation
+
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html[pyspark.sql.DataFrame.toPandas]
+
+
+ifdef::env-github,rspecator-view[]
+
+'''
+== Implementation Specification
+(visible only on this page)
+
+=== Message
+
+Consider using PySpark's built-in capabilities instead of converting this `DataFrame` to Pandas.
+
+endif::env-github,rspecator-view[]