-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Improved Dependency Management for Spark-based Datasets #911
feat(datasets): Improved Dependency Management for Spark-based Datasets #911
Conversation
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
…asets Signed-off-by: Minura Punchihewa <[email protected]>
Hey @noklam, |
import os | ||
|
||
|
||
def parse_glob_pattern(pattern: str) -> str: | ||
special = ("*", "?", "[") | ||
clean = [] | ||
for part in pattern.split("/"): | ||
if any(char in part for char in special): | ||
break | ||
clean.append(part) | ||
return "/".join(clean) | ||
|
||
|
||
def split_filepath(filepath: str | os.PathLike) -> tuple[str, str]: | ||
split_ = str(filepath).split("://", 1) | ||
if len(split_) == 2: # noqa: PLR2004 | ||
return split_[0] + "://", split_[1] | ||
return "", split_[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, these are used for Spark only. I do not want other datasets to use it because it's more like a workaround to handle Spark/Dtabricks magic path.
Most of the dataset should be using the util coming from kedro
instead.
kedro-plugins/kedro-datasets/kedro_datasets/pandas/csv_dataset.py
Lines 17 to 19 in 1a18c5a
Version, | |
get_filepath_str, | |
get_protocol_and_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam Understood. I will move these to the spark_utils.py
module.
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Hey @noklam, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments I left but I only notice it wasn't sent properly 😅
return sorted(matched) | ||
|
||
|
||
def get_dbutils(spark: SparkSession) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always a SparkSession? I think it can be a DatabricksSession
from pyspark.sql import SparkSession | ||
|
||
|
||
def get_spark() -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_spark() -> Any: | |
def get_spark() -> Any: |
I think this should be SparkSession
and DatabricksSession
? You should also wrap it inside TYPE_CHECKING
to avoid import error.
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Haha no problem. I've made the suggested improvements to the type hints, including a couple more involving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the comments and your contribution on the databricks dataset, great stuff!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution @MinuraPunchihewa ! ⭐ Can you update the release notes and add your change + your name to contributors?
Signed-off-by: Minura Punchihewa <[email protected]>
Thanks, @merelcht. I've just updated the release notes. |
Signed-off-by: Merel Theisen <[email protected]>
Description
This PR a
_utils
sub-package to house modules with common utility functions that are used across Spark-based datasets. This avoids the need forpyspark
to be installed for datasets that will run on Databricks.Fixes #849
Development notes
The new
_utils
package organized the utility functions in three main modules,Additional modules can be added to this sub-package to house code that is used in multiple datasets.
These changes have been tested,
ManagedTableDataset
andExternalTableDataset
.Checklist
RELEASE.md
file