feat(datasets): Added the Experimental PolarsDatabaseDataset #990

MinuraPunchihewa · 2025-01-14T18:08:47Z

Description

This PR adds the PolarsDatabaseDataset to support interactions with databases using Polars.

Fixes #853

Development notes

I have extended the SQLQueryDataset to implement this dataset.

These changes have been tested,

~~1. Manually, by running the code locally to load and save tensors from and to Safetensors files.~~
~~2. Via the existing and newly added unit tests.~~

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa · 2025-01-14T18:11:32Z

Hey @noklam, @deepyaman,
I was able to come up with this implementation for the PolarsDatabaseDataset by extending SQLQueryDataset and it seems to work quite well (at least load() does).

Should we implement save() as well? This would require a table name to be provided as parameter.

Or do you have different thoughts on how this dataset ought to be implemented?

noklam

It'd be great if there is at least one example on how to use this, maybe with a sqlite database to avoid the setup.

noklam · 2025-01-27T15:25:22Z

kedro-datasets/kedro_datasets_experimental/polars/polars_database_dataset.py

+class PolarsDatabaseDataset(SQLQueryDataset):
+
+    def __init__(  # noqa: PLR0913
+        self,


Suggested change

self,

self,

*,

To make this keywords-only

noklam · 2025-01-27T15:25:52Z

kedro-datasets/kedro_datasets_experimental/polars/polars_database_dataset.py

+        )
+
+    def save(self, data: None) -> NoReturn:
+        pass


maybe NotImplemented instead?

noklam · 2025-01-27T15:28:17Z

kedro-datasets/kedro_datasets_experimental/polars/polars_database_dataset.py

+    def load(self) -> pl.DataFrame:
+        load_args = copy.deepcopy(self._load_args)
+
+        if self._filepath:
+            load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
+            with self._fs.open(load_path, mode="r") as fs_file:
+                query = fs_file.read()
+        else:
+            query = load_args.pop("sql")
+
+        return pl.read_database(
+            query=query,
+            connection=self._connection_str,
+            **load_args


Can you add some docs or checking to reflect this logic

If filepath exist use it

Otherwise use sql

If both are defined, maybe error out or at least log which one is being used?

Ideally I would re-order the argument as well, since the dataset put sql as the first argument but actually filepath have higher priority which feels counter-intuitive.

deepyaman · 2025-01-27T17:08:04Z

Hey @noklam, @deepyaman, I was able to come up with this implementation for the PolarsDatabaseDataset by extending SQLQueryDataset and it seems to work quite well (at least load() does).

Sorry for the late response; didn't see this.

While there may be opportunities to reduce code copying more broadly, most datasets just inherit from AbstractDataset or AbstractVersionedDataset. Here, inheriting from pandas.SQLQueryDataset adds a pandas dependency, so we shouldn't do that.

Should we implement save() as well? This would require a table name to be provided as parameter.

Probably, because Polars supports it. You can make the table name optional but require it for save.

MinuraPunchihewa added 2 commits January 14, 2025 23:35

added the initial skeleton for the polars database dataset

e5a704d

Signed-off-by: Minura Punchihewa <[email protected]>

updated the implementation by extending SQLQueryDataset

6fc01a8

Signed-off-by: Minura Punchihewa <[email protected]>

noklam reviewed Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

MinuraPunchihewa commented Jan 14, 2025

MinuraPunchihewa commented Jan 14, 2025

noklam left a comment

noklam Jan 27, 2025

noklam Jan 27, 2025

noklam Jan 27, 2025

noklam Jan 27, 2025

deepyaman commented Jan 27, 2025

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

Are you sure you want to change the base?

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

Conversation

MinuraPunchihewa commented Jan 14, 2025

Description

Development notes

Checklist

MinuraPunchihewa commented Jan 14, 2025

noklam left a comment

Choose a reason for hiding this comment

noklam Jan 27, 2025

Choose a reason for hiding this comment

noklam Jan 27, 2025

Choose a reason for hiding this comment

noklam Jan 27, 2025

Choose a reason for hiding this comment

noklam Jan 27, 2025

Choose a reason for hiding this comment

deepyaman commented Jan 27, 2025