Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(datasets): lazily load datasets in init files #277

Merged
merged 29 commits into from
Jul 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
7666aed
perf(datasets): lazily load datasets in init files (api)
deepyaman Jul 22, 2023
415f685
perf(datasets): lazily load datasets in init files (pandas)
deepyaman Jul 22, 2023
c8c3541
fix(datasets): fix no name in module in api/pandas
deepyaman Jul 23, 2023
7d22db1
perf(datasets): lazily load datasets in init files (biosequence)
deepyaman Jul 23, 2023
eaac498
perf(datasets): lazily load datasets in init files (dask)
deepyaman Jul 24, 2023
9336717
perf(datasets): lazily load datasets in init files (databricks)
deepyaman Jul 24, 2023
be34221
perf(datasets): lazily load datasets in init files (email)
deepyaman Jul 24, 2023
89a3f54
perf(datasets): lazily load datasets in init files (geopandas)
deepyaman Jul 24, 2023
c86de49
perf(datasets): lazily load datasets in init files (holoviews)
deepyaman Jul 24, 2023
4665fe4
perf(datasets): lazily load datasets in init files (json)
deepyaman Jul 24, 2023
92af64d
fix(datasets): resolve "too few public attributes"
deepyaman Jul 24, 2023
23a923a
perf(datasets): lazily load datasets in init files (matplotlib)
deepyaman Jul 24, 2023
96d9a54
perf(datasets): lazily load datasets in init files (networkx)
deepyaman Jul 24, 2023
d004a54
perf(datasets): lazily load datasets in init files (pickle)
deepyaman Jul 25, 2023
21fb499
perf(datasets): lazily load datasets in init files (pillow)
deepyaman Jul 25, 2023
587b2c2
perf(datasets): lazily load datasets in init files (plotly)
deepyaman Jul 25, 2023
ced6be9
Merge branch 'main' into perf/datasets/lazy-loader
deepyaman Jul 25, 2023
eed604d
perf(datasets): lazily load datasets in init files (polars)
deepyaman Jul 25, 2023
b9855a3
perf(datasets): lazily load datasets in init files (redis)
deepyaman Jul 26, 2023
65e0c03
perf(datasets): lazily load datasets in init files (snowflake)
deepyaman Jul 28, 2023
9bcc8a2
perf(datasets): lazily load datasets in init files (spark)
deepyaman Jul 28, 2023
92f6b75
Merge branch 'main' into perf/datasets/lazy-loader
deepyaman Jul 28, 2023
7f56e1b
perf(datasets): lazily load datasets in init files (svmlight)
deepyaman Jul 29, 2023
11ca34e
perf(datasets): lazily load datasets in init files (tensorflow)
deepyaman Jul 29, 2023
4a051c7
perf(datasets): lazily load datasets in init files (text)
deepyaman Jul 29, 2023
c95be7b
perf(datasets): lazily load datasets in init files (tracking)
deepyaman Jul 29, 2023
51f2f46
perf(datasets): lazily load datasets in init files (video)
deepyaman Jul 29, 2023
f237a8b
perf(datasets): lazily load datasets in init files (yaml)
deepyaman Jul 29, 2023
17b32e5
Update RELEASE.md
deepyaman Jul 29, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# Upcoming Release:

## Major features and improvements
* Added automatic inference of file format for `pillow.ImageDataSet` to be passed to `save()`
* Implemented lazy loading of dataset subpackages and classes.
* Suppose that SQLAlchemy, a Python SQL toolkit, is installed in your Python environment. With this change, the SQLAlchemy library will not be loaded (for `pandas.SQLQueryDataSet` or `pandas.SQLTableDataSet`) if you load a different pandas dataset (e.g. `pandas.CSVDataSet`).
* Added automatic inference of file format for `pillow.ImageDataSet` to be passed to `save()`.

## Bug fixes and other changes
* Improved error messages for missing dataset dependencies.
* Suppose that SQLAlchemy, a Python SQL toolkit, is not installed in your Python environment. Previously, `from kedro_datasets.pandas import SQLQueryDataSet` or `from kedro_datasets.pandas import SQLTableDataSet` would result in `ImportError: cannot import name 'SQLTableDataSet' from 'kedro_datasets.pandas'`. Now, the same imports raise the more helpful and intuitive `ModuleNotFoundError: No module named 'sqlalchemy'`.

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Expand All @@ -12,7 +16,7 @@ Many thanks to the following Kedroids for contributing PRs to this release:

# Release 1.4.2
## Bug fixes and other changes
* Fixed documentations of `GeoJSONDataSet` and `SparkStreamingDataSet`
* Fixed documentations of `GeoJSONDataSet` and `SparkStreamingDataSet`.
* Fixed problematic docstrings causing Read the Docs builds on Kedro to fail.

# Release 1.4.1:
Expand All @@ -32,16 +36,16 @@ Many thanks to the following Kedroids for contributing PRs to this release:
## Major features and improvements
* Added pandas 2.0 support.
* Added SQLAlchemy 2.0 support (and dropped support for versions below 1.4).
* Added a save method to the APIDataSet
* Added a save method to `APIDataSet`.
* Reduced constructor arguments for `APIDataSet` by replacing most arguments with a single constructor argument `load_args`. This makes it more consistent with other Kedro DataSets and the underlying `requests` API, and automatically enables the full configuration domain: stream, certificates, proxies, and more.
* Relaxed Kedro version pin to `>=0.16`
* Relaxed Kedro version pin to `>=0.16`.
* Added `metadata` attribute to all existing datasets. This is ignored by Kedro, but may be consumed by users or external plugins.
* Added `ManagedTableDataSet` for managed delta tables on Databricks.

## Bug fixes and other changes
* Relaxed `delta-spark` upper bound to allow compatibility with Spark 3.1.x and 3.2.x.
* Upgraded required `polars` version to 0.17.
* Renamed `TensorFlowModelDataset` to `TensorFlowModelDataSet` to be consistent with all other plugins in kedro-datasets.
* Renamed `TensorFlowModelDataset` to `TensorFlowModelDataSet` to be consistent with all other plugins in Kedro-Datasets.

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Expand Down Expand Up @@ -102,11 +106,11 @@ Datasets are Kedro’s way of dealing with input and output in a data and machin
The datasets have always been part of the core Kedro Framework project inside `kedro.extras`. In Kedro `0.19.0`, we will remove datasets from Kedro to reduce breaking changes associated with dataset dependencies. Instead, users will need to use the datasets from the `kedro-datasets` repository instead.

## Major features and improvements
* Changed `pandas.ParquetDataSet` to load data using pandas instead of parquet
* Changed `pandas.ParquetDataSet` to load data using pandas instead of parquet.

# Release 0.1.0:

The initial release of `kedro-datasets`.
The initial release of Kedro-Datasets.

## Thanks to our main contributors

Expand Down
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@
and returns them into either as string or json Dict.
It uses the python requests library: https://requests.readthedocs.io/en/latest/
"""
from typing import Any

__all__ = ["APIDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
APIDataSet: Any

with suppress(ImportError):
from .api_dataset import APIDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"api_dataset": ["APIDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/biosequence/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to read/write from/to a sequence file."""
from typing import Any

__all__ = ["BioSequenceDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
BioSequenceDataSet: Any

with suppress(ImportError):
from .biosequence_dataset import BioSequenceDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"biosequence_dataset": ["BioSequenceDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/dask/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""Provides I/O modules using dask dataframe."""
from typing import Any

__all__ = ["ParquetDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
ParquetDataSet: Any

with suppress(ImportError):
from .parquet_dataset import ParquetDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"parquet_dataset": ["ParquetDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/databricks/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""Provides interface to Unity Catalog Tables."""
from typing import Any

__all__ = ["ManagedTableDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
ManagedTableDataSet: Any

with suppress(ImportError):
from .managed_table_dataset import ManagedTableDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"managed_table_dataset": ["ManagedTableDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/email/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementations for managing email messages."""
from typing import Any

__all__ = ["EmailMessageDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
EmailMessageDataSet: Any

with suppress(ImportError):
from .message_dataset import EmailMessageDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"message_dataset": ["EmailMessageDataSet"]}
)
15 changes: 9 additions & 6 deletions kedro-datasets/kedro_datasets/geopandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``GeoJSONDataSet`` is an ``AbstractVersionedDataSet`` to save and load GeoJSON files.
"""
__all__ = ["GeoJSONDataSet"]
"""``GeoJSONDataSet`` is an ``AbstractVersionedDataSet`` to save and load GeoJSON files."""
from typing import Any

from contextlib import suppress
import lazy_loader as lazy

with suppress(ImportError):
from .geojson_dataset import GeoJSONDataSet
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
GeoJSONDataSet: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"geojson_dataset": ["GeoJSONDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/holoviews/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to save Holoviews objects as image files."""
from typing import Any

__all__ = ["HoloviewsWriter"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
HoloviewsWriter: Any

with suppress(ImportError):
from .holoviews_writer import HoloviewsWriter
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"holoviews_writer": ["HoloviewsWriter"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/json/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save data from/to a JSON file."""
from typing import Any

__all__ = ["JSONDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
JSONDataSet: Any

with suppress(ImportError):
from .json_dataset import JSONDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"json_dataset": ["JSONDataSet"]}
)
10 changes: 6 additions & 4 deletions kedro-datasets/kedro_datasets/matplotlib/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
"""``AbstractDataSet`` implementation to save matplotlib objects as image files."""
from typing import Any

__all__ = ["MatplotlibWriter"]
import lazy_loader as lazy

from contextlib import suppress
MatplotlibWriter: Any

with suppress(ImportError):
from .matplotlib_writer import MatplotlibWriter
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"matplotlib_writer": ["MatplotlibWriter"]}
)
28 changes: 16 additions & 12 deletions kedro-datasets/kedro_datasets/networkx/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
"""``AbstractDataSet`` implementation to save and load NetworkX graphs in JSON
, GraphML and GML formats using ``NetworkX``."""
"""``AbstractDataSet`` implementation to save and load NetworkX graphs in JSON,
GraphML and GML formats using ``NetworkX``."""
from typing import Any

__all__ = ["GMLDataSet", "GraphMLDataSet", "JSONDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
GMLDataSet: Any
GraphMLDataSet: Any
JSONDataSet: Any

with suppress(ImportError):
from .gml_dataset import GMLDataSet

with suppress(ImportError):
from .graphml_dataset import GraphMLDataSet

with suppress(ImportError):
from .json_dataset import JSONDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__,
submod_attrs={
"gml_dataset": ["GMLDataSet"],
"graphml_dataset": ["GraphMLDataSet"],
"json_dataset": ["JSONDataSet"],
},
)
70 changes: 32 additions & 38 deletions kedro-datasets/kedro_datasets/pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,42 +1,36 @@
"""``AbstractDataSet`` implementations that produce pandas DataFrames."""
from typing import Any

__all__ = [
"CSVDataSet",
"DeltaTableDataSet",
"ExcelDataSet",
"FeatherDataSet",
"GBQTableDataSet",
"GBQQueryDataSet",
"HDFDataSet",
"JSONDataSet",
"ParquetDataSet",
"SQLQueryDataSet",
"SQLTableDataSet",
"XMLDataSet",
"GenericDataSet",
]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
CSVDataSet: Any
DeltaTableDataSet: Any
ExcelDataSet: Any
FeatherDataSet: Any
GBQQueryDataSet: Any
GBQTableDataSet: Any
GenericDataSet: Any
HDFDataSet: Any
JSONDataSet: Any
ParquetDataSet: Any
SQLQueryDataSet: Any
SQLTableDataSet: Any
XMLDataSet: Any

with suppress(ImportError):
from .csv_dataset import CSVDataSet
with suppress(ImportError):
from .deltatable_dataset import DeltaTableDataSet
with suppress(ImportError):
from .excel_dataset import ExcelDataSet
with suppress(ImportError):
from .feather_dataset import FeatherDataSet
with suppress(ImportError):
from .gbq_dataset import GBQQueryDataSet, GBQTableDataSet
with suppress(ImportError):
from .hdf_dataset import HDFDataSet
with suppress(ImportError):
from .json_dataset import JSONDataSet
with suppress(ImportError):
from .parquet_dataset import ParquetDataSet
with suppress(ImportError):
from .sql_dataset import SQLQueryDataSet, SQLTableDataSet
with suppress(ImportError):
from .xml_dataset import XMLDataSet
with suppress(ImportError):
from .generic_dataset import GenericDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__,
submod_attrs={
"csv_dataset": ["CSVDataSet"],
"deltatable_dataset": ["DeltaTableDataSet"],
"excel_dataset": ["ExcelDataSet"],
"feather_dataset": ["FeatherDataSet"],
"gbq_dataset": ["GBQQueryDataSet", "GBQTableDataSet"],
"generic_dataset": ["GenericDataSet"],
"hdf_dataset": ["HDFDataSet"],
"json_dataset": ["JSONDataSet"],
"parquet_dataset": ["ParquetDataSet"],
"sql_dataset": ["SQLQueryDataSet", "SQLTableDataSet"],
"xml_dataset": ["XMLDataSet"],
},
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/pickle/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save data from/to a Pickle file."""
from typing import Any

__all__ = ["PickleDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
PickleDataSet: Any

with suppress(ImportError):
from .pickle_dataset import PickleDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"pickle_dataset": ["PickleDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/pillow/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save image data."""
from typing import Any

__all__ = ["ImageDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
ImageDataSet: Any

with suppress(ImportError):
from .image_dataset import ImageDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"image_dataset": ["ImageDataSet"]}
)
15 changes: 9 additions & 6 deletions kedro-datasets/kedro_datasets/plotly/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
"""``AbstractDataSet`` implementations to load/save a plotly figure from/to a JSON
file."""
from typing import Any

__all__ = ["PlotlyDataSet", "JSONDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
JSONDataSet: Any
PlotlyDataSet: Any

with suppress(ImportError):
from .plotly_dataset import PlotlyDataSet
with suppress(ImportError):
from .json_dataset import JSONDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__,
submod_attrs={"json_dataset": ["JSONDataSet"], "plotly_dataset": ["PlotlyDataSet"]},
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/polars/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementations that produce pandas DataFrames."""
from typing import Any

__all__ = ["CSVDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
CSVDataSet: Any

with suppress(ImportError):
from .csv_dataset import CSVDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"csv_dataset": ["CSVDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/redis/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save data from/to a redis db."""
from typing import Any

__all__ = ["PickleDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
PickleDataSet: Any

with suppress(ImportError):
from .redis_dataset import PickleDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"redis_dataset": ["PickleDataSet"]}
)
Loading