Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yaml public datasets loader #914

Draft
wants to merge 39 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
f21bc07
construct yaml loader and new dataset registry and library
SiQube Jan 1, 2025
b763659
move bsc.py to bsc.yaml
SiQube Jan 1, 2025
ff17dc0
move sbsat.py to sbsat.yaml
SiQube Jan 1, 2025
27244b0
move codecomprehension.py to codecomprehension.yaml
SiQube Jan 1, 2025
d9da57a
move copco.py to copco.yaml
SiQube Jan 1, 2025
269dc6a
move didec.py to didec.yaml
SiQube Jan 1, 2025
e2c9d79
move emtec.py to emtec.yaml
SiQube Jan 1, 2025
6cc30c7
move fakenews.py to fakenews.yaml
SiQube Jan 1, 2025
16f2c89
move gazebase*.py to gazebase*.yaml
SiQube Jan 1, 2025
17bf864
move gaze_graph.py to gaze_graph.yaml
SiQube Jan 1, 2025
6311b89
move gaze_on_faces.py to gaze_on_faces.yaml
SiQube Jan 1, 2025
806a826
move hbn.py to hbn.yaml
SiQube Jan 1, 2025
c26392d
move judo1000.py to judo1000.yaml
SiQube Jan 1, 2025
808dd1a
add special case of eyetracker within config
SiQube Jan 1, 2025
91aca73
move interead.py to interead.yaml
SiQube Jan 1, 2025
188f1de
move potec.py to potec.yaml
SiQube Jan 1, 2025
1465ca4
move toy_dataset*.py to toy_dataset*.yaml
SiQube Jan 1, 2025
9965874
adjust tests
SiQube Jan 1, 2025
d75d153
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 1, 2025
a9deb54
adjust raise
SiQube Jan 3, 2025
2ce121f
update licenses (#915)
SiQube Jan 7, 2025
afcb9bb
feat: split `TextStimulus` by column values (#879)
izaskr Jan 8, 2025
79a3f0c
build: update nbsphinx requirement from <0.9.6,>=0.8.8 to >=0.8.8,<0.…
dependabot[bot] Jan 8, 2025
637901a
ci: pre-commit autoupdate (#910)
pre-commit-ci[bot] Jan 21, 2025
d4298c6
fix: support missing dummy row in asc files (#918)
dkrako Jan 22, 2025
b035025
fix!: correctly load trial_columns to GazeDataFrame (#928)
dkrako Jan 22, 2025
6e45eae
build: update pyopenssl requirement (#923)
dependabot[bot] Jan 22, 2025
bf79527
build: update pyarrow requirement from <19,>=11.0.0 to >=11.0.0,<20 (…
dependabot[bot] Jan 22, 2025
dc4b0b7
ci: update github actions (#925)
dkrako Jan 23, 2025
ebade28
build: upgrade polars>=1.21 (#935)
SiQube Feb 5, 2025
745aa95
fix typo in citation file (#933)
theDebbister Feb 5, 2025
e59c91c
Improve regex parsing for DISPLAY_COORDS in asc meta data (#938)
AnnaBhlr Feb 8, 2025
ba8a90a
docs: add discord to readme (#937)
SiQube Feb 11, 2025
feb891e
ci: pre-commit autoupdate (#931)
pre-commit-ci[bot] Feb 11, 2025
59da584
fix: fix data loss ratio errors when missing dummy row (#934)
saphjra Feb 11, 2025
8ebace7
ci: use minor versions as upper bound in dependencies (#943)
dkrako Feb 12, 2025
ac9c3b7
construct yaml loader and new dataset registry and library
SiQube Jan 1, 2025
aee76d4
Merge branch 'main' into yaml-public-datasets-loader
SiQube Feb 12, 2025
0c83fca
Merge branch 'main' into yaml-public-datasets-loader
SiQube Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ repos:
rev: v1.13.0
hooks:
- id: mypy
additional_dependencies: [pandas-stubs, types-tqdm]
additional_dependencies: [pandas-stubs, types-tqdm, types-PyYAML]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
Expand Down
2 changes: 0 additions & 2 deletions src/pymovements/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@
from pymovements.dataset import DatasetDefinition
from pymovements.dataset import DatasetLibrary
from pymovements.dataset import DatasetPaths
from pymovements.dataset import register_dataset
from pymovements.events import EventDataFrame
from pymovements.events import EventGazeProcessor
from pymovements.events import EventProcessor
Expand All @@ -52,7 +51,6 @@
'DatasetLibrary',
'DatasetPaths',
'datasets',
'register_dataset',

'events',
'EventDataFrame',
Expand Down
2 changes: 0 additions & 2 deletions src/pymovements/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@
from pymovements.dataset.dataset import Dataset
from pymovements.dataset.dataset_definition import DatasetDefinition
from pymovements.dataset.dataset_library import DatasetLibrary
from pymovements.dataset.dataset_library import register_dataset
from pymovements.dataset.dataset_paths import DatasetPaths


Expand All @@ -42,5 +41,4 @@
'DatasetDefinition',
'DatasetLibrary',
'DatasetPaths',
'register_dataset',
]
34 changes: 21 additions & 13 deletions src/pymovements/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
from pymovements.dataset.dataset_definition import DatasetDefinition
from pymovements.dataset.dataset_library import DatasetLibrary
from pymovements.dataset.dataset_paths import DatasetPaths
from pymovements.dataset.yaml_dataset_loader import YAMLDatasetLoader
from pymovements.events.frame import EventDataFrame
from pymovements.events.precomputed import PrecomputedEventDataFrame
from pymovements.events.processing import EventGazeProcessor
Expand All @@ -48,35 +49,42 @@

Parameters
----------
definition: str | DatasetDefinition | type[DatasetDefinition]
definition: str | DatasetDefinition | Path
Dataset definition to initialize dataset with.
path : str | Path | DatasetPaths
Path to the dataset directory. You can set up a custom directory structure by passing a
:py:class:`~pymovements.DatasetPaths` instance.
path: str | Path | DatasetPaths
Path to the dataset directory. You can set up a custom directory structure
by passing a :py:class:`~pymovements.DatasetPaths` instance.
"""

def __init__(
self,
definition: str | DatasetDefinition | type[DatasetDefinition],
path: str | Path | DatasetPaths,
self,
definition: str | DatasetDefinition | Path,
path: str | Path | DatasetPaths,
):
self.fileinfo: pl.DataFrame = pl.DataFrame()
self.gaze: list[GazeDataFrame] = []
self.events: list[EventDataFrame] = []
self.precomputed_events: list[PrecomputedEventDataFrame] = []
self.precomputed_reading_measures: list[ReadingMeasures] = []

if isinstance(definition, str):
definition = DatasetLibrary.get(definition)()
if isinstance(definition, type):
definition = definition()
self.definition = deepcopy(definition)
# Handle different definition input types
if isinstance(definition, (str, Path)):
# Check if it's a path to a YAML file
if isinstance(definition, Path) or str(definition).endswith('.yaml'):
self.definition = YAMLDatasetLoader.load_dataset_definition(definition)

Check warning on line 74 in src/pymovements/dataset/dataset.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/dataset.py#L74

Added line #L74 was not covered by tests
else:
# Try to load from registered datasets
self.definition = DatasetLibrary.get(definition)

Check warning on line 77 in src/pymovements/dataset/dataset.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/dataset.py#L77

Added line #L77 was not covered by tests
else:
self.definition = deepcopy(definition)

Check warning on line 79 in src/pymovements/dataset/dataset.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/dataset.py#L79

Added line #L79 was not covered by tests

# Handle path setup
if isinstance(path, (str, Path)):
self.paths = DatasetPaths(root=path, dataset='.')
else:
self.paths = deepcopy(path)
# Fill dataset directory name with dataset definition name if specified.

# Fill dataset directory name with dataset definition name if specified
self.paths.fill_name(self.definition.name)

def load(
Expand Down
83 changes: 51 additions & 32 deletions src/pymovements/dataset/dataset_library.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,65 +20,84 @@
"""DatasetLibrary module."""
from __future__ import annotations

from typing import TypeVar
from pathlib import Path

from pymovements.dataset.dataset_definition import DatasetDefinition
from pymovements.dataset.yaml_dataset_loader import YAMLDatasetLoader


class DatasetLibrary:
"""Provides access by name to :py:class:`~pymovements.DatasetDefinition`.
"""Provides access by name to dataset definitions.

Attributes
----------
definitions: dict[str, type[DatasetDefinition]]
Dictionary of :py:class:`~pymovements.DatasetDefinition`.
definitions: dict[str, DatasetDefinition]
Dictionary of dataset definitions, either as classes or instances
"""

definitions: dict[str, type[DatasetDefinition]] = {}
definitions: dict[str, DatasetDefinition] = {}

@classmethod
def add(cls, definition: type[DatasetDefinition]) -> None:
"""Add :py:class:`~pymovements.DatasetDefinition` to library.
def add(cls, definition: DatasetDefinition | Path | str) -> None:
"""Add a dataset definition to library.

Parameters
----------
definition: type[DatasetDefinition]
The :py:class:`~pymovements.DatasetDefinition` to add to the library.
definition: DatasetDefinition | Path | str
The dataset definition to add. Can be:
- A DatasetDefinition class (legacy)
- A DatasetDefinition instance (from YAML)
- A Path to a YAML file
- A string path to a YAML file
"""
cls.definitions[definition.name] = definition
if isinstance(definition, (str, Path)):
# Load from YAML file
yaml_def = YAMLDatasetLoader.load_dataset_definition(definition)
cls.definitions[yaml_def.name] = yaml_def
else:
# DatasetDefinition instance (from YAML)
cls.definitions[definition.name] = definition

Check warning on line 59 in src/pymovements/dataset/dataset_library.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/dataset_library.py#L59

Added line #L59 was not covered by tests

@classmethod
def get(cls, name: str) -> type[DatasetDefinition]:
"""Get :py:class:`~pymovements.DatasetDefinition` py name.
def get(cls, name: str) -> DatasetDefinition:
"""Get dataset definition by name.

Parameters
----------
name: str
Name of the :py:class:`~pymovements.DatasetDefinition` in the library.
Name of the dataset definition in the library.

Returns
-------
type[DatasetDefinition]
The :py:class:`~pymovements.DatasetDefinition` in the library.
DatasetDefinition
The dataset definition. Could be either a class (legacy) or instance (YAML).

Raises
------
KeyError
If dataset name not found in library.
"""
if name not in cls.definitions:
raise KeyError(

Check warning on line 81 in src/pymovements/dataset/dataset_library.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/dataset_library.py#L81

Added line #L81 was not covered by tests
f"Dataset '{name}' not found in library. "
f"Available datasets: {list(cls.definitions.keys())}",
)
return cls.definitions[name]

@classmethod
def register_yaml_directory(cls, directory: str | Path) -> None:
"""Register all YAML dataset definitions in a directory.

DatsetDefinitionClass = TypeVar('DatsetDefinitionClass', bound=type[DatasetDefinition])


def register_dataset(cls: DatsetDefinitionClass) -> DatsetDefinitionClass:
"""Register a public dataset definition.

Parameters
----------
cls: DatsetDefinitionClass
The :py:class:`~pymovements.DatasetDefinition` to register.
Parameters
----------
directory: str | Path
Directory containing YAML dataset definitions
"""
directory = Path(directory)
for yaml_file in directory.glob('*.yaml'):
cls.add(yaml_file)

Returns
-------
DatsetDefinitionClass
The :py:class:`~pymovements.DatasetDefinition` to register.
"""
DatasetLibrary.add(cls)
return cls
@classmethod
def clear(cls) -> None:
"""Clear all registered datasets."""
cls.definitions.clear()

Check warning on line 103 in src/pymovements/dataset/dataset_library.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/dataset_library.py#L103

Added line #L103 was not covered by tests
151 changes: 151 additions & 0 deletions src/pymovements/dataset/yaml_dataset_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Copyright (c) 2025 The pymovements Project Authors
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
"""YAMLDatasetLoader class."""
from __future__ import annotations

from dataclasses import asdict
from pathlib import Path

import yaml

from pymovements.dataset.dataset_definition import DatasetDefinition
from pymovements.gaze.experiment import Experiment
from pymovements.gaze.eyetracker import EyeTracker


# generalized constructor for !* tags
def type_constructor(
loader: yaml.Loader | yaml.FullLoader | yaml.UnsafeLoader,
prefix: str,
node: yaml.Node,
) -> type:
"""Resolve a YAML tag to a corresponding Python type.

This function is used to handle custom YAML tags (e.g., `!pl.Int64`)
by mapping the tag to a Python type or class name. The type name is
extracted from the YAML tag and evaluated to return the corresponding
Python object. If the type cannot be resolved, an error is raised.

Parameters
----------
loader: yaml.Loader | yaml.FullLoader | yaml.UnsafeLoader
The YAML loader being used to parse the YAML document.
prefix: str
A string prefix for the custom tag (e.g., '!').
node: yaml.Node
The YAML node containing the tag and associated value.

Returns
-------
type
The Python type or class corresponding to the YAML tag.

Raises
------
ValueError: If the specified type name in the tag cannot be resolved
to a valid Python object.

Example:
# Example YAML document:
# !pl.Int64
#
# Resolves to the Python type `pl.Int64` (assuming `pl` is a valid module).

"""
# pylint: disable=unused-argument
# extract the type name (e.g., from !pl.Int64 to pl.Int64)
type_name = node.tag[1:]

built_in_types = {
'int': int,
'float': float,
'str': str,
'bool': bool,
'list': list,
'dict': dict,
'set': set,
'tuple': tuple,
}

# check for built-in types first
if type_name in built_in_types:
return built_in_types[type_name]

try:
module_name, type_attr = type_name.rsplit('.', 1)
module = __import__(module_name)
return getattr(module, type_attr)

except AttributeError as exc:
raise ValueError(f"Unknown type: {type_name}") from exc

Check warning on line 97 in src/pymovements/dataset/yaml_dataset_loader.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/yaml_dataset_loader.py#L96-L97

Added lines #L96 - L97 were not covered by tests


yaml.add_multi_constructor('!', type_constructor)


class YAMLDatasetLoader:
"""Loads dataset definitions from YAML files."""

@staticmethod
def load_dataset_definition(yaml_path: str | Path) -> DatasetDefinition:
"""Load a dataset definition from a YAML file.

Parameters
----------
yaml_path : str | Path
Path to the YAML definition file

Returns
-------
DatasetDefinition
Initialized dataset definition
"""
with open(yaml_path, encoding='utf-8') as f:
data = yaml.load(f, Loader=yaml.Loader)

# Convert experiment dict to Experiment object if present
if 'experiment' in data:
if 'eyetracker' in data['experiment']:
eyetracker = EyeTracker(**data['experiment'].pop('eyetracker'))
else:
eyetracker = None
data['experiment'] = Experiment(**data['experiment'], eyetracker=eyetracker)

# Initialize DatasetDefinition with YAML data
return DatasetDefinition(**data)

@staticmethod
def save_dataset_definition(definition: DatasetDefinition, yaml_path: str | Path) -> None:
"""Save a dataset definition to a YAML file.

Parameters
----------
definition : DatasetDefinition
Dataset definition to save
yaml_path : str | Path
Path where to save the YAML file
"""
# Convert to dict and handle experiment object
data = asdict(definition)

Check warning on line 146 in src/pymovements/dataset/yaml_dataset_loader.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/yaml_dataset_loader.py#L146

Added line #L146 was not covered by tests
if data['experiment']:
data['experiment'] = asdict(data['experiment'])

Check warning on line 148 in src/pymovements/dataset/yaml_dataset_loader.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/yaml_dataset_loader.py#L148

Added line #L148 was not covered by tests

with open(yaml_path, 'w', encoding='utf-8') as f:
yaml.dump(data, f, sort_keys=False)

Check warning on line 151 in src/pymovements/dataset/yaml_dataset_loader.py

View check run for this annotation

Codecov / codecov/patch

src/pymovements/dataset/yaml_dataset_loader.py#L150-L151

Added lines #L150 - L151 were not covered by tests
Loading
Loading