A new way to define parametrizations #132

tobiasraabe · 2021-07-25T22:28:49Z

tobiasraabe
Jul 25, 2021
Maintainer

Introduction

Parametrizations are a powerful tool to avoid code duplication and scale tasks up.

Problem

Currently, parametrizations have some flaws.

It is only possible to parametrize a task by the arguments passed to a function, but it is not possible to include the function of a task in a parametrization.
(optional) Parametrizations are hard to define for some users. The recommendation to write a separate function which creates the inputs of the parametrizations has improved the situation, but maybe the interface can be simplified again.
If ids for parametrized tasks are used (which is recommended), they are detached from the task itself. (Although, following the best practices guide should partly alleviate the problem by creating the ids alongside the inputs.)

Answers

Candidate solutions are posted below as answers which can, then, be discussed in the thread below. Your answer does not need to address all problems, but can also address part of the problems.

You can also start a thread to discuss your experience with parametrizations, what you find difficult, what works well, what use-case is not easily supported. Your post does not have to include a resolution. Descriptions and questions are great!

References

pytest-cases
Conversation with 0az on "meta tasks".

Answered by tobiasraabe

Apr 19, 2022

There is now a loop-based approach to parametrizations which basically solves all issues: https://pytask-dev.readthedocs.io/en/stable/tutorials/repeating_tasks_with_different_inputs.html

View full answer

tobiasraabe · 2021-07-26T21:39:20Z

tobiasraabe
Jul 26, 2021
Maintainer Author

This idea evolved in a discussion with 0az. The approach allows to parametrize tasks with functions while providing an "intuitive" interface.

The main idea is to define a dictionary a the module level whose name starts with task_. Keys of the dictionary are names of tasks. Each value is a dictionary where keys like "function", "depends_on" and "produces" point to the known arguments of a task.

For example, assume you have multiple data sets and there are different functions to plot the data. Here are the functions.

def plot_histogram_of_all_variables(depends_on, produces, plot_kwargs, output_format):
    ...

def plot_kde_of_all_variables(depends_on, produces, plot_kwargs, output_format):
    ...

Next, we define the parametrization.

task_dictionary = {
    f"task_{data_name}_{plot_name}": {
        "function": function,
        "depends_on": path_to_data(data_name),
        "produces": path_to_figure(data_name, plot_name),
    }
    for data_name in DATA
    for plot_name, function in [
        ("hist", plot_histogram_of_all_variables), ("kde", plot_kde_of_all_variables)
    ]
}

Additional keys in a task dictionary are considered to be keyword arguments to the specific task function, e.g. we set the number of bins to a certain value for the histogram plots. Global kwargs are assumed apply to all tasks. For example, we only want to generate pngs of a certain size.

task_dictionary = {
    f"task_{data_name}_{plot_name}": {
        "function": function,
        "depends_on": path_to_data(data_name),
        "produces": path_to_figure(data_name, plot_name),
        "plot_kwargs": {"bins": 20} if plot_name == "hist" else {}
    }
    for data_name in DATA
    for plot_name, function in [
        ("hist", plot_histogram_of_all_variables), ("kde", plot_kde_of_all_variables)
    ]
}
task_dictionary["output_format"] = "png"

Markers

Markers can be added as usual by applying the decorators to the task functions. Another way is to use the special markers key inside the dictionary which can be populated with mark decorators which are applied to a local task function or globally to all task functions in the dictionary.

task_dictionary = {
    f"task_{data_name}_{plot_name}": {
        "function": function,
        ...,
        # KDEs cannot be computed on Windows. Who would have thought that!
        "marks": pytask.mark.skipif(ON_WINDOWS) if plot_name == "kde" else []
    }
    for data_name in DATA
    for plot_name, function in [
        ("hist", plot_histogram_of_all_variables), ("kde", plot_kde_of_all_variables)
    ]
}

# All tasks should persist.
task_dictionary["marks"] = pytask.mark.persist

marks as a key is actually not correct since the object generated by pytask.mark.skip is a marker and only if it is applied to a function it generates a mark. Not sure this is important and I am also not sure what is more intuitive for users.

1 reply

janosg Aug 9, 2021

I like the interface and think it's much easier to learn for new users. Last semester many students struggled with parametrization!

It would still be important to provide best practices. In particular I think that in many cases the task dictionary should be generated by a function, just as the current parametrization inputs.

Pairing ids and tasks in a dictionary and thus making ids mandatory is a nice side effect.

tobiasraabe · 2022-04-19T12:02:57Z

tobiasraabe
Apr 19, 2022
Maintainer Author

There is now a loop-based approach to parametrizations which basically solves all issues: https://pytask-dev.readthedocs.io/en/stable/tutorials/repeating_tasks_with_different_inputs.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new way to define parametrizations #132

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

A new way to define parametrizations #132

tobiasraabe Jul 25, 2021 Maintainer

Introduction

Problem

Answers

References

Replies: 2 comments · 1 reply

tobiasraabe Jul 26, 2021 Maintainer Author

Markers

janosg Aug 9, 2021

tobiasraabe Apr 19, 2022 Maintainer Author

tobiasraabe
Jul 25, 2021
Maintainer

Replies: 2 comments 1 reply

tobiasraabe
Jul 26, 2021
Maintainer Author

tobiasraabe
Apr 19, 2022
Maintainer Author