Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define inputs and outputs of steps/tasks as attributes instead of properties to simplify definition #1027

Open
wants to merge 26 commits into
base: develop
Choose a base branch
from

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Oct 9, 2024

Description

This PR does a refactor of inputs/outputs properties of the steps to be defined in terms of attributes.

Defining a custom Step (or Task) now needs to create the inputs and outputs as attributes instead of properties.
Some classes need running some code, so making a general classmethod is not feasible (maybe I'm wrong?), for example the steps in the columns.py format need information from other attributes, or the loader classes need to fetch info from the datasets which only happen after calling .load().

Instead of simplifying the access to these variables, it just simplifies the definition by writing less lines.

  • Common case:
from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput
from distilabel.steps.typing import StepColumns

if TYPE_CHECKING:
    from distilabel.steps.typing import StepOutput

class MyStep(Step):
    inputs: StepColumns = ["input_field"]
    outputs: StepColumns = ["output_field"]

    def process(self, inputs: StepInput) -> "StepOutput":
        for input in inputs:
            input["output_field"] = input["input_field"]
        yield inputs
  • Attributes with dependencies (do it by defining the model_post_init):
from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput
from distilabel.steps.typing import StepColumns

if TYPE_CHECKING:
    from distilabel.steps.typing import StepOutput

class RenameColumns(Step):
    rename_mappings: Dict[str, str] = None

    def model_post_init(self, __context: Any) -> None:
        super().model_post_init(__context)
        self.outputs = list(self.rename_mappings.values())  # type: ignore

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        outputs = []
        for input in inputs:
            outputs.append(
                {self.rename_mappings.get(k, k): v for k, v in input.items()}  # type: ignore
            )
        yield outputs

Closes #836

@plaguss plaguss added this to the 1.5.0 milestone Oct 9, 2024
@plaguss plaguss self-assigned this Oct 9, 2024
Copy link

github-actions bot commented Oct 9, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1027/

@plaguss plaguss marked this pull request as ready for review October 9, 2024 14:05
Copy link

codspeed-hq bot commented Oct 9, 2024

CodSpeed Performance Report

Merging #1027 will not alter performance

Comparing input-output-classmethod (4c709ae) with develop (dc06161)

Summary

✅ 1 untouched benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant