Merge branch 'feat/dataset-instruction-response-pipeline' of https://…

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline
argilla-io · Jan 28, 2025 · d8e71ea · d8e71ea
2 parents f1fb538 + a0c23f6
commit d8e71ea
Show file tree

Hide file tree

Showing 225 changed files with 9,341 additions and 1,370 deletions.
diff --git a/.github/ISSUE_TEMPLATE/1-add_documentation_report.yml b/.github/ISSUE_TEMPLATE/1-add_documentation_report.yml
@@ -0,0 +1,26 @@
+name: "\U0001F4DA Add a documentation report"
+description: "Have you spotted a typo or mistake in our docs?"
+title: "[DOCS]"
+labels: ["documentation"]
+assignees: []
+
+body:
+  - type: markdown
+    attributes:
+      value: "Thank you for reporting a documentation mistake! Before you get started, please [search to see](https://github.com/argilla-io/distilabel/issues) if an issue already exists for the bug you encountered."
+
+  - type: textarea
+    id: doc_report
+    attributes:
+      label: "Which page or section is this issue related to?"
+      description: "Please include the URL and/or source."
+    validations:
+          required: false
+
+  - type: textarea
+    id: doc_review
+    attributes:
+      label: "What are you documenting, or what change are you making in the documentation?"
+      description: "If a documentation needs to be created, please specify its coverage.\n If there's a typo or something needs revisiting, please indicate it and show code/text/screenshots."
+    validations:
+          required: false
diff --git a/.github/ISSUE_TEMPLATE/2-bug_python.yml b/.github/ISSUE_TEMPLATE/2-bug_python.yml
@@ -0,0 +1,70 @@
+name: "\U0001FAB2 Bug report"
+description: "Report bugs and unexpected behavior."
+title: "[BUG]"
+labels: ["bug", "ml-internal"]
+assignees: []
+
+body:
+  - type: markdown
+    attributes:
+      value: "Thank you for reporting a bug! Before you get started, please [search to see](https://github.com/argilla-io/distilabel/issues) if an issue already exists for the bug you encountered."
+
+  - type: textarea
+    id: bug_description
+    attributes:
+      label: "Describe the bug"
+      description: "A clear and concise description of the bug."
+    validations:
+      required: true
+
+  - type: textarea
+    id: stacktrace
+    attributes:
+      label: "To reproduce"
+      description: "The code to reproduce the behavior."
+      placeholder: |
+        ```python
+        my_python_code
+        ```
+    validations:
+      required: false
+
+  - type: textarea
+    id: expected_behavior
+    attributes:
+      label: "Expected behavior"
+      description: "A clear and concise description of what you expected to happen."
+    validations:
+      required: false
+
+  - type: textarea
+    id: screenshots
+    attributes:
+      label: "Screenshots"
+      description: "If applicable, add screenshots to help explain your problem."
+    validations:
+      required: false
+
+  - type: textarea
+    id: environment
+    attributes:
+      label: "Environment"
+      description: "Since version 1.16.0 you can use `python -m argilla info` command to easily get the used versions."
+      value: |
+        - Distilabel Version [e.g. 1.0.0]:
+        - Python Version [e.g. 3.11]:
+    validations:
+      required: false
+
+  - type: textarea
+    id: additional_context
+    attributes:
+      label: "Additional context"
+      description: "Add any other relevant information."
+    validations:
+      required: false
+
+  - type: markdown
+    attributes:
+      value: |
+        📌  Make sure you have provided all the required information in each section so we can support you properly.
diff --git a/.github/ISSUE_TEMPLATE/3-feature_request.yml b/.github/ISSUE_TEMPLATE/3-feature_request.yml
@@ -0,0 +1,44 @@
+name: "\U0001F195 Feature request"
+description: "Share cool new ideas for the project."
+title: "[FEATURE]"
+labels: ["enhancement", "ml-internal"]
+assignees: []
+
+
+body:
+  - type: markdown
+    attributes:
+      value: "Thank you for sharing your feature request! Please fill out the sections below."
+
+  - type: textarea
+    id: feature_request
+    attributes:
+      label: "Is your feature request related to a problem? Please describe."
+      description: "A clear and concise description of what the problem is."
+      placeholder: "I'm always frustrated when..."
+    validations:
+          required: false
+
+  - type: textarea
+    id: feature_description
+    attributes:
+      label: "Describe the solution you'd like"
+      description: "A clear and concise description of what you want to happen."
+    validations:
+      required: false
+
+  - type: textarea
+    id: feature_alternatives
+    attributes:
+      label: "Describe alternatives you've considered"
+      description: "A clear and concise description of any alternative solutions or features you've considered."
+    validations:
+      required: false
+
+  - type: textarea
+    id: additional_context
+    attributes:
+      label: "Additional context"
+      description: "Add any other context or screenshots about the feature request here."
+    validations:
+      required: false
diff --git a/...ub/ISSUE_TEMPLATE/blank-issue-template.md → .../ISSUE_TEMPLATE/4-blank-issue-template.md b/...ub/ISSUE_TEMPLATE/blank-issue-template.md → .../ISSUE_TEMPLATE/4-blank-issue-template.md
diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml
@@ -0,0 +1,5 @@
+blank_issues_enabled: false
+contact_links:
+  - name: 🗯 Community Discussions
+    url: http://hf.co/join/discord
+    about: Our Discord Community loves to discuss distilabel and NLP topics
diff --git a/.github/ISSUE_TEMPLATE/🆕-feature-request.md b/.github/ISSUE_TEMPLATE/🆕-feature-request.md
diff --git a/.github/ISSUE_TEMPLATE/🐛-bug-report.md b/.github/ISSUE_TEMPLATE/🐛-bug-report.md
diff --git a/.github/ISSUE_TEMPLATE/📚-documentation-update.md b/.github/ISSUE_TEMPLATE/📚-documentation-update.md
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -50,6 +50,12 @@ jobs:
         if: steps.cache.outputs.cache-hit != 'true'
         run: ./scripts/install_dependencies.sh
 
+      - name: Setup tmate session
+        uses: mxschmitt/action-tmate@v3
+        if: ${{ matrix.python-version == '3.12' && github.event_name == 'workflow_dispatch' && inputs.tmate_session }}
+        with:
+          limit-access-to-actor: true
+
       - name: Lint
         run: make lint
 

diff --git a/.gitignore b/.gitignore
@@ -77,4 +77,4 @@ venv.bak/
 # Other
 *.log
 *.swp
-.DS_Store
+.DS_Store
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -11,7 +11,7 @@ repos:
           - --fuzzy-match-generates-todo
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.7.2
+    rev: v0.8.1
     hooks:
       - id: ruff
         args: [--fix]

diff --git a/README.md b/README.md
@@ -94,6 +94,7 @@ In addition, the following extras are available:
 - `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
 - `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.
 - `sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
+- `mlx`: for using [MLX](https://github.com/ml-explore/mlx) models via the `MlxLLM` integration.
 
 ### Structured generation
 

diff --git a/docs/api/models/image_generation/image_generation_gallery.md b/docs/api/models/image_generation/image_generation_gallery.md
@@ -0,0 +1,10 @@
+# ImageGenerationModel Gallery
+
+This section contains the existing [`ImageGenerationModel`][distilabel.models.image_generation] subclasses implemented in `distilabel`.
+
+::: distilabel.models.image_generation
+    options:
+        filters:
+        - "!^ImageGenerationModel$"
+        - "!^AsyngImageGenerationModel$"
+        - "!typing"
diff --git a/docs/api/models/image_generation/index.md b/docs/api/models/image_generation/index.md
@@ -0,0 +1,7 @@
+# ImageGenerationModel
+
+This section contains the API reference for the `distilabel` image generation models, both for the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] synchronous implementation, and for the [`AsyncImageGenerationModel`][distilabel.models.image_generation.AsyncImageGenerationModel] asynchronous one.
+
+For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - ImageGenerationModel](../../../sections/how_to_guides/basic/task/image_task.md).
+
+::: distilabel.models.image_generation.base
diff --git a/docs/api/pipeline/typing.md b/docs/api/pipeline/typing.md
diff --git a/docs/api/step/typing.md b/docs/api/step/typing.md
diff --git a/docs/api/task/image_task.md b/docs/api/task/image_task.md
@@ -0,0 +1,7 @@
+# ImageTask
+
+This section contains the API reference for the `distilabel` image generation tasks.
+
+For more information on how the [`ImageTask`][distilabel.steps.tasks.ImageTask] works and see some examples, check the [Tutorial - Task - ImageTask](../../sections/how_to_guides/basic/task/generator_task.md) page.
+
+::: distilabel.steps.tasks.base.ImageTask
diff --git a/docs/api/task/task_gallery.md b/docs/api/task/task_gallery.md
@@ -8,5 +8,6 @@ This section contains the existing [`Task`][distilabel.steps.tasks.Task] subclas
         - "!Task"
         - "!_Task"
         - "!GeneratorTask"
+        - "!ImageTask"
         - "!ChatType"
         - "!typing"
diff --git a/docs/api/task/typing.md b/docs/api/task/typing.md
diff --git a/docs/api/typing.md b/docs/api/typing.md
@@ -0,0 +1,8 @@
+# Types
+
+This section contains the different types used accross the distilabel codebase.
+
+::: distilabel.typing.base
+::: distilabel.typing.steps
+::: distilabel.typing.models
+::: distilabel.typing.pipeline
diff --git a/docs/assets/tutorials-assets/math-sheperd.png b/docs/assets/tutorials-assets/math-sheperd.png
diff --git a/docs/sections/getting_started/installation.md b/docs/sections/getting_started/installation.md
@@ -57,6 +57,8 @@ Additionally, as part of `distilabel` some extra dependencies are available, mai
 
 - `sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
 
+- `mlx`: for using [MLX](https://github.com/ml-explore/mlx) models via the `MlxLLM` integration.
+
 ### Data processing
 
 - `ray`: for scaling and distributing a pipeline with [Ray](https://github.com/ray-project/ray).

diff --git a/docs/sections/how_to_guides/advanced/distiset.md b/docs/sections/how_to_guides/advanced/distiset.md
@@ -119,6 +119,33 @@ class MagpieGenerator(GeneratorTask, MagpieBase):
 
 The `Citations` section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: ` ```@misc{...}``` `. This information will be automatically used in the README of your `Distiset` if you decide to call `distiset.push_to_hub`. Alternatively, if the `Citations` is not found, but in the `References` there are found any urls pointing to `https://arxiv.org/`, we will try to obtain the `Bibtex` equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.
 
+#### Image Datasets
+
+!!! info "Keep reading if you are interested in Image datasets"
+
+    The `Distiset` object has a new method `transform_columns_to_image` specifically to transform the images to `PIL.Image.Image` before pushing the dataset to the hugging face hub.
+
+Since version `1.5.0` we have the [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/task/imagegeneration/) task that is able to generate images from text. By default, all the process will work internally with a string representation for the images. This is done for simplicity while processing. But to take advantage of the Hugging Face Hub functionalities if the dataset generated is going to be stored there, a proper Image object may be preferable, so we can see the images in the dataset viewer for example. Let's take a look at the following pipeline extracted from "examples/image_generation.py" at the root of the repository to see how we can do it:
+
+```diff
+# Assume all the imports are already done, we are only interested
+with Pipeline(name="image_generation_pipeline") as pipeline:
+    img_generation = ImageGeneration(
+        name="flux_schnell",
+        llm=igm,
+        InferenceEndpointsImageGeneration(model_id="black-forest-labs/FLUX.1-schnell")
+    )
+    ...
+
+if __name__ == "__main__":
+    distiset = pipeline.run(use_cache=False, dataset=ds)
+    # Save the images as `PIL.Image.Image`
++   distiset = distiset.transform_columns_to_image("image")
+    distiset.push_to_hub(...)
+```
+
+After calling [`transform_columns_to_image`][distilabel.distiset.Distiset.transform_columns_to_image] on the image columns we may have generated (in this case we only want to transform the `image` column, but a list can be passed). This will apply to any leaf nodes we have in the pipeline, meaning if we have different subsets, the "image" column will be found in all of them, or we can pass a list of columns.
+
 ### Save and load from disk
 
 Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).
-Original file line number
+Diff line change
@@ Expand Up / @@ -77,4 +77,4 @@ venv.bak/ @@
     # Other
     *.log
     *.swp
-    .DS_Store
+    .DS_Store