Skip to content

Commit

Permalink
Merge branch 'feat/dataset-instruction-response-pipeline' of https://…
Browse files Browse the repository at this point in the history
…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline
  • Loading branch information
burtenshaw committed Jan 28, 2025
2 parents f1fb538 + a0c23f6 commit d8e71ea
Show file tree
Hide file tree
Showing 225 changed files with 9,341 additions and 1,370 deletions.
26 changes: 26 additions & 0 deletions .github/ISSUE_TEMPLATE/1-add_documentation_report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: "\U0001F4DA Add a documentation report"
description: "Have you spotted a typo or mistake in our docs?"
title: "[DOCS]"
labels: ["documentation"]
assignees: []

body:
- type: markdown
attributes:
value: "Thank you for reporting a documentation mistake! Before you get started, please [search to see](https://github.com/argilla-io/distilabel/issues) if an issue already exists for the bug you encountered."

- type: textarea
id: doc_report
attributes:
label: "Which page or section is this issue related to?"
description: "Please include the URL and/or source."
validations:
required: false

- type: textarea
id: doc_review
attributes:
label: "What are you documenting, or what change are you making in the documentation?"
description: "If a documentation needs to be created, please specify its coverage.\n If there's a typo or something needs revisiting, please indicate it and show code/text/screenshots."
validations:
required: false
70 changes: 70 additions & 0 deletions .github/ISSUE_TEMPLATE/2-bug_python.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
name: "\U0001FAB2 Bug report"
description: "Report bugs and unexpected behavior."
title: "[BUG]"
labels: ["bug", "ml-internal"]
assignees: []

body:
- type: markdown
attributes:
value: "Thank you for reporting a bug! Before you get started, please [search to see](https://github.com/argilla-io/distilabel/issues) if an issue already exists for the bug you encountered."

- type: textarea
id: bug_description
attributes:
label: "Describe the bug"
description: "A clear and concise description of the bug."
validations:
required: true

- type: textarea
id: stacktrace
attributes:
label: "To reproduce"
description: "The code to reproduce the behavior."
placeholder: |
```python
my_python_code
```
validations:
required: false

- type: textarea
id: expected_behavior
attributes:
label: "Expected behavior"
description: "A clear and concise description of what you expected to happen."
validations:
required: false

- type: textarea
id: screenshots
attributes:
label: "Screenshots"
description: "If applicable, add screenshots to help explain your problem."
validations:
required: false

- type: textarea
id: environment
attributes:
label: "Environment"
description: "Since version 1.16.0 you can use `python -m argilla info` command to easily get the used versions."
value: |
- Distilabel Version [e.g. 1.0.0]:
- Python Version [e.g. 3.11]:
validations:
required: false

- type: textarea
id: additional_context
attributes:
label: "Additional context"
description: "Add any other relevant information."
validations:
required: false

- type: markdown
attributes:
value: |
📌 Make sure you have provided all the required information in each section so we can support you properly.
44 changes: 44 additions & 0 deletions .github/ISSUE_TEMPLATE/3-feature_request.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: "\U0001F195 Feature request"
description: "Share cool new ideas for the project."
title: "[FEATURE]"
labels: ["enhancement", "ml-internal"]
assignees: []


body:
- type: markdown
attributes:
value: "Thank you for sharing your feature request! Please fill out the sections below."

- type: textarea
id: feature_request
attributes:
label: "Is your feature request related to a problem? Please describe."
description: "A clear and concise description of what the problem is."
placeholder: "I'm always frustrated when..."
validations:
required: false

- type: textarea
id: feature_description
attributes:
label: "Describe the solution you'd like"
description: "A clear and concise description of what you want to happen."
validations:
required: false

- type: textarea
id: feature_alternatives
attributes:
label: "Describe alternatives you've considered"
description: "A clear and concise description of any alternative solutions or features you've considered."
validations:
required: false

- type: textarea
id: additional_context
attributes:
label: "Additional context"
description: "Add any other context or screenshots about the feature request here."
validations:
required: false
5 changes: 5 additions & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
blank_issues_enabled: false
contact_links:
- name: 🗯 Community Discussions
url: http://hf.co/join/discord
about: Our Discord Community loves to discuss distilabel and NLP topics
20 changes: 0 additions & 20 deletions .github/ISSUE_TEMPLATE/🆕-feature-request.md

This file was deleted.

30 changes: 0 additions & 30 deletions .github/ISSUE_TEMPLATE/🐛-bug-report.md

This file was deleted.

16 changes: 0 additions & 16 deletions .github/ISSUE_TEMPLATE/📚-documentation-update.md

This file was deleted.

6 changes: 6 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@ jobs:
if: steps.cache.outputs.cache-hit != 'true'
run: ./scripts/install_dependencies.sh

- name: Setup tmate session
uses: mxschmitt/action-tmate@v3
if: ${{ matrix.python-version == '3.12' && github.event_name == 'workflow_dispatch' && inputs.tmate_session }}
with:
limit-access-to-actor: true

- name: Lint
run: make lint

Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ venv.bak/
# Other
*.log
*.swp
.DS_Store
.DS_Store
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ repos:
- --fuzzy-match-generates-todo

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.7.2
rev: v0.8.1
hooks:
- id: ruff
args: [--fix]
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ In addition, the following extras are available:
- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.
- `sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
- `mlx`: for using [MLX](https://github.com/ml-explore/mlx) models via the `MlxLLM` integration.

### Structured generation

Expand Down
10 changes: 10 additions & 0 deletions docs/api/models/image_generation/image_generation_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# ImageGenerationModel Gallery

This section contains the existing [`ImageGenerationModel`][distilabel.models.image_generation] subclasses implemented in `distilabel`.

::: distilabel.models.image_generation
options:
filters:
- "!^ImageGenerationModel$"
- "!^AsyngImageGenerationModel$"
- "!typing"
7 changes: 7 additions & 0 deletions docs/api/models/image_generation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ImageGenerationModel

This section contains the API reference for the `distilabel` image generation models, both for the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] synchronous implementation, and for the [`AsyncImageGenerationModel`][distilabel.models.image_generation.AsyncImageGenerationModel] asynchronous one.

For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - ImageGenerationModel](../../../sections/how_to_guides/basic/task/image_task.md).

::: distilabel.models.image_generation.base
3 changes: 0 additions & 3 deletions docs/api/pipeline/typing.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/step/typing.md

This file was deleted.

7 changes: 7 additions & 0 deletions docs/api/task/image_task.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ImageTask

This section contains the API reference for the `distilabel` image generation tasks.

For more information on how the [`ImageTask`][distilabel.steps.tasks.ImageTask] works and see some examples, check the [Tutorial - Task - ImageTask](../../sections/how_to_guides/basic/task/generator_task.md) page.

::: distilabel.steps.tasks.base.ImageTask
1 change: 1 addition & 0 deletions docs/api/task/task_gallery.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,6 @@ This section contains the existing [`Task`][distilabel.steps.tasks.Task] subclas
- "!Task"
- "!_Task"
- "!GeneratorTask"
- "!ImageTask"
- "!ChatType"
- "!typing"
3 changes: 0 additions & 3 deletions docs/api/task/typing.md

This file was deleted.

8 changes: 8 additions & 0 deletions docs/api/typing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Types

This section contains the different types used accross the distilabel codebase.

::: distilabel.typing.base
::: distilabel.typing.steps
::: distilabel.typing.models
::: distilabel.typing.pipeline
Binary file added docs/assets/tutorials-assets/math-sheperd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/sections/getting_started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ Additionally, as part of `distilabel` some extra dependencies are available, mai

- `sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https://github.com/UKPLab/sentence-transformers).

- `mlx`: for using [MLX](https://github.com/ml-explore/mlx) models via the `MlxLLM` integration.

### Data processing

- `ray`: for scaling and distributing a pipeline with [Ray](https://github.com/ray-project/ray).
Expand Down
27 changes: 27 additions & 0 deletions docs/sections/how_to_guides/advanced/distiset.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,33 @@ class MagpieGenerator(GeneratorTask, MagpieBase):

The `Citations` section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: ` ```@misc{...}``` `. This information will be automatically used in the README of your `Distiset` if you decide to call `distiset.push_to_hub`. Alternatively, if the `Citations` is not found, but in the `References` there are found any urls pointing to `https://arxiv.org/`, we will try to obtain the `Bibtex` equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.

#### Image Datasets

!!! info "Keep reading if you are interested in Image datasets"

The `Distiset` object has a new method `transform_columns_to_image` specifically to transform the images to `PIL.Image.Image` before pushing the dataset to the hugging face hub.

Since version `1.5.0` we have the [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/task/imagegeneration/) task that is able to generate images from text. By default, all the process will work internally with a string representation for the images. This is done for simplicity while processing. But to take advantage of the Hugging Face Hub functionalities if the dataset generated is going to be stored there, a proper Image object may be preferable, so we can see the images in the dataset viewer for example. Let's take a look at the following pipeline extracted from "examples/image_generation.py" at the root of the repository to see how we can do it:

```diff
# Assume all the imports are already done, we are only interested
with Pipeline(name="image_generation_pipeline") as pipeline:
img_generation = ImageGeneration(
name="flux_schnell",
llm=igm,
InferenceEndpointsImageGeneration(model_id="black-forest-labs/FLUX.1-schnell")
)
...

if __name__ == "__main__":
distiset = pipeline.run(use_cache=False, dataset=ds)
# Save the images as `PIL.Image.Image`
+ distiset = distiset.transform_columns_to_image("image")
distiset.push_to_hub(...)
```

After calling [`transform_columns_to_image`][distilabel.distiset.Distiset.transform_columns_to_image] on the image columns we may have generated (in this case we only want to transform the `image` column, but a list can be passed). This will apply to any leaf nodes we have in the pipeline, meaning if we have different subsets, the "image" column will be found in all of them, or we can pass a list of columns.

### Save and load from disk

Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).
Expand Down
Loading

0 comments on commit d8e71ea

Please sign in to comment.