Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first cut dataloader v2 ADR #374

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dushyantbehl
Copy link
Contributor

Description of the change

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Dushyant Behl <[email protected]>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.


1. Passing collators directly to SFTTrainer.

In the code collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes data collators have to be inferred for the dataset and arguments passed . it is better to infer it in code for labelled datasets


### Splitting and Interleaving datasets

Other argument such as `splitter_arguments` can be passed to HF [`datasets.test_train_split`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.train_test_split) to create a test/train split of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for creating validation dataset ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in case users want to split the data set via ratio.

in fms-hf-tuning via `tuning/utils/preprocessing_utils.py::get_preprocessed_dataset` can be retained as a data
handler which performs tokenization.

The implementation is flexible enough for users to specify their own implementation of data handling routines
Copy link
Collaborator

@Ssukriti Ssukriti Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users of product are not going to select what data handler to apply, unless they have special use case like applying templates . Using tuning type and type of dataset , proper defaults have to be selected and set in library. Please indicate how that is going to happen

if it is a vision-text use case, appropriate handler has to be used automatically

The input spec which user specifies on how to pass information to such data loader is this

```
dataloader:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear to me if this spec will be the default once this feature is merged in. If so, i urge to consider to maintain also the simpler earlier interface where you only need to pass a data path in. So users who do not need to have so complicated dataloader requirements do not need to go through the trouble to write the spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianlim the idea is to keep the simpler interface while also have a detailed spec which can handle complex use cases and variour other requirements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dushyantbehl Is my understanding right that we can have a set of inbuilt data configs which people can trigger with say some flag?

For example, we can have:
--data-config-name pretokenized
--data-config-data-path

So that we don't have to maintain multiple data manipulation paths.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be even better if defaults could be selected automatically for simple file formats and use cases - users enter multimodal JSON and default handler is selected . only for users that want to modify it / have a complex use case, we also expose a way to do that


When the dataloader goes through each `DataSetConfig`

The HF dataloader implements functionality to process different type of files via its `load_dataset` factory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me fi the plan is to implement a new DataLoader abstraction or what. I feel taht from a data standpoint, the abstraction interface should be the DataSet not the DataLoader. THis is because in accelerate, the It has its own DataLoader implementation. You dont want to touch that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plan is not to superimpose on the data loader implementations of accelerate or others rather just to say that this internal abstraction will use HF/Accelerate API while other can implement data loaders which are experimental but not yet available in HF.

Thanks for the pointer on the StatefulDataloader we can definitely evaluate that for our consideration but the idea to have a design like this is to allow flexibility of implementing custom dataloaders like the one in our fms-fsdp repo inside fms-hf-tuning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe accelerate stance on the data loader is that it shouldn’t be user implemented. I feel you can cover most use cases by customising dataset, data collator, sampler or batch sampler. Do you have a use case where these are not sufficient

Copy link
Collaborator

@Ssukriti Ssukriti Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use case of custom dataloader is also not clear to me. Seems like from use cases you defined so far, we just need

  1. preprocessing functions for selected data types like parquet,chat etc
  2. code for sampling and thus mixing different formats of datasets in one training

They do not warrant a new dataLoader, these are just preprocessing functions. Also to mention a big reason of using standard HF APIs in this repo was simplicity and ease of maintenance. With the direction being - any new logic should be contributed to HF , so we can leverage standard APIs.

if someone contributes some dataloader for experimentation - we will have to make that very clear in docs and separate it from product use cases


Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the
tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data. In this design we consider two approaces for data collators.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, I feel you do not need to touch the DataCollators since they are part of the DataLoader abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on that front, hence we also prefer the approach where we use preimplemented collators and no reimplement or break any api.

Supporting stateful dataloader will mean refactoring the fms-fsdp implementation into our abstract `Dataloader` class.

In brief, things to consider here will be,
1. Data handler support needs to be added to the stateful data loader as we want lazy execution of handlers (as and when data is loaded).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its aI think accelerate v1.0 should already have stateful dataloader implemented if I recall correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your pointer and have replied to your question above.

sampling:
ratio: 0.3
data_handlers:
- name: apply_tokenizer_chat_template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dushyantbehl @ashokponkumar doesnt SFT trainer add chat template on their own? https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support if dataset is of certain format . Chat is already supported by SFT trainer, since majority of your data handler examples given by you are applying templates, trying to understand use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants