-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add first cut dataloader v2 ADR #374
base: main
Are you sure you want to change the base?
Add first cut dataloader v2 ADR #374
Conversation
Signed-off-by: Dushyant Behl <[email protected]>
Thanks for making a pull request! 😃 |
|
||
1. Passing collators directly to SFTTrainer. | ||
|
||
In the code collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes data collators have to be inferred for the dataset and arguments passed . it is better to infer it in code for labelled datasets
|
||
### Splitting and Interleaving datasets | ||
|
||
Other argument such as `splitter_arguments` can be passed to HF [`datasets.test_train_split`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.train_test_split) to create a test/train split of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for creating validation dataset ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in case users want to split the data set via ratio.
in fms-hf-tuning via `tuning/utils/preprocessing_utils.py::get_preprocessed_dataset` can be retained as a data | ||
handler which performs tokenization. | ||
|
||
The implementation is flexible enough for users to specify their own implementation of data handling routines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users of product are not going to select what data handler to apply, unless they have special use case like applying templates . Using tuning type and type of dataset , proper defaults have to be selected and set in library. Please indicate how that is going to happen
if it is a vision-text use case, appropriate handler has to be used automatically
The input spec which user specifies on how to pass information to such data loader is this | ||
|
||
``` | ||
dataloader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not clear to me if this spec will be the default once this feature is merged in. If so, i urge to consider to maintain also the simpler earlier interface where you only need to pass a data path in. So users who do not need to have so complicated dataloader requirements do not need to go through the trouble to write the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianlim the idea is to keep the simpler interface while also have a detailed spec which can handle complex use cases and variour other requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dushyantbehl Is my understanding right that we can have a set of inbuilt data configs which people can trigger with say some flag?
For example, we can have:
--data-config-name pretokenized
--data-config-data-path
So that we don't have to maintain multiple data manipulation paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be even better if defaults could be selected automatically for simple file formats and use cases - users enter multimodal JSON and default handler is selected . only for users that want to modify it / have a complex use case, we also expose a way to do that
|
||
When the dataloader goes through each `DataSetConfig` | ||
|
||
The HF dataloader implements functionality to process different type of files via its `load_dataset` factory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me fi the plan is to implement a new DataLoader
abstraction or what. I feel taht from a data standpoint, the abstraction interface should be the DataSet
not the DataLoader
. THis is because in accelerate
, the It has its own DataLoader
implementation. You dont want to touch that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This plan is not to superimpose on the data loader implementations of accelerate or others rather just to say that this internal abstraction will use HF/Accelerate API while other can implement data loaders which are experimental but not yet available in HF.
Thanks for the pointer on the StatefulDataloader
we can definitely evaluate that for our consideration but the idea to have a design like this is to allow flexibility of implementing custom dataloaders like the one in our fms-fsdp
repo inside fms-hf-tuning
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe accelerate stance on the data loader is that it shouldn’t be user implemented. I feel you can cover most use cases by customising dataset, data collator, sampler or batch sampler. Do you have a use case where these are not sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use case of custom dataloader is also not clear to me. Seems like from use cases you defined so far, we just need
- preprocessing functions for selected data types like parquet,chat etc
- code for sampling and thus mixing different formats of datasets in one training
They do not warrant a new dataLoader, these are just preprocessing functions. Also to mention a big reason of using standard HF APIs in this repo was simplicity and ease of maintenance. With the direction being - any new logic should be contributed to HF , so we can leverage standard APIs.
if someone contributes some dataloader for experimentation - we will have to make that very clear in docs and separate it from product use cases
|
||
Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the | ||
tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data. In this design we consider two approaces for data collators. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, I feel you do not need to touch the DataCollators
since they are part of the DataLoader
abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on that front, hence we also prefer the approach where we use preimplemented collators and no reimplement or break any api.
Supporting stateful dataloader will mean refactoring the fms-fsdp implementation into our abstract `Dataloader` class. | ||
|
||
In brief, things to consider here will be, | ||
1. Data handler support needs to be added to the stateful data loader as we want lazy execution of handlers (as and when data is loaded). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its aI think accelerate v1.0 should already have stateful dataloader implemented if I recall correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your pointer and have replied to your question above.
sampling: | ||
ratio: 0.3 | ||
data_handlers: | ||
- name: apply_tokenizer_chat_template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dushyantbehl @ashokponkumar doesnt SFT trainer add chat template on their own? https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support if dataset is of certain format . Chat is already supported by SFT trainer, since majority of your data handler examples given by you are applying templates, trying to understand use case
Description of the change
Related issue number
How to verify the PR
Was the PR tested