Add first cut dataloader v2 ADR #374

dushyantbehl · 2024-10-17T18:29:57Z

Description of the change

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Dushyant Behl <[email protected]>

github-actions · 2024-10-17T18:30:09Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Ssukriti · 2024-10-22T22:45:57Z

architecture_records/004-dataloader-v2.md

+
+1. Passing collators directly to SFTTrainer.
+
+    In the code collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality


yes data collators have to be inferred for the dataset and arguments passed . it is better to infer it in code for labelled datasets

Ssukriti · 2024-10-23T02:18:36Z

architecture_records/004-dataloader-v2.md

+
+### Splitting and Interleaving datasets
+
+Other argument such as `splitter_arguments` can be passed to HF [`datasets.test_train_split`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.train_test_split) to create a test/train split of


is this for creating validation dataset ?

Yes, in case users want to split the data set via ratio.

Ssukriti · 2024-10-23T02:21:30Z

architecture_records/004-dataloader-v2.md

+in fms-hf-tuning via `tuning/utils/preprocessing_utils.py::get_preprocessed_dataset` can be retained as a data 
+handler which performs tokenization.
+
+The implementation is flexible enough for users to specify their own implementation of data handling routines


users of product are not going to select what data handler to apply, unless they have special use case like applying templates . Using tuning type and type of dataset , proper defaults have to be selected and set in library. Please indicate how that is going to happen

if it is a vision-text use case, appropriate handler has to be used automatically

fabianlim · 2024-10-23T04:19:06Z

architecture_records/004-dataloader-v2.md

+The input spec which user specifies on how to pass information to such data loader is this
+
+```
+dataloader:


Its not clear to me if this spec will be the default once this feature is merged in. If so, i urge to consider to maintain also the simpler earlier interface where you only need to pass a data path in. So users who do not need to have so complicated dataloader requirements do not need to go through the trouble to write the spec

@fabianlim the idea is to keep the simpler interface while also have a detailed spec which can handle complex use cases and variour other requirements.

@dushyantbehl Is my understanding right that we can have a set of inbuilt data configs which people can trigger with say some flag?

For example, we can have:
--data-config-name pretokenized
--data-config-data-path

So that we don't have to maintain multiple data manipulation paths.

I think it would be even better if defaults could be selected automatically for simple file formats and use cases - users enter multimodal JSON and default handler is selected . only for users that want to modify it / have a complex use case, we also expose a way to do that

fabianlim · 2024-10-23T04:21:57Z

architecture_records/004-dataloader-v2.md

+
+When the dataloader goes through each `DataSetConfig`
+
+The HF dataloader implements functionality to process different type of files via its `load_dataset` factory.


It is not clear to me fi the plan is to implement a new DataLoader abstraction or what. I feel taht from a data standpoint, the abstraction interface should be the DataSet not the DataLoader. THis is because in accelerate, the It has its own DataLoader implementation. You dont want to touch that.

huggingface/transformers#31441

This plan is not to superimpose on the data loader implementations of accelerate or others rather just to say that this internal abstraction will use HF/Accelerate API while other can implement data loaders which are experimental but not yet available in HF.

Thanks for the pointer on the StatefulDataloader we can definitely evaluate that for our consideration but the idea to have a design like this is to allow flexibility of implementing custom dataloaders like the one in our fms-fsdp repo inside fms-hf-tuning.

I believe accelerate stance on the data loader is that it shouldn’t be user implemented. I feel you can cover most use cases by customising dataset, data collator, sampler or batch sampler. Do you have a use case where these are not sufficient

use case of custom dataloader is also not clear to me. Seems like from use cases you defined so far, we just need

preprocessing functions for selected data types like parquet,chat etc

code for sampling and thus mixing different formats of datasets in one training

They do not warrant a new dataLoader, these are just preprocessing functions. Also to mention a big reason of using standard HF APIs in this repo was simplicity and ease of maintenance. With the direction being - any new logic should be contributed to HF , so we can leverage standard APIs.

if someone contributes some dataloader for experimentation - we will have to make that very clear in docs and separate it from product use cases

fabianlim · 2024-10-23T04:22:33Z

architecture_records/004-dataloader-v2.md

+
+Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the
+tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data. In this design we consider two approaces for data collators.
+


similarly, I feel you do not need to touch the DataCollators since they are part of the DataLoader abstraction.

Agreed on that front, hence we also prefer the approach where we use preimplemented collators and no reimplement or break any api.

fabianlim · 2024-10-23T04:23:25Z

architecture_records/004-dataloader-v2.md

+Supporting stateful dataloader will mean refactoring the fms-fsdp implementation into our abstract `Dataloader` class.
+
+In brief, things to consider here will be,
+1. Data handler support needs to be added to the stateful data loader as we want lazy execution of handlers (as and when data is loaded).


its aI think accelerate v1.0 should already have stateful dataloader implemented if I recall correctly?

I saw your pointer and have replied to your question above.

Ssukriti · 2024-10-23T16:32:35Z

architecture_records/004-dataloader-v2.md

+    sampling:
+      ratio: 0.3
+    data_handlers:
+      - name: apply_tokenizer_chat_template


@dushyantbehl @ashokponkumar doesnt SFT trainer add chat template on their own? https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support if dataset is of certain format . Chat is already supported by SFT trainer, since majority of your data handler examples given by you are applying templates, trying to understand use case

add first cut dataloader v2 ADR

9304ea1

Signed-off-by: Dushyant Behl <[email protected]>

Ssukriti reviewed Oct 22, 2024

View reviewed changes

Ssukriti reviewed Oct 23, 2024

View reviewed changes

fabianlim reviewed Oct 23, 2024

View reviewed changes

Ssukriti reviewed Oct 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add first cut dataloader v2 ADR #374

Add first cut dataloader v2 ADR #374

dushyantbehl commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

Ssukriti Oct 22, 2024

Ssukriti Oct 23, 2024

dushyantbehl Oct 23, 2024

Ssukriti Oct 23, 2024 •

edited

Loading

fabianlim Oct 23, 2024

dushyantbehl Oct 23, 2024

ashokponkumar Oct 23, 2024

Ssukriti Oct 23, 2024

fabianlim Oct 23, 2024

fabianlim Oct 23, 2024

dushyantbehl Oct 23, 2024

fabianlim Oct 23, 2024

Ssukriti Oct 23, 2024 •

edited

Loading

fabianlim Oct 23, 2024

dushyantbehl Oct 23, 2024

fabianlim Oct 23, 2024

dushyantbehl Oct 23, 2024

Ssukriti Oct 23, 2024


		1. Passing collators directly to SFTTrainer.

		In the code collators are collected by `get_data_collator` functionality and passed to `SFTTrainer`. We can retain the same functionality


		### Splitting and Interleaving datasets

		Other argument such as `splitter_arguments` can be passed to HF [`datasets.test_train_split`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.train_test_split) to create a test/train split of


		When the dataloader goes through each `DataSetConfig`

		The HF dataloader implements functionality to process different type of files via its `load_dataset` factory.


		Data collators specifically for TRL use cases like chat based interactions which apply chat templates and proper attention masking on the
		tokenized data like in the case of `DataCollatorForCompletionOnlyLM` handle a specific functionality on the data. In this design we consider two approaces for data collators.

Add first cut dataloader v2 ADR #374

Are you sure you want to change the base?

Add first cut dataloader v2 ADR #374

Conversation

dushyantbehl commented Oct 17, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ssukriti Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ssukriti Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ssukriti Oct 23, 2024 •

edited

Loading

Ssukriti Oct 23, 2024 •

edited

Loading