Combined dataset feature #261

le1nux · 2024-09-24T13:49:10Z

What does this PR do?

This PR addresses issue #258 (inefficiencies in the dataloader) and additionally introduces a combined dataset, where a dataset can now comprise a list of datasets and iterate over them.
As part of fixing the dataloader inefficiencies, we now implement the sample skipping functionality not on the dataloader level anymore but in an adapted version of the PyTorch DistributedSampler. I reran a warm start and the learning is equivalent to a full, non-warmstarted run.

General Changes

Introduced ResumableDistributedSampler which is a copy of the PyTorch DistributedSampler added with the feature to skip samples. This is from now on used for warmstarts instead of the skip_num_samples in the Dataloader. In case of skipping samples, the dataloader had to instantiate a ResumableBatchSampler which was internally iterating over all the dataset indices. For small datasets this was fine, but for larger datasets (in the trillion token range) this became a bottleneck at instantiation time:

modalities/src/modalities/dataloader/samplers.py

Lines 25 to 28 in b79d04d

    
           self.underlying_batch_sampler = underlying_batch_sampler 
        
           # NOTE: we are only iterating ove the indices not the actual data 
        
           # so this is relatively cheap 
        
           self.indices = list(iter(self.underlying_batch_sampler))

Skipping in the ResumableDistributedSampler is skipping in O(1) now. The ResumableBatchSampler was removed from the codebase.

Replaced the packed index generation routine (inefficient due to for loop)

modalities/src/modalities/dataloader/dataset.py

Lines 331 to 334 in b79d04d

    
           return [ 
        
               ((i * self.block_size - i) * self._token_size_in_bytes, self.block_size * self._token_size_in_bytes) 
        
               for i in range(num_samples) 
        
           ]

with a vectorized version.

added new NumberConversion routine num_samples_from_num_tokens

Breaking Changes

Removed RepeatingDataloader, as a feature that was never actively used for running multiple epochs and had complex maintenance when refactoring the sampling. If needed we could reimpliment it.
In the settings, the training_progress section has now num_seen_samples instead of local_num_seen_batches , as skipping is now done on the Sampler level and not on the dataloader level anymore
batch_size and fast_forward_batch_id fields in the LLMDataLoader are not neede anymore and were removed.

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…ataset_feature

…y default anymore.

flxst

LGTM :) Left a few minor comments.

flxst · 2024-10-23T08:41:44Z

src/modalities/dataloader/dataset.py

        """
        Initializes the PackedMemMapDatasetBase object.

        Args:
            raw_data_path (Path): Path to a packed binary file (*.pbin).
                Use `modalities data pack_encoded_data` to create one based on a JSONL-file.
            sample_key (str): The key to access the sample in the BatchEncoding.
+            load_index (bool, optional): Flag indicating whether to load the index. Defaults to True.


Wouldn't it be more consistent if this defaulted to False like in PackedMemMapDatasetContinuous (see line 308)?

In PackedMemMapDatasetContinuous we would never load the index (apart for debugging purposes), that's why I made it defaulting to False. The "Continuuous" implementation does not need an index. The PackedMemMapDatasetBase, however, in it's default implementation would use the index for packing the data, which is why it defaults to True.

flxst · 2024-10-23T08:48:00Z

CHANGELOG_DEV.md

General comments on the changelog:

The table in the beginning also needs to be updated

It might be useful to reverse the order of PRs so that the newest ones come first

src/modalities/dataloader/dataset.py

tests/dataloader/samplers/test_distributed_samplers.py

Co-authored-by: Felix Stollenwerk <[email protected]>

…es/modalities into combined_dataset_feature

le1nux added 26 commits September 16, 2024 17:53

feat: combined datasets implementation

dcb866f

chore: Merge branch 'warmstart_infrastructure_switch' into combined_d…

04c3189

…ataset_feature

feat: added DistributedSampler (unmodified pytorch version)

63b370e

refactor: DistributedSampler from pytorch

51f3fef

refactor: vectorized packed index generation

a9a5d93

feat: added test coverage for CombinedDataset

3198e4d

refactor: moved sampler tests

299d2e6

feat: implemented distributed sampler tests

915ec8f

refactor: refactored ResumableDistributedSampler

a25a788

refactor: commented out old sample skipping in dataloader

dd5316e

feat: added new sampling strategy to config lorem ipsum

c7a9cfc

feat: added more tests for the distributed sampler

16f684e

chore: added documentation to ResumableDistributedSampler

d21e0df

refactor: the PackedMemMapDatasetContinuous does not load the index b…

fb9deea

…y default anymore.

feat: added test for dataset packing

2823745

chore: removed legacy code from DataloaderFactory

e1091bf

refactor: upated configs

95a121e

feat: added number conversion routine

611c77b

chore: updated tutorial configs

4a759f3

refactor: removed obsolete test test_dataloader_with_fixed_num_batches

b5cc617

refactor: adapted more failing test to the dataloader changes

053275e

refactor: removed RepeatingDataLoader

e5ee5e9

refactor: removed ResumableBatchSampler

26b1152

refactor: removed legacy tests

6a62709

refactor: fixed e2e tests

b0ac334

chore: updated documentation

aed1d3d

le1nux requested review from mali-git, flxst and fromm-m September 27, 2024 08:41

chore: Merge branch 'main' into combined_dataset_feature

03fab0c

chore: fixed minor path issue

4d27cae

flxst approved these changes Oct 23, 2024

View reviewed changes

le1nux and others added 8 commits October 24, 2024 00:00

refactor: improved test_skipped_and_distributed_dataloader_from_config

2e8a880

Update src/modalities/dataloader/dataset.py

6d9b88a

Co-authored-by: Felix Stollenwerk <[email protected]>

Update tests/dataloader/samplers/test_distributed_samplers.py

b067718

Co-authored-by: Felix Stollenwerk <[email protected]>

Update tests/dataloader/samplers/test_distributed_samplers.py

cc9ef97

Co-authored-by: Felix Stollenwerk <[email protected]>

Update tests/dataloader/samplers/test_distributed_samplers.py

09554b1

Co-authored-by: Felix Stollenwerk <[email protected]>

Update tests/dataloader/samplers/test_distributed_samplers.py

856fba7

Co-authored-by: Felix Stollenwerk <[email protected]>

refactor: fixed typos

c2e6b8c

chore: Merge branch 'combined_dataset_feature' of github.com:Modaliti…

9b6c0b0

…es/modalities into combined_dataset_feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combined dataset feature #261

Combined dataset feature #261

le1nux commented Sep 24, 2024 •

edited

Loading

flxst left a comment

flxst Oct 23, 2024

le1nux Oct 24, 2024

flxst Oct 23, 2024

	self.underlying_batch_sampler = underlying_batch_sampler
	# NOTE: we are only iterating ove the indices not the actual data
	# so this is relatively cheap
	self.indices = list(iter(self.underlying_batch_sampler))

	return [
	((i * self.block_size - i) * self._token_size_in_bytes, self.block_size * self._token_size_in_bytes)
	for i in range(num_samples)
	]

Combined dataset feature #261

Are you sure you want to change the base?

Combined dataset feature #261

Conversation

le1nux commented Sep 24, 2024 • edited Loading

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

flxst left a comment

Choose a reason for hiding this comment

flxst Oct 23, 2024

Choose a reason for hiding this comment

le1nux Oct 24, 2024

Choose a reason for hiding this comment

flxst Oct 23, 2024

Choose a reason for hiding this comment

le1nux commented Sep 24, 2024 •

edited

Loading