Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combined dataset feature #261

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Combined dataset feature #261

wants to merge 36 commits into from

Conversation

le1nux
Copy link
Member

@le1nux le1nux commented Sep 24, 2024

What does this PR do?

This PR addresses issue #258 (inefficiencies in the dataloader) and additionally introduces a combined dataset, where a dataset can now comprise a list of datasets and iterate over them.
As part of fixing the dataloader inefficiencies, we now implement the sample skipping functionality not on the dataloader level anymore but in an adapted version of the PyTorch DistributedSampler. I reran a warm start and the learning is equivalent to a full, non-warmstarted run.

Screenshot 2024-09-27 at 10 36 19

General Changes

  • Introduced ResumableDistributedSampler which is a copy of the PyTorch DistributedSampler added with the feature to skip samples. This is from now on used for warmstarts instead of the skip_num_samples in the Dataloader. In case of skipping samples, the dataloader had to instantiate a ResumableBatchSampler which was internally iterating over all the dataset indices. For small datasets this was fine, but for larger datasets (in the trillion token range) this became a bottleneck at instantiation time:
    self.underlying_batch_sampler = underlying_batch_sampler
    # NOTE: we are only iterating ove the indices not the actual data
    # so this is relatively cheap
    self.indices = list(iter(self.underlying_batch_sampler))

    Skipping in the ResumableDistributedSampler is skipping in O(1) now. The ResumableBatchSampler was removed from the codebase.
  • Replaced the packed index generation routine (inefficient due to for loop)
    return [
    ((i * self.block_size - i) * self._token_size_in_bytes, self.block_size * self._token_size_in_bytes)
    for i in range(num_samples)
    ]

    with a vectorized version.
  • added new NumberConversion routine num_samples_from_num_tokens

Breaking Changes

  • Removed RepeatingDataloader, as a feature that was never actively used for running multiple epochs and had complex maintenance when refactoring the sampling. If needed we could reimpliment it.
  • In the settings, the training_progress section has now num_seen_samples instead of local_num_seen_batches , as skipping is now done on the Sampler level and not on the dataloader level anymore
  • batch_size and fast_forward_batch_id fields in the LLMDataLoader are not neede anymore and were removed.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) Left a few minor comments.

"""
Initializes the PackedMemMapDatasetBase object.

Args:
raw_data_path (Path): Path to a packed binary file (*.pbin).
Use `modalities data pack_encoded_data` to create one based on a JSONL-file.
sample_key (str): The key to access the sample in the BatchEncoding.
load_index (bool, optional): Flag indicating whether to load the index. Defaults to True.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more consistent if this defaulted to False like in PackedMemMapDatasetContinuous (see line 308)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PackedMemMapDatasetContinuous we would never load the index (apart for debugging purposes), that's why I made it defaulting to False. The "Continuuous" implementation does not need an index. The PackedMemMapDatasetBase, however, in it's default implementation would use the index for packing the data, which is why it defaults to True.

CHANGELOG_DEV.md Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments on the changelog:

  • The table in the beginning also needs to be updated
  • It might be useful to reverse the order of PRs so that the newest ones come first

src/modalities/dataloader/dataset.py Outdated Show resolved Hide resolved
src/modalities/dataloader/dataset.py Outdated Show resolved Hide resolved
src/modalities/dataloader/dataset.py Outdated Show resolved Hide resolved
tests/dataloader/samplers/test_distributed_samplers.py Outdated Show resolved Hide resolved
tests/dataloader/samplers/test_distributed_samplers.py Outdated Show resolved Hide resolved
tests/dataloader/samplers/test_distributed_samplers.py Outdated Show resolved Hide resolved
tests/dataloader/samplers/test_distributed_samplers.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants