Avoid using skip() in hf_datasets #838

mori360 · 2025-02-12T18:36:08Z

Fix issue #809

The current self._data.skip(self._sample_idx) could not get the correct data for c_4 dataset.
Thus we switch to next() first before the fix is landed.

Test plan:
We reproduce the #809 by resuming from checkpoint at step 500, then compare the loss curve in 3 conditions:

the origin curve running from step 0 to 750
the resumed curve keeping .skip()
the resumed curve switch to next(), with this PR change

Warning
for c_4 dataset, if we resume from a large enough step, we call next() for self._sample_idx times, resuming from checkpoint would be much slower than using .skip()

Next step:
add unit test:

test the state_dict check between dcp.save/load and torch.save/load
test the difference between next() and .skip()

tianyu-l · 2025-02-13T00:22:05Z

torchtitan/datasets/hf_datasets.py

        if isinstance(self._data, Dataset) and self._sample_idx == len(self._data):
            return iter([])

-        return iter(self._data.skip(self._sample_idx))


I think we need to understand if skip causes error in both map-style and Iterable datasets, or only in the newly added IterableDataset case.
If it's the latter we should just revert #521, rather than universally use next for both, because it would make the healthy case slow too.

I would suggest that we land the PR first. It is better to have a slower checkpoint resume than an incorrect silent accuracy failure. It's blocking several accuracy verifications. Or at least we should make the default C4 dataset work for now.

tianyu-l

stamp to unblock, but we should follow up with more robust tests.

Avoid using skip() in hf_datasets

6f23c07

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2025

mori360 added 5 commits February 12, 2025 11:00

lint

c27beb9

format

873a45d

format

6759603

lint

b01f822

Update hf_datasets.py

04d93b8

mori360 marked this pull request as ready for review February 12, 2025 20:34

mori360 requested review from tianyu-l and fegin February 12, 2025 20:34

tianyu-l reviewed Feb 13, 2025

View reviewed changes

tianyu-l linked an issue Feb 13, 2025 that may be closed by this pull request

Loss metrics dramatically change after resuming from checkpoint #809

Open

tianyu-l approved these changes Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid using skip() in hf_datasets #838

Avoid using skip() in hf_datasets #838

mori360 commented Feb 12, 2025 •

edited

Loading

tianyu-l Feb 13, 2025

fegin Feb 13, 2025

tianyu-l left a comment

Avoid using skip() in hf_datasets #838

Are you sure you want to change the base?

Avoid using skip() in hf_datasets #838

Conversation

mori360 commented Feb 12, 2025 • edited Loading

tianyu-l Feb 13, 2025

Choose a reason for hiding this comment

fegin Feb 13, 2025

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

mori360 commented Feb 12, 2025 •

edited

Loading