-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using skip() in hf_datasets #838
base: main
Are you sure you want to change the base?
Conversation
if isinstance(self._data, Dataset) and self._sample_idx == len(self._data): | ||
return iter([]) | ||
|
||
return iter(self._data.skip(self._sample_idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to understand if skip
causes error in both map-style and Iterable datasets, or only in the newly added IterableDataset case.
If it's the latter we should just revert #521, rather than universally use next
for both, because it would make the healthy case slow too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we land the PR first. It is better to have a slower checkpoint resume than an incorrect silent accuracy failure. It's blocking several accuracy verifications. Or at least we should make the default C4 dataset work for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamp to unblock, but we should follow up with more robust tests.
Fix issue #809
The current
self._data.skip(self._sample_idx)
could not get the correct data for c_4 dataset.Thus we switch to next() first before the fix is landed.
Test plan:
We reproduce the #809 by resuming from checkpoint at step 500, then compare the loss curve in 3 conditions:
Warning
for c_4 dataset, if we resume from a large enough step, we call next() for
self._sample_idx
times, resuming from checkpoint would be much slower than using .skip()Next step:
add unit test: