Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing issue with resuming training for saved dataset state (>1 epoch) #869

Open
rodosingh opened this issue Jan 28, 2025 · 2 comments
Open
Labels
bug Something isn't working

Comments

@rodosingh
Copy link

Environment

  • OS: [Ubuntu 22.04]
  • Hardware (GPU, or instance type): [MI300, ROCm==6.1.0]
  • NUM_NODES: [2]
  • GPUs/NODE: [8]

Context

  • While trying to resume my training from state (both checkpoint and S3 dataset state) beyond one epoch, throws error to start from the state where first epoch ended.
  • Please see the error below. And also I'm using choose and repeat functionality of StreamingDataset class to downsample & upsample, respectively as per requirement (the demo .yaml file is also attached here).
0: [rank0]: Traceback (most recent call last):
0: [rank0]:   File "/home/<user>/LLaVA-NeXT/llava/train/train_mem.py", line 4, in <module>
0: [rank0]:     train()
0: [rank0]:   File "/home/<user>/LLaVA-NeXT/llava/train/train.py", line 2034, in train
0: [rank0]:     trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1948, in train
0: [rank0]:     return inner_training_loop(
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2246, in _inner_training_loop
0: [rank0]:     for step, inputs in enumerate(epoch_iterator):
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
0: [rank0]:     current_batch = next(dataloader_iter)
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
0: [rank0]:     data = self._next_data()
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
0: [rank0]:     return self._process_data(data)
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
0: [rank0]:     data.reraise()
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
0: [rank0]:     raise exception
0: [rank0]: ValueError: Caught ValueError in DataLoader worker process 0.
0: [rank0]: Original Traceback (most recent call last):
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
0: [rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
0: [rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
0: [rank0]:     data.append(next(self.dataset_iter))
0: [rank0]:   File "/home/<user>/LLaVA-NeXT/streaming/streaming/base/dataset.py", line 1501, in __iter__
0: [rank0]:     sample_ids = self._get_work(epoch, sample_in_epoch)
0: [rank0]:   File "/home/<user>/LLaVA-NeXT/streaming/streaming/base/dataset.py", line 1046, in _get_work
0: [rank0]:     epoch_sample_ids = generate_work(self.batching_method, self, p_world, epoch,
0: [rank0]:   File "/home/<user>/LLaVA-NeXT/streaming/streaming/base/batching/__init__.py", line 45, in generate_work
0: [rank0]:     return get(dataset, world, epoch, sample_in_epoch)
0: [rank0]:   File "/home/<user>/LLaVA-NeXT/streaming/streaming/base/batching/random.py", line 57, in generate_work_random_batching
0: [rank0]:     big_ids = get_partitions(dataset.partition_algo, dataset.epoch_size,
0: [rank0]:   File "/home/<user>/LLaVA-NeXT/streaming/streaming/base/partition/__init__.py", line 69, in get_partitions
0: [rank0]:     raise ValueError(f'Resuming further into the dataset ({drop_first}) than it has samples ' +
0: [rank0]: ValueError: Resuming further into the dataset (7824000) than it has samples (6468555)

Corresponding Yaml file that specifies path to S3 shards:

datasets:
- shard_path: 's3://object/data/shards/LLaVA_Stage2/VQA-RAD/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/infographic_vqa/'
- shard_path: 's3://object/data/shards/LLaVA_Stage2/iconqa/'
  choose: 1365
- shard_path: 's3://object/data/shards/LLaVA_Stage2/TabMWP/'  
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa_nona_context/'  
  choose: 960
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa_nona_context/'  
- shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa/' 
  repeat: 2
- shard_path: 's3://object/data/shards/LLaVA_Stage2/ureader_kg/' 
- shard_path: 's3://object/data/shards/LLaVA_Stage2/aokvqa/' 
- shard_path: 's3://object/data/shards/LLaVA_Stage2/k12_printing/' 
  choose: 2566

Can anyone please take a look at this issue?
Any further info, please let me know in the thread.

Thanks for your help!

@rodosingh rodosingh added the bug Something isn't working label Jan 28, 2025
@ethantang-db
Copy link
Contributor

just to confirm, are you only wanting to repeat shard_path: 's3://object/data/shards/LLaVA_Stage2/scienceqa/' twice?

@rodosingh
Copy link
Author

Thanks @ethantang-db 🙌.

Yes, there are multiple such datasets, but just for illustration I have provided this example of repeat and also some for choose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants