Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotImplementedError for iterator in IterableData class while debugging CodonTransformer finetuning #17

Open
Cauwth opened this issue Dec 25, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@Cauwth
Copy link

Cauwth commented Dec 25, 2024

I am trying to implement a subclass of IterableData to iterate over a JSON file to finetuen the model, but I am encountering an error. The IterableData class has an abstract method iterator that is supposed to be implemented in subclasses. However, I am unsure how to correctly implement the iterator method in my IterableJSONData class.

I did not use the SLURM

train_data = IterableJSONData(args.dataset_dir)

and the error is like this :
Exception has occurred: NotImplementedError Caught NotImplementedError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/tianhao/miniconda3/envs/CodonTransformer/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 291, in _worker_loop fetcher = _DatasetKind.create_fetcher( File "/home/tianhao/miniconda3/envs/CodonTransformer/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 81, in create_fetcher return _utils.fetch._IterableDatasetFetcher( File "/home/tianhao/miniconda3/envs/CodonTransformer/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 22, in __init__ self.dataset_iter = iter(dataset) File "/data/wth/plant_protein/CodonTransformer/CodonTransformer/CodonUtils.py", line 541, in __iter__ return itertools.islice(self.iterator, worker_rk, None, worker_nb) File "/data/wth/plant_protein/CodonTransformer/CodonTransformer/CodonUtils.py", line 517, in iterator raise NotImplementedError NotImplementedError
How should I implement the iterator method in the IterableJSONData subclass to properly read the JSON file line by line and handle multi-processing environments?
I have tried code like this
image
to the CLASS IterableJSONData

but got another error

Exception has occurred: ValueError Expected positive integer total_steps, but got -1 File "/data/wth/plant_protein/CodonTransformer/finetune.py", line 87, in configure_optimizers "scheduler": torch.optim.lr_scheduler.OneCycleLR( File "/data/wth/plant_protein/CodonTransformer/finetune.py", line 167, in main trainer.fit(harnessed_model, data_loader) File "/data/wth/plant_protein/CodonTransformer/finetune.py", line 231, in <module> main(args) ValueError: Expected positive integer total_steps, but got -1

@gui11aume
Copy link
Collaborator

Hi @Cauwth and thanks for raising the issue. It looks to me like some part of the code is missing. The code was taken from this repo, where IterableJSONData overwrites the iterator method to implement it. Would you try to replace the code with the one I pointed to and see if it works out of the box?

@Cauwth
Copy link
Author

Cauwth commented Dec 26, 2024

Thank you for your suggestion! I have updated the iterator method in IterableJSONData based on the recommended repository. The updated code is as follows:
image

In debug mode, it seems that the data reading process is working correctly. However, the error still occurs in:
image

I was wondering: should I manually compute the total_steps value?

@gui11aume
Copy link
Collaborator

Yes, exactly! An iterable dataset is just a stream, so there is no way for the data loader to know how many steps there are. It's not always an issue; you can train until the stream is exhausted, but with a learning-rate scheduler, you need to specify a number of steps so that is knows when to warm up and when to decay. It's just a matter of specifying the value of total_steps in your case.
You can compute it from the number of examples you have (n_examples), the batch size (batch_size), the number of GPUs (n_gpus) and gradient_accumulation as n_examples / (batch_size * n_gpus * gradient_accumulation). If I remember correctly you need to divide by gradient_accumulation because the learning rate is updated only in stepping batches, i.e., batches where the back propagation is computed.

@Cauwth
Copy link
Author

Cauwth commented Dec 28, 2024

It works.Thank you so much!

@Cauwth Cauwth closed this as completed Dec 28, 2024
@gui11aume
Copy link
Collaborator

Thank you for raising the issue! We were not aware that there was a problem with the code. I will reopen the issue until we fix the code.

@gui11aume gui11aume reopened this Dec 28, 2024
@gui11aume
Copy link
Collaborator

@Adibvafa Can you prepare a pull request to fix issue #17?

@Cauwth
Copy link
Author

Cauwth commented Dec 29, 2024

After setting total_steps, the code does run, but sometimes the actual training steps exceed the maximum predefined steps. This might be because the batch_size cannot evenly divide the dataset. I couldn’t find a way to resolve this issue, so I had to set total_steps to a value much larger than the calculated expected value.

@Adibvafa Adibvafa self-assigned this Jan 4, 2025
@Adibvafa Adibvafa added the bug Something isn't working label Jan 4, 2025
@Adibvafa
Copy link
Owner

Adibvafa commented Jan 4, 2025

I will work on this over weekend. Thank you for opening this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants