load data sequence is confusing #20358

workhours · 2024-10-22T14:08:12Z

Bug description

I understand data consuming sequence in lightning is:
1, sanity check: call val_dataloader
2, training: call train_dataloader
3, validate: call val_dataloader
from above sequence I understand the cycle of a epoch is start from val_dataloader and end at train_dataloader, and the 3rd validate reuse val data from 1st val_dataloader.
but if if you check trainer.current_epoch: assume current_epoch is 1 at sanity check val_dataloader, then it increased to 2 at train_dataloader. in thise case it's seems the cycle of a epoch is start from train_dataloader and end at val_dataloader.
in this situation will confuse how to write code in val_dataloader when dynamic loading data. if infinite epoch, no problem. but at last epoch(I don't know now it's last one), should I ignore val_data is None or should I try to load it as if next round of cycle?

I think sanitcy check logic and validate logic should merge as one data-setup, but used twice for difference purpose. twice call val_dataloader and once call training_dataloader also make difficult to manage data load

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

The text was updated successfully, but these errors were encountered:

workhours · 2024-10-22T14:29:27Z

sorry for submit many times since damned firewall.
btw, fit_loop.on_run_start and on_advanced_start has twice setup_data, why? is advanced_start the real start of train?

workhours · 2024-10-22T14:37:07Z

the simple scenario is if user want feed data for next epoch, just give a one-callable interface. once for all types of data(val, train,test,predict...) which give a clear message: if call again, it's must request data for next epoch.
not the framework is very very flexible that it's difficult to write data moving logic in val_dataloader, train_dataloader, etc..
or the framework should provide a clear notification that the current epoch is ended, no more data request.

workhours added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Oct 22, 2024

github-actions bot added the ver: 2.4.x label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load data sequence is confusing #20358

load data sequence is confusing #20358

workhours commented Oct 22, 2024

workhours commented Oct 22, 2024

workhours commented Oct 22, 2024

load data sequence is confusing #20358

load data sequence is confusing #20358

Comments

workhours commented Oct 22, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

workhours commented Oct 22, 2024

workhours commented Oct 22, 2024