The trainers runs a single validation step after resume (not sanity) #18110
Replies: 9 comments 4 replies
-
Seeing a similar behavior with version |
Beta Was this translation helpful? Give feedback.
-
I debugged the phenomenon step by step. The issue is that during the restarting status, 0 steps of training and 1 step of validation are performed. To fix this, it is necessary to correct the restart logic, which seems to require a deep understanding of the PL code (I gave up and decided to allow this 1-step validation). When loading a checkpoint, When training begins, it first checks if pytorch-lightning/src/lightning/pytorch/loops/training_epoch_loop.py Lines 199 to 201 in 1439da4 Next, it enters the validation loop. pytorch-lightning/src/lightning/pytorch/loops/fit_loop.py Lines 197 to 200 in 1439da4 In EvaluationLoop's pytorch-lightning/src/lightning/pytorch/loops/evaluation_loop.py Lines 147 to 148 in 1439da4 Following the In EvaluationLoop, |
Beta Was this translation helpful? Give feedback.
-
Hello (:wave: @cdancette), I am encountering the same issue. A single validation step is performed after a mid-epoch checkpoint is loaded, where the stage is not set as sanity according to Trainer. Any news or issue related to this discussion ? |
Beta Was this translation helpful? Give feedback.
-
Seems that this issue was already present a long ago (#11504) and it was even fixed (#11552), but if we look on current code is seems that the fix is gone... The fix in version 1.5.10 (lines 534-535): pytorch-lightning/pytorch_lightning/loops/epoch/training_epoch_loop.py Lines 530 to 538 in 9ebdc52 That same code in the next version 1.6.0 ant later versions doesn't have the fix: pytorch-lightning/pytorch_lightning/loops/epoch/training_epoch_loop.py Lines 522 to 528 in 44e3edb |
Beta Was this translation helpful? Give feedback.
-
Facing the same issue, it is particularly annoying when using checkpoint callback with save_top_k since it tracks possibly wrong "best metric". |
Beta Was this translation helpful? Give feedback.
-
I'm also facing this issue |
Beta Was this translation helpful? Give feedback.
-
Also facing the same issue. I think fixing this correctly is a bit tricky, so I just ended up with dirty workaround:
|
Beta Was this translation helpful? Give feedback.
-
I also encountered this problem. Is there an github issue open for this? This definitely deserves attention |
Beta Was this translation helpful? Give feedback.
-
this is very annoying. deserves more attention |
Beta Was this translation helpful? Give feedback.
-
I am encountering a weird behavior using lightning (I use the lightningCLI as well).
When I resume a training that has failed, with
python train.py fit
, the trainer runs a single validation step that is not a sanity check (I disabled sanity checks, and also the state of the trainer isRunningStage.VALIDATING
).I run my main training with val_check_interval=0.5.
This is a problem, as it saves in my logs a metric point for this single batch as if it was an evaluation on the full validation set.
I have metrics curve like this: when I resume, there is a wrong point (much higher here than the other ones).
Do you know what could cause this issue and how to get around it ?
Beta Was this translation helpful? Give feedback.
All reactions