The trainers runs a single validation step after resume (not sanity) #18110

cdancette · 2023-07-18T15:52:19Z

cdancette
Jul 18, 2023

I am encountering a weird behavior using lightning (I use the lightningCLI as well).

When I resume a training that has failed, with python train.py fit, the trainer runs a single validation step that is not a sanity check (I disabled sanity checks, and also the state of the trainer is RunningStage.VALIDATING).

I run my main training with val_check_interval=0.5.

This is a problem, as it saves in my logs a metric point for this single batch as if it was an evaluation on the full validation set.
I have metrics curve like this: when I resume, there is a wrong point (much higher here than the other ones).

Do you know what could cause this issue and how to get around it ?

arnaudstiegler · 2023-08-18T20:22:32Z

arnaudstiegler
Aug 18, 2023

Seeing a similar behavior with version 2.0.2 and 2.0.7 (latest). Might be worth opening an issue

1 reply

jojonki Mar 14, 2024

I also faced this problem at 2.2.0. I set num_sanity_val_steps=0 at Trainer and use an epoch-end checkpoint.

jojonki · 2024-03-14T02:47:10Z

jojonki
Mar 14, 2024

I debugged the phenomenon step by step. The issue is that during the restarting status, 0 steps of training and 1 step of validation are performed. To fix this, it is necessary to correct the restart logic, which seems to require a deep understanding of the PL code (I gave up and decided to allow this 1-step validation).

When loading a checkpoint, restarting is set to True.

pytorch-lightning/src/lightning/pytorch/loops/loop.py

Line 84 in 1439da4

self.restarting = True

When training begins, it first checks if restarting is True and if validation is defined. If resuming training, it matches this condition and skips training. The Training Epoch progress bar ends at 0%.

pytorch-lightning/src/lightning/pytorch/loops/training_epoch_loop.py

Lines 199 to 201 in 1439da4

    
           if self.restarting and self._should_check_val_fx(data_fetcher): 
        
               # skip training and run validation in `on_advance_end` 
        
               return

Next, it enters the validation loop.

pytorch-lightning/src/lightning/pytorch/loops/fit_loop.py

Lines 197 to 200 in 1439da4

    
           self.setup_data() 
        
           if self.skip: 
        
               return 
        
           self.reset()

In EvaluationLoop's setup_data, it returns immediately, and data_fetcher remains None.

pytorch-lightning/src/lightning/pytorch/loops/evaluation_loop.py

Lines 147 to 148 in 1439da4

    
           if self._combined_loader is not None and trainer_fn == TrainerFn.FITTING and not self._should_reload_val_dl: 
        
               return

Following the reset, data is "prefetched only once".

pytorch-lightning/src/lightning/pytorch/loops/evaluation_loop.py

Line 219 in 1439da4

data_fetcher = _select_data_fetcher(trainer, trainer.state.stage)

pytorch-lightning/src/lightning/pytorch/loops/fetchers.py

Line 95 in 1439da4

def __init__(self, prefetch_batches: int = 1) -> None:

In EvaluationLoop, self._restarting becomes False, and normal training begins thereafter.

pytorch-lightning/src/lightning/pytorch/loops/fit_loop.py

Line 210 in 1439da4

self._restarting = False

0 replies

Youyoun · 2024-06-19T08:38:57Z

Youyoun
Jun 19, 2024

Hello (:wave: @cdancette), I am encountering the same issue. A single validation step is performed after a mid-epoch checkpoint is loaded, where the stage is not set as sanity according to Trainer. Any news or issue related to this discussion ?

0 replies

korotaS · 2024-06-21T07:38:42Z

korotaS
Jun 21, 2024

Seems that this issue was already present a long ago (#11504) and it was even fixed (#11552), but if we look on current code is seems that the fix is gone...

The fix in version 1.5.10 (lines 534-535):

pytorch-lightning/pytorch_lightning/loops/epoch/training_epoch_loop.py

Lines 530 to 538 in 9ebdc52

    
           # TODO(@awaelchli): let training/eval loop handle logic around limit_*_batches and val_check_batch 
        
           is_val_check_batch = is_last_batch 
        
           # while restarting with no fault-tolerant, batch_progress.current.ready is -1 
        
           if batch_idx == -1: 
        
               return False 
        
           if isinstance(self.trainer.limit_train_batches, int) and is_infinite_dataset: 
        
               is_val_check_batch = (batch_idx + 1) % self.trainer.limit_train_batches == 0

That same code in the next version 1.6.0 ant later versions doesn't have the fix:

pytorch-lightning/pytorch_lightning/loops/epoch/training_epoch_loop.py

Lines 522 to 528 in 44e3edb

    
           # TODO(@awaelchli): let training/eval loop handle logic around limit_*_batches and val_check_batch 
        
           is_val_check_batch = is_last_batch 
        
           if isinstance(self.trainer.limit_train_batches, int) and is_infinite_dataset: 
        
               is_val_check_batch = (batch_idx + 1) % self.trainer.limit_train_batches == 0 
        
           elif self.trainer.val_check_batch != float("inf"): 
        
               is_val_check_batch = (batch_idx + 1) % self.trainer.val_check_batch == 0 
        
           return is_val_check_batch

0 replies

theodorblackbird · 2024-07-05T10:25:26Z

theodorblackbird
Jul 5, 2024

Facing the same issue, it is particularly annoying when using checkpoint callback with save_top_k since it tracks possibly wrong "best metric".

0 replies

shirondru · 2024-07-15T16:05:50Z

shirondru
Jul 15, 2024

I'm also facing this issue

0 replies

SeminKim · 2024-07-22T13:30:42Z

SeminKim
Jul 22, 2024

Also facing the same issue. I think fixing this correctly is a bit tricky, so I just ended up with dirty workaround:
manually exit train epoch loop with on_validation_model_zero_grad hook, which is called just before val_loop.run() in train_loop.on_advance_end(). Worked for me with Lightning 2.1.3.

    def on_validation_model_zero_grad(self):
        '''
        Small hack to avoid first validation on resume. 
        This will NOT work if the gradient accumulation step should be performed at this point.
        '''
        super().on_validation_model_zero_grad()
        if self.trainer.ckpt_path is not None and getattr(self, '_restarting_skip_val_flag', True):
            self._restarting_skip_val_flag = False
            raise StopIteration

2 replies

ethanhe42 Sep 30, 2024

this works for me but somehow messed up global_step counter

ethanhe42 Sep 30, 2024

here's a better fix without StopIteration

    def on_validation_model_zero_grad(self) -> None:
        '''
        Small hack to avoid first validation on resume. 
        This will NOT work if the gradient accumulation step should be performed at this point.
        '''
        super().on_validation_model_zero_grad()
        if self.trainer.ckpt_path is not None and getattr(self, '_restarting_skip_val_flag', True):
            self.trainer.sanity_checking = True
            self._restarting_skip_val_flag = False

magehrig · 2024-08-13T10:41:59Z

magehrig
Aug 13, 2024

I also encountered this problem. Is there an github issue open for this? This definitely deserves attention

0 replies

ethanhe42 · 2024-09-30T01:56:21Z

ethanhe42
Sep 30, 2024

this is very annoying. deserves more attention

1 reply

Youyoun Sep 30, 2024

I created an issue but wasn't addressed yet: #20288

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The trainers runs a single validation step after resume (not sanity) #18110

{{title}}

Replies: 9 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The trainers runs a single validation step after resume (not sanity) #18110

Replies: 9 comments · 4 replies

Replies: 9 comments 4 replies