Stepwise LR-Scheduler not working across epochs #17544

maltesilber · 2023-05-02T11:29:32Z

Bug description

Description

I'm training a model based on number of iterations instead of a number of epochs. The same model trains on datasets of different sizes, hence one epoch differs in the number of iterations. Let's say I want to train e.g. a model for 900 iterations which corresponds to 90 epochs on one of the datasets and want to have a stepwise lr scheduler on iteration 300 & 600. To my understanding this is not natively possible in the pytorch lightning environment.
I know that I can change the lr scheduler interval to "step" and then set the frequency, like so:

'lr_scheduler': {"scheduler": sched, "interval": "step", "frequency": 300}

However this only applies the steps within one epoch. If I set the frequency larger than the number of iteration per epoch no scheduler step is applied. I would assume that the expected behaviour is to call the scheduler.step() every n frequency across multiple epochs.

What version are you seeing the problem on?

v2_0

How to reproduce the bug

import os

import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()

        for param_group in self.optimizers().optimizer.param_groups:
            lr = param_group['lr']
        self.log('lr', lr, prog_bar=True, on_step=True, on_epoch=False)
        return {"loss": loss}

    def configure_optimizers(self):
        opt = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        scheduler = torch.optim.lr_scheduler.StepLR(opt, 1)
        return {"optimizer": opt, 'lr_scheduler': {"scheduler": scheduler,
                                                   "interval": "step",
                                                   "frequency": 10}}


def run():
    train_data = DataLoader(RandomDataset(32, 32), batch_size=8)

    model = BoringModel()
    trainer = Trainer(
        accelerator='cpu',
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=-1,
        max_steps=30,
        log_every_n_steps=1
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

The text was updated successfully, but these errors were encountered:

z13670 · 2023-06-09T08:54:49Z

same problem

maltesilber · 2023-07-25T13:56:44Z

Solved it using the LamdaLR scheduler. First define a function that corresponds to your lr schedule:

def step_decay(base_lr, step_size, gamma):
    def fn(step):
        return base_lr*gamma**(step//step_size)
    return fn

And configure the optimizer so the function gets called on every step:

def configure_optimizers(self):
     lr = 0.5
     optimizer = torch.optim.SGD(self.layer.parameters(), lr=lr)
     scheduler = torch.optim.lr_scheduler.LambdaLR(
         optimizer=optimizer,
         lr_lambda=step_decay(base_lr=lr, step_size=10, gamma=0.1)
     )
     return [optimizer], [{'scheduler': scheduler, 'interval': 'step'}]

awaelchli · 2023-11-25T23:29:55Z

I think it is a reasonable ask for the frequency parameter to apply across epoch boundaries. This is an easy change, here in this line of code
https://github.com/Lightning-AI/lightning/blob/af852ff5908e9a99917eeeff05bb4536dbb1cade/src/lightning/pytorch/loops/training_epoch_loop.py#L363

the self.batch_idx would have to be changed to self.total_batch_idx, that's all. Anyone from the community is free to contribute this change.

maltesilber added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 2, 2023

github-actions bot added the ver: 2.0.x label May 2, 2023

awaelchli added help wanted Open to be worked on lr scheduler and removed needs triage Waiting to be triaged by maintainers labels Nov 25, 2023

awaelchli added this to the 2.1.x milestone Nov 25, 2023

awaelchli modified the milestones: 2.1.x, 2.2.x Feb 8, 2024

awaelchli modified the milestones: 2.2.x, 2.3.x Jun 13, 2024

awaelchli modified the milestones: 2.3.x, 2.4.x Aug 7, 2024

falckt linked a pull request Sep 5, 2024 that will close this issue

Update LR step scheduler to use total step to work across epochs #20248

Open

7 tasks

lantiga mentioned this issue Oct 7, 2024

Stepwise LR scheduler #20211

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stepwise LR-Scheduler not working across epochs #17544

Stepwise LR-Scheduler not working across epochs #17544

maltesilber commented May 2, 2023 •

edited

Loading

z13670 commented Jun 9, 2023

maltesilber commented Jul 25, 2023 •

edited

Loading

awaelchli commented Nov 25, 2023 •

edited

Loading

Stepwise LR-Scheduler not working across epochs #17544

Stepwise LR-Scheduler not working across epochs #17544

Comments

maltesilber commented May 2, 2023 • edited Loading

Bug description

Description

What version are you seeing the problem on?

How to reproduce the bug

z13670 commented Jun 9, 2023

maltesilber commented Jul 25, 2023 • edited Loading

awaelchli commented Nov 25, 2023 • edited Loading

maltesilber commented May 2, 2023 •

edited

Loading

maltesilber commented Jul 25, 2023 •

edited

Loading

awaelchli commented Nov 25, 2023 •

edited

Loading