Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stepwise LR-Scheduler not working across epochs #17544

Open
maltesilber opened this issue May 2, 2023 · 3 comments · May be fixed by #20248
Open

Stepwise LR-Scheduler not working across epochs #17544

maltesilber opened this issue May 2, 2023 · 3 comments · May be fixed by #20248
Labels
bug Something isn't working help wanted Open to be worked on lr scheduler ver: 2.0.x
Milestone

Comments

@maltesilber
Copy link

maltesilber commented May 2, 2023

Bug description

Description

I'm training a model based on number of iterations instead of a number of epochs. The same model trains on datasets of different sizes, hence one epoch differs in the number of iterations. Let's say I want to train e.g. a model for 900 iterations which corresponds to 90 epochs on one of the datasets and want to have a stepwise lr scheduler on iteration 300 & 600. To my understanding this is not natively possible in the pytorch lightning environment.
I know that I can change the lr scheduler interval to "step" and then set the frequency, like so:

'lr_scheduler': {"scheduler": sched, "interval": "step", "frequency": 300}

However this only applies the steps within one epoch. If I set the frequency larger than the number of iteration per epoch no scheduler step is applied. I would assume that the expected behaviour is to call the scheduler.step() every n frequency across multiple epochs.

What version are you seeing the problem on?

v2_0

How to reproduce the bug

import os

import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()

        for param_group in self.optimizers().optimizer.param_groups:
            lr = param_group['lr']
        self.log('lr', lr, prog_bar=True, on_step=True, on_epoch=False)
        return {"loss": loss}

    def configure_optimizers(self):
        opt = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        scheduler = torch.optim.lr_scheduler.StepLR(opt, 1)
        return {"optimizer": opt, 'lr_scheduler': {"scheduler": scheduler,
                                                   "interval": "step",
                                                   "frequency": 10}}


def run():
    train_data = DataLoader(RandomDataset(32, 32), batch_size=8)

    model = BoringModel()
    trainer = Trainer(
        accelerator='cpu',
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=-1,
        max_steps=30,
        log_every_n_steps=1
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()
@maltesilber maltesilber added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 2, 2023
@z13670
Copy link

z13670 commented Jun 9, 2023

same problem

@maltesilber
Copy link
Author

maltesilber commented Jul 25, 2023

Solved it using the LamdaLR scheduler. First define a function that corresponds to your lr schedule:

def step_decay(base_lr, step_size, gamma):
    def fn(step):
        return base_lr*gamma**(step//step_size)
    return fn

And configure the optimizer so the function gets called on every step:

def configure_optimizers(self):
     lr = 0.5
     optimizer = torch.optim.SGD(self.layer.parameters(), lr=lr)
     scheduler = torch.optim.lr_scheduler.LambdaLR(
         optimizer=optimizer,
         lr_lambda=step_decay(base_lr=lr, step_size=10, gamma=0.1)
     )
     return [optimizer], [{'scheduler': scheduler, 'interval': 'step'}]

@awaelchli awaelchli added help wanted Open to be worked on lr scheduler and removed needs triage Waiting to be triaged by maintainers labels Nov 25, 2023
@awaelchli
Copy link
Contributor

awaelchli commented Nov 25, 2023

I think it is a reasonable ask for the frequency parameter to apply across epoch boundaries. This is an easy change, here in this line of code
https://github.com/Lightning-AI/lightning/blob/af852ff5908e9a99917eeeff05bb4536dbb1cade/src/lightning/pytorch/loops/training_epoch_loop.py#L363

the self.batch_idx would have to be changed to self.total_batch_idx, that's all. Anyone from the community is free to contribute this change.

@awaelchli awaelchli added this to the 2.1.x milestone Nov 25, 2023
@awaelchli awaelchli modified the milestones: 2.1.x, 2.2.x Feb 8, 2024
@awaelchli awaelchli modified the milestones: 2.2.x, 2.3.x Jun 13, 2024
@awaelchli awaelchli modified the milestones: 2.3.x, 2.4.x Aug 7, 2024
@falckt falckt linked a pull request Sep 5, 2024 that will close this issue
7 tasks
@lantiga lantiga mentioned this issue Oct 7, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on lr scheduler ver: 2.0.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants