NaNs during training #1

eps696 · 2020-02-09T13:21:39Z

Thanks you for this very interesting repo.
I got some issues with training though: the parameters (codebook values for motion and total_loss for appearance) drop to NaN within few epochs (5-20) when trained with default hyperparameters.
Increase in learning rate to 0.5~1e-4 eliminates this behaviour, but doesn't look like a solution.
Did you encounter such issues in your practice and do you have any advice to sort this out?

olegkhomenko · 2020-05-27T14:18:24Z

+1
Same problem during both motion and appearance model fine-tuning

olegkhomenko · 2020-05-27T19:22:45Z

@eps696 there is a bug if you are using pytorch>=1.3.0:
replace 142 line in train.py (and in test.py as well)

y = F.grid_sample(frame1, flow.permute(0,2,3,1), padding_mode="border")

with

y = F.grid_sample(frame1, flow.permute(0,2,3,1), padding_mode="border", align_corners=True)

eps696 · 2020-05-29T11:25:26Z

@olegkhomenko thanks; i use pytorch <= 1.2.0, there's no such option

HellwayXue · 2020-06-26T04:45:33Z

Have you solved this problem? could you please tell me the solution?

eps696 · 2020-06-26T20:06:16Z

@HellwayXue for that time i just added some dumb logging after defining codebook, manually resuming the training if that happens:

            if numpy.isnan(numpy.min(codebook)):
                raise Exception(' training failed at epoch %d' % epoch)

you may want to use numpy.nan_to_num instead (as it's used in sort_motion_codebook for instance)

HellwayXue · 2020-06-27T03:32:56Z

@eps696 Thank you for your reply! but how these nans come from? I've tried gradient clipping, optimizer weight decay and none of them work. Your suggestions are after defining codebook, but i get nans even if i don't save it.

eps696 · 2020-06-28T14:53:50Z

@HellwayXue no idea alas
tbh i didn't really get into codebook mechanics; as far as it works in general, i'm ok with some flaws

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaNs during training #1

NaNs during training #1

eps696 commented Feb 9, 2020

olegkhomenko commented May 27, 2020

olegkhomenko commented May 27, 2020

eps696 commented May 29, 2020

HellwayXue commented Jun 26, 2020

eps696 commented Jun 26, 2020

HellwayXue commented Jun 27, 2020

eps696 commented Jun 28, 2020

NaNs during training #1

NaNs during training #1

Comments

eps696 commented Feb 9, 2020

olegkhomenko commented May 27, 2020

olegkhomenko commented May 27, 2020

eps696 commented May 29, 2020

HellwayXue commented Jun 26, 2020

eps696 commented Jun 26, 2020

HellwayXue commented Jun 27, 2020

eps696 commented Jun 28, 2020