Tensorflow BackupAndRestore method does not work #20712

antipisa · 2025-01-02T14:48:44Z

I copied the example code here and it raises a ValueError with Python 3.11 and Tensorflow 2.17:

import keras
import numpy as np

class InterruptingCallback(keras.callbacks.Callback):
   def on_epoch_begin(self, epoch, logs=None):
     if epoch == 4:
       raise RuntimeError('Interrupting!')
callback = keras.callbacks.BackupAndRestore(backup_dir="/tmp/backup")
model = keras.models.Sequential([keras.layers.Dense(10)])
model.compile(keras.optimizers.SGD(), loss='mse')
try:
   model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
             batch_size=1, callbacks=[callback, InterruptingCallback()],
             verbose=0)
except Exception as e:
   print(e)
history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
                     epochs=10, batch_size=1, callbacks=[callback],
                     verbose=0)
len(history.history['loss'])


ValueError: To use the BackupAndRestore method, your model must be built before you call `fit()`. Model is unbuilt. You can build it beforehand by calling it on a batch of data.

Doesn't that defeat the purpose of backupAndRestore?

The text was updated successfully, but these errors were encountered:

mehtamansi29 · 2025-01-02T16:20:26Z

Hi @antipisa -

Thanks for reporting the issue. Here getting error in BackupAndRestore callback method is because of model is not build yet(no trainable parameters are there as weights are not assigned).
Attached gist here for reference.

Further more reference where you can find error will be raise if model is not build.

By building the model with some input shape model.build(input_shape=(None,20)), code will work fine with BackupAndRestore callback method.

antipisa · 2025-01-02T16:34:00Z

@mehtamansi29 does one have to call model.build both the first time the model is instantiated and then again after restarting interruption?

mehtamansi29 · 2025-01-02T16:57:45Z

Hi @antipisa -

Here need to define model.build() at first time before training the model. After interruption if model architecture is not change so no need to reinstantiated after interruption.
Attached gist shown after interruption model parameter and architecture doesn't change.

antipisa · 2025-01-02T17:01:28Z

@mehtamansi29 a common use case for this would be training is interrupted due to computer crash / kernel restart, so one would have to re-compile the model and load and restore from backup. But this requires compiling the model again and so presumably one must also call build again

mehtamansi29 · 2025-01-02T17:16:10Z

Yes @antipisa - In the case computer crash/kernel restart need to build model again after interruption because model configuration(weights,optimizer..) are lost. So for that case need to rebuild the model. If model architecture and nodel configuration doesn't change then no need to reinstantiated after interruption

antipisa · 2025-01-02T17:34:40Z

Thank you!

Surya2k1 · 2025-01-02T18:35:52Z

Yes @antipisa - In the case computer crash/kernel restart need to build model again after interruption because model configuration(weights,optimizer..) are lost. So for that case need to rebuild the model. If model architecture and nodel configuration doesn't change then no need to reinstantiated after interruption

I see a conflict of my understanding here. Once training proceeds as per save_freq the model data at 'backup_dir'. Unless this directory lost it is expected to restore the model state and resume training again right? For even kernel restart or crash this is expected to save model's latest state and restore it again once training resume back.

google-ml-butler · 2025-01-02T18:52:31Z

Are you satisfied with the resolution of your issue?
Yes
No

github-actions bot assigned mehtamansi29 Jan 2, 2025

mehtamansi29 added type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. stat:awaiting response from contributor labels Jan 2, 2025

google-ml-butler bot removed the stat:awaiting response from contributor label Jan 2, 2025

mehtamansi29 added the stat:awaiting response from contributor label Jan 2, 2025

google-ml-butler bot removed the stat:awaiting response from contributor label Jan 2, 2025

mehtamansi29 added the stat:awaiting response from contributor label Jan 2, 2025

mehtamansi29 mentioned this issue Jan 2, 2025

Update BackupAndRestore class example #20714

Merged

google-ml-butler bot removed the stat:awaiting response from contributor label Jan 2, 2025

fchollet closed this as completed in #20714 Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow BackupAndRestore method does not work #20712

Tensorflow BackupAndRestore method does not work #20712

antipisa commented Jan 2, 2025 •

edited

Loading

mehtamansi29 commented Jan 2, 2025 •

edited

Loading

antipisa commented Jan 2, 2025

mehtamansi29 commented Jan 2, 2025

antipisa commented Jan 2, 2025

mehtamansi29 commented Jan 2, 2025

antipisa commented Jan 2, 2025

Surya2k1 commented Jan 2, 2025

google-ml-butler bot commented Jan 2, 2025

Tensorflow BackupAndRestore method does not work #20712

Tensorflow BackupAndRestore method does not work #20712

Comments

antipisa commented Jan 2, 2025 • edited Loading

mehtamansi29 commented Jan 2, 2025 • edited Loading

antipisa commented Jan 2, 2025

mehtamansi29 commented Jan 2, 2025

antipisa commented Jan 2, 2025

mehtamansi29 commented Jan 2, 2025

antipisa commented Jan 2, 2025

Surya2k1 commented Jan 2, 2025

google-ml-butler bot commented Jan 2, 2025

antipisa commented Jan 2, 2025 •

edited

Loading

mehtamansi29 commented Jan 2, 2025 •

edited

Loading