Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow BackupAndRestore method does not work #20712

Closed
antipisa opened this issue Jan 2, 2025 · 8 comments · Fixed by #20714
Closed

Tensorflow BackupAndRestore method does not work #20712

antipisa opened this issue Jan 2, 2025 · 8 comments · Fixed by #20714
Assignees
Labels
type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.

Comments

@antipisa
Copy link

antipisa commented Jan 2, 2025

I copied the example code here and it raises a ValueError with Python 3.11 and Tensorflow 2.17:

import keras
import numpy as np

class InterruptingCallback(keras.callbacks.Callback):
   def on_epoch_begin(self, epoch, logs=None):
     if epoch == 4:
       raise RuntimeError('Interrupting!')
callback = keras.callbacks.BackupAndRestore(backup_dir="/tmp/backup")
model = keras.models.Sequential([keras.layers.Dense(10)])
model.compile(keras.optimizers.SGD(), loss='mse')
try:
   model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
             batch_size=1, callbacks=[callback, InterruptingCallback()],
             verbose=0)
except Exception as e:
   print(e)
history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
                     epochs=10, batch_size=1, callbacks=[callback],
                     verbose=0)
len(history.history['loss'])


ValueError: To use the BackupAndRestore method, your model must be built before you call `fit()`. Model is unbuilt. You can build it beforehand by calling it on a batch of data.

Doesn't that defeat the purpose of backupAndRestore?

@mehtamansi29
Copy link
Collaborator

mehtamansi29 commented Jan 2, 2025

Hi @antipisa -

Thanks for reporting the issue. Here getting error in BackupAndRestore callback method is because of model is not build yet(no trainable parameters are there as weights are not assigned).
Attached gist here for reference.

Further more reference where you can find error will be raise if model is not build.

By building the model with some input shape model.build(input_shape=(None,20)), code will work fine with BackupAndRestore callback method.

@mehtamansi29 mehtamansi29 added type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. stat:awaiting response from contributor labels Jan 2, 2025
@antipisa
Copy link
Author

antipisa commented Jan 2, 2025

@mehtamansi29 does one have to call model.build both the first time the model is instantiated and then again after restarting interruption?

@mehtamansi29
Copy link
Collaborator

Hi @antipisa -

Here need to define model.build() at first time before training the model. After interruption if model architecture is not change so no need to reinstantiated after interruption.
Attached gist shown after interruption model parameter and architecture doesn't change.

@antipisa
Copy link
Author

antipisa commented Jan 2, 2025

@mehtamansi29 a common use case for this would be training is interrupted due to computer crash / kernel restart, so one would have to re-compile the model and load and restore from backup. But this requires compiling the model again and so presumably one must also call build again

@mehtamansi29
Copy link
Collaborator

Yes @antipisa - In the case computer crash/kernel restart need to build model again after interruption because model configuration(weights,optimizer..) are lost. So for that case need to rebuild the model. If model architecture and nodel configuration doesn't change then no need to reinstantiated after interruption

@antipisa
Copy link
Author

antipisa commented Jan 2, 2025

Thank you!

@Surya2k1
Copy link
Contributor

Surya2k1 commented Jan 2, 2025

Yes @antipisa - In the case computer crash/kernel restart need to build model again after interruption because model configuration(weights,optimizer..) are lost. So for that case need to rebuild the model. If model architecture and nodel configuration doesn't change then no need to reinstantiated after interruption

I see a conflict of my understanding here. Once training proceeds as per save_freq the model data at 'backup_dir'. Unless this directory lost it is expected to restore the model state and resume training again right? For even kernel restart or crash this is expected to save model's latest state and restore it again once training resume back.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants