-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to disable container spec validation when restoring checkpoints #11307
Comments
Seems fine to me so long as it's flag-gated behind a scary-sounding flag, and that the behavior for the now-not-validated fields is consistent (will they all effectively inherit the pre-checkpoint values?) |
@EtiennePerot yes my understanding is they'll effectively inherit the pre-checkpoint flag. I think it makes sense to put this behind a scary flag and to still emit a warning or error-level log so we can alert on validation errors. |
I've created #11323 for this change - let me know if it's acceptable! For the now-not-validated fields, I believe two things may happen. For fields that don't have any semantic effect at restore time, gvisor will behave according to the values given at checkpoint time. For other fields, differences in the checkpoint/restore specs will still cause loud restore failures in subcomponents of the kernel (e.g. a missing bind mount will cause filesystem restoration to fail). |
This PR adds a new flag entitled `skip-restore-spec-validation-unsafe`, defaulting to false. If the flag is set, the container spec given when restoring a checkpoint will no longer be validated against the original container spec given when the checkpoint was taken. In practice, many spec differences are benign, and it can be useful to allow the container specs to vary somewhat between checkpoint and restore. See #11307 . FUTURE_COPYBARA_INTEGRATE_REVIEW=#11323 from cweld510:cweld/skip-spec-validation-unsafe 17b2c1b PiperOrigin-RevId: 710752422
This PR adds a new flag entitled `skip-restore-spec-validation-unsafe`, defaulting to false. If the flag is set, the container spec given when restoring a checkpoint will no longer be validated against the original container spec given when the checkpoint was taken. In practice, many spec differences are benign, and it can be useful to allow the container specs to vary somewhat between checkpoint and restore. See #11307 . FUTURE_COPYBARA_INTEGRATE_REVIEW=#11323 from cweld510:cweld/skip-spec-validation-unsafe 17b2c1b PiperOrigin-RevId: 710752422
Thanks @EtiennePerot for the help and review! |
Description
We recently upgraded to a gvisor version containing extra validation of the container spec used when restoring checkpoints (i.e. all the logic here). Unfortunately, at the same time, we also discovered that we were inadvertently varying the container spec between checkpoint and restore in lots of subtle ways that, although technically incorrect, weren't causing any obvious bugs in our application. For example, varying the arguments to the container's init process between checkpoint and restore doesn't actually cause anything to break, because we consume the arguments into memory just once, before the checkpoint occurs. The new validation logic turns those small variances in the container spec into big failures that cause production issues.
Is it reasonable to add an option to disable the container spec validation? I'm fully aware that varying the container spec between checkpoint and restore is technically wrong, and I intend to fix it -- but in the interim, it would be nice to disable the validation to maintain stability.
Is this feature related to a specific bug?
No response
Do you have a specific solution in mind?
No response
The text was updated successfully, but these errors were encountered: