-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradual Rollout Support #98
Comments
I assume something like this will be needed
Other ideas are:
|
@p-strusiewiczsurmacki-mobica I have a few comments on the steps:
apiVersion: network.schiff.telekom.de/v1alpha1
kind: NodeConfiguration
metadata:
name: <node-name>
spec:
vrfs: {}
l2vnis: {}
[...]
Regarding the other ideas:
|
@chdxD1 Just 2 more questions before I'll try to incorporate your comments into this:
EDIT:
|
@chdxD1 I've started working on this on Monday. I've created draft PR so you can take a quick look if you'll have some spare time and tell me if I am going in good direction with this. It is implemented mostly as described in my previous comment. Right now I need to make I have additional question as well - what should we do with the mounted config file? Should it stay the way it is now, or should it also be a part of NodeConfig? |
Regarding the mounted config file: It might become a CRD instead of a configmap however it will be the same for all nodes. |
OK, I think I've got most of it by now. |
In the current design network-operator runs independently from each other. Custom resources are read from the cluster and configured on the local node.
If a user configures a faulty configuration this could render all nodes inoperable at roughly the same time. There is no rollback mechanism in place that reverts faulty configuration or a gradual rollout of configuration.
I propose a 2-step approach:
.status
part.The central controller renders each node configuration individually (and in a gradual fashion) thus allowing to respond to failed node configurations by stopping any further rollout until the input resources are changed again.
Because connection to the API server might be impacted by a faulty rollout, step two depends on step one. Each network-operator on the node should be individually capable to rollback the configuration to a previous, working state.
For step two: The controller (or K8s API server with ownership relations) should also clean up node configurations (which can be written as a dedicated CRD) when a node leaves the cluster. As we are a heavy user of Cluster API this is required to not clog the API server with unnecessary resources.
The text was updated successfully, but these errors were encountered: