-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify status changes on managed interfaces #530
Verify status changes on managed interfaces #530
Conversation
Thanks for your PR,
To skip the vendors CIs use one of:
|
Hi @mlguerrero12 thanks for the PR! |
Hello @SchSeba indeed this is supposed to mitigate https://issues.redhat.com/browse/OCPBUGS-22337 |
@dmoessne can you share the details publicly ? currently the assumption is that config daemon owns sriov configuration (unless otherwise instructed e.g via externally managed feature) so no one is expected to change the configuration. based on that assumption comparing the states is expected to be sufficient. that said, im not against checking the system for the actual state however we need to understand IF we actually need this as it brings with it overhead in the daemon reconcile loop. |
@SchSeba, @adrianchiris, we have a customer that manually changed the mtu of some VFs that were previously configured with a sriov policy. The sriov config daemon detects these changes (via pollNicStatus which runs every 30s) and updates the status of the sriov network node state object. However, the reconciliation loop doesn't trigger because it only does when the generation field changes (this is when the spec of the sriov-network-node-state changes). We already informed the customer that they shouldn't modify the mtu without using sriov policies. I'm currently working on a PR to also make the operator react when the status doesn't match what was set in the spec. This customer was relying on this bug. The problem was that due to a recent issue with etcd encryption, the kube-api server pods were restarted (weekly). The resource version of objects were incremented due to this, including the one for the "device-plugin-config" config map. The resource version of this config map is part of the sriov-network-node-state spec. Due to this, the sriov config daemon started its reconciliation loop. The function OnNodeStateChange of the generic plugin detected changes in the MTU and returned "needDrain = true". The node was drained and this is the reason why the customer complained. They didn't understand why the node was repeatedly draining. The Apply function of the generic plugin never reconciled the status after the draining because it only does when the spec changes. This is what I'm fixing with this PR. Regarding the cost, yes, it has an extra cost but it's just some for loops to compare some parameters of the interfaces. Nothing to be concerned about I would think but I could speed this up with some routines. To sum up, customer shouldn't be changing the mtu without using sriov policies of VFs that are managed by the sriov operator. We identified some issues in the logic of the sriov daemon. One of them is solved with this PR. |
Thanks for your PR,
To skip the vendors CIs use one of:
|
d4716bf
to
212a2ad
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
212a2ad
to
9daecca
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
I added a commit to start the reconciliation loop of the sriov config daemon when status of interfaces are externally modified (see my previous comment). Updated title and description of PR. |
Thanks for your PR,
To skip the vendors CIs use one of:
|
9daecca
to
091f7dc
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
Pull Request Test Coverage Report for Build 8818177298Details
💛 - Coveralls |
091f7dc
to
fad55a8
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
fad55a8
to
ae9592c
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
ae9592c
to
4e630f6
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
4e630f6
to
356043c
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
356043c
to
98150f5
Compare
26d85b4
to
46cee51
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
failing test is not related |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just small nits to add comments and we can merge this one.
nice work!
46cee51
to
4739f7b
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
4739f7b
to
af847fe
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
@Eoghan1232 @e0ne, please have a look at this when you have time. Thanks! |
af847fe
to
2885578
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thx
Thanks for your PR,
To skip the vendors CIs use one of:
|
The generic plugin was applying config changes only if the desired spec of interfaces was different from the last applied spec. This logic is different from the one in OnNodeStateChange where the real status of the interfaces is used to detect changes. By removing the LastState parameter (and related code), the generic plugin will also use the real status of interfaces to decide whether to apply changes or not. The SyncNodeState function has this logic.
Users could modify the settings of VFs which have been configured by the sriov operator. This PR starts the reconciliation loop when these changes are detected in the generic plugin. Signed-off-by: Marcelo Guerrero <[email protected]>
Logic to check missing kernel arguments is placed in a method to be used by both OnNodeStateChange and CheckStatusChanges.
2885578
to
0d1a11a
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
merging this one thanks for all the help folks! |
Users could modify the settings of VFs which have been configured by the sriov operator. This PR starts the reconciliation loop when these changes are detected in the generic plugin.