-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Package Approval Time #462
Comments
related to #398 |
Triaged |
I have run multiple Nephio end to end tests locally, but I am not seeing this issue on Porch. The packages are approved in a reasonable amount of time seconds/small number of minutes. However, the resources associated with the packages themselves can take a long time to come up (~10 minutes for cluster creation over capi). This is not a porch issue but an issue in the package itself. Logging from e2e test to follow once the PR below containing better logging is merged. |
Thank you for looking into this. I totally agree and I learned about it when I was working on Onesummit demo. The problem is mostly related to the package. About cluster creation, I can understand 10 mins. But the same is not the case when NF package has a dependencies CR. So it can be coming from functions, interface-fn, nfdeploy-fn or dnn-fn. I hope logging is able to show us the reason. |
Below is the timestamps extracted from the log from this test run on Saturday: |
This PR generates the CSV from the log. nephio-project/test-infra#268 |
The |
This is a sample of a recent execution
BTW, the sandbox creation takes ~10 min |
Some observations from debugging. Testing using the UPF edge deploy flow - https://github.com/nephio-project/test-infra/blob/main/e2e/tests/free5gc/005.sh The mutator pipeline execution can take some time (~5 min) and the conditions related to the readiness gates are updated to true in good time. But the delay appears to be related to the approval controller not "getting" the latest packagerevision until at least 5 mins later. and fails to verify the readiness gates - https://github.com/nephio-project/nephio/blob/main/controllers/pkg/reconcilers/approval/reconciler.go#L131 With some expertise from others, we determined that this may be related to how the Watch notification is impl in porch. @johnbelamaric Would you have some time to direct us on the best approach here? |
I am not sure I have a great answer here. It's hard to know without more investigation as to exactly what is going on. Can you track the propagation of the change in the cases where you are seeing delays? That is, if approval controller is not seeing the change, is it stuck on an older resource version (commit)? Or is the problem that the Porch API server is somehow not seeing the change? If it's the first one (porch API sees the change, approval controller doesn't), then the issues is either a watch event that failed to publish (porch issue) or one that somehow was missed by the approval controller. If it's the second one (porch API does not see the change), then we need to look deeper into how there is a mismatch between Git and the Porch API server. I did have a PR that touched this area, that has never merged (I merged it a long time ago but the Google internal TNA team found some issues so I backed it out): nephio-project/porch#8. That covers how the internal cache is maintained withing Porch and was primarily focused on performance degradation over time. But it may be relevant in this case as it pertains to consistency of the cache. |
Ok. Some progress, I think. TEST SUMMARY001.sh: PASS in 182 seconds This is an just a single run and I did observe that the amf pkg got stuck in a proposed state and needed to be push on manually. |
That looks a lot better! |
Second run on fresh VM. TEST SUMMARY001.sh: PASS in 191 seconds Not very consistent so far, but better timings. We still have an issue with the amf/smf deploy. Can't see why but the amf pkg gets approved quite early and then gets stuck trying to publish a v2 revision: err |
I think the "initial" policy won't approve a v2! Thus the name - it approves an initial version but not subsequent edits. Looks like previously it only worked because some initial deltas got batched into a single version. Now with faster approvals, the approval controller is winning a race it used to lose! Hacky workaround: add an approval delay annotation to that package to delay approval a little bit. It's akin to fixing a race condition with a sleep - bad practice but may help temporarily. We will need a better policy than "initial" to handle this properly. |
Added a 1m delay to the amf pvs. events owning PackageVariant not Ready Checked the pv and the status is:
https://github.com/nephio-project/porch/blob/main/pkg/engine/engine.go#L992 |
Tweaked the requeue duration to 15 secs. TEST SUMMARY001.sh: PASS in 144 seconds |
Change approval controller PR Get to hit the api directly instead of reading from local cache. Adjust the reque duration to prevent race condition. During debugging the approval delay issue reported [here](#462) it became apparent that the packagerev being fetched was a cached version which didn't get updated for quite some time. To circumvent this, we are retrieving the PR using the apiReader interface which bypasses the local cache and hits the k8s api directly.
Much better numbers! :) Congratulations! |
Just one minor question: have you considered using the existing porchRESTClient for the Get() call in the approval controller to bypass the cache, instead of introducing yet another client? |
While checking for R2, I realized that the package approval time is more than R1. Sometimes even after 15 mins the approval doesn't happen so I have to approve it manually.
I tried it several times and it seems that the issue is reproducible.
@johnbelamaric if needed I can add the logs of the porch controller.
The text was updated successfully, but these errors were encountered: