Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Package Approval Time #462

Open
arora-sagar opened this issue Dec 15, 2023 · 19 comments
Open

High Package Approval Time #462

arora-sagar opened this issue Dec 15, 2023 · 19 comments
Assignees
Labels
area/platform area/porch Porch related issues bug Something isn't working help wanted Extra attention is needed triaged
Milestone

Comments

@arora-sagar
Copy link

While checking for R2, I realized that the package approval time is more than R1. Sometimes even after 15 mins the approval doesn't happen so I have to approve it manually.

I tried it several times and it seems that the issue is reproducible.

@johnbelamaric if needed I can add the logs of the porch controller.

@arora-sagar arora-sagar added this to the R2 milestone Dec 15, 2023
@johnbelamaric
Copy link
Member

#452

@liamfallon liamfallon added the area/porch Porch related issues label Mar 26, 2024
@gvbalaji gvbalaji added this to R4 Apr 3, 2024
@gvbalaji gvbalaji moved this to Todo in R4 Apr 3, 2024
@gvbalaji gvbalaji modified the milestones: R2, R3 Apr 3, 2024
@liamfallon
Copy link
Member

related to #398

@liamfallon liamfallon added bug Something isn't working help wanted Extra attention is needed labels May 17, 2024
@liamfallon
Copy link
Member

Triaged
Triage Comment: Urgent

@liamfallon
Copy link
Member

I have run multiple Nephio end to end tests locally, but I am not seeing this issue on Porch. The packages are approved in a reasonable amount of time seconds/small number of minutes.

However, the resources associated with the packages themselves can take a long time to come up (~10 minutes for cluster creation over capi). This is not a porch issue but an issue in the package itself.

Logging from e2e test to follow once the PR below containing better logging is merged.
nephio-project/test-infra#267

@arora-sagar
Copy link
Author

I have run multiple Nephio end to end tests locally, but I am not seeing this issue on Porch. The packages are approved in a reasonable amount of time seconds/small number of minutes.

However, the resources associated with the packages themselves can take a long time to come up (~10 minutes for cluster creation over capi). This is not a porch issue but an issue in the package itself.

Logging from e2e test to follow once the PR below containing better logging is merged. nephio-project/test-infra#267

Thank you for looking into this. I totally agree and I learned about it when I was working on Onesummit demo. The problem is mostly related to the package. About cluster creation, I can understand 10 mins. But the same is not the case when NF package has a dependencies CR. So it can be coming from functions, interface-fn, nfdeploy-fn or dnn-fn.

I hope logging is able to show us the reason.

@liamfallon
Copy link
Member

Below is the timestamps extracted from the log from this test run on Saturday:

approve.csv

@liamfallon
Copy link
Member

This PR generates the CSV from the log. nephio-project/test-infra#268

@electrocucaracha
Copy link
Member

The run_test function generates some metrics (test execution time and MB downloaded). There are also some warning messages triggered when the waiting time exceeds 2/3 of expected (https://github.com/nephio-project/test-infra/blob/main/e2e/lib/k8s.sh#L44)

@electrocucaracha
Copy link
Member

This is a sample of a recent execution

Test tasks Time Download
001.sh 591 secs 1,105 MB
002.sh 1038 secs 2,212 MB
003.sh 47 secs 177 MB
004.sh 100 secs 222 MB
005.sh 664 secs 118 MB
006.sh 855 secs 31 MB
007.sh 1019 secs 143 MB
008.sh 321 secs 10 MB
009.sh 17 secs 0 MB

BTW, the sandbox creation takes ~10 min

@efiacor
Copy link
Contributor

efiacor commented Aug 6, 2024

Some observations from debugging.

Testing using the UPF edge deploy flow - https://github.com/nephio-project/test-infra/blob/main/e2e/tests/free5gc/005.sh

The mutator pipeline execution can take some time (~5 min) and the conditions related to the readiness gates are updated to true in good time.

But the delay appears to be related to the approval controller not "getting" the latest packagerevision until at least 5 mins later.
https://github.com/nephio-project/nephio/blob/main/controllers/pkg/reconcilers/approval/reconciler.go#L94

and fails to verify the readiness gates - https://github.com/nephio-project/nephio/blob/main/controllers/pkg/reconcilers/approval/reconciler.go#L131

With some expertise from others, we determined that this may be related to how the Watch notification is impl in porch.
https://github.com/nephio-project/porch/blob/main/pkg/registry/porch/watch.go

@johnbelamaric Would you have some time to direct us on the best approach here?
Of all the existing controllers in the nephio repo, this one is at least of some use to generic users of the nephio operator so it would be great if we could improve it.

@johnbelamaric
Copy link
Member

@johnbelamaric Would you have some time to direct us on the best approach here?

I am not sure I have a great answer here. It's hard to know without more investigation as to exactly what is going on. Can you track the propagation of the change in the cases where you are seeing delays? That is, if approval controller is not seeing the change, is it stuck on an older resource version (commit)? Or is the problem that the Porch API server is somehow not seeing the change?

If it's the first one (porch API sees the change, approval controller doesn't), then the issues is either a watch event that failed to publish (porch issue) or one that somehow was missed by the approval controller.

If it's the second one (porch API does not see the change), then we need to look deeper into how there is a mismatch between Git and the Porch API server. I did have a PR that touched this area, that has never merged (I merged it a long time ago but the Google internal TNA team found some issues so I backed it out): nephio-project/porch#8. That covers how the internal cache is maintained withing Porch and was primarily focused on performance degradation over time. But it may be relevant in this case as it pertains to consistency of the cache.

@efiacor
Copy link
Contributor

efiacor commented Aug 13, 2024

Ok. Some progress, I think.
If we change the packagerev Get in the approval controller to use a Reader, we get a substantial reduction in timings for the free5gc e2e run.

TEST SUMMARY

001.sh: PASS in 182 seconds
002.sh: PASS in 315 seconds
003.sh: PASS in 50 seconds
004.sh: PASS in 48 seconds
005.sh: PASS in 273 seconds
006.sh: PASS in 464 seconds
007.sh: PASS in 413 seconds
008.sh: PASS in 307 seconds
009.sh: PASS in 31 seconds


This is an just a single run and I did observe that the amf pkg got stuck in a proposed state and needed to be push on manually.

@johnbelamaric
Copy link
Member

That looks a lot better!

@efiacor
Copy link
Contributor

efiacor commented Aug 13, 2024

Second run on fresh VM.

TEST SUMMARY

001.sh: PASS in 191 seconds
002.sh: PASS in 290 seconds
003.sh: PASS in 45 seconds
004.sh: PASS in 52 seconds
005.sh: PASS in 374 seconds
006.sh: PASS in 400 seconds
007.sh: PASS in 565 seconds
008.sh: PASS in 55 seconds
009.sh: PASS in 25 seconds


Not very consistent so far, but better timings.

We still have an issue with the amf/smf deploy. Can't see why but the amf pkg gets approved quite early and then gets stuck trying to publish a v2 revision: err
approval policy "initial" not met

See git commit graph:
Screenshot from 2024-08-13 23-36-59

@johnbelamaric
Copy link
Member

I think the "initial" policy won't approve a v2! Thus the name - it approves an initial version but not subsequent edits. Looks like previously it only worked because some initial deltas got batched into a single version. Now with faster approvals, the approval controller is winning a race it used to lose!

Hacky workaround: add an approval delay annotation to that package to delay approval a little bit. It's akin to fixing a race condition with a sleep - bad practice but may help temporarily.

We will need a better policy than "initial" to handle this properly.

@efiacor
Copy link
Contributor

efiacor commented Aug 13, 2024

Added a 1m delay to the amf pvs.
It gets proposed along the lines somewhere and then stalls with:

events owning PackageVariant not Ready

Checked the pv and the status is:

  • lastTransitionTime: "2024-08-13T23:43:39Z"
    message: 'Internal error occurred: cannot update a package revision with lifecycle
    value "Proposed"; package must be Draft'
    reason: Error
    status: "False"
    type: Ready

https://github.com/nephio-project/porch/blob/main/pkg/engine/engine.go#L992

@efiacor
Copy link
Contributor

efiacor commented Aug 15, 2024

Tweaked the requeue duration to 15 secs.
Added a 3 min delay on the AMF & SMF PVs.
All ran though without manual intervention.

TEST SUMMARY

001.sh: PASS in 144 seconds
002.sh: PASS in 277 seconds
003.sh: PASS in 63 seconds
004.sh: PASS in 58 seconds
005.sh: PASS in 748 seconds
006.sh: PASS in 412 seconds
007.sh: PASS in 619 seconds
008.sh: PASS in 48 seconds
009.sh: PASS in 21 seconds


nephio-prow bot pushed a commit that referenced this issue Aug 22, 2024
Change approval controller PR Get to hit the api directly instead of
reading from local cache.
Adjust the reque duration to prevent race condition.

During debugging the approval delay issue reported
[here](#462) it became
apparent that the packagerev being fetched was a cached version which
didn't get updated for quite some time.
To circumvent this, we are retrieving the PR using the apiReader
interface which bypasses the local cache and hits the k8s api directly.
@kispaljr
Copy link
Contributor

Much better numbers! :) Congratulations!

@kispaljr
Copy link
Contributor

kispaljr commented Aug 26, 2024

Just one minor question: have you considered using the existing porchRESTClient for the Get() call in the approval controller to bypass the cache, instead of introducing yet another client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform area/porch Porch related issues bug Something isn't working help wanted Extra attention is needed triaged
Projects
Status: Done
Development

No branches or pull requests

7 participants