-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrying applying CRs in the e2e
jobs.
#1293
Retrying applying CRs in the e2e
jobs.
#1293
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ybettan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe pull request involves modifications to two Bash scripts used in Kubernetes deployment and build processes. The primary changes introduce a retry mechanism for applying resources and creating pods. Both scripts ( Changes
Assessment against linked issues
The changes directly address the issue by implementing a retry mechanism that allows services more time to become ready before applying Custom Resources (CRs). Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
✅ Deploy Preview for openshift-kmm ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
ci/prow/e2e-incluster-build (2)
45-45
: Add error handling for timeoutThe script should handle cases where the timeout is reached. Currently, if the resources can't be applied within 1 minute, the script will fail without a clear error message.
-timeout 1m bash -c 'until oc apply -k ci/e2e; do sleep 3; done' +if ! timeout 1m bash -c 'until oc apply -k ci/e2e; do sleep 3; done'; then + echo "ERROR: Failed to apply resources after 1 minute of retries" + exit 1 +fi
Line range hint
48-51
: Simplify and improve pod name matchingThe current jq queries for matching pod names could be simplified and made more robust by:
- Using explicit label selectors with
oc get pods
- Simplifying the jq query
-timeout 1m bash -c 'until oc get pods -o json | jq -er ".items[].metadata.name | select(.? | match(\"build\"))"; do sleep 1; done' -POD_NAME=$(oc get pods -o json | jq -r '.items[].metadata.name | select(.? | match("build"))') +timeout 1m bash -c 'until oc get pods -l job-name=kmm-ci-build -o name; do sleep 1; done' +POD_NAME=$(oc get pods -l job-name=kmm-ci-build -o name | sed 's|pod/||') -timeout 1m bash -c 'until oc get pods -o json | jq -er ".items[].metadata.name | select(.? | match(\"sign\"))"; do sleep 1; done' -POD_NAME=$(oc get pods -o json | jq -r '.items[].metadata.name | select(.? | match("sign"))') +timeout 1m bash -c 'until oc get pods -l job-name=kmm-ci-sign -o name; do sleep 1; done' +POD_NAME=$(oc get pods -l job-name=kmm-ci-sign -o name | sed 's|pod/||')Also applies to: 54-57
ci/prow/e2e-hub-spoke-incluster-build (3)
50-50
: Document timeout valuesThe script uses various timeout values (1m, 3m, 15m) for different operations. Consider adding comments explaining the rationale behind these timeout values to help maintainers understand and adjust them if needed.
+# Retry applying resources for up to 1 minute to handle webhook service availability timeout 1m bash -c 'until oc apply -k ci/e2e-hub; do sleep 3; done'
Line range hint
52-54
: Standardize build wait patternsThe build wait patterns could be standardized with the pod wait patterns from the other script for consistency.
-timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -er ".items[].metadata.name | select(.? | match(\"build\"))"; do sleep 1; done' -export build_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -r '.items[].metadata.name | select(.? | match("build"))') +timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-build -o name; do sleep 1; done' +export build_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-build -o name | sed 's|build/||') -timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -er ".items[].metadata.name | select(.? | match(\"sign\"))"; do sleep 1; done' -export sign_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -r '.items[].metadata.name | select(.? | match("sign"))') +timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-sign -o name; do sleep 1; done' +export sign_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-sign -o name | sed 's|build/||')Also applies to: 57-59
50-50
: Consider extracting common retry patternsBoth scripts implement similar retry patterns for
oc apply
and resource waiting. Consider extracting these patterns into helper functions in a shared script to reduce duplication and ensure consistency.Example shared functions:
# helpers.sh wait_for_resource() { local namespace=$1 local resource_type=$2 local selector=$3 local timeout=${4:-1m} if ! timeout "${timeout}" bash -c "until oc -n ${namespace} get ${resource_type} ${selector} &>/dev/null; do sleep 1; done"; then echo "ERROR: Timeout waiting for ${resource_type} in namespace ${namespace}" return 1 fi } retry_apply() { local path=$1 local timeout=${2:-1m} if ! timeout "${timeout}" bash -c "until oc apply -k ${path}; do sleep 3; done"; then echo "ERROR: Failed to apply resources at ${path} after ${timeout} of retries" return 1 fi }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ci/prow/e2e-hub-spoke-incluster-build
(1 hunks)ci/prow/e2e-incluster-build
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (go)
The webhook services are not always available right after the deployments are ready. This is causing a race condition between the services being ready to get requests and the CRs applied to the cluster. By retrying to apply the CRs we can give the services the time they need to become ready. Signed-off-by: Yoni Bettan <[email protected]>
0835c06
to
8156f88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ci/prow/e2e-hub-spoke-incluster-build (1)
Line range hint
1-54
: Consider architectural improvements for better reliability.The script has several areas where reliability could be improved:
- Inconsistent timeout durations (1m vs 3m vs 15m)
- Limited error handling and reporting
- No cleanup on failure
Consider these improvements:
- Define timeout constants at the top of the script:
readonly RESOURCE_TIMEOUT="5m" readonly BUILD_TIMEOUT="10m" readonly ACM_TIMEOUT="15m"
- Add a trap for cleanup:
cleanup() { # Add cleanup logic here echo "Cleaning up resources..." } trap cleanup EXIT
- Add error reporting function:
report_error() { local component=$1 local message=$2 echo "Error in ${component}: ${message}" # Add logging or metric collection }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ci/prow/e2e-hub-spoke-incluster-build
(1 hunks)ci/prow/e2e-incluster-build
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- ci/prow/e2e-incluster-build
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
ci/prow/e2e-hub-spoke-incluster-build (1)
54-54
: Verify the timeout and retry intervals.The retry mechanism is a good addition to handle race conditions with webhook services. However, consider the following aspects:
- The 1-minute timeout might be too short for larger deployments.
- The 3-second sleep interval between retries might lead to unnecessary load on the API server.
Consider this improved version with exponential backoff:
-timeout 1m bash -c 'until oc apply -k ci/e2e-hub; do sleep 3; done' +timeout 5m bash -c ' + retries=0 + until oc apply -k ci/e2e-hub; do + retries=$((retries + 1)) + if [ $retries -gt 10 ]; then + echo "Failed to apply resources after 10 retries" + exit 1 + fi + sleep $((2 ** retries)) + done +'Let's verify the resource application patterns in the codebase:
✅ Verification successful
The timeout and retry intervals are consistent with established patterns.
The current implementation using a 1-minute timeout and 3-second sleep interval aligns with similar resource application patterns throughout the codebase. While exponential backoff could be an improvement, the current approach is consistent with the project's conventions and sufficient for the use case.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for similar resource application patterns that might need the same retry mechanism rg -A 2 "oc apply" --type bashLength of output: 61
Script:
#!/bin/bash # Search for oc apply patterns with context rg -C 3 "oc apply" # Search for timeout patterns rg -C 2 "timeout.*bash" # Search for retry/until patterns rg -C 2 "until.*do"Length of output: 42582
/retest |
2 similar comments
/retest |
/retest |
/lgtm |
d80a1e4
into
rh-ecosystem-edge:main
The webhook services are not always available right after the deployments are ready. This is causing a race condition between the services being ready to get requests and the CRs applied to the cluster.
By retrying to apply the CRs we can give the services the time they need to become ready.
/assign @yevgeny-shnaidman @TomerNewman
Fixes #1291
Summary by CodeRabbit
oc apply
commands to handle potential transient failures.