Allow `imagePullBackOff` for the specified duration #7666

pritidesai · 2024-02-14T00:04:55Z

Changes

We have implemented imagePullBackOff to fail fast. The issue with this approach is, this can be a transient error depending on the infrastructure. Often times the node where the pod is scheduled experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff) compared to other authentication failure, missing image, etc. In case of a rate limit, the pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. But the fail fast approach results in a taskRun failure and hence pipelineRun results in a failure.

Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, controller returns a permanent failure.

#5987
#7184

/kind feature

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
pre-commit Passed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Configure default-imagepullbackoff-timeout to allow imagePullBackOff to retry and wait for the specified duration before failing the pipeline.

tekton-robot · 2024-02-14T00:10:10Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.2%	85.5%	-5.7
pkg/reconciler/taskrun/taskrun.go	86.6%	83.9%	-2.7

tekton-robot · 2024-02-14T00:13:33Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.2%	85.5%	-5.7
pkg/reconciler/taskrun/taskrun.go	86.6%	83.9%	-2.7

vdemeester

SGTM but I think we should use a duration type instead of just a number (that represent minutes). It's more "user-friendly" in my opinion.

vdemeester · 2024-02-14T09:35:55Z

config/config-defaults.yaml

@@ -87,6 +87,10 @@ data:
    # no default-resolver-type is specified by default
    default-resolver-type:

+    # default-imagepullbackoff-timeout contains the default number of minutes to wait
+    # before requeuing the TaskRun to retry
+    # default-imagepullbackoff-timeout: "5"


We probably want to use time values instead, like 1m, 5m, 10s or 1h.

or we need to add "minutes" in the field name.

I was looking at this too and thinking that in all cases I can think of K8s uses "seconds". I think that's simplest vs. supporting something fancier.

I do not have strong opinion either ways. I like the time values as it is more clear and flexible such that the timeout can be specified in seconds, minutes, hours, etc. But at the same time, it looses consistency with the existing taskRun timeout field which uses minutes.

We have the time.Duration implemented now, do we want to change that to adding minutes in the field name or use seconds?

good point -- we might want to keep our time units consistent across the board in the future and I think time.Duration is truly better basis.

pkg/apis/config/default.go

skaegi · 2024-02-14T18:32:37Z

pkg/reconciler/taskrun/taskrun.go

@@ -222,10 +224,30 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1.TaskRun) pkgrecon
 	return nil
 }

-func (c *Reconciler) checkPodFailed(tr *v1.TaskRun) (bool, v1.TaskRunReason, string) {
+func (c *Reconciler) checkPodFailed(tr *v1.TaskRun, ctx context.Context) (bool, v1.TaskRunReason, string) {


K8s API convention is that "context" should be the first parameter if needed.

It’s even Go conventions 😇

tekton-robot · 2024-02-14T19:10:20Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.2%	91.9%	0.7
pkg/reconciler/taskrun/taskrun.go	86.6%	87.2%	0.6

tekton-robot · 2024-02-14T19:11:53Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.2%	91.9%	0.7
pkg/reconciler/taskrun/taskrun.go	86.6%	87.2%	0.6

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <[email protected]> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <[email protected]>

tekton-robot · 2024-02-14T19:26:57Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.2%	91.9%	0.7
pkg/reconciler/taskrun/taskrun.go	86.6%	87.2%	0.6

tekton-robot · 2024-02-14T19:27:33Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.2%	91.9%	0.7
pkg/reconciler/taskrun/taskrun.go	86.6%	87.2%	0.6

tekton-robot · 2024-02-14T19:28:42Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.2%	91.9%	0.7
pkg/reconciler/taskrun/taskrun.go	86.6%	87.2%	0.6

config/config-defaults.yaml

JeromeJu

Thanks @pritidesai

tekton-robot · 2024-02-15T17:20:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JeromeJu, skaegi, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JeromeJu,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JeromeJu · 2024-02-15T17:24:29Z

/lgtm
Thanks for supporting this @pritidesai

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <[email protected]>

afrittoli

Thanks @pritidesai
Just a couple of minor things, maybe for a follow up PR

afrittoli · 2024-02-15T16:53:03Z

docs/additional-configs.md

+  name: config-defaults
+  namespace: tekton-pipelines
+data:
+  default-imagepullbackoff-timeout: "5"


The example needs to be updated to 5m as well

thanks @afrittoli - ptal - #7679. Thanks!

afrittoli · 2024-02-15T16:58:18Z

pkg/reconciler/taskrun/taskrun.go

+					if imagePullBackOffTimeOut.Seconds() != 0 {
+						p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
+						if err != nil {
+							message := fmt.Sprintf(`The step %q in TaskRun %q failed to pull the image %q and the pod with error: "%s."`, step.Name, tr.Name, step.ImageID, err)


NIT: error messages should start in lowercase

thanks @afrittoli - ptal - #7679. Thanks!

afrittoli · 2024-02-15T17:54:07Z

pkg/reconciler/taskrun/taskrun.go

+				if sidecar.Waiting.Reason == ImagePullBackOff {
+					imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout
+					// only attempt to recover from the imagePullBackOff if specified
+					if imagePullBackOffTimeOut.Seconds() != 0 {
+						p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
+						if err != nil {
+							message := fmt.Sprintf(`The sidecar %q in TaskRun %q failed to pull the image %q and the pod with error: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, err)
+							return true, v1.TaskRunReasonImagePullFailed, message
+						}
+						for _, condition := range p.Status.Conditions {
+							// check the pod condition to get the time when the pod was scheduled
+							// keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration
+							if condition.Type == corev1.PodScheduled {
+								if c.Clock.Since(condition.LastTransitionTime.Time) < imagePullBackOffTimeOut {
+									return false, "", ""
+								}
+							}
+						}
+					}
+				}


NIT: this might be a reusable function instead of having the same code twice

There is a minor difference, both operating on different datatype. Its a little tricky as one has reference to step while other has a reference to sidecar.

pritidesai · 2024-02-15T18:28:29Z

Thank you @vdemeester @skaegi @JeromeJu @afrittoli for the reviews!

We just upgraded our deployment to 0.53 and would like to cherry pick this in 0.53 and 0.56. Thoughts?

pritidesai · 2024-02-15T18:59:48Z

/cherry-pick release-v0.56.x

tekton-robot · 2024-02-15T19:00:33Z

@pritidesai: new pull request created: #7678

In response to this:

/cherry-pick release-v0.56.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 This is a manual cheery-pick of #7666 Signed-off-by: Priti Desai <[email protected]>

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 14, 2024

tekton-robot requested review from afrittoli and bobcatfish February 14, 2024 00:05

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 14, 2024

vdemeester reviewed Feb 14, 2024

View reviewed changes

afrittoli self-assigned this Feb 14, 2024

skaegi reviewed Feb 14, 2024

View reviewed changes

pritidesai force-pushed the imagepullbackoff-1 branch from 5d7e065 to 78a1b3d Compare February 14, 2024 19:02

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 14, 2024

pritidesai force-pushed the imagepullbackoff-1 branch from 78a1b3d to 940e91a Compare February 14, 2024 19:05

pritidesai force-pushed the imagepullbackoff-1 branch from 940e91a to 9bf76db Compare February 14, 2024 19:21

JeromeJu self-assigned this Feb 14, 2024

skaegi approved these changes Feb 14, 2024

View reviewed changes

pritidesai added this to the Pipelines v0.57 milestone Feb 15, 2024

vdemeester approved these changes Feb 15, 2024

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2024

JeromeJu reviewed Feb 15, 2024

View reviewed changes

config/config-defaults.yaml Show resolved Hide resolved

JeromeJu approved these changes Feb 15, 2024

View reviewed changes

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2024

tekton-robot merged commit fd17c74 into tektoncd:main Feb 15, 2024
13 checks passed

afrittoli reviewed Feb 15, 2024

View reviewed changes

pritidesai mentioned this pull request Feb 15, 2024

[release-v0.53.x] wait for a given duration in case of imagePullBackOff #7677

Merged

8 tasks

tekton-robot mentioned this pull request Feb 15, 2024

[release-v0.56.x] Allow imagePullBackOff for the specified duration #7678

Merged

This was referenced Jul 1, 2024

Configurable grace period for TaskRun pods in ImagePullBackOff #5987

Closed

Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `imagePullBackOff` for the specified duration #7666

Allow `imagePullBackOff` for the specified duration #7666

pritidesai commented Feb 14, 2024 •

edited

Loading

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

vdemeester left a comment

vdemeester Feb 14, 2024

vdemeester Feb 14, 2024

skaegi Feb 14, 2024

pritidesai Feb 14, 2024

skaegi Feb 14, 2024

skaegi Feb 14, 2024

vdemeester Feb 15, 2024

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

JeromeJu left a comment

tekton-robot commented Feb 15, 2024

JeromeJu commented Feb 15, 2024

afrittoli left a comment

afrittoli Feb 15, 2024

pritidesai Feb 16, 2024

afrittoli Feb 15, 2024

pritidesai Feb 16, 2024

afrittoli Feb 15, 2024

pritidesai Feb 16, 2024

pritidesai commented Feb 15, 2024

pritidesai commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

Allow imagePullBackOff for the specified duration #7666

Allow imagePullBackOff for the specified duration #7666

Conversation

pritidesai commented Feb 14, 2024 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

vdemeester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

tekton-robot commented Feb 14, 2024

JeromeJu left a comment

Choose a reason for hiding this comment

tekton-robot commented Feb 15, 2024

JeromeJu commented Feb 15, 2024

afrittoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pritidesai commented Feb 15, 2024

pritidesai commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

Allow `imagePullBackOff` for the specified duration #7666

Allow `imagePullBackOff` for the specified duration #7666

pritidesai commented Feb 14, 2024 •

edited

Loading