When does a timeout counter actually start? #4078

badamowicz · 2021-07-07T11:51:36Z

Given an Task with a timeout set to 30m. And given the Pod created by the TaskRun goes to status Pending due to missing resources and exceeds those 30m. Will this cause a timeout to be triggered?
We believe we are currently facing this issue in some of our Pods. But we also think the timeout counter shouldn't start before the Pod's status is Running.
Can someone explain? Thanks!

ghost · 2021-07-07T15:30:11Z

/kind question

bobcatfish · 2021-07-08T17:59:54Z

Hey @badamowicz ! I think what you're describing is right: the timeout countdown starts as soon as the "StartTime" is set in the TaskRun/PipelineRun status, which would be the moment at which the Controller starts reconciling the TaskRun.

What you're describing would be an alternative approach but it would have the downside of never timing out a pod that can't be scheduled - e.g. imagine you've created a TaskRun which wants to use a volume that doesn't exist, with the current logic, that would eventually timeout, but if the timeout waited for the pod to start running, it would never time out. It's possible we could add the functionality you're describing (and in fact we seem to have a trend toward more and more fine grained control over timeouts https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)

(At the very least we should probably explain this in the docs 🤔 https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#configuring-a-failure-timeout)

We believe we are currently facing this issue in some of our Pods. But we also think the timeout counter shouldn't start before the Pod's status is Running.

Are you able to provide more info about what these pods are waiting for? It'd help to understand more about your system and why you want to allow pods to wait for some resources to become available before running (most of the design choices we've made assume that when you create a TaskRun, you want it to run immediately vs. eventually when everything it needs is ready).

badamowicz · 2021-07-09T07:34:36Z

Hi @bobcatfish!

Are you able to provide more info about what these pods are waiting for? It'd help to understand more about your system and why you want to allow pods to wait for some resources to become available before running

We may indeed have a very special behaviour of our pipelines. What actually is happening is this:

Build the artifacts of a large Java application.
Then start some 35 Tasks in parallel, each performing Selenium tests against the deployed application. Each of these tasks consists of:
** a Pod with Websphere Liberty server along with the deployed application
** an Oracle pod as sidecar
** a Pod with a mock server as sidecar
Do some cleanup stuff

Now those 35 parallel TaskRuns cause a huge peak of CPU and memory consumption on the cluster's nodes and may eventually force Pods of other PipelineRuns to go to status Pending. This is expected behaviour and OK for us. However we had to increase all timeouts to avoid those pending Pods beeing killed due to timeout.

So for use cases like this it could be nice to have a configuration option like in this example:

- name: some-taskrun
  taskRef:
    name: some-task
  timeout:
    value: 30m
    ignorePending: 'true'

What do you think?

bobcatfish · 2021-07-09T20:08:15Z

thanks for explaining more about your use case!

im a bit hesitant about adding this b/c i feel like there are going to be other scenarios that folks will run into where they'll want some variation on this logic - for example, maybe folks want the behavior you are describing, but only when the pod is waiting for resources to be scheduled, and not in other cases (e.g. when a volume a pod needs hasnt been created, or when an image pull fails)

that makes me wonder if this is the kind of thing it would make sense to solve by adding an optional component vs adding to the existing API - e.g. you could run your pipelineruns with timeouts off, and have a separate controller that watches pipelineruns in your system and it can use whatever logic it wants to decide timeouts (e.g. ignoring pending)

i feel bad suggesting this tho b/c writing your own controller isn't the easiest ask - i think it should be easier in the long run, and we're taking some small steps toward this, for example the proposal to add a custom task SDK (tektoncd/community#461) involves making it easier to write a controller - there are also example controllers in our experimental repo (e.g. https://github.com/tektoncd/experimental/tree/main/task-loops)

the other thing that comes to mind is that this request reminds me of TEP-0015 which was all about adding a "pending" status for PipelineRuns. iirc the motivation was to allow for folks to create as many PipelineRuns as they want, but have something else observe the cluster and decide when they should actually run (i.e. when resources were actually available - which i think is similar to what you're describing). pursuing that option would also involve writing a custom controller 😬

curious what other folks think too @tektoncd/core-maintainers - ill put this on the agenda for the next API working group as well to get some more perspectives

badamowicz · 2021-07-12T08:52:05Z

ill put this on the agenda for the next API working group as well to get some more perspectives

Great! Thanks for discussing this issue so deeply! Looking forward to the results of the working group.

bobcatfish · 2021-07-12T22:30:59Z

Hey @badamowicz - we discussed this in the API working group today (notes and link to the recording). I'd say sentiments were on the fence, between adding an 'execution timeout' feature like you're describing, and noting that it feels like the core problem here is more around scheduling - very similar to the problem described in TEP-0015.

Speaking of which - @jbarrick-mesosphere @vincent-pli @FogDong - have you been able to leverage the functionality added in #3522 to create logic around running Pending pipelines when resources are available? If so it'd be great so learn more, it might be that that would help with @badamowicz 's use case here as well 🙏

(long story short @badamowicz I think we're still hoping we can avoid adding this feature but I'd say it's not entirely off the table!)

zhouhaibing089 · 2021-07-13T00:45:35Z

I'd like to compare what others have in their timeout configuration. For example, in Prow:

https://github.com/kubernetes/test-infra/blob/49f933a3eda05d96589b6d97e0fce10df952ac6d/prow/config/config.go#L491-L499

It supports three different timeout configurations:

PodUnscheduledTimeout
PodPendingTimeout
PodRunningTimeout

I don't really have any point to make, just providing some information here.

badamowicz · 2021-07-13T06:15:09Z

Hi @bobcatfish !
I'm not sure if #3522 would really help us. However, we will for sure discuss and test this in our team.

The example/solution provided by @zhouhaibing089 looks great for me. Having such configuration options at least in the TaskRuns would be a highly elegant solution in my opinion.

Will get back as soon as we've discussed #3522. Thanks!

badamowicz · 2021-07-14T11:16:13Z

Hi @bobcatfish !
We've dicussed #3522 in our team. But we think this is - at least not yet - a solution for us because it would require a CRD observing the cluster's resources and schedule the PRs accordingly. This would be a kind of mimicing core K8s behaviour.

I think this issue is also a bit related to #699 though they do not discuss resources there.

We'd like to keep this issue going and our idea is to file a detailled feature request. What do you (and others) think?

bobcatfish · 2021-07-15T00:34:05Z

Thanks for sharing that info about prow @zhouhaibing089 ! one thing that's interesting about that is that from what I can tell, that configuration isn't specified on each individual job (i.e. you can't have one job with a different PodPendingTimeout than another) but globally - which kinda aligns with my suggestion that instead of adding this functionality to Pipelines, one could add another component (controller) that observes running pipelines and makes timeout related decisions.

We'd like to keep this issue going and our idea is to file a detailled feature request. What do you (and others) think?

I'm still hoping we can avoid adding this at the Pipeline level but I'm definitely open to it. I've added this to the agenda for the next API working group to discuss again; would like to hear more from other @tektoncd/core-maintainers on their thoughts.

One possible compromise: I wonder if a controller level config option could do the trick, which would make it so that timeouts are only applied after scheduling? I'm guessing that wouldn't totally work for you though because you still DO want the TaskRun to timeout eventually even if it can't be scheduled?

badamowicz · 2021-07-15T06:40:55Z

One possible compromise: I wonder if a controller level config option could do the trick, which would make it so that timeouts are only applied after scheduling?

Yes! In our particular use case this would definitely help. That's about what I've described here. However, I'm aware that this also has the drawback that Pending status caused by other reasons (e.g. missing volume) will no longer be recognized - as you mentioned. But as a first step towards a "big" solution which will also cover other use cases, this would be very helpfully for us.

But will await the results of the next API working group first. Thanks!

ghost · 2021-07-15T14:58:58Z

Another option might be to param-aterize the test-running task, passing through a limit as a param to the code executing the test suite. It wouldn't be Tekton-enforced but this would be a way of having very fine-grained control over the precise meaning of "timeout" while also keeping the value of that timeout in the yaml. Does your test runner support this as a configuration option? I wonder are there any reasons to prefer a tekton-enforced execution timeout over the test runner's own?

ghglza · 2021-07-16T11:29:23Z

@sbwsg no I'm afraid we would have to implement timeouts on the test-runner layer by ourselves.
This (and same for building a dedicated controller) would from our perspective essentially convert a CI/CD mgmt project to a SW development effort, which is capacity- and lifcycle-mgmt-wise unfortunately out of scope for us.
I would dare to speculate that this is similar for most users of Tekton, so some finer-grained control of timeouts w/o having to code would IMHO be a very welcome feature!
But of course this is one of the areas where hitting the sweet spot between API- and code-bloat within Tekton vs. duplicated coding requirements on the side of the users is really tricky... if I could choose I'd vote for timeout categories similar to Prow.

bobcatfish · 2021-07-16T18:54:29Z

Oh yeah that's a great idea too @sbwsg

But of course this is one of the areas where hitting the sweet spot between API- and code-bloat within Tekton vs. duplicated coding requirements on the side of the users is really tricky

Really appreciate your patience and understanding around this @ghglza @badamowicz !!

Building on the idea of having controller level timeouts, maybe we could add something like "global pending timeout", and if it's set, we'd have the behavior you've requested, i.e. the TaskRun timeout timer wouldn't start until the pod went from Pending to Running (i think that's the right point based on pod phases), during that point the pending timeout would be applied, and once the pod is Running, then the TaskRun timer would start.

This is assuming that the pending timeout is global and not something we want to configure on a TaskRun/PipelineRun basis (which I'm thinking is probably a fair assumption, since this is about cluster wide conditions?)

ghglza · 2021-07-21T16:17:12Z

@bobcatfish that sounds great!
For me it seems also that a "pending timeout" does not belong to pipeline specific settings -- IMHO it would (at least as a start) even suffice if this timeout would be configurable only at controller startup.

badamowicz · 2021-07-22T08:29:10Z

I basically agree with @ghglza. However I can see there were also other options discussed in the last API working group meeting.

@bobcatfish: What is your intention of how to proceed with this issue in general? Is there still an ongoing inernal discussion? Will we have to file a feature request?

Thanks!

bobcatfish · 2021-07-26T23:09:41Z

Hey @badamowicz apologies for the delay! The discussion at the API working group last week wasn't super conclusive but it seems like having some controller level configuration around this is something that could work. I've sent an email to tekton-dev to try to get a few more eyes on this and make sure I don't recommend something to you that other folks strongly disagree with 🤞

(also I'm about to go on vacation for 2 weeks but I'll make sure to follow up on this when I'm back! in the meantime also please feel free to join our API working group if any of the times work for you - https://github.com/tektoncd/community/blob/main/working-groups.md#api - and/or continue the discussion on slack )

Thanks again for your patience 🙏

badamowicz · 2021-07-27T09:08:20Z

Thanks again for your patience

You're welcome! We also do appreciate the deep investigation you're putting into this issue. Have a nive holiday!

bobcatfish · 2021-08-13T22:18:54Z

Thanks @badamowicz ! :D it was nice to have some time away

So quick update, there were no strong replies to the thread in tekton-dev so it seems like adding some kind of controller level "pending timeout" could be an acceptable solution - if that's something you think might work for you, the next step for proposing a feature is that someone will need to create a TEP describing the proposal.

bobcatfish · 2021-08-24T15:35:54Z

Hello again @badamowicz and everyone - in a surprising change of events, I'd actually like to go back on what I said and advocate for what @badamowicz was asking for all along 🙏 namely to be able to configure this on a per-run level. The reason for this is that we've been exploring how to have parity with Google Cloud Build (GCB) and Tekton, i.e how to implement build configured with GCB using Tekton types, and we had missed queueTtl - which seems to roughly map to the same concept as the pending timeout @badamowicz has requested.

I'll follow up on the mailing list and in the API working group as well 🙏 sorry for EVEN MORE back and forth here

vdemeester · 2021-08-24T17:05:38Z

I'll follow up on the mailing list and in the API working group as well sorry for EVEN MORE back and forth here

no is temporary, yes is forever 😝

This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).

bobcatfish · 2021-10-14T00:11:02Z

Okay after that big delay I've finally got the TEP together @badamowicz ! It's at tektoncd/community#540 - please let me know what you think (I also included you as an author but np if you'd rather I remove that XD). I'm very keen to know if this will meet your needs or not. Particularly since some details are a bit different from your initial proposal - for example you mentioned starting the existing timeout only after a pod is Running but I propose using the Pod condition PodScheduled - which means in my proposal the time to pull images would be counted as part of the execution time.

badamowicz · 2021-10-15T12:23:10Z

That's great news @bobcatfish ! The TEP looks OK for us. Looking forward for the release containing the new timeout features! Thanks for putting so much effort in this issue!

And great honor being mentioned as an author!

This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).

kscherer · 2021-10-29T14:03:11Z

My team was having a similar issue. We did not expect a Task timeout to be started while the Task was in Pending state. There is a Pipeline timeout that already is effectively a "global" timeout.

Until the TEP is implemented and released we will be handling the Task timeout using a sidecar. We already have a sidecar for progress monitoring and hang checks. Adding a task timeout to this will not be difficult.

tekton-robot · 2022-01-27T14:54:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-02-26T15:07:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-03-28T16:01:28Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-03-28T16:01:30Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bobcatfish · 2022-04-11T22:06:02Z

whoops this shouldn't have gone rotten! still being addressed (slowly) in TEP-0092

/reopen
/lifecycle frozen

tekton-robot · 2022-04-11T22:06:04Z

@bobcatfish: Reopened this issue.

In response to this:

whoops this shouldn't have gone rotten! still being addressed (slowly) in TEP-0092

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).

EmmaMunley · 2023-01-27T21:52:01Z

/assign @EmmaMunley

jerop · 2023-02-06T18:06:26Z

@badamowicz @ghglza @kscherer - we are revisiting this problem and wanted to check in if you have any additional considerations, please review the update to TEP-0092 from @EmmaMunley: tektoncd/community#948

cc @vdemeester @pritidesai @bobcatfish

tekton-robot added the kind/question Issues or PRs that are questions around the project or a particular feature label Jul 7, 2021

bobcatfish added area/api Indicates an issue or PR that deals with the API. kind/documentation Categorizes issue or PR as related to documentation. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jul 8, 2021

bobcatfish removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 9, 2021

bobcatfish self-assigned this Jul 12, 2021

badamowicz closed this as completed Jul 15, 2021

badamowicz reopened this Jul 15, 2021

bobcatfish removed their assignment Aug 16, 2021

bobcatfish mentioned this issue Oct 14, 2021

[TEP-0092] TaskRun scheduling timeout ⏱️ tektoncd/community#540

Merged

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2022

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 26, 2022

tekton-robot closed this as completed Mar 28, 2022

tekton-robot reopened this Apr 11, 2022

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 11, 2022

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

EmmaMunley mentioned this issue Jan 25, 2023

Configurable grace period for TaskRun pods in ImagePullBackOff #5987

Closed

tekton-robot assigned EmmaMunley Jan 27, 2023

EmmaMunley mentioned this issue Feb 13, 2023

TEP-0092: TaskRun Timeouts - Updated Proposal tektoncd/community#948

Closed

EmmaMunley removed their assignment May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When does a timeout counter actually start? #4078

When does a timeout counter actually start? #4078

badamowicz commented Jul 7, 2021

ghost commented Jul 7, 2021

bobcatfish commented Jul 8, 2021

badamowicz commented Jul 9, 2021

bobcatfish commented Jul 9, 2021 •

edited

Loading

badamowicz commented Jul 12, 2021

bobcatfish commented Jul 12, 2021

zhouhaibing089 commented Jul 13, 2021

badamowicz commented Jul 13, 2021

badamowicz commented Jul 14, 2021

bobcatfish commented Jul 15, 2021

badamowicz commented Jul 15, 2021 •

edited

Loading

ghost commented Jul 15, 2021

ghglza commented Jul 16, 2021 •

edited

Loading

bobcatfish commented Jul 16, 2021

ghglza commented Jul 21, 2021

badamowicz commented Jul 22, 2021

bobcatfish commented Jul 26, 2021

badamowicz commented Jul 27, 2021 •

edited

Loading

bobcatfish commented Aug 13, 2021

bobcatfish commented Aug 24, 2021

vdemeester commented Aug 24, 2021

bobcatfish commented Oct 14, 2021

badamowicz commented Oct 15, 2021 •

edited

Loading

kscherer commented Oct 29, 2021

tekton-robot commented Jan 27, 2022

tekton-robot commented Feb 26, 2022

tekton-robot commented Mar 28, 2022

tekton-robot commented Mar 28, 2022

bobcatfish commented Apr 11, 2022

tekton-robot commented Apr 11, 2022

EmmaMunley commented Jan 27, 2023

jerop commented Feb 6, 2023 •

edited

Loading

When does a timeout counter actually start? #4078

When does a timeout counter actually start? #4078

Comments

badamowicz commented Jul 7, 2021

ghost commented Jul 7, 2021

bobcatfish commented Jul 8, 2021

badamowicz commented Jul 9, 2021

bobcatfish commented Jul 9, 2021 • edited Loading

badamowicz commented Jul 12, 2021

bobcatfish commented Jul 12, 2021

zhouhaibing089 commented Jul 13, 2021

badamowicz commented Jul 13, 2021

badamowicz commented Jul 14, 2021

bobcatfish commented Jul 15, 2021

badamowicz commented Jul 15, 2021 • edited Loading

ghost commented Jul 15, 2021

ghglza commented Jul 16, 2021 • edited Loading

bobcatfish commented Jul 16, 2021

ghglza commented Jul 21, 2021

badamowicz commented Jul 22, 2021

bobcatfish commented Jul 26, 2021

badamowicz commented Jul 27, 2021 • edited Loading

bobcatfish commented Aug 13, 2021

bobcatfish commented Aug 24, 2021

vdemeester commented Aug 24, 2021

bobcatfish commented Oct 14, 2021

badamowicz commented Oct 15, 2021 • edited Loading

kscherer commented Oct 29, 2021

tekton-robot commented Jan 27, 2022

tekton-robot commented Feb 26, 2022

tekton-robot commented Mar 28, 2022

tekton-robot commented Mar 28, 2022

bobcatfish commented Apr 11, 2022

tekton-robot commented Apr 11, 2022

EmmaMunley commented Jan 27, 2023

jerop commented Feb 6, 2023 • edited Loading

bobcatfish commented Jul 9, 2021 •

edited

Loading

badamowicz commented Jul 15, 2021 •

edited

Loading

ghglza commented Jul 16, 2021 •

edited

Loading

badamowicz commented Jul 27, 2021 •

edited

Loading

badamowicz commented Oct 15, 2021 •

edited

Loading

jerop commented Feb 6, 2023 •

edited

Loading