-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When does a timeout counter actually start? #4078
Comments
/kind question |
Hey @badamowicz ! I think what you're describing is right: the timeout countdown starts as soon as the "StartTime" is set in the TaskRun/PipelineRun status, which would be the moment at which the Controller starts reconciling the TaskRun. What you're describing would be an alternative approach but it would have the downside of never timing out a pod that can't be scheduled - e.g. imagine you've created a TaskRun which wants to use a volume that doesn't exist, with the current logic, that would eventually timeout, but if the timeout waited for the pod to start running, it would never time out. It's possible we could add the functionality you're describing (and in fact we seem to have a trend toward more and more fine grained control over timeouts https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) (At the very least we should probably explain this in the docs 🤔 https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#configuring-a-failure-timeout)
Are you able to provide more info about what these pods are waiting for? It'd help to understand more about your system and why you want to allow pods to wait for some resources to become available before running (most of the design choices we've made assume that when you create a TaskRun, you want it to run immediately vs. eventually when everything it needs is ready). |
Hi @bobcatfish!
We may indeed have a very special behaviour of our pipelines. What actually is happening is this:
Now those 35 parallel TaskRuns cause a huge peak of CPU and memory consumption on the cluster's nodes and may eventually force Pods of other PipelineRuns to go to status Pending. This is expected behaviour and OK for us. However we had to increase all timeouts to avoid those pending Pods beeing killed due to timeout. So for use cases like this it could be nice to have a configuration option like in this example:
What do you think? |
thanks for explaining more about your use case! im a bit hesitant about adding this b/c i feel like there are going to be other scenarios that folks will run into where they'll want some variation on this logic - for example, maybe folks want the behavior you are describing, but only when the pod is waiting for resources to be scheduled, and not in other cases (e.g. when a volume a pod needs hasnt been created, or when an image pull fails) that makes me wonder if this is the kind of thing it would make sense to solve by adding an optional component vs adding to the existing API - e.g. you could run your pipelineruns with timeouts off, and have a separate controller that watches pipelineruns in your system and it can use whatever logic it wants to decide timeouts (e.g. ignoring pending) i feel bad suggesting this tho b/c writing your own controller isn't the easiest ask - i think it should be easier in the long run, and we're taking some small steps toward this, for example the proposal to add a custom task SDK (tektoncd/community#461) involves making it easier to write a controller - there are also example controllers in our experimental repo (e.g. https://github.com/tektoncd/experimental/tree/main/task-loops) the other thing that comes to mind is that this request reminds me of TEP-0015 which was all about adding a "pending" status for PipelineRuns. iirc the motivation was to allow for folks to create as many PipelineRuns as they want, but have something else observe the cluster and decide when they should actually run (i.e. when resources were actually available - which i think is similar to what you're describing). pursuing that option would also involve writing a custom controller 😬 curious what other folks think too @tektoncd/core-maintainers - ill put this on the agenda for the next API working group as well to get some more perspectives |
Great! Thanks for discussing this issue so deeply! Looking forward to the results of the working group. |
Hey @badamowicz - we discussed this in the API working group today (notes and link to the recording). I'd say sentiments were on the fence, between adding an 'execution timeout' feature like you're describing, and noting that it feels like the core problem here is more around scheduling - very similar to the problem described in TEP-0015. Speaking of which - @jbarrick-mesosphere @vincent-pli @FogDong - have you been able to leverage the functionality added in #3522 to create logic around running Pending pipelines when resources are available? If so it'd be great so learn more, it might be that that would help with @badamowicz 's use case here as well 🙏 (long story short @badamowicz I think we're still hoping we can avoid adding this feature but I'd say it's not entirely off the table!) |
I'd like to compare what others have in their timeout configuration. For example, in Prow: It supports three different timeout configurations:
I don't really have any point to make, just providing some information here. |
Hi @bobcatfish ! The example/solution provided by @zhouhaibing089 looks great for me. Having such configuration options at least in the TaskRuns would be a highly elegant solution in my opinion. Will get back as soon as we've discussed #3522. Thanks! |
Hi @bobcatfish ! I think this issue is also a bit related to #699 though they do not discuss resources there. We'd like to keep this issue going and our idea is to file a detailled feature request. What do you (and others) think? |
Thanks for sharing that info about prow @zhouhaibing089 ! one thing that's interesting about that is that from what I can tell, that configuration isn't specified on each individual job (i.e. you can't have one job with a different PodPendingTimeout than another) but globally - which kinda aligns with my suggestion that instead of adding this functionality to Pipelines, one could add another component (controller) that observes running pipelines and makes timeout related decisions.
I'm still hoping we can avoid adding this at the Pipeline level but I'm definitely open to it. I've added this to the agenda for the next API working group to discuss again; would like to hear more from other @tektoncd/core-maintainers on their thoughts. One possible compromise: I wonder if a controller level config option could do the trick, which would make it so that timeouts are only applied after scheduling? I'm guessing that wouldn't totally work for you though because you still DO want the TaskRun to timeout eventually even if it can't be scheduled? |
Yes! In our particular use case this would definitely help. That's about what I've described here. However, I'm aware that this also has the drawback that Pending status caused by other reasons (e.g. missing volume) will no longer be recognized - as you mentioned. But as a first step towards a "big" solution which will also cover other use cases, this would be very helpfully for us. But will await the results of the next API working group first. Thanks! |
Another option might be to |
@sbwsg no I'm afraid we would have to implement timeouts on the test-runner layer by ourselves. |
Oh yeah that's a great idea too @sbwsg
Really appreciate your patience and understanding around this @ghglza @badamowicz !! Building on the idea of having controller level timeouts, maybe we could add something like "global pending timeout", and if it's set, we'd have the behavior you've requested, i.e. the TaskRun timeout timer wouldn't start until the pod went from Pending to Running (i think that's the right point based on pod phases), during that point the pending timeout would be applied, and once the pod is Running, then the TaskRun timer would start. This is assuming that the pending timeout is global and not something we want to configure on a TaskRun/PipelineRun basis (which I'm thinking is probably a fair assumption, since this is about cluster wide conditions?) |
@bobcatfish that sounds great! |
I basically agree with @ghglza. However I can see there were also other options discussed in the last API working group meeting. @bobcatfish: What is your intention of how to proceed with this issue in general? Is there still an ongoing inernal discussion? Will we have to file a feature request? Thanks! |
Hey @badamowicz apologies for the delay! The discussion at the API working group last week wasn't super conclusive but it seems like having some controller level configuration around this is something that could work. I've sent an email to tekton-dev to try to get a few more eyes on this and make sure I don't recommend something to you that other folks strongly disagree with 🤞 (also I'm about to go on vacation for 2 weeks but I'll make sure to follow up on this when I'm back! in the meantime also please feel free to join our API working group if any of the times work for you - https://github.com/tektoncd/community/blob/main/working-groups.md#api - and/or continue the discussion on slack ) Thanks again for your patience 🙏 |
You're welcome! We also do appreciate the deep investigation you're putting into this issue. Have a nive holiday! |
Thanks @badamowicz ! :D it was nice to have some time away So quick update, there were no strong replies to the thread in tekton-dev so it seems like adding some kind of controller level "pending timeout" could be an acceptable solution - if that's something you think might work for you, the next step for proposing a feature is that someone will need to create a TEP describing the proposal. |
Hello again @badamowicz and everyone - in a surprising change of events, I'd actually like to go back on what I said and advocate for what @badamowicz was asking for all along 🙏 namely to be able to configure this on a per-run level. The reason for this is that we've been exploring how to have parity with Google Cloud Build (GCB) and Tekton, i.e how to implement build configured with GCB using Tekton types, and we had missed queueTtl - which seems to roughly map to the same concept as the pending timeout @badamowicz has requested. I'll follow up on the mailing list and in the API working group as well 🙏 sorry for EVEN MORE back and forth here |
no is temporary, yes is forever 😝 |
This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
Okay after that big delay I've finally got the TEP together @badamowicz ! It's at tektoncd/community#540 - please let me know what you think (I also included you as an author but np if you'd rather I remove that XD). I'm very keen to know if this will meet your needs or not. Particularly since some details are a bit different from your initial proposal - for example you mentioned starting the existing timeout only after a pod is |
That's great news @bobcatfish ! The TEP looks OK for us. Looking forward for the release containing the new timeout features! Thanks for putting so much effort in this issue! And great honor being mentioned as an author! |
This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
My team was having a similar issue. We did not expect a Task timeout to be started while the Task was in Pending state. There is a Pipeline timeout that already is effectively a "global" timeout. Until the TEP is implemented and released we will be handling the Task timeout using a sidecar. We already have a sidecar for progress monitoring and hang checks. Adding a task timeout to this will not be difficult. |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
whoops this shouldn't have gone rotten! still being addressed (slowly) in TEP-0092 /reopen |
@bobcatfish: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
This TEP addresses tektoncd/pipeline#4078 by proposing that at runtime `TaskRuns` can be configured with a new timeout that applies before they actually start executing - and only after that are the existing timeouts applied. This is meant to make it possible to control when our existing timeout feature works in resource constrained environment and also to get parity with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
/assign @EmmaMunley |
@badamowicz @ghglza @kscherer - we are revisiting this problem and wanted to check in if you have any additional considerations, please review the update to TEP-0092 from @EmmaMunley: tektoncd/community#948 |
Given an Task with a timeout set to 30m. And given the Pod created by the TaskRun goes to status Pending due to missing resources and exceeds those 30m. Will this cause a timeout to be triggered?
We believe we are currently facing this issue in some of our Pods. But we also think the timeout counter shouldn't start before the Pod's status is Running.
Can someone explain? Thanks!
The text was updated successfully, but these errors were encountered: