Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When does a timeout counter actually start? #4078

Open
badamowicz opened this issue Jul 7, 2021 · 33 comments
Open

When does a timeout counter actually start? #4078

badamowicz opened this issue Jul 7, 2021 · 33 comments
Labels
area/api Indicates an issue or PR that deals with the API. kind/documentation Categorizes issue or PR as related to documentation. kind/question Issues or PRs that are questions around the project or a particular feature lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@badamowicz
Copy link

Given an Task with a timeout set to 30m. And given the Pod created by the TaskRun goes to status Pending due to missing resources and exceeds those 30m. Will this cause a timeout to be triggered?
We believe we are currently facing this issue in some of our Pods. But we also think the timeout counter shouldn't start before the Pod's status is Running.
Can someone explain? Thanks!

@ghost
Copy link

ghost commented Jul 7, 2021

/kind question

@tekton-robot tekton-robot added the kind/question Issues or PRs that are questions around the project or a particular feature label Jul 7, 2021
@bobcatfish
Copy link
Collaborator

Hey @badamowicz ! I think what you're describing is right: the timeout countdown starts as soon as the "StartTime" is set in the TaskRun/PipelineRun status, which would be the moment at which the Controller starts reconciling the TaskRun.

What you're describing would be an alternative approach but it would have the downside of never timing out a pod that can't be scheduled - e.g. imagine you've created a TaskRun which wants to use a volume that doesn't exist, with the current logic, that would eventually timeout, but if the timeout waited for the pod to start running, it would never time out. It's possible we could add the functionality you're describing (and in fact we seem to have a trend toward more and more fine grained control over timeouts https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)

(At the very least we should probably explain this in the docs 🤔 https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#configuring-a-failure-timeout)

We believe we are currently facing this issue in some of our Pods. But we also think the timeout counter shouldn't start before the Pod's status is Running.

Are you able to provide more info about what these pods are waiting for? It'd help to understand more about your system and why you want to allow pods to wait for some resources to become available before running (most of the design choices we've made assume that when you create a TaskRun, you want it to run immediately vs. eventually when everything it needs is ready).

@bobcatfish bobcatfish added area/api Indicates an issue or PR that deals with the API. kind/documentation Categorizes issue or PR as related to documentation. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jul 8, 2021
@badamowicz
Copy link
Author

Hi @bobcatfish!

Are you able to provide more info about what these pods are waiting for? It'd help to understand more about your system and why you want to allow pods to wait for some resources to become available before running

We may indeed have a very special behaviour of our pipelines. What actually is happening is this:

  1. Build the artifacts of a large Java application.
  2. Then start some 35 Tasks in parallel, each performing Selenium tests against the deployed application. Each of these tasks consists of:
    ** a Pod with Websphere Liberty server along with the deployed application
    ** an Oracle pod as sidecar
    ** a Pod with a mock server as sidecar
  3. Do some cleanup stuff

Now those 35 parallel TaskRuns cause a huge peak of CPU and memory consumption on the cluster's nodes and may eventually force Pods of other PipelineRuns to go to status Pending. This is expected behaviour and OK for us. However we had to increase all timeouts to avoid those pending Pods beeing killed due to timeout.

So for use cases like this it could be nice to have a configuration option like in this example:

- name: some-taskrun
  taskRef:
    name: some-task
  timeout:
    value: 30m
    ignorePending: 'true'

What do you think?

@bobcatfish bobcatfish removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 9, 2021
@bobcatfish
Copy link
Collaborator

bobcatfish commented Jul 9, 2021

thanks for explaining more about your use case!

im a bit hesitant about adding this b/c i feel like there are going to be other scenarios that folks will run into where they'll want some variation on this logic - for example, maybe folks want the behavior you are describing, but only when the pod is waiting for resources to be scheduled, and not in other cases (e.g. when a volume a pod needs hasnt been created, or when an image pull fails)

that makes me wonder if this is the kind of thing it would make sense to solve by adding an optional component vs adding to the existing API - e.g. you could run your pipelineruns with timeouts off, and have a separate controller that watches pipelineruns in your system and it can use whatever logic it wants to decide timeouts (e.g. ignoring pending)

i feel bad suggesting this tho b/c writing your own controller isn't the easiest ask - i think it should be easier in the long run, and we're taking some small steps toward this, for example the proposal to add a custom task SDK (tektoncd/community#461) involves making it easier to write a controller - there are also example controllers in our experimental repo (e.g. https://github.com/tektoncd/experimental/tree/main/task-loops)

the other thing that comes to mind is that this request reminds me of TEP-0015 which was all about adding a "pending" status for PipelineRuns. iirc the motivation was to allow for folks to create as many PipelineRuns as they want, but have something else observe the cluster and decide when they should actually run (i.e. when resources were actually available - which i think is similar to what you're describing). pursuing that option would also involve writing a custom controller 😬

curious what other folks think too @tektoncd/core-maintainers - ill put this on the agenda for the next API working group as well to get some more perspectives

@badamowicz
Copy link
Author

ill put this on the agenda for the next API working group as well to get some more perspectives

Great! Thanks for discussing this issue so deeply! Looking forward to the results of the working group.

@bobcatfish bobcatfish self-assigned this Jul 12, 2021
@bobcatfish
Copy link
Collaborator

Hey @badamowicz - we discussed this in the API working group today (notes and link to the recording). I'd say sentiments were on the fence, between adding an 'execution timeout' feature like you're describing, and noting that it feels like the core problem here is more around scheduling - very similar to the problem described in TEP-0015.

Speaking of which - @jbarrick-mesosphere @vincent-pli @FogDong - have you been able to leverage the functionality added in #3522 to create logic around running Pending pipelines when resources are available? If so it'd be great so learn more, it might be that that would help with @badamowicz 's use case here as well 🙏

(long story short @badamowicz I think we're still hoping we can avoid adding this feature but I'd say it's not entirely off the table!)

@zhouhaibing089
Copy link
Contributor

I'd like to compare what others have in their timeout configuration. For example, in Prow:

https://github.com/kubernetes/test-infra/blob/49f933a3eda05d96589b6d97e0fce10df952ac6d/prow/config/config.go#L491-L499

It supports three different timeout configurations:

  1. PodUnscheduledTimeout
  2. PodPendingTimeout
  3. PodRunningTimeout

I don't really have any point to make, just providing some information here.

@badamowicz
Copy link
Author

Hi @bobcatfish !
I'm not sure if #3522 would really help us. However, we will for sure discuss and test this in our team.

The example/solution provided by @zhouhaibing089 looks great for me. Having such configuration options at least in the TaskRuns would be a highly elegant solution in my opinion.

Will get back as soon as we've discussed #3522. Thanks!

@badamowicz
Copy link
Author

Hi @bobcatfish !
We've dicussed #3522 in our team. But we think this is - at least not yet - a solution for us because it would require a CRD observing the cluster's resources and schedule the PRs accordingly. This would be a kind of mimicing core K8s behaviour.

I think this issue is also a bit related to #699 though they do not discuss resources there.

We'd like to keep this issue going and our idea is to file a detailled feature request. What do you (and others) think?

@bobcatfish
Copy link
Collaborator

Thanks for sharing that info about prow @zhouhaibing089 ! one thing that's interesting about that is that from what I can tell, that configuration isn't specified on each individual job (i.e. you can't have one job with a different PodPendingTimeout than another) but globally - which kinda aligns with my suggestion that instead of adding this functionality to Pipelines, one could add another component (controller) that observes running pipelines and makes timeout related decisions.

We'd like to keep this issue going and our idea is to file a detailled feature request. What do you (and others) think?

I'm still hoping we can avoid adding this at the Pipeline level but I'm definitely open to it. I've added this to the agenda for the next API working group to discuss again; would like to hear more from other @tektoncd/core-maintainers on their thoughts.

One possible compromise: I wonder if a controller level config option could do the trick, which would make it so that timeouts are only applied after scheduling? I'm guessing that wouldn't totally work for you though because you still DO want the TaskRun to timeout eventually even if it can't be scheduled?

@badamowicz
Copy link
Author

badamowicz commented Jul 15, 2021

One possible compromise: I wonder if a controller level config option could do the trick, which would make it so that timeouts are only applied after scheduling?

Yes! In our particular use case this would definitely help. That's about what I've described here. However, I'm aware that this also has the drawback that Pending status caused by other reasons (e.g. missing volume) will no longer be recognized - as you mentioned. But as a first step towards a "big" solution which will also cover other use cases, this would be very helpfully for us.

But will await the results of the next API working group first. Thanks!

@ghost
Copy link

ghost commented Jul 15, 2021

Another option might be to param-aterize the test-running task, passing through a limit as a param to the code executing the test suite. It wouldn't be Tekton-enforced but this would be a way of having very fine-grained control over the precise meaning of "timeout" while also keeping the value of that timeout in the yaml. Does your test runner support this as a configuration option? I wonder are there any reasons to prefer a tekton-enforced execution timeout over the test runner's own?

@ghglza
Copy link

ghglza commented Jul 16, 2021

@sbwsg no I'm afraid we would have to implement timeouts on the test-runner layer by ourselves.
This (and same for building a dedicated controller) would from our perspective essentially convert a CI/CD mgmt project to a SW development effort, which is capacity- and lifcycle-mgmt-wise unfortunately out of scope for us.
I would dare to speculate that this is similar for most users of Tekton, so some finer-grained control of timeouts w/o having to code would IMHO be a very welcome feature!
But of course this is one of the areas where hitting the sweet spot between API- and code-bloat within Tekton vs. duplicated coding requirements on the side of the users is really tricky... if I could choose I'd vote for timeout categories similar to Prow.

@bobcatfish
Copy link
Collaborator

Oh yeah that's a great idea too @sbwsg

But of course this is one of the areas where hitting the sweet spot between API- and code-bloat within Tekton vs. duplicated coding requirements on the side of the users is really tricky

Really appreciate your patience and understanding around this @ghglza @badamowicz !!

Building on the idea of having controller level timeouts, maybe we could add something like "global pending timeout", and if it's set, we'd have the behavior you've requested, i.e. the TaskRun timeout timer wouldn't start until the pod went from Pending to Running (i think that's the right point based on pod phases), during that point the pending timeout would be applied, and once the pod is Running, then the TaskRun timer would start.

This is assuming that the pending timeout is global and not something we want to configure on a TaskRun/PipelineRun basis (which I'm thinking is probably a fair assumption, since this is about cluster wide conditions?)

@ghglza
Copy link

ghglza commented Jul 21, 2021

@bobcatfish that sounds great!
For me it seems also that a "pending timeout" does not belong to pipeline specific settings -- IMHO it would (at least as a start) even suffice if this timeout would be configurable only at controller startup.

@badamowicz
Copy link
Author

I basically agree with @ghglza. However I can see there were also other options discussed in the last API working group meeting.

@bobcatfish: What is your intention of how to proceed with this issue in general? Is there still an ongoing inernal discussion? Will we have to file a feature request?

Thanks!

@bobcatfish
Copy link
Collaborator

Hey @badamowicz apologies for the delay! The discussion at the API working group last week wasn't super conclusive but it seems like having some controller level configuration around this is something that could work. I've sent an email to tekton-dev to try to get a few more eyes on this and make sure I don't recommend something to you that other folks strongly disagree with 🤞

(also I'm about to go on vacation for 2 weeks but I'll make sure to follow up on this when I'm back! in the meantime also please feel free to join our API working group if any of the times work for you - https://github.com/tektoncd/community/blob/main/working-groups.md#api - and/or continue the discussion on slack )

Thanks again for your patience 🙏

@badamowicz
Copy link
Author

badamowicz commented Jul 27, 2021

Thanks again for your patience

You're welcome! We also do appreciate the deep investigation you're putting into this issue. Have a nive holiday!

@bobcatfish
Copy link
Collaborator

Thanks @badamowicz ! :D it was nice to have some time away

So quick update, there were no strong replies to the thread in tekton-dev so it seems like adding some kind of controller level "pending timeout" could be an acceptable solution - if that's something you think might work for you, the next step for proposing a feature is that someone will need to create a TEP describing the proposal.

@bobcatfish bobcatfish removed their assignment Aug 16, 2021
@bobcatfish
Copy link
Collaborator

Hello again @badamowicz and everyone - in a surprising change of events, I'd actually like to go back on what I said and advocate for what @badamowicz was asking for all along 🙏 namely to be able to configure this on a per-run level. The reason for this is that we've been exploring how to have parity with Google Cloud Build (GCB) and Tekton, i.e how to implement build configured with GCB using Tekton types, and we had missed queueTtl - which seems to roughly map to the same concept as the pending timeout @badamowicz has requested.

I'll follow up on the mailing list and in the API working group as well 🙏 sorry for EVEN MORE back and forth here

@vdemeester
Copy link
Member

I'll follow up on the mailing list and in the API working group as well sorry for EVEN MORE back and forth here

no is temporary, yes is forever 😝

bobcatfish added a commit to bobcatfish/community that referenced this issue Oct 14, 2021
This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime `TaskRuns` can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
@bobcatfish
Copy link
Collaborator

Okay after that big delay I've finally got the TEP together @badamowicz ! It's at tektoncd/community#540 - please let me know what you think (I also included you as an author but np if you'd rather I remove that XD). I'm very keen to know if this will meet your needs or not. Particularly since some details are a bit different from your initial proposal - for example you mentioned starting the existing timeout only after a pod is Running but I propose using the Pod condition PodScheduled - which means in my proposal the time to pull images would be counted as part of the execution time.

@badamowicz
Copy link
Author

badamowicz commented Oct 15, 2021

That's great news @bobcatfish ! The TEP looks OK for us. Looking forward for the release containing the new timeout features! Thanks for putting so much effort in this issue!

And great honor being mentioned as an author!

bobcatfish added a commit to bobcatfish/community that referenced this issue Oct 15, 2021
This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime `TaskRuns` can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
@kscherer
Copy link
Contributor

My team was having a similar issue. We did not expect a Task timeout to be started while the Task was in Pending state. There is a Pipeline timeout that already is effectively a "global" timeout.

Until the TEP is implemented and released we will be handling the Task timeout using a sidecar. We already have a sidecar for progress monitoring and hang checks. Adding a task timeout to this will not be difficult.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2022
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 26, 2022
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bobcatfish
Copy link
Collaborator

whoops this shouldn't have gone rotten! still being addressed (slowly) in TEP-0092

/reopen
/lifecycle frozen

@tekton-robot tekton-robot reopened this Apr 11, 2022
@tekton-robot
Copy link
Collaborator

@bobcatfish: Reopened this issue.

In response to this:

whoops this shouldn't have gone rotten! still being addressed (slowly) in TEP-0092

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 11, 2022
bobcatfish added a commit to bobcatfish/community that referenced this issue Apr 11, 2022
This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime `TaskRuns` can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
bobcatfish added a commit to bobcatfish/community that referenced this issue Apr 15, 2022
This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime `TaskRuns` can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
tekton-robot pushed a commit to tektoncd/community that referenced this issue Apr 18, 2022
This TEP addresses tektoncd/pipeline#4078 by
proposing that at runtime `TaskRuns` can be configured with a new
timeout that applies before they actually start executing - and only
after that are the existing timeouts applied.

This is meant to make it possible to control when our existing timeout
feature works in resource constrained environment and also to get parity
with [Google Cloud Build's existing queuettl feature](https://cloud.google.com/build/docs/build-config-file-schema#queuettl).
@EmmaMunley
Copy link
Contributor

/assign @EmmaMunley

@jerop
Copy link
Member

jerop commented Feb 6, 2023

@badamowicz @ghglza @kscherer - we are revisiting this problem and wanted to check in if you have any additional considerations, please review the update to TEP-0092 from @EmmaMunley: tektoncd/community#948

cc @vdemeester @pritidesai @bobcatfish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue or PR that deals with the API. kind/documentation Categorizes issue or PR as related to documentation. kind/question Issues or PRs that are questions around the project or a particular feature lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Status: Todo
Development

No branches or pull requests

9 participants