-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template with retry and exit hook stuck in Pending
#13239
Comments
This can certainly be confusing, but it is expected behavior -- the final node's exit code determines the success or failure of the overall Workflow. For example, think about the opposite use-case, "I want my Workflow to succeed if my As such, it is actually currently not possible to otherwise determine (without relying on the exit code) whether you want a successful final node to result in a failure -- you would need to be able to indicate your high-level intent somewhere. See #12530 for a feature request to add something to the spec to indicate this behavior separately from an exit code. The user-land workaround there is to modify your exit code appropriately to match your intent.
The visualization might not be the most intuitive in this case. There's a switch in the graph UI (the lightning bolt IIRC) for a slower algorithm that might(?) be more intuitive in this case as it could group nodes differently. Without access to your completed Workflow's
Yea that certainly sounds like a bug. A regression given that it actually worked as expected before. It sounds like a bug likely stemming from #12402's change in handling task results (we've had a few of those -- fixed one race condition, created more new race conditions 🙃). I'm not sure which of the changes in 3.5.5 would cause that though from a quick glance, especially as you appear to not be using any
Your Controller ConfigMap and the YAML of the running Workflow's |
Pending
Hi agilgur5! Thank you very much for your detailed answer! I think what confused me about part 1 was the behaviour of the workflow when I either remove the retry or the exit hook:
But I think I understand now that this is intended that the exit hook becomes more or less a final step of the retry when both retry and exit hook are defined.
Yes you are correct, which makes total sense.
Thank you, this solution is completely sufficient for us, it might confuse at first when debugging, but I think with a clear exit message it won't be much of a problem. |
About part 2, here is the status of the stuck workflow:status:
phase: Running
startedAt: '2024-06-25T07:33:17Z'
finishedAt: null
estimatedDuration: 20
progress: 0/2
nodes:
argo-retry-hook-issue-54bnq:
id: argo-retry-hook-issue-54bnq
name: argo-retry-hook-issue-54bnq
displayName: argo-retry-hook-issue-54bnq
type: DAG
templateName: start
templateScope: local/
phase: Running
startedAt: '2024-06-25T07:33:17Z'
finishedAt: null
estimatedDuration: 20
progress: 0/2
children:
- argo-retry-hook-issue-54bnq-2742882792
argo-retry-hook-issue-54bnq-2742882792:
id: argo-retry-hook-issue-54bnq-2742882792
name: argo-retry-hook-issue-54bnq.failing-retry-step
displayName: failing-retry-step
type: Retry
templateName: failing-retry-step
templateScope: local/
phase: Failed
boundaryID: argo-retry-hook-issue-54bnq
message: Error (exit code 1)
startedAt: '2024-06-25T07:33:17Z'
finishedAt: '2024-06-25T07:33:27Z'
estimatedDuration: 10
progress: 0/2
resourcesDuration:
cpu: 0
memory: 3
outputs:
exitCode: '1'
children:
- argo-retry-hook-issue-54bnq-352626555
- argo-retry-hook-issue-54bnq-3765049435
argo-retry-hook-issue-54bnq-352626555:
id: argo-retry-hook-issue-54bnq-352626555
name: argo-retry-hook-issue-54bnq.failing-retry-step(0)
displayName: failing-retry-step(0)
type: Pod
templateName: failing-retry-step
templateScope: local/
phase: Failed
boundaryID: argo-retry-hook-issue-54bnq
message: Error (exit code 1)
startedAt: '2024-06-25T07:33:17Z'
finishedAt: '2024-06-25T07:33:20Z'
estimatedDuration: 3
progress: 0/1
resourcesDuration:
cpu: 0
memory: 3
nodeFlag:
retried: true
outputs:
exitCode: '1'
hostNodeName: docker-desktop
argo-retry-hook-issue-54bnq-3765049435:
id: argo-retry-hook-issue-54bnq-3765049435
name: argo-retry-hook-issue-54bnq.failing-retry-step.onExit
displayName: failing-retry-step.onExit
type: Pod
templateName: exit-task
templateScope: local/
phase: Pending
boundaryID: argo-retry-hook-issue-54bnq
startedAt: '2024-06-25T07:33:27Z'
finishedAt: null
estimatedDuration: 2
progress: 0/1
nodeFlag:
hooked: true
storedTemplates:
namespaced/argo-retry-hook-issue/additional-step:
name: additional-step
inputs: {}
outputs: {}
metadata: {}
container:
name: main
image: argoproj/argosay:v2
command:
- /argosay
args:
- echo
- Hello second step
resources: {}
namespaced/argo-retry-hook-issue/exit-task:
name: exit-task
inputs: {}
outputs: {}
metadata: {}
container:
name: main
image: argoproj/argosay:v2
command:
- /argosay
args:
- echo
- Hello exit task
resources: {}
namespaced/argo-retry-hook-issue/failing-retry-step:
name: failing-retry-step
inputs: {}
outputs: {}
metadata: {}
container:
name: main
image: argoproj/argosay:v2
command:
- /argosay
args:
- exit 1
resources: {}
retryStrategy:
limit: '1'
retryPolicy: OnError
namespaced/argo-retry-hook-issue/start:
name: start
inputs: {}
outputs: {}
metadata: {}
dag:
tasks:
- name: failing-retry-step
template: failing-retry-step
arguments: {}
hooks:
exit:
template: exit-task
arguments: {}
- name: additional-step
template: additional-step
arguments: {}
dependencies:
- failing-retry-step
conditions:
- type: PodRunning
status: 'False'
resourcesDuration:
cpu: 0
memory: 3
storedWorkflowTemplateSpec:
templates:
- name: start
inputs: {}
outputs: {}
metadata: {}
dag:
tasks:
- name: failing-retry-step
template: failing-retry-step
arguments: {}
hooks:
exit:
template: exit-task
arguments: {}
- name: additional-step
template: additional-step
arguments: {}
dependencies:
- failing-retry-step
- name: failing-retry-step
inputs: {}
outputs: {}
metadata: {}
container:
name: main
image: argoproj/argosay:v2
command:
- /argosay
args:
- exit 1
resources: {}
retryStrategy:
limit: '1'
retryPolicy: OnError
- name: additional-step
inputs: {}
outputs: {}
metadata: {}
container:
name: main
image: argoproj/argosay:v2
command:
- /argosay
args:
- echo
- Hello second step
resources: {}
- name: exit-task
inputs: {}
outputs: {}
metadata: {}
container:
name: main
image: argoproj/argosay:v2
command:
- /argosay
args:
- echo
- Hello exit task
resources: {}
entrypoint: start
arguments: {}
ttlStrategy:
secondsAfterCompletion: 8640000
podGC:
strategy: OnPodSuccess
podDisruptionBudget:
minAvailable: 1
workflowTemplateRef:
name: argo-retry-hook-issue
volumeClaimGC:
strategy: OnWorkflowCompletion
workflowMetadata:
labels:
example: 'true'
artifactRepositoryRef:
default: true
artifactRepository: {}
artifactGCStatus:
notSpecified: true in addition, here is the yaml of the configmap: apiVersion: v1
data:
config: |
instanceID: xry
workflowDefaults:
spec:
podDisruptionBudget:
minAvailable: 1
podGC:
strategy: OnPodSuccess
ttlStrategy:
secondsAfterCompletion: 8640000
volumeClaimGC:
strategy: OnWorkflowCompletion
nodeEvents:
enabled: true
kind: ConfigMap
metadata:
labels:
helm.sh/chart: argo-workflows-0.30.0
name: argo-workflows-workflow-controller-configmap
namespace: default I hope these settings will help to reproduce it, I will also play around a bit with the default settings in the config map. |
Mmmh even removing all workflowDefaults does not work, neither using steps instead of the dag approach. |
This is just based on the last task's exit code, so without an exit hook it is more straightforward
Again I'd recommend toggling the "lightning bolt" button to try the slower, alternative layout, which sometimes makes more sense. In this case the fast layout (which uses a Coffman-Graham Sort) is just showing that those two tasks ran in parallel -- they both kicked off when the retry node completed.
🤔 I think this might be because |
That's good to know that DAGs will be the future, as we're currently switching from steps to DAGs!
Okay, I think we are fine with the solution that the exit hook itself throws if the task status is failed, so that the main workflow also fails. Thanks for your help and explanation here! Which only leaves the part 2, about v3.5.5 or higher 😬 Did you get a chance to test my settings? If you managed to run my workflow with those settings successfully on your end, then I will dig deeper into my settings, as argo itself would not be the problem. |
I haven't yet, still on the to-do list. But nothing seemed to catch my eye in your |
Hi @agilgur5 did you find time to check if you can reproduce my issue? |
Yes this is probably fixed by one of the Thanks @jswxstw for following up on all these stuck Workflows due to WorkflowTaskResults regressions!
It's a template level hook in that case, and in the middle of the template, so I think this behavior is correct: the last task's
Per your docs reference above and my same comment linked above, since the last hook is |
Hi guys, I'm gonna be honest, I'm still a bit confused that the workflow marked as successful if one of its retry tasks failed but the last exit hook succeeded, especially since this is not happening when I don't use retry. So I'm looking forward to #12530. Thanks for your discussion on it, nonetheless! Still I think for now I can use your workaround, even though it will confuse people during debugging. |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
Hi argo team,
I have two problems with the exact same workflow template, depending on the version I'm using.
In my workflow, I have template DAG task which has both a retry option (limit and policy doesn't seem to matter, it doesn't work with any setting) and a simple exit lifecycle hook. After that step, I have an additional step which should not be executed if the retry fails (either until the limit is reached or because the error does not match the retryPolicy).
Here are my issues:
Versions up to
v3.4.17
(latestv.3.4.X
version), or up tov3.5.4
:The workflow runs the task with the retry, failing because of some reason (in my example workflow by intention), then it either retries or not depending on the retryPolicy and the error type.
Afterwards the hook runs as expected, but then the workflow seems to want to continue, the next step taking the exit hook step as its predecessor, thus skipping. The whole workflow is then marked as succeeded, while the retry step clearly failed. This is very unexpected behaviour to me!
I would expect the workflow to fail directly after the exit hook of the failing step has been executed, as one of its steps has failed.
Versions
v3.5.5
up tov3.5.8
(currentlylatest
version):I am not entirely sure, but I can imagine that this commit about retry (fix: retry node with expression status Running -> Pending #12637) from you to version
v3.5.5
might have fixed the above issue, but now I run into a new problem that the exit hook step is stuck in pending mode, while I can clearly see in the Kubernetes cluster that the pod has been completed almost instantly.The attached logs further down of the workflow coordinator and the wait container of the exit hook pod (which shows the pod has been executed and completed) are displaying this error. The workflow coordinator keeps repeating the last 4 lines in a loop until I terminate or stop the workflow.
I kindly ask you to test both the issues above, maybe you have already fixed the first issue, but then the second one blocks me.
If you need further details, I'll be happy to share them.
Version
v3.4.17, v3.5.4, v3.5.5, v3.5.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: