Workflow is Error but taskset node is not Error when Agent pod failed #14200

Tuilot · 2025-02-17T10:15:56Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When I submit the workflow to argo namepsace , the agent pod failed
Then, the workflow turns to Error state, but the taskset node is still Pending.

# kubectl -n argo get po hello-plugin-1340600742-agent -owide
NAME                            READY   STATUS    RESTARTS   AGE     IP               NODE                NOMINATED NODE   READINESS GATES
hello-plugin-1340600742-agent   4/4     Evicted   0          3m19s   192.168.28.240   train070            <none>           <none>

# kubectl -n argo get workflowtaskset hello-plugin -oyaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTaskSet
metadata:
    creationTimestamp: "2025-02-17T09:50:11Z"
  generation: 1
  labels:
    workflows.argoproj.io/completed: "true"
  name: hello-plugin
  namespace: argo
  ownerReferences:
  - apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    name: hello-plugin
    uid: 0709dfa5-3603-487d-ba45-40e1f19ccc87
  resourceVersion: "5377104186"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/argo/workflowtasksets/hello-plugin
  uid: 43313c93-90f3-41cc-8467-0d1c845c9a60
spec:
  tasks:
    hello-plugin-2816962999:
      inputs: {}
      metadata: {}
      name: hello-plugin
      outputs: {}
      plugin:
        hello: {}
status:
  nodes:
    hello-plugin-2816962999:
      message: Queuing
      phase: Pending

# kubectl -n argo get wf hello-plugin -oyaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v1
  creationTimestamp: "2025-02-17T09:50:11Z"
  generation: 4
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Error
    workflows.argoproj.io/workflow-archiving-status: Archived
  name: hello-plugin
  namespace: argo
  resourceVersion: "5377104189"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/argo/workflows/hello-plugin
  uid: 0709dfa5-3603-487d-ba45-40e1f19ccc87
spec:
  arguments: {}
  entrypoint: main
  templates:
  - dag:
      tasks:
      - arguments: {}
        name: hello
        template: hello-plugin
    inputs: {}
    metadata: {}
    name: main
    outputs: {}
  - inputs: {}
    metadata: {}
    name: hello-plugin
    outputs: {}
    plugin:
      hello: {}
status:
  artifactRepositoryRef:
    artifactRepository: {}
    default: true
  conditions:
  - status: "False"
    type: PodRunning
  - status: "True"
    type: Completed
  finishedAt: "2025-02-17T09:53:43Z"
  nodes:
    hello-plugin:
      children:
      - hello-plugin-2816962999
      displayName: hello-plugin
      finishedAt: "2025-02-17T09:53:43Z"
      id: hello-plugin
      name: hello-plugin
      outboundNodes:
      - hello-plugin-2816962999
      phase: Error
      progress: 0/1
      startedAt: "2025-02-17T09:50:11Z"
      templateName: main
      templateScope: local/hello-plugin
      type: DAG
    hello-plugin-2816962999:
      boundaryID: hello-plugin
      displayName: hello
      finishedAt: "2025-02-17T09:53:43Z"
      id: hello-plugin-2816962999
      message: Queuing
      name: hello-plugin.hello
      **phase: Pending**
      progress: 0/1
      startedAt: "2025-02-17T09:50:11Z"
      templateName: hello-plugin
      templateScope: local/hello-plugin
      type: Plugin
  **phase: Error**
  progress: 0/1
  startedAt: "2025-02-17T09:50:11Z"

Version(s)

latest

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo
  name: hello-plugin
spec:
  entrypoint: main
  templates:
  - dag:
      tasks:
      - arguments: {}
        name: hello
        template: hello-plugin
    name: main
  - name: hello-plugin
    plugin:
      hello: {}

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
time="2025-02-17T17:53:43.164Z" level=info msg="Processing workflow" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg=updateAgentPodStatus namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=error msg="Mark error node" error="agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo nodeName=hello-plugin.hello workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 phase Pending -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 message: agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Outbound nodes of hello-plugin set to [hello-plugin-2816962999]" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin phase Running -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin finished: 2025-02-17 09:53:43.165321874 +0000 UTC" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Checking daemoned children of hello-plugin" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Updated phase Running -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Marking workflow completed" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Checking daemoned children of " namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.170Z" level=info msg="Workflow update successful" namespace=argo phase=Error resourceVersion=5377104182 workflow=hello-plugin

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

jswxstw · 2025-02-17T13:16:16Z

So weird, this issue should have been fixed by #12723.

Logs as below show that node hello-plugin-2816962999 has been marked as Error.

time="2025-02-17T17:53:43.165Z" level=error msg="Mark error node" error="agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo nodeName=hello-plugin.hello workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 phase Pending -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 message: agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo workflow=hello-plugin

In addition, completed taskset nodes in WorkflowTaskSet should also be removed.

argo-workflows/workflow/controller/taskset.go

Line 70 in f5d59e9

    
           func (woc *wfOperationCtx) removeCompletedTaskSetStatus(ctx context.Context) error {

hello-plugin-2816962999:
    boundaryID: hello-plugin
    displayName: hello
    finishedAt: "2025-02-17T09:53:43Z" # finishedAt is not nil, so it has already been marked as completed.
    id: hello-plugin-2816962999
    message: Queuing # I have never seen such an error message before, and here should be 'agent pod failed with reason...'
    name: hello-plugin.hello
    phase: Pending
    progress: 0/1
    startedAt: "2025-02-17T09:50:11Z"
    templateName: hello-plugin
    templateScope: local/hello-plugin
    type: Plugin

Tuilot · 2025-02-17T13:36:31Z

@jswxstw You're right, the message for hello-plugin-2816962999 has been updated repeatedly, 'agent pod failed with reason...' was overwritten.

Tuilot · 2025-02-17T13:41:04Z

in once operate：
Mark all non-fulfilled taskset nodes as error because agent pod failed, then continue reconciling the taskset.
This redundant reconcile task set is likely the root cause of the problem.

jswxstw · 2025-02-18T03:50:53Z

During updateAgentPodStatus, mark all uncompleted taskset nodes to error status because agent pod failed. Then workflow is in error state and wfOperationCtx.taskSet is empty.

The workflow would have ended by this point. In what situation would the redundant reconcileTaskSet you mentioned occur?
I cannot reproduce this issue, and there is a similar case in tests:

argo-workflows/workflow/controller/operator_agent_test.go

Line 29 in 64f5093

func TestHTTPTemplate(t *testing.T) {

Tuilot · 2025-02-18T15:32:19Z

@jswxstw This test modification will trigger the error. All the processes mentioned above happen in once operate.

t.Run("ExecuteHTTPTemplate", func(t *testing.T) {
		ctx := context.Background()
		woc := newWorkflowOperationCtx(wf, controller)
		woc.operate(ctx)
		pod, err := controller.kubeclientset.CoreV1().Pods(woc.wf.Namespace).Get(ctx, woc.getAgentPodName(), metav1.GetOptions{})
		assert.NoError(t, err)
		assert.NotNil(t, pod)
		ts, err := controller.wfclientset.ArgoprojV1alpha1().WorkflowTaskSets(wf.Namespace).Get(ctx, "hello-world", metav1.GetOptions{})
		assert.NoError(t, err)
		assert.NotNil(t, ts)
		assert.Len(t, ts.Spec.Tasks, 1)
		ts.Status.Nodes = make(map[string]wfv1.NodeResult)
		ts.Status.Nodes["hello-world"] = wfv1.NodeResult{
			Phase:   wfv1.NodePending,
			Message: "Queuing",
		}
		_, err = controller.wfclientset.ArgoprojV1alpha1().WorkflowTaskSets(wf.Namespace).UpdateStatus(ctx, ts, metav1.UpdateOptions{})
		assert.Nil(t, err)
		wf, err = controller.wfclientset.ArgoprojV1alpha1().Workflows(wf.Namespace).Get(ctx, "hello-world", metav1.GetOptions{})
		assert.Nil(t, err)
		// simulate agent pod failure scenario
		pod.Status.Phase = v1.PodFailed
		pod.Status.Message = "manual termination"
		pod, err = controller.kubeclientset.CoreV1().Pods(woc.wf.Namespace).UpdateStatus(ctx, pod, metav1.UpdateOptions{})
		assert.Nil(t, err)
		assert.Equal(t, v1.PodFailed, pod.Status.Phase)
		// sleep 1 second to wait for informer getting pod info
		time.Sleep(time.Second)
		woc = newWorkflowOperationCtx(wf, controller)
		woc.operate(ctx)
		assert.Equal(t, wfv1.WorkflowError, woc.wf.Status.Phase)
		assert.Equal(t, `agent pod failed with reason:"manual termination"`, woc.wf.Status.Message)
		assert.Len(t, woc.wf.Status.Nodes, 1)
		assert.Equal(t, wfv1.NodeError, woc.wf.Status.Nodes["hello-world"].Phase)
		assert.Equal(t, `agent pod failed with reason:"manual termination"`, woc.wf.Status.Nodes["hello-world"].Message)
		ts, err = controller.wfclientset.ArgoprojV1alpha1().WorkflowTaskSets(wf.Namespace).Get(ctx, "hello-world", metav1.GetOptions{})
		assert.NoError(t, err)
		assert.NotNil(t, ts)
		assert.Empty(t, ts.Spec.Tasks)
		assert.Empty(t, ts.Status.Nodes)
	})

Tuilot · 2025-02-18T15:48:33Z

@jswxstw

The workflow would have ended by this point. In what situation would the redundant reconcileTaskSet you mentioned occur? I cannot reproduce this issue, and there is a similar case in tests:

The redundant reconcileTaskSet refers to unnecessarily reconciling the taskSet when woc.taskSet is empty. The correct behavior should be skipping the reconciliation in such cases.

jswxstw · 2025-02-19T02:30:24Z

@Tuilot Would you like to submit a PR to fix this?

Tuilot · 2025-02-19T02:53:15Z

@jswxstw yes

Tuilot added the type/bug label Feb 17, 2025

jswxstw added the area/agent Argo Agent that runs for HTTP and Plugin templates label Feb 17, 2025

jswxstw added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Feb 19, 2025

This was referenced Feb 20, 2025

fix: Workflow is Error but taskset node is not Error when Agent pod failed. Fixes #14200 #14212

Closed

fix: Workflow is Error but taskset node is not Error when Agent pod failed. Fixes #14200 #14230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow is Error but taskset node is not Error when Agent pod failed #14200

Workflow is Error but taskset node is not Error when Agent pod failed #14200

Tuilot commented Feb 17, 2025 •

edited

Loading

jswxstw commented Feb 17, 2025 •

edited

Loading

Tuilot commented Feb 17, 2025 •

edited

Loading

Tuilot commented Feb 17, 2025 •

edited

Loading

jswxstw commented Feb 18, 2025

Tuilot commented Feb 18, 2025 •

edited

Loading

Tuilot commented Feb 18, 2025 •

edited

Loading

jswxstw commented Feb 19, 2025

Tuilot commented Feb 19, 2025

Workflow is Error but taskset node is not Error when Agent pod failed #14200

Workflow is Error but taskset node is not Error when Agent pod failed #14200

Comments

Tuilot commented Feb 17, 2025 • edited Loading

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

jswxstw commented Feb 17, 2025 • edited Loading

Tuilot commented Feb 17, 2025 • edited Loading

Tuilot commented Feb 17, 2025 • edited Loading

jswxstw commented Feb 18, 2025

Tuilot commented Feb 18, 2025 • edited Loading

Tuilot commented Feb 18, 2025 • edited Loading

jswxstw commented Feb 19, 2025

Tuilot commented Feb 19, 2025

Tuilot commented Feb 17, 2025 •

edited

Loading

jswxstw commented Feb 17, 2025 •

edited

Loading

Tuilot commented Feb 17, 2025 •

edited

Loading

Tuilot commented Feb 17, 2025 •

edited

Loading

Tuilot commented Feb 18, 2025 •

edited

Loading

Tuilot commented Feb 18, 2025 •

edited

Loading