Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow is Error but taskset node is not Error when Agent pod failed #14200

Open
4 tasks done
Tuilot opened this issue Feb 17, 2025 · 8 comments · May be fixed by #14230
Open
4 tasks done

Workflow is Error but taskset node is not Error when Agent pod failed #14200

Tuilot opened this issue Feb 17, 2025 · 8 comments · May be fixed by #14230
Labels
area/agent Argo Agent that runs for HTTP and Plugin templates solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug

Comments

@Tuilot
Copy link

Tuilot commented Feb 17, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When I submit the workflow to argo namepsace , the agent pod failed
Then, the workflow turns to Error state, but the taskset node is still Pending.

# kubectl -n argo get po hello-plugin-1340600742-agent -owide
NAME                            READY   STATUS    RESTARTS   AGE     IP               NODE                NOMINATED NODE   READINESS GATES
hello-plugin-1340600742-agent   4/4     Evicted   0          3m19s   192.168.28.240   train070            <none>           <none> 
# kubectl -n argo get workflowtaskset hello-plugin -oyaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTaskSet
metadata:
    creationTimestamp: "2025-02-17T09:50:11Z"
  generation: 1
  labels:
    workflows.argoproj.io/completed: "true"
  name: hello-plugin
  namespace: argo
  ownerReferences:
  - apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    name: hello-plugin
    uid: 0709dfa5-3603-487d-ba45-40e1f19ccc87
  resourceVersion: "5377104186"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/argo/workflowtasksets/hello-plugin
  uid: 43313c93-90f3-41cc-8467-0d1c845c9a60
spec:
  tasks:
    hello-plugin-2816962999:
      inputs: {}
      metadata: {}
      name: hello-plugin
      outputs: {}
      plugin:
        hello: {}
status:
  nodes:
    hello-plugin-2816962999:
      message: Queuing
      phase: Pending
# kubectl -n argo get wf hello-plugin -oyaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v1
  creationTimestamp: "2025-02-17T09:50:11Z"
  generation: 4
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Error
    workflows.argoproj.io/workflow-archiving-status: Archived
  name: hello-plugin
  namespace: argo
  resourceVersion: "5377104189"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/argo/workflows/hello-plugin
  uid: 0709dfa5-3603-487d-ba45-40e1f19ccc87
spec:
  arguments: {}
  entrypoint: main
  templates:
  - dag:
      tasks:
      - arguments: {}
        name: hello
        template: hello-plugin
    inputs: {}
    metadata: {}
    name: main
    outputs: {}
  - inputs: {}
    metadata: {}
    name: hello-plugin
    outputs: {}
    plugin:
      hello: {}
status:
  artifactRepositoryRef:
    artifactRepository: {}
    default: true
  conditions:
  - status: "False"
    type: PodRunning
  - status: "True"
    type: Completed
  finishedAt: "2025-02-17T09:53:43Z"
  nodes:
    hello-plugin:
      children:
      - hello-plugin-2816962999
      displayName: hello-plugin
      finishedAt: "2025-02-17T09:53:43Z"
      id: hello-plugin
      name: hello-plugin
      outboundNodes:
      - hello-plugin-2816962999
      phase: Error
      progress: 0/1
      startedAt: "2025-02-17T09:50:11Z"
      templateName: main
      templateScope: local/hello-plugin
      type: DAG
    hello-plugin-2816962999:
      boundaryID: hello-plugin
      displayName: hello
      finishedAt: "2025-02-17T09:53:43Z"
      id: hello-plugin-2816962999
      message: Queuing
      name: hello-plugin.hello
      **phase: Pending**
      progress: 0/1
      startedAt: "2025-02-17T09:50:11Z"
      templateName: hello-plugin
      templateScope: local/hello-plugin
      type: Plugin
  **phase: Error**
  progress: 0/1
  startedAt: "2025-02-17T09:50:11Z"

Version(s)

latest

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: argo
  name: hello-plugin
spec:
  entrypoint: main
  templates:
  - dag:
      tasks:
      - arguments: {}
        name: hello
        template: hello-plugin
    name: main
  - name: hello-plugin
    plugin:
      hello: {}

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
time="2025-02-17T17:53:43.164Z" level=info msg="Processing workflow" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg=updateAgentPodStatus namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=error msg="Mark error node" error="agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo nodeName=hello-plugin.hello workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 phase Pending -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 message: agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Outbound nodes of hello-plugin set to [hello-plugin-2816962999]" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin phase Running -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin finished: 2025-02-17 09:53:43.165321874 +0000 UTC" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Checking daemoned children of hello-plugin" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Updated phase Running -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Marking workflow completed" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="Checking daemoned children of " namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.170Z" level=info msg="Workflow update successful" namespace=argo phase=Error resourceVersion=5377104182 workflow=hello-plugin

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@jswxstw jswxstw added the area/agent Argo Agent that runs for HTTP and Plugin templates label Feb 17, 2025
@jswxstw
Copy link
Member

jswxstw commented Feb 17, 2025

So weird, this issue should have been fixed by #12723.

Logs as below show that node hello-plugin-2816962999 has been marked as Error.

time="2025-02-17T17:53:43.165Z" level=error msg="Mark error node" error="agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo nodeName=hello-plugin.hello workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 phase Pending -> Error" namespace=argo workflow=hello-plugin
time="2025-02-17T17:53:43.165Z" level=info msg="node hello-plugin-2816962999 message: agent pod failed with reason:\"The node was low on resource: ephemeral-storage.\"" namespace=argo workflow=hello-plugin

In addition, completed taskset nodes in WorkflowTaskSet should also be removed.

func (woc *wfOperationCtx) removeCompletedTaskSetStatus(ctx context.Context) error {

hello-plugin-2816962999:
    boundaryID: hello-plugin
    displayName: hello
    finishedAt: "2025-02-17T09:53:43Z" # finishedAt is not nil, so it has already been marked as completed.
    id: hello-plugin-2816962999
    message: Queuing # I have never seen such an error message before, and here should be 'agent pod failed with reason...'
    name: hello-plugin.hello
    phase: Pending
    progress: 0/1
    startedAt: "2025-02-17T09:50:11Z"
    templateName: hello-plugin
    templateScope: local/hello-plugin
    type: Plugin

@Tuilot
Copy link
Author

Tuilot commented Feb 17, 2025

@jswxstw You're right, the message for hello-plugin-2816962999 has been updated repeatedly, 'agent pod failed with reason...' was overwritten.

@Tuilot
Copy link
Author

Tuilot commented Feb 17, 2025

in once operate:
Mark all non-fulfilled taskset nodes as error because agent pod failed, then continue reconciling the taskset.
This redundant reconcile task set is likely the root cause of the problem.

@jswxstw
Copy link
Member

jswxstw commented Feb 18, 2025

During updateAgentPodStatus, mark all uncompleted taskset nodes to error status because agent pod failed. Then workflow is in error state and wfOperationCtx.taskSet is empty.

The workflow would have ended by this point. In what situation would the redundant reconcileTaskSet you mentioned occur?
I cannot reproduce this issue, and there is a similar case in tests:

func TestHTTPTemplate(t *testing.T) {

@Tuilot
Copy link
Author

Tuilot commented Feb 18, 2025

@jswxstw This test modification will trigger the error. All the processes mentioned above happen in once operate.

t.Run("ExecuteHTTPTemplate", func(t *testing.T) {
		ctx := context.Background()
		woc := newWorkflowOperationCtx(wf, controller)
		woc.operate(ctx)
		pod, err := controller.kubeclientset.CoreV1().Pods(woc.wf.Namespace).Get(ctx, woc.getAgentPodName(), metav1.GetOptions{})
		assert.NoError(t, err)
		assert.NotNil(t, pod)
		ts, err := controller.wfclientset.ArgoprojV1alpha1().WorkflowTaskSets(wf.Namespace).Get(ctx, "hello-world", metav1.GetOptions{})
		assert.NoError(t, err)
		assert.NotNil(t, ts)
		assert.Len(t, ts.Spec.Tasks, 1)
		ts.Status.Nodes = make(map[string]wfv1.NodeResult)
		ts.Status.Nodes["hello-world"] = wfv1.NodeResult{
			Phase:   wfv1.NodePending,
			Message: "Queuing",
		}
		_, err = controller.wfclientset.ArgoprojV1alpha1().WorkflowTaskSets(wf.Namespace).UpdateStatus(ctx, ts, metav1.UpdateOptions{})
		assert.Nil(t, err)
		wf, err = controller.wfclientset.ArgoprojV1alpha1().Workflows(wf.Namespace).Get(ctx, "hello-world", metav1.GetOptions{})
		assert.Nil(t, err)
		// simulate agent pod failure scenario
		pod.Status.Phase = v1.PodFailed
		pod.Status.Message = "manual termination"
		pod, err = controller.kubeclientset.CoreV1().Pods(woc.wf.Namespace).UpdateStatus(ctx, pod, metav1.UpdateOptions{})
		assert.Nil(t, err)
		assert.Equal(t, v1.PodFailed, pod.Status.Phase)
		// sleep 1 second to wait for informer getting pod info
		time.Sleep(time.Second)
		woc = newWorkflowOperationCtx(wf, controller)
		woc.operate(ctx)
		assert.Equal(t, wfv1.WorkflowError, woc.wf.Status.Phase)
		assert.Equal(t, `agent pod failed with reason:"manual termination"`, woc.wf.Status.Message)
		assert.Len(t, woc.wf.Status.Nodes, 1)
		assert.Equal(t, wfv1.NodeError, woc.wf.Status.Nodes["hello-world"].Phase)
		assert.Equal(t, `agent pod failed with reason:"manual termination"`, woc.wf.Status.Nodes["hello-world"].Message)
		ts, err = controller.wfclientset.ArgoprojV1alpha1().WorkflowTaskSets(wf.Namespace).Get(ctx, "hello-world", metav1.GetOptions{})
		assert.NoError(t, err)
		assert.NotNil(t, ts)
		assert.Empty(t, ts.Spec.Tasks)
		assert.Empty(t, ts.Status.Nodes)
	})

@Tuilot
Copy link
Author

Tuilot commented Feb 18, 2025

@jswxstw

The workflow would have ended by this point. In what situation would the redundant reconcileTaskSet you mentioned occur? I cannot reproduce this issue, and there is a similar case in tests:

The redundant reconcileTaskSet refers to unnecessarily reconciling the taskSet when woc.taskSet is empty. The correct behavior should be skipping the reconciliation in such cases.

@jswxstw
Copy link
Member

jswxstw commented Feb 19, 2025

@Tuilot Would you like to submit a PR to fix this?

@jswxstw jswxstw added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Feb 19, 2025
@Tuilot
Copy link
Author

Tuilot commented Feb 19, 2025

@jswxstw yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent Argo Agent that runs for HTTP and Plugin templates solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug
Projects
None yet
2 participants