Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance wlm-operator #70

Open
zjgemi opened this issue Aug 9, 2022 · 0 comments
Open

Enhance wlm-operator #70

zjgemi opened this issue Aug 9, 2022 · 0 comments

Comments

@zjgemi
Copy link
Collaborator

zjgemi commented Aug 9, 2022

There are some todos for wlm-operator, including but not limited to

  • Develop a robust agent for forwarding the red-box socket. It may retry under network interruptions.
  • Make configurator more robust under the forwarding interruptions of socket.
  • Wlm-operator is able to get logs of slurm jobs, while Argo's resource template only outputs something like
time="2022-07-13T02:39:55.042Z" level=info msg="Get slurmjobs 200"
time="2022-07-13T02:39:55.043Z" level=info msg="failure condition '{status.status == [Failed]}' evaluated false"
time="2022-07-13T02:39:55.043Z" level=info msg="success condition '{status.status == [Succeeded]}' evaluated false"
time="2022-07-13T02:39:55.044Z" level=info msg="0/1 success conditions matched"
time="2022-07-13T02:39:55.045Z" level=info msg="Waiting for resource slurmjob.wlm.sylabs.io/wlm-rhhbc-hello-dphos-hello-slurm-run-42
03105651 in namespace argo resulted in retryable error: Neither success condition nor the failure condition has been matched. Retryi
ng..."

Wlm-operator may provide a log persistence on the local side.

  • To avoid modification of Argo, dflow use 3 steps to complete a wlm template, including a prepare step, a run step and a collect step. The prepare step copies inputs artifacts from the container to some host path. The run step mounts the host directory and apply the wlm resource which uploads the input files to the remote cluster, and submit a slurm job, finally downloads output files to a mounted host directory. The collect step copies the output artifacts from the host to the container for Argo collecting. Is simplification of the procedure possible?
njzjz pushed a commit to njzjz/dflow that referenced this issue Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant