Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Slurm agent #3005

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Dec 16, 2024

Tracking issue

flyteorg/flyte#5634

Why are the changes needed?

What changes were proposed in this pull request?

Support asynchronous create and get methods of slurm agent based on asyncssh:

  1. create: Submit a batch script to the remote Slurm cluster with sbatch
  2. get: Check the job state

Todos

  • Switch to ShellTask

How was this patch tested?

We test these two simple methods in the development environment described as follows:

  • Local: MacBook with flytekit installed
  • Remote: Single Ubuntu server with slurmctld and slurmd running
    • We plan to write a single-host setup tutorial and organize useful resources here
  • Local and remote communicate through ssh connection and files can be transferred through sftp

Following demonstrates a simple slurm task:

# tiny_slurm.py
import os
from typing import Any, Dict

from flytekit import kwtypes, workflow
from flytekitplugins.slurm import Slurm, SlurmTask


class CFG:
    run_remote = False


slurm_tiny_job = SlurmTask(
    name="demo-slurm",
    task_config=Slurm(
        batch_script=f"""#!/bin/bash
#SBATCH --account=flyte
#SBATCH --partition=debug

echo "Display hostname with srun..."
srun -N 1 hostname""",
        local_path="./test.py",
        remote_path="/tmp/test.py",
    ),
    inputs=kwtypes(dummy=str),
)


@workflow
def hi_slurm(dummy: str) -> Dict[str, Any]:
    """Return slurm job information."""
    res = slurm_tiny_job(dummy=dummy)

    return res


if __name__ == "__main__":
    from flytekit.clis.sdk_in_container import pyflyte
    from click.testing import CliRunner

    runner = CliRunner()
    path = os.path.realpath(__file__)

    # Local run
    print(f">>> LOCAL EXEC <<<")
    result = runner.invoke(pyflyte.main, ["run", path, "hi_slurm", "--dummy", "dummy_input"])
    print(result.output)

The test result is shown as follows:

  • Local
    Screenshot 2024-12-19 at 12 06 01 AM

  • Remote with Slurm
    Screenshot 2024-12-19 at 12 06 32 AM

Setup process

As stated above

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 75.30%. Comparing base (f99d50e) to head (9644b99).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/extend/backend/utils.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3005       +/-   ##
===========================================
+ Coverage   51.08%   75.30%   +24.21%     
===========================================
  Files         201      201               
  Lines       21231    21247       +16     
  Branches     2731     2729        -2     
===========================================
+ Hits        10846    15999     +5153     
+ Misses       9787     4486     -5301     
- Partials      598      762      +164     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

1 participant