Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for pbs files on slurm #166

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

aalitaiga
Copy link

No description provided.

@aalitaiga aalitaiga force-pushed the adrien_slurm branch 2 times, most recently from 355547e to 6f31489 Compare August 11, 2017 14:40

#PBS -N arrayJob
#PBS -o arrayJob_%A_%a.out
#PBS -t 1-2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ignore arrays as long as SmartDispatch is not using them.

)

#
# pbs_string2 = """\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a second PBS template?

text_file.write(command)
process = Popen("sbatch test.pbs", stdout=PIPE, stderr=PIPE, shell=True)
stdout, stderr = process.communicate()
assert_true("Submitted batch job" in stdout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can define the function as a Method of TestSlurm and use self.assertIn rather than assert_true

stdout, stderr = process.communicate()
job_params = [c.strip() for c in stdout.split("\n")[1:] if c != '']
# import ipdb; ipdb.set_trace()
assert_true(all(p == param for p in job_params))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this would give a more detailed error message self.assertSequenceEqual(job_params, (param for _ in job_params))


def tearDown(self):
process = Popen("rm *.out test.pbs", stdout=PIPE, stderr=PIPE, shell=True)
stdout, stderr = process.communicate()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to use glob instead of Popen.

for file_name in (glob.glob('*.out') + ["test.pbs"]):
    os.remove(file_name)

#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --time=01:00:00
#SBATCH --gres=gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-l naccelerators=x will give gpu:x with squeue -o gres, so you should use --gres=gpu:x in sbatch file too. You should not test for 1 only, give it a random integer too.

#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --time=01:00:00
#SBATCH --gres=gpu
#SBATCH --constraint=gpu6gb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu:x and constraint should be formatable in sbatch string since we want to try different values.

output_array = output_array or param_array
for param, output in zip(param_array, output_array):
com = pbs_string.format(
string.format(command.format(param))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity:
command -> command_template
com -> param_command


def _add_cluster_specific_rules(self):
for pbs in self.pbs_list:
node_resource = pbs.resources.pop('nodes')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem wise to pop ['nodes'] if there is other information than gpus and ppn. I would rather substitute "gpus=[0-9]+:" with "". However we would need to check if resources['nodes'] becomes empty afterwards and remove it if so. It is very likely to happen, but it is necessary if we want to keep other informations when present.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If go from "nodes=1:ppn=2:gpus=2" to "nodes=1" is it okay or I should remove nodes completely in this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok

Parameters
----------
**options : dict
each key is the name of a SBATCH option (see `Options`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Options section in this docstring

"""

for option_name, option_value in options.items():
self.sbatch_options[option_name] = option_value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to add back the "-"/"--" here if we drop it in add_sbatch_flags.

if len(option_name) == 1:
    "-" + option_name 
else:
    "--" + option_name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a duplicate at line 81? You can change it to option_name = "-" + option_name and leave line 81 like it is.


for flag in flags:
split = flag.find('=')
options[flag[:split]] = flag[split+1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more consistent with the PBS options treatment if we drop the "-"/"--" in front of the option names. I would also raise a ValueError if neither are present.

@@ -39,6 +39,13 @@
nvidia-smi
"""

# Checking which cluster is running the tests first
process = Popen("sacctmgr list cluster", stdout=PIPE, stderr=PIPE, shell=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would subprocess.check_output be simpler to use? It's used widely in this repo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would write functions for qstat and slurm cluster detections. Something like

def get_qstat_cluster_name():
    """
    ... Some doc ...

    Raises
    ------
    OSError
        The cluster may not have qstat installed.
    """
    output = Popen(["qstat", "-B"], stdout=PIPE).communicate()[0]
    return output.split('\n')[2].split(' ')[0]


def get_slurm_cluster_name():
    """
    ... Some doc ...
    Raises
    ------
    OSError
        The cluster may not have slurm installed.
    """
    # ... Some code ...

Those would be used in detect_cluster()

@@ -115,6 +115,7 @@ def detect_cluster():
output = Popen(["qstat", "-B"], stdout=PIPE).communicate()[0]
except OSError:
# If qstat is not available we assume that the cluster is unknown.
# TODO: handle MILA + CEDAR + GRAHAM
Copy link
Collaborator

@bouthilx bouthilx Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can insert the function get_slurm_cluster_name() here for reference, but let it still return None. We don't want to add support for specific clusters now. That will be another ticket.

@bouthilx
Copy link
Collaborator

@aalitaiga You should rename pbs_slurm_test.py otherwise Travis will keep reporting failures. We could add --ignore-files= in .travis.yml but we don't want those tests to run locally neither. They are only usefull if we want to tests new settings on the slurm configuration, so they should only be ran alone. A name like verify_slurms_pbs_wrapper.py would be self-explanatory.

@@ -78,7 +78,12 @@ def add_sbatch_flags(self, flags):

for flag in flags:
split = flag.find('=')
options[flag[:split]] = flag[split+1:]
if flag.startswith('--'):
options[flag[2:split]] = flag[split+1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be simpler.

if not flag.startswith("-"):
  raise ValueError("Invalid SBATCH flag ({})".format(flag))

options[flag[:split].lstrip("-")] = flag[split+1:]

gpus = re.match(".*gpus=([0-9]+)", pbs.resources['nodes']).group(1)
ppn = re.match(".*ppn=([0-9]+)", pbs.resources['nodes']).group(1)
pbs.resources['nodes'] = re.sub("ppn=[0-9]+", "", pbs.resources['nodes'])
pbs.resources['nodes'] = re.sub(":gpus=[0-9]+", "", pbs.resources['nodes'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't look for empty pbs.resources['nodes'] because there is always nodes=1 remaining? That would be fine in my opinion, I just want to be sure to understand for which reason you do not check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is the reason

@@ -115,7 +115,7 @@ def detect_cluster():
output = Popen(["qstat", "-B"], stdout=PIPE).communicate()[0]
except OSError:
# If qstat is not available we assume that the cluster is unknown.
# TODO: handle MILA + CEDAR + GRAHAM
cluster_name = get_slurm_cluster_name()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the TODO there. Otherwise, it is not obvious why it still returns None.

@@ -131,6 +131,11 @@ def detect_cluster():
cluster_name = "hades"
return cluster_name

def get_slurm_cluster_name():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a unit test for this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no unittest for 'detect_cluster' should I make one for 'get_cluster_name'?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be one for each functions.

@@ -73,6 +73,21 @@ def add_pbs_flags(self, flags):
pbs.add_resources(**resources)
pbs.add_options(**options)

def add_sbatch_flags(self, flags):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a unit test for add_sbatch_flags


class SlurmJobGenerator(JobGenerator):

def _add_cluster_specific_rules(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a unit test for this

@@ -206,6 +209,7 @@ def parse_arguments():

parser.add_argument('-p', '--pool', type=int, help="Number of workers that will be consuming commands. Default: Nb commands")
parser.add_argument('--pbsFlags', type=str, help='ADVANCED USAGE: Allow to pass a space seperated list of PBS flags. Ex:--pbsFlags="-lfeature=k80 -t0-4"')
parser.add_argument('--sbatchFlags', type=str, help='ADVANCED USAGE: Allow to pass a space seperated list of SBATCH flags. Ex:--sbatchFlags="-qos=high --output=file.out"')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a unit test for this.

@bouthilx
Copy link
Collaborator

bouthilx commented Sep 29, 2017

We need unit-tests for

  • add_sbatch_flags
  • add_sbatch_options
  • command line option
  • job generator pbs file replacements (pnn, gpus, etc)

Make sure to test that what should pass, pass and what should fail fails. We need to try many different things to test the command line options. For instance, we need to try adding bad options or adding options already included by job generator. We need to make sure to test every cracks as this seems to me as an error-prone setup (replacing options for a framework by options a wrapper of another framework will recognize as what we want it to be for the latter framework 😓).

elif flag.startswith('-'):
options[flag[1:split]] = flag[split+1:]
if split == -1:
raise ValueError("Invalid SBATCH flag ({})".format(flag))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know it's invalid probably because it should be a single - or the = is missing. The error message should give a hint about it.

raise ValueError("Invalid SBATCH flag ({})".format(flag))
options[flag[:split].lstrip("-")] = flag[split+1:]
elif flag.startswith('-') and split == -1:
options[flag[1:2]] = flag[2:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply flag[1]? 😛

def test_add_sbatch_flag_invalid(self):
invalid_flags = ["--qos high", "gpu", "-lfeature=k80"]
for flag in invalid_flags:
assert_raises(ValueError, self._test_add_sbatch_flags, "--qos high")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you pass the same flag each time.

self.pbs = job_generator.pbs_list

def test_ppn_ncpus(self):
assert_true("ppn" not in str(self.pbs[0]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to make a dummy mock of self._add_cluster_specific_rules() to test that otherwise ppn was indeed present in the PBS and ncpus was not.

@aalitaiga aalitaiga force-pushed the adrien_slurm branch 3 times, most recently from b42e8bd to a02e32b Compare October 11, 2017 21:51
@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 95.346% when pulling a02e32b on aalitaiga:adrien_slurm into 9133e15 on SMART-Lab:master.

@aalitaiga aalitaiga force-pushed the adrien_slurm branch 2 times, most recently from 5bd3c86 to 39107f1 Compare October 12, 2017 01:06
@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 95.346% when pulling 39107f1 on aalitaiga:adrien_slurm into 9133e15 on SMART-Lab:master.

aalitaiga and others added 28 commits October 17, 2017 20:55
Why:

For each option, add_sbatch_option would add the option in both the form
--[OPTION_NAME] and [OPTION_NAME].
It will need many conversions, not only on resources, so better make it
clean.
Slurm has no queues, so PBS option -q is invalid and non-convertible.
$PBS_JOBID was used to set the stdout/err of the job as well as in the
commands. Replace them with $SLURM_JOB_ID.

Also, workers were accessing os.environ[PBS_JOBID] so we added a second
fetch on SLURM_JOB_ID in case os.environ[PBS_JOBID] gave undefined.
Slurm cannot be passed environment variables defined locally on
command-line like PBS_FILENAME is. To bypass this, we add a definition
in the prolog, making PBS_FILENAME available to all commands and epilog.

NOTE: We leave PBS_FILENAME definition in command-line too such that any
user using $PBS_FILENAME inside a custom pbsFlag can still do so.
PBS options -V is not converted properly to SBATCH --export ALL.
We remove it and replace it with --export=ALL is the sbatch options.
Slurm does not have a equivalent environment variable set like
PBS_WALLTIME. To avoid confusion, all variables PBS_WALLTIME are renamed
to SBATCH_TIMELIMIT (the environment variable one would use to set --time
with sbatch). As SBATCH_TIMELIMIT is not set automatically, we add it to
the prolog to make it available to all commands and epilog.

NOTE: PBS_WALLTIME is set in seconds, but we only have HH:MM:SS-like strings
at the time of building the PBS file. We needed to add a
walltime_to_seconds helper function to convert HH:MM:SS like strings
into seconds, so that SBATCH_TIMELIMIT is set with seconds like
PBS_WALLTIME.
It is possible to query the system to see if some commands are available
using distutils.spawn.find_executable(command_name).

Clusters where more than one launcher are available will still get
launchers selected based on string matching. For instance,
get_launcher("helios") would always return msub no matter what is
available on the system.
JobGenerators are selected by job_generator_factory based on the
cluster's name. We use a more flexible, duck typing approach for Slurm
clusters. If cluster name is not known, or not any of the if-case
clauses in the factory, then we look at which launchers are available in
the system. If it is sbatch, then a SlurmJobGenerator is built,
a JobGenerator otherwise.
The command `sacctmgr` fails on some computers (mila01 namely), but the
current behavior gives the impression sbatch is simply not available.

Printing the stderr makes it more obvious that sbatch should be
available, but something is broken behind sacctmgr. It only appears when
using -vv options nevertheless.
Adding a script to do automatic verifications to assert validity of the
current code.

The verifications are not automatic unit-tests, they need automatically
checks that the process executed successfully, but the administrator
still needs to verify manually, reading the logs, that the requested
resources were provided.

Verifications can easily be combined, building on top of each others,
from complex ones to simpler ones.

Here is a list of all the verification currently implemented for slurm
clusters:

 1. very_simple_task                              (1  CPU)
 2. verify_simple_task_with_one_gpu               (1  CPU 1 GPU)
 3. verify_simple_task_with_many_gpus             (1  CPU X GPU)
 4. verify_many_task                              (X  CPU)
 5. verify_many_task_with_many_cores              (XY CPU)
 6. verify_many_task_with_one_gpu                 (X CPU X GPU)
 7. verify_many_task_with_many_gpus               (X CPU Y GPU)
 8. verify_simple_task_with_autoresume_unneeded   (1 CPU)
 9. verify_simple_task_with_autoresume_needed     (1 CPU)
10. verify_many_task_with_autoresume_needed       (X CPU)
My initial though was that get_launcher should raise an error when no
launcher is found on the system since there cannot be any job launcher.
I realized that this would break the --doNotLaunch option that users may
want to use on system with no launcher, just to create the files.
The tests were failing because the account was not specified.
The tests were failing because the account was not specified
There was a missing parentheses which was causing a bad conversion of
"DD:HH:MM:SS" to seconds.

The unit-test was also missing the same parentheses. I added a unit-test
to make sure such error could not occur again.
@coveralls
Copy link

Coverage Status

Coverage increased (+1.2%) to 95.738% when pulling 33c048b on aalitaiga:adrien_slurm into 9133e15 on SMART-Lab:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants