- Define a user config.
- Check if the job is runnable : if yes, run it ; else: ask for queuing.
- Send back the print out of the script.
- Clean everything after the job has been successfully terminated, get back all output data over ssh.
Example of what it would be :
import slurmlib
import torch
import pandas
# define a job here:
def Net(torch.nn.Module):
...
# define a function to run here.
def train(arg1, arg2, kwarg1=v, kwarg2=v2):
net = Net()
for i in ...
...
output.to_csv("results.csv")
if __name__ == '__main__':
slurmlib.Job('gpu'=1, 'feature'='tesla')
slurmlib.run(train, arg1, arg2, kwarg1=v, kwarg2=v2)
# Function has to save results in file for them to be pulled back from slurm.
output = pandas.read_csv("results.csv")
You need to have a setup with the cluster which works first by using ssh !
git clone https://github.com/Diviyan-Kalainathan/slurmlib.git
cd slurmlib
python setup.py install
cd ~/.ssh # Configure files default_ssh_config and slurm_config.yml