-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSpartan_Introduction.Rmd
434 lines (284 loc) · 20.3 KB
/
Spartan_Introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
---
title: "Spartan Introduction - Coding Club"
author: "David Wilkinson"
date: "21 March 2017"
output:
html_document:
toc: true
toc_float: true
pdf_document:
toc: true
---
# What is Spartan?
***
<center>![](Images/spartanlogo.png)</center>
***
Spartan is the University of Melbourne's general purpose hybrid traditional high performance computing system (HPC) with cloud instances from the NeCTAR Research Cloud and attached Research Data Storage Services (RDSS).
It is designed to suit the needs of researchers whose desktop/laptop is not up to the particular task. Models running slow, datasets are too big, not enough cores, application licensing issues, etc.
Spartan consists of:
* a management node for system administrators,
* a log in node for users to connect to the system and submit jobs,
* a small number of 'bare metal' compute nodes for multinode tasks,
* any 'bare metal' user-procured hardware (e.g., departmental nodes),
* vHPC cloud compute nodes for overflow and GPGPU tasks, and
* general cloud compute nodes.
# Accessing Spartan
## Getting an Account
To gain access to Spartan you need to create an account.
* Via Karaage at [link](https://dashboard.hpc.unimelb.edu.au/karaage)
* Need to create a project
+ Need a project leader (you), and you can invite collaborators for joint projects
+ Need a project title/description to demonstrate research goals and/or research support
* Takes ~2 days for approval
***
<center>![](Images/Karaage.png)</center>
***
## Required Programs
To connect to Spartan you will need a Secure Shell (SSH) and a Secure File Transfer Protocol (SFTP) client. The SSH client is your interface with Spartan while the SFTP client is used to transfer files from your local computer to your Spartan home directory.
### Windows Users
Use PuTTY as your SSH client. This is an easy set-up with the following five steps:
1. Set your host name: spartan.hpc.unimelb.edu.au
2. Set your port number: leave as default (whereas Boab users need a defined port)
3. Set your connection type: SSH
4. Name your session to make it easy for future log-ins: Whatever you like i.e. Spartan
5. Save your session details
***
<center>![](Images/PuTTY_steps.png)</center>
***
Your first log-in to Spartan via the SSH client creates your home directory on Spartan so it is important to do that before setting up your SFTP client.
Use WinSCP as your SFTP client. This is an easy process with the following six steps:
1. Set your file protocol: SFTP
2. Set your host name: spartan.hpc.unimelb.edu.au
3. Set your port number: leave as default (whereas Boab users need a defined port)
4. Enter your username
5. Enter your password (note, this will show more characters than you entered when saved)
6. Log-in
***
<center>![](Images/WinSCP_steps.png)</center>
***
Inside a WinSCP session you will have dual file explorer windows: your local machine (left) and your Spartan home directory (right). If you have not made your initial log-in to Spartan via your SSH client you will have a blank white screen in the right-hand window. Transferring files between the two directories is achieved with a simple drag-and-drop interface.
***
<center>![](Images/WinSCP_home.png)</center>
***
Your two clients are now set up and everything is ready to access Spartan. After set-up you don't need to deal directly with PuTTY as a separate program anymore as there is a button in the WinSCP toolbar to open a session.
***
<center>![](Images/WinSCP_PuTTY.png)</center>
***
### Mac/Linux Users
Unlike for Windows, OSX and (most) linux operating systems already have SSH installed and have a fully functional terminal built in.
So to SSH into the Spartan log-in node, you just need to open up a terminal (the `Terminal` application on a mac), and issue the command:
```{}
```
Replacing `myusername` with your username. Then enter your password (no characters will show, that's normal) and hit return. That should look something like this:
***
<center>![](Images/terminal_ssh_steps.png)</center>
***
As for Windows, your first log-in to Spartan via the SSH client creates your home directory on Spartan so it is important to do that before setting up your SFTP client.
SFTP is also installed on most OSX and linux versions, so you could transfer files to Spartan directly from the terminal (do `man sftp` in the terminal if you're interested in that). However it's normally easier to use a SFTP client with a graphical user interface. One nice option for OSX (that also works on Windows) is [Cyberduck](https://cyberduck.io). Once installed, you can add a new connection to Spartan with the following steps:
1. Clicking on the `+` in the lower-left corner
2. Set the connection type to SFTP
3. Give the connection a name
4. Set the server name: `spartan.hpc.unimelb.edu.au`
5. Set the port number to 22
6. Enter your username
***
<center>![](Images/cyberduck_sftp_setup.png)</center>
***
You can then close the connection details box, and double-click on the new connection to initiate it. This will pop up a file explorer like the image below, listing the files in your working directory on Spartan. You can drag and drop files between there and your computer.
***
<center>![](Images/cyberduck_session.png)</center>
***
## Log in to Spartan
Now that you have everything in place to access Spartan, open an SSH session (in PuTTY or Terminal). as before, this wont display characters as you type your password.
***
<center>![](Images/Spartan_log_in.png)</center>
***
The text that loads at the beginning gives you the usual university IT policy spiel, a couple of getting started prompts, a warning about the log-in node, and the obligatory 300 reference.
***
<center>![](Images/Spartan_300.png)</center>
***
## Help
Typing `man Spartan` loads the university's Spartan FAQ. Rather than loading the entire document it loads it a screen at a time and you have to navigate by keyboard commands. For now, you will only need these:
* `h`: help, shows you the keyboard shortcuts
* `q`: quit the document and go back to the Spartan interface
* `^` is used in place of the control/command button, so `^V` is Ctrl+V
* `Enter/Return` will let you advance by one line
* `^Y` will let you go back by one line
* `^V` will let you advance by one window
* `^B` will let you go back by one window
***
<center>![](Images/Spartan_key_shortcuts.png)</center>
***
If you want help with a particular function in Spartan you can type `man <function name>` (without < >). This works like `?` in R.
# Using Spartan
The log-in node on Spartan is a shared resource between all users and is only allocated the memory to handle log-ins and job submission. Do not run jobs in the log-in node or the admins will get upset and kill the job. There are two ways to get out of the log-in node and into dedicated compute nodes: `sinteractive` and `sbatch`.
## sinteractive
The `sinteractive` command will give you access to a compute node (as soon as available) where you can work interactively with your job. While you can submit an entire job in this method, that is better saved for `sbatch` and `sinteractive` used for testing/debugging. There are default settings for `sinteractive` which should be enough for most uses, but they can be modified if needed like so:
```{}
sinteractive --time=00:10:00 --nodes=1 --ntasks=1 --cpus-per-task=2
```
This will give you access to one node to perform one task with two processors for ten minutes (default values I think). Lets open a default node and explore some basic commands. Type `sinteractive`. This submits a request for a default compute node (1), and you have to wait for the resource to become available (2) before you can continue.
***
<center>![](Images/Spartan_sinteractive.png)</center>
***
The `ls` command will show you everything in your current directory. Folders show up in blue.
***
<center>![](Images/Spartan_ls.png)</center>
***
The `cd` command will let you change your current directory. As Spartan is Linux based it uses / not \ for directories (unlike Windows). The \ character is used to say that the preceding term is a special character. Lets change into a sub-folder and see what is inside. Spartan lets you pre-fill file paths/names using the tab key like R, but it will pre-fill everything until a choice needs to be made, and won't provide options at that stage, but will complete everything once no more divergences exist.
***
<center>![](Images/Spartan_cd.png)</center>
***
The programs available on Spartan as referred to as modules, and you can see a complete list using `module avail`.
***
<center>![](Images/Spartan_module_avail.png)</center>
***
Loading a particular module requires two commands: `module load <module name>` (to load the module into the node's environment) and `<module name>` (to start the program). To load R for example:
***
<center>![](Images/Spartan_module_load.png)</center>
***
## sbatch
The `sbatch` command is used for direct job submission to Spartan using the batch system called SLURM (Simple Linux Utility for Resource Management).
***
<center>![](Images/slurm.jpg)</center>
***
This system tracks resources throughout the cluster and builds a queue of jobs to run, and when resources are available the job scheduler will direct your job to a compute node to run.
***
<center>![](Images/Spartan_queue_example.png)</center>
***
To submit a job using the `sbatch` command you need to write a slurm file that sets up your instance (amount of memory, number of processors, etc) and runs your model. Note for Windows users: write these in Notepad++ as Windows and Linux use different line break notation (/r/n vs /n) and standard text files written on Windows won't run on Spartan.
***
<center>![](Images/Slurm_file_example.png)</center>
***
Slurm files can be as simple or as complex as required depending on the job you want to run. Slurm files always need to begin with `#!/bin/bash` on the first line. Why? It tells the shell what kind of interpretor to run (in this case bash). On subsequent lines you set up your instance first using `#SBATCH <your command here>` and then the commands to run your job. Some useful `#SBATCH` commands to consider are:
* `#SBATCH --job-name=<name>`: Lets you give your job a name alongside its job id. Useful if you're running lots of jobs at once. By default it sets it to your file name, or replace <name> with your name of choice like this:
```{}
#SBATCH --job-name=MyJob
```
* `#SBATCH -p <partition>`: This is where you select which partition on Spartan your job will run. For most jobs you will set your partition as `cloud` which can run single node jobs of up to 8 CPUs and up to 50GB of memory. For multi-node jobs or >50GB of memory set it as `physical` (up to 200GB), and if you have access to a dedicated partition then use `your partition name`. In most cases you will use the following:
```{}
#SBATCH -p cloud
```
* `#SBATCH --time=<>`: As Spartan is a communal resource and jobs are allocated a share from a queue you need to specify a maximum amount of walltime that you want your instance to remain open. As you aren't likely to know how long your model will need to run for (outside of a rough guess) it is recommended that you give a conservative estimate. If necessary you can contact Spartan support and get your time extended. There are multiple formats for entering a time value depending on the scale of your job: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". Many SLURM documentations will list setting `--time=0` as a way to set an indefinite walltime but this will automatically be rejected by Spartan. For example, a one hour instance could be called with the following:
```{}
#SBATCH --time=01:00:00 # hours:minutes:seconds format
```
* `#SBATCH --nodes=<number>`: You need to request an allocation of compute nodes. Most jobs will be single node jobs, but there is the ability to run jobs over multiple nodes that talk to each other. It is not recommended to try running multiple communicating nodes via the cloud partition, use the physical partition instead. To call a single node use the following:
```{}
#SBATCH --nodes=1
```
* `#SBATCH --ntasks=<number>`: This line informs the SLURM controller that job steps within the allocation will launch a maximum of *number* tasks and to provide sufficient resources. Most jobs will need to a perform a single task which can be set as follows:
```{}
#SBATCH --ntasks=1
```
* `#SBATCH --cpus-per-task=<number>`: This informs the SLURM controller that each task will need *number* of processors per task. Cloud nodes have access to eight cores each. To allocate four processors (like on a quad-core desktop without multi-threading) you would set:
```{}
#SBATCH --cpus-per-task=4
```
* `#SBATCH --mem=<number>`: This is where you nominate the maximum amount of memory required per node (in megabytes). Cloud nodes have access to up to 50GB of memory, physical nodes are used for large jobs of up to 200GB. To request 10GB of memory (remembering that 1GB = 1024MB) you would use:
```{}
#SBATCH --mem=10240
```
* `#SBATCH --mail`: The final group of useful `sbatch` commands that you would regularly use sets up email notifications of various events during job submission/running. There are two separate commands here: `--mail-user=<>` to set who gets notified, and `--mail-type=<>` to chooses what you get notified about. Some useful mail options include:
+ `BEGIN`: the model is out of the queue and started to run (includes start time and time in queue)
+ `END`: the model has completed (includes run-time)
+ `FAIL`: the model has failed (includes run-time)
+ `REQUEUE`: the model has been re-queued (i.e. someone with priority has had you booted off)
+ `ALL`: all of the above plus `STAGE_OUT`
+ `TIME_LIMIT_50`: reached 50% of your allocated time
+ `TIME_LIMIT_80`: reached 80% of your allocated time
+ `TIME_LIMIT_90`: reached 90% of your allocated time
* You can string multiple notifications types together by separating them with commas, so you could do something like the following:
```{}
#SBATCH --mail-user=<your email here>
#SBATCH --mail-type=ALL,TIME_LIMIT_90
```
After setting up your instance you then need to supply the instructions for what you want your job to do. This will often just be a change of directory, loading the module (program) you need, and calling the script you want to run. Calling the script you want to run follows a set format of commands: <module name> [options] [< infile] [> out-file].The first two of these are covered earlier, so lets look at complete example for R before looking at options in more details:
```{}
cd Coding\ Club/R/
module load R
R --vanilla < tutorial.R
```
As you can see this merges the second command in loading a module with loading a script. You can see the different options available for calling an R script using `man R` (after loading the module into the environment). Some useful options include:
* `--save`: Save workspace at the end of the session
* `--no-save`: Don't save workspace at the end of the session
* `--vanilla`: A wrapper around `--no-save` and a few other commands useful for starting R as a blank slate (no loading pre-saved objects, etc)
`< infile` is where you call the script you want to run e.g. `< myFile.R`, and `> outfile` presumably lets you set specific output files (it isn't compulsory, and I've never had reason to use it).
Now we can put all of this together to create our SLURM file:
```{r eval=FALSE}
#!/bin/bash
#SBATCH --job-name=Coding_Club_Example
#SBATCH -p cloud
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10240
#SBATCH --mail-user="[email protected]"
#SBATCH --mail-type=ALL
module load R
R --vanilla < tutorial.R
```
Assuming this Slurm file is named "MyFile.slurm", you can submit your job to the queue from the log-in node using the following command
```{}
sbatch MyFile.slurm
```
***
<center>![](Images/Spartan_sbatch.png)</center>
***
## Useful Auxiliary Commands
Now that we know how to submit jobs to Spartan either through `sinteractive` or `sbatch`, lets look at some other useful commands in Spartan. Spartan uses a Unix-based operating system so a lot of Unix xommands work here. These commands range from help functions, in built text-editors, checking job status, or job control. Some of these may have been used earlier in this document but are important enough for another mention:
* `man <function name>`: This is Spartan's help function. For example, `man Spartan` will call the Spartan FAQ, `man sbatch` will call the help file for the `sbatch` command, and `man R` will call the help file for running R in Spartan.
***
<center>![](Images/Spartan_man.png)</center>
***
* `pwd`: Print working directory
* `ls`: List all files/directories in current directory
* `cd <file path>`: Change current directory. `cd` sets you to your home directory, `cd -` goes to previous directory, `cd ..` goes back one step towards root
* `rm <filename>`: Delete a file
* `mkdir <folder name>`: Create a new folder in current directory
* `rmdir <folder name>`: Delete a folder (must be empty)
* `nano <filename>`: This is Spartan's in-built text editor. This can be used for on the fly alterations to a file (i.e. increase memory limit), but I use it most often for reading the slurm output files for checking why a model run failed.
***
<center>![](Images/Spartan_nano.png)</center>
***
* `head <filename>` and `tail <filename>`: Print the first/last ten lines of a file
* `history`: List commands you've used previously. You can navigate previous commands in your current session using the up/down keys like in R, but this lists previous commands over previous sessions as well
* `echo <text>`: Prints text. Useful for debugging
* `squeue`: This command is used for checking on job status.
+ `squeue`: This base command will show all jobs in the queue for all users
+ `squeue -u <username>`: This will show all jobs for a particular user
+ `squeue -u <username> -t RUNNING`: This will show all running jobs for a particular user (can also use `PENDING`)
***
<center>![](Images/Spartan_squeue.png)</center>
***
* `scancel`: This command is used for cancelling jobs in the queue. For example:
+ `scancel -u <username>`: cancels all jobs for that username
+ `scancel -u <jobid>`: cancels the job called by <jobid>
+ `scancel --name <JobName>`: cancels jobs by name
+ `scancel -t PENDING -u <username>`: This cancels all pending jobs for a particular user
* `sstat`: This command is used to show memory information of running jobs:
+ For non-admin users there is no need to specify a username as you are restricted to your own jobs only by default
+ You can use the `--format` option to specify the details you want to see. You can list multiple comma-separated fields to view more details
+ `sstat --helpformat` will list the available fields to view
+ Browsing `man sstat` will let you see descriptions for each field
+ For example: `sstat -j <jobid> --format JobID,NTasks,Nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize`
* `sacct`: This command is used to show memory information of completed jobs:
+ For non-admin users there is no need to specify a username as you are restricted to your own jobs only by default
+ You can use the `--format` option to specify the details you want to see. You can list multiple comma-separated fields to view more details
+ `sacct --helpformat` will list the available fields to view (many more options that `sstat`)
+ Browsing `man sacct` will let you see descriptions for each field
+ For example: `sacct -j <jobid> --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,ExitCode,CPUTime`
# Additional information and links
* There are simple example slurm files for the most common modules at `/usr/local/common/` (including R, MATLAB, and Python)
* [Online Spartan FAQ](https://dashboard.hpc.unimelb.edu.au/spartan-faq). Same as using `man spartan`
* [Online SLURM FAQ](http://www.ceci-hpc.be/slurm_faq.html)
* `srun` and `salloc` functions are used for parallel jobs
* [Support for multi-core/multi-threaded architecture](https://slurm.schedmd.com/mc_support.html)
* [Running R in parallel using the rslurm package](https://cran.r-project.org/web/packages/rslurm/vignettes/rslurm-vignette.html)
```{r eval=FALSE, echo=FALSE}
"GNU Terry Pratchett"
```
***
***