Skip to content

Commit

Permalink
Update Spartan guides
Browse files Browse the repository at this point in the history
  • Loading branch information
Doi90 committed Aug 19, 2020
1 parent 8173d7c commit dc2f26d
Show file tree
Hide file tree
Showing 27 changed files with 1,699 additions and 1,251 deletions.
25 changes: 11 additions & 14 deletions 03-Spartan_Introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,13 @@

***

Spartan is the University of Melbourne's high performance computing system (HPC). It combines high-performance bare-metal compute nodes (somewhere on campus) with cloud instances from the NeCTAR Research Cloud and attached Research Data Storage Services (RDSS).

It is designed to suit the needs of researchers whose desktop/laptop is not up to the particular task. Models running slow, datasets are too big, not enough cores, application licensing issues, etc.
Spartan is the University of Melbourne's high performance computing system (HPC). It is designed to suit the needs of researchers whose desktop/laptop is not up to the particular task. Models running slow, datasets are too big, not enough cores, application licensing issues, etc.

Spartan consists of:

* a management node for system administrators,
* two log in nodes for users to connect to the system and submit jobs,
* 'bare metal' compute nodes,
* cloud compute nodes, and
* 'bare metal' compute nodes, and
* GPUGPU compute nodes.

## Accessing Spartan
Expand Down Expand Up @@ -202,7 +199,7 @@ The programs available on Spartan as referred to as modules, and you can see a c

***

A massive list of all modules isn't that useful! We can return more targetted lists either by adding a keyword to `module avail` or by using `module spider`. `module avail` searches for the keyword in the "full" module name so something like `module avail R` will return every module with the letter *r* in it so use something more specific like `module avail R/3.5` instead to return all versions of the R module that are version 3.5.*. `module spider` is a fuzzy match to the "actual" name of the module so `module spider R` will return all modules that are the closest match to the keyword, and a list of other module "actual" names that might also match.
A massive list of all modules isn't that useful! We can return more targetted lists either by adding a keyword to `module avail` or by using `module spider`. `module avail` searches for the keyword in the "full" module name so something like `module avail r` will return every module with the letter *r* in it so use something more specific like `module avail r/3.6` instead to return all versions of the R module that are version 3.6.*. `module spider` is a fuzzy match to the "actual" name of the module so `module spider r` will return all modules that are the closest match to the keyword, and a list of other module "actual" names that might also match.

***
<center>![](Images/Spartan_module_avail_R.png)</center>
Expand Down Expand Up @@ -242,7 +239,7 @@ To submit a job using the `sbatch` command you need to write a `slurm` script th
# To give your job a name, replace "MyJob" with an appropriate name
#SBATCH --job-name=Rsample
#SBATCH -p cloud
#SBATCH -p physical
# For R need to run on single CPU
#SBATCH --ntasks=1
Expand All @@ -255,7 +252,7 @@ To submit a job using the `sbatch` command you need to write a `slurm` script th
#SBATCH --mail-type=ALL
# Load the environment variables for R
module load R/3.5.0-spartan_gcc-6.2.0
module load r/3.6.0
# The command to actually run the job
R --vanilla < tutorial.R
Expand All @@ -269,10 +266,10 @@ R --vanilla < tutorial.R
#SBATCH --job-name=MyJob
```

* `#SBATCH -p <partition>`: This is where you select which partition on Spartan your job will run. The `cloud` partition can only run single node jobs of up to 12 CPUs and 100GB of memory. The `physical` partition can run single or multi-node jobs of up to 12CPUS and 250GB of memory. There are also other specialty partitions with larger requirements (up to 1500GB of memory) or GPUs as well. If you have access to a dedicated partition then use `your partition name`. In most cases you will use the following:
* `#SBATCH -p <partition>`: This is where you select which partition on Spartan your job will run. By default you only have acces to the `physical` partition which can run single or multi-node jobs of up to 72CPUS and 1500GB of memory (although getting access to that amount of resources in a single job will take time). There are also other specialty partitions with larger requirements or GPUs as well. If you have access to a dedicated partition then use `your partition name`. In most cases you will use the following:

```{}
#SBATCH -p cloud
#SBATCH -p physical
```

* `#SBATCH --time=<>`: As Spartan is a communal resource and jobs are allocated a share from a queue you need to specify a maximum amount of walltime that you want your instance to remain open. As you aren't likely to know how long your model will need to run for (outside of a rough guess) it is recommended that you give a conservative estimate. If necessary you can contact Spartan support and get your time extended. There are multiple formats for entering a time value depending on the scale of your job: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". Many SLURM documentations will list setting `--time=0` as a way to set an indefinite walltime but this will automatically be rejected by Spartan. For example, a one hour instance could be called with the following:
Expand All @@ -281,7 +278,7 @@ R --vanilla < tutorial.R
#SBATCH --time=01:00:00 # hours:minutes:seconds format
```

* `#SBATCH --nodes=<number>`: You need to request an allocation of compute nodes. Most jobs will be single node jobs, but there is the ability to run jobs over multiple nodes that talk to each other. It is not recommended to try running multiple communicating nodes via the cloud partition, use the physical partition instead. Multi-node jobs will require using `OpenMPI` to allow the different nodes to communicate. To call a single node use the following:
* `#SBATCH --nodes=<number>`: You need to request an allocation of compute nodes. Most jobs will be single node jobs, but there is the ability to run jobs over multiple nodes that talk to each other. Multi-node jobs will require using `OpenMPI` to allow the different nodes to communicate. To call a single node use the following:

```{}
#SBATCH --nodes=1
Expand All @@ -299,7 +296,7 @@ R --vanilla < tutorial.R
#SBATCH --cpus-per-task=4
```

* `#SBATCH --mem=<number>`: This is where you nominate the maximum amount of memory required per node (in megabytes). Cloud nodes have access to up to 100GB of memory, standard physical nodes are used for large jobs of up to 250GB. Some physical nodes and other specialist partitions will have much larger limits (up to 1500GB). To request 10GB of memory (remembering that 1GB = 1024MB) you would use:
* `#SBATCH --mem=<number>`: This is where you nominate the maximum amount of memory required per node (in megabytes). Physical nodes can have up to 1500GB. To request 10GB of memory (remembering that 1GB = 1024MB) you would use:

```{}
#SBATCH --mem=10240
Expand Down Expand Up @@ -336,7 +333,7 @@ Now we can put all of this together to create our SLURM file:
#!/bin/bash
#SBATCH --job-name=Coding_Club_Example
#SBATCH -p cloud
#SBATCH -p physical
#SBATCH --time=1:00:00
Expand All @@ -349,7 +346,7 @@ Now we can put all of this together to create our SLURM file:
#SBATCH --mail-user="[email protected]"
#SBATCH --mail-type=ALL
module load R/3.5.0-GCC-6.2.0
module load r/3.6.0
Rscript --vanilla tutorial.R
```
Expand Down
2 changes: 1 addition & 1 deletion 04-Install_R_Packages_on_Spartan.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ export ftp_proxy=$http_proxy
## Open R session
module load R/3.5.0-GCC-6.2.0
module load r/3.6.0
R
```
Expand Down
20 changes: 10 additions & 10 deletions 07-Spartan_Batch_Submission.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Putting it together the whole script will look something like this for an `R` sc
#
#SBATCH --ntasks=1
#
#SBATCH -p cloud
#SBATCH -p physical
#
#SBATCH --mem=10000
#
Expand All @@ -139,7 +139,7 @@ j=$2
module purge
module load R/3.5.0-GCC-6.2.0
module load r/3.6.0
cd directory_path
Expand Down Expand Up @@ -216,7 +216,7 @@ Example *batch submission script*: `batch_submission.slurm`
for simulation in {1..300}
do
sbatch /data/cephfs/[project_id]/scripts/slurm/job_submission.slurm $simulation
sbatch /data/gpfs/projects/[project_id]/scripts/slurm/job_submission.slurm $simulation
done
```
Expand All @@ -230,7 +230,7 @@ Example *job submission script*: `job_submission.slurm`
#
#SBATCH --ntasks=1
#
#SBATCH -p cloud
#SBATCH -p physical
#
#SBATCH --mem=10000
#
Expand All @@ -243,9 +243,9 @@ simulation=$1
module purge
module load R/3.5.0-GCC-6.2.0
module load r/3.6.0
cd /data/cephfs/[project_id]
cd /data/gpfs/projects/[project_id]
Rscript --vanilla scripts/R/script.R $simulation
```
Expand Down Expand Up @@ -291,7 +291,7 @@ do
for growth_rate in {1..5}
do
sbatch /data/cephfs/[project_id]/scripts/slurm/job_submission.slurm $pop_start_size $growth_rate
sbatch /data/gpfs/projects/[project_id]/scripts/slurm/job_submission.slurm $pop_start_size $growth_rate
done
done
Expand All @@ -306,7 +306,7 @@ Example *job submission script*: `job_submission.slurm`
#
#SBATCH --ntasks=1
#
#SBATCH -p cloud
#SBATCH -p physical
#
#SBATCH --mem=10000
#
Expand All @@ -320,9 +320,9 @@ growth_rate=$2
module purge
module load R/3.5.0-GCC-6.2.0
module load r/3.6.0
cd /data/cephfs/[project_id]
cd /data/gpfs/projects/[project_id]
Rscript --vanilla scripts/R/script.R $pop_start_size $growth_rate
```
Expand Down
Loading

0 comments on commit dc2f26d

Please sign in to comment.