Bioinformatics on the cluster

Author: Adam Reid, Head of Bioinformatics, Gurdon Institute, University of Cambridge Email: ajr236@cam.ac.uk

Some of this material has been remixed from training materials developed by University of Cambridge Bioinformatics Training Facility course on High Performance Computing, Licensed CC BY 4.0

Aim of the course

Teach users the basics of how to use the Gurdon Institute compute cluster (skynet)

Learning outcomes

You will be able to access the cluster, move files to and from your own computer and the internet
You will be able to submit jobs using the Slurm submission system
You will be able to access Rstudio server
You will be aware of suitable places for storing files

Pre-requisites

Some Linux command-line knowledge
A cluster account (contact Charles)

Part 1 Introduction to the cluster

What is a cluster and what are its uses?

The terms cluster, compute cluster, HPC (high performance computing) and farm are often used interchangeably to mean the same thing - several computers connected together in a network. Each computer is referred to as a node in the network.

The main usage of HPC clusters is to run resource-intensive and/or parallel tasks.

For example: running thousands of simulations, each one taking several hours; assembling a genome from sequencing data, which requires computations on large volumes of data in memory; or mapping hundreds of RNA-seq samples to a reference.

These tasks would be extremely challenging to complete on a regular computer. However, they are just the kind of task that a HPC cluster would excel at.

When working on a cluster it is important to understand what kinds of resources are available to us.

These are the main resources we need to consider:

CPU (central processing units) is the "brain" of the computer, performing a wide range of operations and calculations. CPUs can have several "cores", which means they can run tasks in parallel, increasing the throughput of calculations per second. A typical personal computer may have a CPU with 4-8 cores. A single compute node on the HPC may have 32-48 cores (and often these are faster than the CPU on our computers).
RAM (random access memory) is a quick access storage where data is temporarily held while being processed by the CPU. A typical personal computer may have 8-32Gb of RAM. A single compute nodes on a HPC may often have >100Gb RAM.
GPUs (graphical processing units) are similar to CPUs, but are more specialised in the type of operations they can do. While less flexible than CPUs, each GPU can do thousands of calculations in parallel. This makes them extremely well suited for graphical tasks, but also more generally for matrix computations and so are often used in machine learning applications.

Usually, HPC clusters are available to members of large institutions (such as a Universities or research institutes) or sometimes from cloud providers. This means that:

There are many users, who may simultaneously be using the cluster.
Each user may want to run several jobs concurrently.
Often large volumes of data are being processed and there is a need for high-performance storage (allowing fast read-writting of files).

So, at any one time, across all the users, there might be many thousands of processes running on the cluster! There has to be a way to manage all this workload, and this is why HPC clusters are typically organised somewhat differently from what we might be used to when we work on our own computers.

Here is a schematic of a cluster, we go into its details in the following sections.

Nodes

There are two types of nodes on a cluster:

login nodes (also known as head or submit nodes).
compute nodes (also known as worker nodes).

The login nodes are the computers that the user connects to and from where they interact with the cluster. Depending on the size of the cluster, there is often only one login node, but larger clusters may have several of them. Login nodes are used to interact with the filesystem (move around the directories), download and move files, edit and/or view text files and doing other small routine tasks.

The compute nodes are the machines that will actually do the hard work of running jobs. These are often high-spec computers with many CPUs and high RAM (or powerful GPU cards), suitable for computationally demanding tasks. Often, there are several "flavours" of compute nodes on the same cluster. For example some compute nodes may have fewer CPUs but higher memory (suitable for memory-intensive tasks), while others may have the opposite (suitable for highly-parallelisable tasks).

Users do not have direct access to the compute nodes and instead submitting jobs via a job scheduler.

The Gurdon compute cluster

Filesystem

The filesystem on a HPC cluster often consists of storage partitions that are shared across all the nodes, including both the login and compute nodes. This means that data can be accessed from all the computers that compose the HPC cluster.

Although the filesystem organisation may differ depending on the institution, typical HPC servers often have two types of storage:

The user's home directory (e.g. /mnt/home3/user) is the default directory that one lands on when logging in to the HPC. This is often quite small and possibly backed up. The home directory can be used for storing things like configuration files or locally installed software.

A scratch space (e.g. /mnt/scratch/user), which is high-performance, large-scale storage. This type of storage may be private to the user or shared with a group. It is usually not backed up, so the user needs to ensure that important data are stored elsewhere. This is the main partition were data is processed from.

At the Gurdon Institute we have:

home3 (/mnt/home3/group/user)

Features: limited total space (1 Tb), backed up

Useful for: Installing software and backing up results
scratch (/mnt/scratch/group/user)

Features: no quotas (~1 Petabyte total), not backed up

Useful for: Running compute jobs, generating lots of intermediate files
Datastore /mnt/Sequencing and smb://datastore.computing.gurdon.cam.ac.uk/Sequencing

Useful for: where to find your Gurdon and CI sequencing data

For backed up directories it is important to avoid large numbers of files and non-work related files as this slows down back-ups and reduces space available for others.

Where is my sequencing data?

Sequencing data is kept in the Sequencing Datastore. This is accessible from your computer by mounting this samba drive smb://datastore.computing.gurdon.cam.ac.uk e.g. using Finder -> Go -> Connect to Server and pasting in the link on a Mac.

This Sequencing Datastore is also mounted on the compute cluster at /mnt/Sequencing.

For samples sequenced at CI, data is automatically downloaded to:

//Datastore/Sequencing/(GL Folder)/CI FASTQ/ on the samba share and /mnt/Sequencing/(GL Folder)/CI FASTQ/ on cb-milan1

(ftp server is checked nightly at 0300)

For samples sequenced at the Gurdon Institute, raw data is manually compressed and copied to:

//Datastore/Sequencing/(GL Folder)/Run Folder/

FASTQ files are manually copied to (Kay Harnish):

//Datastore/Sequencing/(GL Folder)/Basespace FASTQ/

Manually demultiplexed FASTQ files (CI or GI):

//Datastore/Sequencing/(GL Folder)/CB FASTQ/

Bioinformatics share

Team workspaces are not visible by members of other teams and so to allow people to share data between the Bioinformatics group and other groups, we have set up a shared space with unrestricted access. This means that anything can be deleted by anyone so please be careful.

This can be accessed on the cluster here: /mnt/bioinfo_sharing/sharing/

or from your Mac by using Finder. Under the Go menu, select Connect to server, type smb://cb-share1.gurdon.private.cam.ac.uk in the box and choose to connect as Guest.

Getting on the cluster

If you are using a Mac:

Open terminal from Spotlight Search or this icon:

If you are on a PC:

Open Command Prompt with this icon:

Then type:

ssh <user>@cb-milan1.gurdon.private.cam.ac.uk

e.g.

ssh ajr236@cb-milan1.gurdon.private.cam.ac.uk

Moving files

scp - move files between your computer and the cluster

Copy a file from cluster to my computer (from my computer)

scp ajr236@cb-milan1.gurdon.private.cam.ac.uk:/mnt/home3/reid/ajr236/graph.png .

Copy a file from my computer to the cluster (from my computer)

scp graph.png ajr236@cb-milan1.gurdon.private.cam.ac.uk:/mnt/home3/reid/ajr236/

wget – get files from the internet

wget http://ftp.flybase.net/genomes/dmel/current/gtf/dmel-all-r6.43.gtf.gz

N.b. To run visualisations you will need to copy your files to your local computer

Practical 1

ssh to the cluster
Make a new directory in your homespace called tutorial and move into that directory
Get the automated_gene_summaries.tsv.gz file from flybase using wget
Copy the file from the cluster to your local machine using scp

Open a new terminal
scp automated_gene_summaries.tsv.gz from @cb-milan1.gurdon.private.cam.ac.uk:

Unzip the file on the command line using gunzip
Check the file contents
Get just the first ten lines and make a new file called head.txt
Copy head.txt to the cluster with scp

Part 2 Submitting jobs

HPC servers usually have job scheduling software that manages all the jobs that the users submit to be run on the compute nodes. This allows efficient usage of the compute resources (CPUs and RAM), and the user does not have to worry about affecting other people's jobs.

We discussed nodes and CPUs above. When submitting your jobs to a cluster there are several other useful terms to know. Not least:

Job : a unit of work, with resource requests and steps to be run
Thread : single process running on a core
Queue/Partition : a group of nodes with particular characteristics e.g. resources, software

The job scheduler uses an algorithm to prioritise the jobs, weighing aspects such as:

how much time did you request to run your job?
how many resources (CPUs and RAM) do you need?
how many other jobs have you got running at the moment?

Based on these, the algorithm will rank each of the jobs in the queue to decide on a "fair" way to prioritise them. Note that this priority dynamically changes all the time, as jobs are submitted or cancelled by the users, and depending on how long they have been in the queue. For example, a job requesting many resources may start with a low priority, but the longer it waits in the queue, the more its priority increases.

This can be thought of like managing tables at a restaurant

Why not just run jobs on the head node?

It doesn't allow fair distribution of resources
You know that your job will get the resources you need
Helps with record-keeping and troubleshooting
Job gets run in the background (go for a coffee!)
More power – faster runs
Parallelisation!!!

When should I use the head nodes?

Editing text files
Basic UNIX commands – mv, cp, rm etc.
Managing data
Short, low-memory processes
Testing – with simple example data for fast turn around
For submitting jobs
Monitoring jobs
Post processing

Slurm job scheduler

On the Gurdon cluster (and on the University of Cambridge HPC cluster) the job scheduler is Slurm

There are several Slurm programs which allow submission and management of jobs

Examining the available queues and node status

sinfo will tell you about the partitions and what resources are available

down = not in use

drng/drain = removed by administrator

mix = some CPUs allocated some idle

idle = available

alloc = node in use

Examining pending and running jobs

!!! Add image here

squeue
squeue -u <username>

To get detailed parameters about a job:

sacct --format jobname,account,state,AllocCPUs,reqmem,maxrss,averss,elapsed -j <jobid>

To get parameters about a finished job, such as how long it took, how much memory it used and whether it failed or not

seff <jobid>

This will give you output such as (as well as some warnings from Perl which you can ignore):

Job ID: 822393
Cluster: skynet
User/Group: /reid
State: FAILED (exit code 2)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:00 core-walltime
Job Wall-clock time: 00:00:00
Memory Utilized: 1.14 MB
Memory Efficiency: 0.01% of 7.81 GB

Memory Utilized is particularly useful for determining how much memory to ask for in the future when running similar jobs.

To kill a job:

scancel <jobid>

Submitting a job

srun

Interactive, typically for MPI jobs– “proper” parallel computing
salloc

Interactive shell with job allocation – bit like logging into a virtual machine
sbatch

“embarrassingly” parallel computing – i.e. most bioinformatics

Submit jobs for non-interactive execution

Requires writing a bash script
slurm_sub.py

Custom script which makes running sbatch easy and has added reporting

Batch scripts (sbatch)

Allow multiple steps to be run in a single job
Ideal for small bioinformatics workflows
Submitted in the background and allows for good reporting
Set up scheduling parameter templates rather than typing them at execution

#!/bin/bash
#SBATCH --job-name=parallel_job # Job name 
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) 
#SBATCH --mail-user=email@ufl.edu # Where to send mail 
#SBATCH --nodes=1 # Run all processes on a single node 
#SBATCH --ntasks=1 # Run a single task 
#SBATCH --cpus-per-task=4 # Number of CPU cores per task 
#SBATCH --mem=1gb # Job memory request 
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.log # Standard output and error log 
pwd; hostname; date

echo "Running sleep program on $SLURM_CPUS_ON_NODE CPU cores" 

sleep 10 

date

slurm_sub.py

This is a custom script which makes submitting sbatch jobs easier

It is located at /mnt/home3/slurm/slurm_sub.py. Simple things can be run very simply e.g. slurm_sub.py sleep 10

STDOUT goes to job.o
STERR goes to job.e
The batch script is job.sbatch.sh
-j to change job name from ‘job’
-p, -n, -N, -m parameters just like sbatch

Job submission parameters

sbatch parameter	Flag	Default value
Job name	`-J` or `--jobname`	<script name>
Partition/queue	`-p` or `--partition`	1804
Number of tasks	`-n` or `--ntasks`	-
Number of nodes	`-N` or `--nodes`	1
Standard output	`-o`	`slurm-<jobid>.out`
Standard error	`-e`	`slurm-<jobid>.out`
Mailing	`--mail-type [NONE, BEGIN, END, FAIL, ALL]` and `--mail-user`
RAM	`--mem`	4000Mb

Choosing your parameters

If you don't specify resource requirements you will get the following resources by default:

7 days of running time (equivalent to -t 7-00:00:00)
1804 partition, with Ubuntu 18.04 operating system
1 CPU (equivalent to -c 1)
4GB RAM (equivalent to --mem=4000)

Choosing your parameters can be tricky, because you probably don't know how much you need. It is always sensible to do a test run to see how much RAM, time and how many CPUs are required. You can run seff once it has finished to see what was used and then extrapolate to your full dataset.

Partitions

Often, HPC servers have different types of compute node setups (e.g. queues for fast jobs, or long jobs, or high-memory jobs, etc.). SLURM calls these "partitions" and you can use the -p option to choose which partition your job runs on. Usually, which partitions are available on your HPC should be provided by the admins.

It's worth keeping in mind that partitions have separate queues, and you should always try to choose the partition that is most suited to your job.

We have 1804 which runs Ubuntu 18.04 and 2004 which runs Ubuntu 20.04. 2004 uses more recent machines which have more resources.

Job submission examples

If we run the program hostname on the head node we get this result:

If we submit hostname to the cluster with srun we get:

If we submit sleep 100 to the cluster as a batch job using slurm_sub.py we get:

Practical 2

Use srun to submit the command hostname – which node did the job run on?
Adjust the parameters of srun to run the same command multiple times (-n) – did they run on the same or different nodes?
Use slurm_sub.py to submit the command sleep 100. Use squeue –u <user> to find out which node it is running on. What were the output status and error messages from the job? What time did the job start and when did it finish?
Run the sleep 100 command again, but kill it before it finishes. What was the output/error status of the job this time?
Copy files: sample1_1.fastq, sample1_2.fastq and dm6_chrM.fa from /mnt/bioinfo_sharing/sharing/course_material/cluster/ to your tutorial directory. Map the data to reference fasta dm6_chrM.fa with the following commands, running each job using slurm_sub.py. Do not copy and paste the commands – they may fail due to hidden characters.

slurm_sub.py –j index1 bwa index dm6_chrM.fa
slurm_sub.py –j mem1 bwa mem –o sample1.sam dm6_chrM.fa sample1_1.fastq sample1_2.fastq
slurm_sub.py –j view1 samtools view –b –o sample1.bam sample1.sam 
Slurm_sub.py –j sort1 samtools sort –o sample1_sort.bam sample1.bam
slurm_sub.py –j index1 samtools index sample1_sort.bam

Try viewing the mapped reads in IGV on your laptop or using samtools view sample1_sort.bam | less on the cluster. Always check your output!

BONUS EXERCISE: Write a shell script which captures the steps and runs the whole thing. Then run on several sets of fastq files.

Part 3 Rstudio and installing software

RStudio server

Different head nodes have different versions of R installed and these can each be accessed on your laptop from different URLs. cb-head3 and cb-head4 can be considered legacy nodes which will be removed as the new cluster develops. Older R versions will be maintained by virtualisation.

R v4.2.0 on cb-milan1 http://cb-milan1.gurdon.private.cam.ac.uk:8787/

R v4.1.0 on cb-head4 http://cb-head4.gurdon.private.cam.ac.uk:8787/

R v3.5.2 on cb-head3 http://cb-head3.gurdon.private.cam.ac.uk:8787/

When you install an R package on cb-milan1 it will not be available from R on other nodes, each is kept separate. Your own R package installs are kept in e.g. ~/R/x86_64-pc-linux-gnu-library/4.2/ with a different folder for each R version.

If you install a more up-to-date version than is available centrally (in /usr/local/lib/R/site-library), then both will be available in the “packages” menu.

Default loading e.g. “library(Seurat)” will be to your local version.

jupyter lab

You may want to run Jupyter lab (or notebook) on the cluster so you can, for instance, analyse scRNA-seq data using python and scanpy.

Login to the cluster:

ssh <user>@cb-milan1.gurdon.private.cam.ac.uk

Start a screen:

screen -S jupyterlab

Allocate resources on slurm:

salloc -p 1804 -c 2 --mem=50G

Submit an interactive job to slurm

srun --pty bash

Activate a conda environment if needed?

conda activate scanpy

Note which node the job is running on. Then start jupyter lab/notebook on a specific port, e.g. port 8181. N.b. when running Jupyter notebook you may need to do this first export XDG_RUNTIME_DIR=""

jupyter lab --no-browser --ip 0.0.0.0 --port 8181

Copy the link that shows up in your screen - something like:

http://127.0.0.1:8181/lab?token=53da29195682403ce73074170f3a260b80c36004571a1b89

Start a new terminal on your laptop / desktop and ssh again with port forwarding this time, substituting for the actual node your job is running on. Port 8181 on the local machine will be connected to 8181 on the cluster node (which we specified above) by a tunnel.

ssh ajr236@cb-milan1.gurdon.private.cam.ac.uk -L 8181:<node>:8181

Then paste the Jupyter lab URL you copied into your browser - tada!

Kill screen - when all is done the screen will hang around unless you kill it (n.b. this kills all your screens)

pkill screen

Installing software

You can install software and run it from your home directory (you may want to add paths to your .bashrc file)
You can use conda for R/python packages (local but allows multiple environments and “easy” installation)
If something might be generally useful and/or it is hard (impossible without root access?) to install you can ask Charles to install it centrally.

Appendix

Cheatsheets

Linux Unix - http://www.mathcs.emory.edu/~valerie/courses/fall10/155/resources/unix_cheatsheet.html

Slurm - https://www.chpc.utah.edu/presentations/SlurmCheatsheet.pdf

Gurdon cluster


Logging in	`ssh <user>@cb-milan1.gurdon.private.cam.ac.uk`
Filesystem	Scratch `/mnt/scratch` – no quotas (~1 Petabyte total)
	(Running compute jobs)
	home3 `/mnt/home3` – limited space (1 Tb)
	(Installing software, backing up results, Conda environments)
Datastore	`/mnt/Sequencing` (from head node only)
	`smb://datastore.computing.gurdon.cam.ac.uk/Sequencing` mounted locally on your Mac or PC
RStudio server	v4.2.0 http://cb-milan1.gurdon.private.cam.ac.uk:8787/
	v4.1.0 http://cb-head4.gurdon.private.cam.ac.uk:8787/
	v3.5.2 http://cb-head3.gurdon.private.cam.ac.uk:8787/

Practical 1 answers

# SSH to the cluster
ssh ajr236@cb-milan1.gurdon.private.cam.ac.uk

# Make a new working directory
mkdir tutorial

# Change to that directory
cd tutorial

# Get interesting file from the internet
wget http://ftp.flybase.org/releases/FB2021_06/precomputed_files/genes/automated_gene_summaries.tsv.gz 

# From local machine, copy the file from the cluster to that machine
scp ajr236@cb-milan1.gurdon.private.cam.ac.uk:/mnt/home3/reid/ajr236/tutorial/automated_gene_summaries.tsv.gz .

# Unzip the file
gunzip automated_gene_summaries.tsv.gz

# Make a new file with just the first 10 lines in it
head -10 automated_gene_summaries.tsv > head.txt

# From local machine, copy the new file to the working directory on the cluster
scp head.txt ajr236@cb-milan1.gurdon.private.cam.ac.uk:~/tutorial/

Practical 2 answers

Use srun to submit the command hostname – which node did the job run on?

srun hostname

Adjust the parameters of srun to run the same command multiple times (-n) – did they run on the same or different nodes?

srun --ntasks 2 hostname

Use slurm_sub.py to submit the command sleep 100. Use squeue –u <user> to find out which node it is running on. What were the output status and error messages from the job? What time did the job start and when did it finish?

# submit the sleep command using slurm_sub.py
slurm_sub.py sleep 100

# Check that the job is running and on which node
squeue -u <user>

# When the job is finished, look at the start and finish times to work out how long it took
# Also look to see if the job finished properly e.g. “Job finished successfully, return value: 0”
cat job.o

# Error file should be empty
cat job.e

Run the sleep 100 command again, but kill it before it finishes. What was the output/error status of the job this time?

# submit the sleep command using slurm_sub.py
slurm_sub.py sleep 100

# Kill the job (get the job id from the message upon submission or from squeue
scancel <jobid> 

# There is no status message in stdout
cat job.o

# Error file should say something like “slurmstepd: *** JOB 5010440 ON node20 CANCELLED AT 2022-01-31T15:49:14 ***”
cat job.e

Copy files: sample1_1.fastq, sample1_2.fastq and dm6_chrM.fa from /mnt/bioinfo_sharing/sharing/course_material/cluster/ to your tutorial directory. Map the data to reference fasta dm6_chrM.fa with the following commands, running each job using slurm_sub.py

# Copy files   
cp /mnt/bioinfo_sharing/sharing/course_material/cluster/sample1_*.fastq .
cp /mnt/bioinfo_sharing/sharing/course_material/cluster/dm6_chrM.fa .
  
# Make an index of the reference sequence 
bwa index dm6_chrM.fa 

# Run the mapping step
slurm_sub.py –j bwamem1 bwa mem dm6_chrM.fa sample1_1.fastq sample1_2.fastq -o sample1.sam
  
# Convert sam to bam
slurm_sub.py –j sam2bam1 samtools view -b sample1.sam -o sample1.bam

# sort the bam file
slurm_sub.py –j sort1 samtools sort -o sample1_sorted.bam sample1.bam

# Index the bam file
slurm_sub.py –j index1 samtools index sample1_sorted.bam

Practical 2 bonus exercise

Map_batch.sh –

#!/bin/bash
sample_name='sample2'

bwa mem ../dm6_chrM.fa $sample_name\_1.fastq  $sample_name\_2.fastq -o $sample_name\.sam
samtools view -b $sample_name\.sam -o $sample_name\.bam
samtools sort -o $sample_name_sorted\.bam $sample_name\.bam
samtools index $sample_name_sorted\.bam

slurm_sub.py -j sample2_batch bash map_batch.sh

Or

sbatch -J sample2_batch map_batch.sh

Online courses

Linux command line

https://ubuntu.com/tutorials/command-line-for-beginners#1-overview

https://www.futurelearn.com/courses/linux-for-bioinformatics

http://www.ee.surrey.ac.uk/Teaching/Unix/

https://swcarpentry.github.io/shell-novice/

An actual command line interface on the web for practicing

Cluster computing

https://training.csx.cam.ac.uk/bioinformatics/course/bioinfo-introhpc

http://bioinformatics-core-shared-training.github.io/shell-novice/

https://bioinformatics-core-shared-training.github.io/Managing-your-research-data/

https://github.com/bioinformatics-core-shared-training/nextflow_september_2021

http://bioinformatics-core-shared-training.github.io/hpc/

Licensed CC BY 4.0

Files

bioinf_cluster.md

Latest commit

History