Skip to content

Latest commit

 

History

History
412 lines (246 loc) · 29.1 KB

README.md

File metadata and controls

412 lines (246 loc) · 29.1 KB

Data analysis project on GenomeDK

This repository serves as a walk-through to begin a data analysis project on the GenomeDK cluster. Be that a PiB, BSc, MSc, or Ph.D. project. For more seasoned users, it can serve as a showcase of best practices. Some basic experience executing commands in a "terminal" is assumed.

The cluster is a very large collection of computers with a shared file system. It does not have a screen and a keyboard you can go use. However, by connecting to the cluster from your computer, you can create and edit files much like if they were on your machine. The goal here is to take you through all the steps required to make this possible.

NB: This is not a leisure read. Make sure you follow it to the letter, doing each step in order.

Below you will see something like <cluster user name> or <project folder> a lot. When ever you see something like that you should replace the whole thing, including < and >, with whatever is says. E.g., if your cluster user name is preben you should replace <cluster user name> with preben.

Before you can begin, you need access to the cluster. The cluster is called GenomeDK and has its own website with lots of information and documentation. To get an account on the cluster, you must request one here. Below, username will represent your user name.

On the cluster, you have a home folder that only you have can access. That is where you end up when you log in. Collaborative projects or projects that use or generate a lot of data projects belong in project folders. If you do a project, we will set up a dedicated project folder for this.

Your project is a research project. Even if it is meant for training, think of it as training reproducibility too. Your efforts, experiences, results, and developed expertise is valuable to all that comes after you. So do not make your project a dead end; make it a shoulder to stand on.

Reproducibility basically means that anyone should be able to do exactly what you did to exactly reproduce your results. They should be able to get the same results in terms of numbers, plots, etc. This requires care and attention to detail, but it is not difficult. This page will help you get set up to do reproducible research.

Setting up your own machine

Before we get to the cluster, we need to get you properly set up on your own machine.

Install Python

If you have not done so already, you should install a distribution of Python called Anaconda. Anaconda is not only an easy way of installing Python on Windows, Mac, and Linux; it also comes with the conda package management system (see below). To install Anaconda, head to this page. When the download has been completed, you must follow the default installation.

The Terminal

Most of the programs we will use in this course are command-line applications. I.e., programs that are executed by writing their name and any arguments in a terminal rather than clicking on an icon and using a graphical user interface. There are many different programs that can serve as a terminal. If you have a Windows machine, you must use the Anaconda Powershell Prompt (not the Anaconda Prompt and not the CMD). You installed Anaconda Powershell Prompt along with Anaconda Python. If you have a Mac, the terminal you will use is called Terminal. The Terminal application is pre-installed on Mac. So from now on, whenever I refer to the terminal, I mean Anaconda Powershell Prompt on Windows and Terminal on Mac. We will assume some familiarity with using a terminal and with executing commands on the command line. If you have not used a terminal before, or if you are a bit rusty, you should run through this primer before you go on.

Conda environments

You need to install packages and programs for use in your analyses and pipelines. Sometimes, however, the versions of packages you need for one project conflicts with the versions you need for other projects that you work on in parallel. Such conflicts seem like an unsolvable problem. Would it not be fantastic if you could create a small world insulated from the rest of your Anaconda installation. Then that small world would only contain the packages you needed for a single project. If each project had its own isolated world, then there would be no such conflicts. Fortunately, there is a tool that lets you do just that, and its name is Conda. The small worlds that Conda creates are called "environments," and you can create as many as you like. You can then switch between them as you switch between your bioinformatics projects. Conda also downloads and installs the packages for you, and it makes sure that the packages you install in each environment are compatible. It even makes sure that packages needed by packages (dependencies) are installed too. By creating an environment for each project, the libraries installed for each project do not interfere.

Create an environment on your local machine

When you install Anaconda or Miniconda, Conda makes a single base environment for you. It is called "base," and this is why it says "(base)" at your terminal prompt. You need a conda environment for your project on both your local machine and on the cluster. Let us call both of them 'bircproject' (you can call it anything you like).

The environment on your local machine does not need a lot of packages since it mainly serves to let you connect to the cluster. This creates the environment and installs slurm-jupyter from my conda channel:

conda create -n bircproject -c kaspermunch slurm-jupyter

Say yes (press Enter) when asked to install packages.

VPN

To be able to connect to the cluster, you need to on the AU internal network. You can do that by either physically being on campus or by connecting to the AU network using VPN. To install VPN, use the instructions on this page. Before you can use the VPN, you also need to enable two-step verification. You can see how to do that on the same page. If you are not physically on campus, you need to activate your VPN before you can log in to the cluster. Your password for VPN is the same as you use to log on to access Blackboard.

Connecting to the cluster using ssh

SSH is short for secure shell. A shell is the software that lets you run commands in your terminal window. The secure shell (SSH) lets you securely log in to another computer so you can navigate the folders and run commands on that machine. So when you open your terminal window, your commands run on your local machine, but when you "ssh" (yes it is a verb too) into the cluster, your commands now run on the cluster. Before you go on, try to run the command hostname in you terminal. You can see that it prints something that tells you that you are on your own computer.

You connect to the cluster from the terminal by executing this command (remember to replace <cluster user name> with your actual cluster user name):

ssh <cluster user name>@login.genome.au.dk

When you do, you are prompted for the password for your cluster username. Enter that and press enter. You are now in your home folder on the cluster. Your terminal looks the same as before, but it will print:

  _____                                ______ _   __
 |  __ \                               |  _  \ | / /
 | |  \/ ___ _ __   ___  _ __ ___   ___| | | | |/ /
 | | __ / _ \ '_ \ / _ \| '_ ` _ \ / _ \ | | |    \
 | |_\ \  __/ | | | (_) | | | | | |  __/ |/ /| |\  \
  \____/\___|_| |_|\___/|_| |_| |_|\___|___/ \_| \_/

If you run the hostname command again, you can see that you are on fe-open-01. Now log out of the cluster again. You do that using the exit command or by pressing Ctrl-d. Now you are back on your own machine. Try hostname again and see what your own machine is called.

You will need to log in to the cluster many many times, so you should set up your ssh connection to the cluster so you can connect securely without typing the password every time. You do not need to know how this works, but if you are interested, here is roughly how:

Firstly, you have to understand what public/private encryption keys are. A private key is a very long, random sequence of bits. A private key is kept secret and never leaves your own machine. A public key is another string of bits that is a derivative of the private key. You can generate a unique public key from the private key but cannot get the private key from a public key: It is a one-way process. Using the public key, you can encrypt (or sign) any message, and it will only be possible to decrypt it using the private key. In other words, anyone with your public key can send you encrypted messages that only you will be able to read. So, if the cluster has your public key saved, it can authenticate you like this: The cluster sends your machine a message that is encrypted using your public key. Your machine then decrypts the message using its private key and sends it back. If the cluster agrees it is decrypted correctly, it logs you in.

First, check if you have these two authentication files on your local machine (you can do so by running ls -a ~/.ssh in the terminal):

~/.ssh/id_rsa
~/.ssh/id_rsa.pub

You most likely do not. If so, you generate a pair of authentication keys with the command below. Just press Enter when asked "Enter file in which to save the key". Do not enter a passphrase when prompted - just press enter:

ssh-keygen -t rsa

Now use ssh to create a directory ~/.ssh on the cluster (assuming your username on the cluster is <cluster user name>). You will be prompted for your password.

ssh <cluster user name>@login.genome.au.dk mkdir -p .ssh

Finally, append the public ssh key on your local machine to the file .ssh/authorized_keys on the cluster and enter your password (replace <cluster user name> with your cluster user name):

cat ~/.ssh/id_rsa.pub | ssh [email protected] 'cat >> .ssh/authorized_keys'

From now on, you can log into the cluster from your local machine without being prompted for a password.

Setting up your home on the cluster

Now log in to the cluster:

ssh <cluster user name>@login.genome.au.dk

(see, no password).

Install Python on your cluster account

You need to install miniconda (a minimal Anaconda version) in your cluster home dir. Log in to the cluster and run this command to download the miniconda install script:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Then use this command to download and install miniconda:

bash Miniconda3-latest-Linux-x86_64.sh

Follow the default installation, and say yes when it asks you if it should run conda init for you.

NP: Now log out of the cluster and log back in. This is needed to make the conda command available to you.

Creating an environment on the cluster

Log in to the cluster and run this command to create a conda environment for your project on the cluster. This environment should contain the packages that you need for your project. Unless your supervise tells you otherwise, the command below should install what you need to get started on a project:

conda create --name bircproject -c gwforg -c conda-forge -c bioconda -c kaspermunch slurm-jupyter gwf jupyterlab jupyterlab-git pandas seaborn biopython 

If your supervisor wants you to install a specific set of packages for your project, he/she may have given you a file called environment.yml. In that case you create you instead create your environment like this:

conda env create -n bircproject -f environment.yml

You probably end up needing more packages than you initially included, you can easily install them later. E.g., to see how to install biopython using conda, just google "conda biopython". The top link instructs you to install it like this: conda install -c conda-forge biopython.

Important: Whenever you log into the cluster to work on your project, you should activate your bircproject environment like this:

conda activate bircproject

When your environment is active, it says (bircproject) on the command prompt instead of (base).

Set up Jupyter

Jupyter is a notebook environment where you can easily combine text, code, and plots. Using the slurm-jupyter tool, you can run a jupyter notebook on the cluster but see it in the browser on your own machine. That way, your analysis runs on the cluster file system where your data is but the notebook interface is sent to your browser window.

Before you can connect to a jupyter session on the cluster, you need to do a bit of configuration of the jupyter settings on the cluster. slurm-jupyter comes with script that automates that procedure. Just log into the cluster, activate your environment, and run:

config-slurm-jupyter.sh

The script will ask for a lot of information. You can just press Enter for all of them except when prompted for what password you want to use: Then, you must type your cluster password.

Backup and version control

Your files on the cluster are not backed up! If you want to backup files, you need to put them in a folder called BACKUP. However, a better way is to use git and GitHub (see below).

GitHub

The page you are reading now is a GitHub repository. In addition to the documentation you are currently reading, it serves as a template for your project.

Start by creating your own github account if you do not have one.

Then you need to follow the instructions on this page to add ssh keys to GitHub.

Now log in to your GitHub account on the web. Go to the birc-project repository. In addition to this page, the repository contains a folder structure and config files, which serve as a good starting point for a project repository. You must now "fork" the repository. Forking creates your own copy of the birc-project repository under your own GitHub account. To fork the repository, click the button in the top right corner that says Fork.

Once your own birc-project repository is ready, go to the repository front page. There you will find your own copy of this page README.md. Read on from there.

Git

Git is a version control tool that you use from the terminal. A folder under Git control is called a repository. birc-project is such a repository. Git does not interfere with your files and it does not save them. It lets you monitor the state of your files so you can easily see if any files are added, modified, or removed, and it allows you to (manually) maintain a record of what files where changes when, how, and for what reason.

Start logging into the cluster and run these two commands to let Git know who you are:

git config --global user.name "<Your GitHub user name>"
git config --global user.email <[email protected]>

Note that to run git commands tracking your birc-project, you must stand in your birc-project folder when you run them.

These tutorials are good. The three most important commands to learn are git status, git add, git rm, git commit and git push.

You should read the documentation, but in a nutshell this is how it works:

git status shows you the status of your files. It may show you that you have created or modified one or more files.

To journal the creation or modification of a file, you run git add <the_changed_file>. This "stages" the change. This adds the file to the a "group" of changes that represent some modification to your project. You can add more files to that "group" by running git add again. if Maybe you added two python files that together lets you run some analysis. Now you want to store the group of changes to the journal maintained by Git and associate it with a description that describes it. For that you use git commit -m 'description'. If you added two python files, your description could be 'Added two python files for my analysis'. Now your changes are recorded in the Git journal for the birc-project repository on the cluster, but that does not serve as a backup in case you accidentally delete your entire folder or the cluster burns down. To backup your repository you need to synchronize your local birc-project repository on your computer with the birc-project repository on GitHub. You can see how below.

There is a cheat sheet here and some good visual guides here and here.

Rember: Any files you push to GitHub are backed up. Make a habit of pushing all changes you make.

Working on the cluster

The project folder

The project folder is a folder that is set up on the cluster to hold your project. I call it projectfolder here, but it will be called something sensible like baboonadmixture.

It is accessible to only you and anyone else you collaborate with (such as your supervisor). The project folder is in your home directory and should hold the following subfolders:

projectfolder
    /data
    /people
        /username
        /supervisor_username

The <projectfolder>/people/<username> is your domain. This is where you have all the files that relates to your project.

The name of the project folder is is also the name of the account that you should bill any work on the cluster to. When you run srun, sbatch or slurm-jupyter you myst specify that project name using the -A or --account options (see below for more details on that).

Cloning your git repository to the cluster

Go to your folder under the project folder (<projectfolder>/people/<username>) and clone your repository:

git clone [email protected]:username/birc-project.git

(replace username with your GitHub username).

You now have a folder called <projectfolder>/people/<username>/birc-project and this is where you will do all your work for the project.

If you cd into birc-directory and run ls, you will see a number of folders.

  • data: stores small (tens of megabases) data files you want to keep .
  • docs: stores text files explaining how you set up your project and analyzes.
  • binder: stores an environment.yml files documenting your conda environment used in the project.
  • sandbox: stores experiments that are not yet part of your project workflow.
  • scripts: stores Python scripts that that produces intermediate and final results.
  • steps: stores intermediary files ("steps" on the way to final results).
  • notebooks: stores Juptyer notebooks with code, documentation, and results.
  • figures: stores result plots and figures you make.
  • results: stores the small result files of your project (tens of megabases).

Files in all those folders are under Git control, except files in the steps folder. Those files are not backed up in any way, but should instead be reproducible using the code and information in your other folders.

Visual Studio Code

If you did not do so when you installed Anaconda, you should download and install Visual Studio Code. VScode is great for developing scripts and editing text files. Once you have installed VS code, you should install the "Remote Development" extension. You do that by clicking the funny squares in the left bar and search for "Remote Development". Once installed, you can click the small green square in the lower-left corner to connect to the cluster. Select "Connect current window to host" then "Add new SSH host", then type <username>@login.genome.au.dk, then select the config file .ssh/config. Now you can click the small green square in the lower-left corner to connect to the cluster by selecting login.genome.au.dk. It may take a bit, but once it is done installing a remote server, you will have access to the files in your home folder on the cluster.

How to run a Jupyter notebook on the cluster

Jupyter runs best in the Chrome browser or Safari on Mac. For the best experience, install that before you go on. It does not need to be your default browser. slurm-jupyter will use it anyway. Now make sure you are on your own machine and that your popgen environment is activated. Then run this command to start a jupyter notebook on the cluster and send the display to your browser:

slurm-jupyter -C -u <cluster_user_name> -A <projectfolder<> -e bircproject

(replace <cluster_user_name> with your cluster user name, <projectfolder> with your project folder name.

Watch the terminal to see what is going on. After a while, a jupyter notebook should show up in your browser window. The first time you do this, your browser may refuse to show jupyter because the connection is unsafe. In Safari you are prompted with this winidow where you click "details":

image

Then you get this window and click "visit this website":

image

In Chrome, you can simply type the characters "thisisunsafe" while in the Chrome window:

image

Once ready, jupyter may ask for your cluster password. To close the jupyter notebook, press Ctrl-c in the terminal. Closing the browser window does not close down the jupyter on the cluster. You can read this tutorial to learn how to use a jupyter notebook.

Running interactive commands on the cluster

When you log into the cluster, you land on the "front-end" of the cluster. Think of it as the lobby of a giant hotel. If you execute the hostname command, you will get fe-open-01. fe1 is the name of the front-end machine. The "front-end" is a single machine shared by anyone who logs in. So you cannot run resource-intensive jobs there, but quick commands are ok. Commands that finish in less than ten seconds are ok. In the exercises for this course, you will run software that takes a much longer time to finish. So you need one of the computing machines on the cluster, so you can work on that instead. You ask for a computing machine by running this command:

srun --mem-per-cpu=1g --time=3:00:00 --account=<projectfolder> --pty bash

That says that you need at most one gigabyte of memory, that you need it for at most three hours (the duration of the exercise), and that the computing expenses should be billed to the project <projectfolder>. When you execute the command, your terminal will say "srun: job 40924828 queued and waiting for resources". That means that you are waiting for a machine. Once it prints "srun: job 40924828 has been allocated resources", you have been logged into a computing node. If you execute the hostname command, you will get something like s05n20. s05n20 is a computing machine. The same way you moved from your own computer to front-end machine of the cluster by logging in using ssh, the command above moves you from the front-end to a compute machine. Now you can execute any command you like without causing trouble for anyone.

Now try to log out of the compute node by executing the exit command or by pressing Ctrl-d. If you execute the hostname command again, you will get fe1.genomedk.net showing that you are back at the front-end machine.

Queueing commands on the cluster

For non-interactive work, it is better to submit your command as a job to the cluster. When you do that, the job gets queued along with many other jobs, and as soon as the requested resources are available on the cluster, the job will start on one the many many machines. To submit a job, you must first create a file (a "batch script") that contains both the requested computer resources and the command you want to run.

Create a file called myscript.sh with exactly this content:

#!/bin/bash
#SBATCH --mem=1gb
#SBATCH --time=01:00:00
#SBATCH --account=<projectfolder>
#SBATCH --job-name=firstjob

echo "I can submit cluster jobs now!" > success.txt

(replace <projectfolder> with your project folder name)

The first line says this is a bash script, the lines following three lines say that your job needs at most one gigabyte of memory, will run for at most one hour, that the expenses should be billed to the project <projectfolder>. The fourth line gives the name of the job. Here we have called it firstjob, but you should name it something sensible.

You submit the job using the sbatch command:

sbatch myscript.sh

Now your job is queued. Use the mj command to see what jobs you have queued or running. That will show something like this:

                                                                        Alloc
Job ID           Username Queue    Jobname    SessID NDS  S Elap Time   nodes
---------------- -------- -------- ---------- ------ ---  - ----------  -----
34745986         kmt      normal   firstjob       --   1  R 0-00:19:27  s03n56

If you want to cancel this job before it finishes, you can use the scancel command:

scancel 34745986

Once your job finishes, it has created the file success.txt and written "I can submit cluster jobs now!" to it. So see that you can use the cat command:

cat success.txt

When you a program or script on the command line, it usually also prints some information in the terminal. When you run a job on the cluster there is no terminal to print to. Instead, this is written to two files that you can read when the job finishes. In this case, the fiels are called firstjob.stdout and firstjob.stderr. To see what is in them, you can use the cat command:

cat firstjob.stdout

and

cat firstjob.stderr

That is basically it.

How to copy files to and from the cluster

You may need to transfer files back and forth between your own machine and the cluster. To copy a file called file in a directory called dir on the cluster to the current folder on your own machine, you can use the scp command:

scp <cluster_user_name>@login.genome.au.dk:dir/file .

To copy a file called file in the current folder on your own machine to a folder called dir on the cluster, you do this:

scp ./file <cluster_user_name>@login.genome.au.dk:dir/

Reproducible research

This is the last part, but the most important.

More detail on what that means...

Do not save data files, save how you made them

In the world of data projects, there are three kinds of data files.

  1. Those representing the input to your project (sequencing reads, raw data, etc.)
  2. Those representing the output from your project (numbers, notebooks, plots, tables, etc.)
  3. Those representing intermediary steps to get from files of type 1 to files of type 2.

Type 1 files are usually hard/expensive to reproduce, and type 2 is saved indefinitely on the cluster. Type 2 files are generally very small and inexpensive to save indefinitely on the cluster. Type 3 files can be large and are easily regenerated if your project is reproducible. So type 3 files are not saved indefinitely. In fact, type 3 files should be deleted as soon as the project is finished. Toward the end of this tutorial, I will help you get set up to distinguish these three types of files.

Export your conda environment

When you have created you conda environment (and whenever you add new packages to your environment), you should update the environment.yml file in the binder directory. That file allows anyone to create the same conda environment, ensuring that they have all the packages needed to run your code. You do that like this:

conda env export --from-history -f ./binder/environment.yml

Supervisor note: this also makes it easy to create a binder for playing with the students notebooks without interfering with the project. Go to Binder and fill in the https address of the student repository: E.g. https://github.com/username/birc-project.

Building workflows with GWF

https://gwf.app/

Managing notebooks