Week 4. scanpy, docker, advanced topics

Most of this exercise will be done in a Jupyter notebook. The notebook will be run in a docker container. In this document, there are instructions for getting the docker container started, copying the jupyter notebook, and logging into your jupyter notebook. Once you have logged into jupyter, you can explore the jupyter notebooks, which will demonstrate single-cell analysis using scanpy.

1. Connect to the server

Find your assigned server name on this website. Instructions to connect to the server can be found on this page.

2. Pull the docker image

docker1 pull biohpc/scrna2024
docker1 images

The first command downloads the image from dockerhub, if not already on the server. The second command lists all currently available images on the server, you should see biohpc/scrna2024 on the list.

3. Clone the workshop repository

Chose a working directory, 'cd' to it, and then download a copy of this repository from github. For working dir, you may chose a directory in /workdir/$USER, or in /home2/$USER. In this example I will use /workdir/$USER/scrna_workshop.

# first, make the directory if it does not already exist, then cd into it
mkdir -p /workdir/$USER/scrna_workshop
cd /workdir/$USER/scrna_workshop

# now use git clone to make a copy of workshop materials in this directory
git clone https://github.com/bixBeta/scRNA-WS24.git

About choosing a working directory

We usually recommend doing your work in some subdirectory of /workdir/$USER. /workdir/$USER is local disk, so input/output operations will be fastest using this directory. However, you only have access to /workdir/$USER during your reservation. If you want to save anything, you will need to copy it to more permanent storage (such as your home directory) before the reservation ends.

More recently, we also have the option of using /home2/$USER for computation. This is networked storage, but configured to be fast enough to support computations. It still may not be as fast as local disk. The benefit is that /home2/$USER is permanent storage, so you will not lose any data when your reservation ends. However, it is subject to a quota (usually 200Gb), unless you or your lab has purchased additional storage.

4. Start a docker container

The container is a virtual machine, running the operating system stored in the biohpc/scrna2024 image. The image is designed to start a jupyter notebook as soon as the container starts. The command will look like this (read on for some explanation):

docker1 run -p 8009:8888 -v `pwd`:/data \
    -v /workdir/sc_workshop_2024:/workdir/sc_workshop_2024:ro  \
    --rm -d biohpc/scrna2024 /data

Details about the docker1 run command options

-p 8009:8888: maps the port 8009 on your assigned server to port 8888 inside the container. By default, jupyter uses port 8888, but this is just the port inside the virtual machine. To access it on your computer, you need to map this to a port on the host machine. By using this command, you should be able to access the jupyter server at http://cbsuXXXXX.biohpc.cornell.edu:8009. (replace cbsuXXXXX with your assigned machine). If you get an error about the port being busy, it may be being used by another workshop participant, so try another port between 8009-8039. Remember that 8016 is already being used for RStudio.
-v `pwd`:/data: This mounts your current directory (returned by the `pwd` command) to the directory /data inside the docker container. At BioHPC, the docker1 version of docker restricts the directories you can mount; you should be able to mount any directory in /workdir, /local, or /home2 owned by you. docker1 also automatically mounts /workdir/$USER to /workdir inside the container by default.
-v /workdir/scrna_workshop_2024:/data/scrna_workshop_2024:ro: This is an additional mount command (you can mount as many directories as you like); this time mounting /workdir/scrna_workshop_2024 on the server to /data/scrna_workshop_2024 inside the container. The :ro at the end means that it will be a read-only mount. docker1 allows read-only mounting of directories owned by others, if the owner has granted access (using docker1 access command). By mounting this directory, you will be able to access the example data without making your own copy.
--rm just means to delete the container when it finishes running. It is not necessary, but helps keep the server clear of old zombie containers.
-d means to run the container in the background.
The /data at the end of the command is passed to the container, this directory will be used as jupyter's root directory.

If successful, the command should ouput a long string of characters - this is the container ID. Usually only the first 12 characters of the ID are used. You can also verify the container is running with the command docker1 ps. This lists all running containers on the server.

If not successful, you may need to try a different port instead of 8009 (see detail section above for more info about choosing a port).

5. Log into your jupyter notebook

First, you need to get the notebook URL, which includes a long authentication token. Copy your container ID from the previous section, and replace the XXXXXX in the following command with the container ID:

docker1 exec XXXXXXXX jupyter lab list

This will return the address of your jupyter notebook. It will start with http://XXXXXX:8888. You need to change the XXXXX part to your server name (i.e., cbsumma01.biohpc.cornell.edu), and change 8888 to the port you have mapped to (i.e., 8009). You can do this manually, or automate this with the following code:

# change port and containerID below as appropriate
port=8009 
containerID=XXXXXXX

# then run this command
docker1 exec $containerID jupyter lab list | tail -n 1 |
   awk -F '[ /]' -v host=$HOSTNAME -v port=$port '{print "http://"host":"port$4}'

The last command should output the URL of your jupyter notebook. You should be able to connect to it if you are on the Cornell campus network. If you are off campus, you will need to set up a tunnel (see instructions in details below).

Connecting from outside the Cornell network

Set up custom firewall: find your current IP. You can use this web page: https://whatismyipaddress.com/ to find your current IP. Login to https://biohpc.cornell.edu/Default.aspx, under your user ID, click "Custom firewall". Use pulldown to locate your server, and enter your IP.
Set up tunnel: Run this command on Mac Terminal or Windows Command Prompt. Replace USERNAME with your BioHPC user ID, cbsuXXXX with your assigned server, and 8009 with your mapped port (in both places): ssh -N -L 8009:localhost:8009 USERNAME@cbsuXXXXX.biohpc.cornell.edu
After you run this command, it may look like the terminal hangs there. That is ok. Keep this terminal Wwndow open.
Visit the jupyter notebook URL above, but replace the servername with the word 'localhost'. The URL should look something like: http://localhost:8009/?token=this1is2a3very4long5token6string.

6. Now you should be in jupyter!

You should see your files in the left-side pane (assuming folder icon on upper-right is chosen). You can navigate to the scRNA-WS24/Lessons directory, and double-click on the Week4-scanpy.ipynb file to start working through the jupyter notebook.

7. Claim files when finished

Because it is running in docker, the files created by the notebook will not belong to you on the server. From the server, you can execute the command docker1 claim to 'claim' the files and fix the permissions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week4.md

Week4.md

Week 4. scanpy, docker, advanced topics

1. Connect to the server

2. Pull the docker image

3. Clone the workshop repository

4. Start a docker container

5. Log into your jupyter notebook

6. Now you should be in jupyter!

7. Claim files when finished

Files

Week4.md

Latest commit

History

Week4.md

File metadata and controls

Week 4. scanpy, docker, advanced topics

1. Connect to the server

2. Pull the docker image

3. Clone the workshop repository

4. Start a docker container

5. Log into your jupyter notebook

6. Now you should be in jupyter!

7. Claim files when finished