This week, we cover some basics of how sequence is generated, delve into the details of data and formats, and talk through basic QC of your sequence.
Let's play around with some real sequence data and QC it. The expectation is that you'll do this on your laptop/desktop. (though it should also be possible to do all of this on the cluster or cloud, provided that docker is installed).
This is not a step-by-step tutorial, but all of the commands you need to complete these steps are introduced in the lecture slides. Remember to use the -h
or --help
flags to get usage, or use man <command>
to see additional help on particular commands. Ask for help in the slack #bfx_workshop channel if you get stuck!
-
Get an interactive job in a docker container. The image chrisamiller/docker-genomic-analysis has a lot of common genomics tools installed that may be useful for the first few steps. Be sure to mount your working directory so that you have access inside the container! (see slides for examples of this)
-
We're going to work with data from a human cell line posted here: https://storage.googleapis.com/bfx_workshop_tmp/Exome_Tumor.tar Make a directory called "week03", and download the tarball to your computer using the command line (
wget
orcurl
). -
Use
tar -xvf
to extract the directory from the tar file, then cd into the directory and look around withls
. We're not going to use all of this data in this week's homework. Let's focus on the contents ofExome_Tumor.tar
. Untar it, then unzip the fastq files. -
Look at the first three records (not first three lines!) of each fastq file. Take a close look at the read names and how they match up across files.
-
How many paired end sequences do these files contain?
-
What is the read length? Is the read length consistent for every record?
-
How many total nucleotides of sequence are contained in these two files?
-
Use
gzip
to recompress these two fastq files to save space -
Exit that docker container (type
exit
) and launch a new docker session using a container that has the fastqc tool:quay.io/biocontainers/fastqc:0.11.9--0
-
Run fastqc on these data:
fastqc *.fastq.gz
What is the asterisk doing here? Note the files it produces - an HTML file, with a user-friendly summary, and a zip file, which you can dig into if you need more details, or wanted to parse the output files by hand. -
Exit the docker image, browse to the html files, and open them up. Do you see any potential issues with the sequence data? (Handy shortcuts: If you're using WSL/Ubuntu on a windows machine, try typing
explorer.exe myfile.html
to open the html file in a browser. Mac users can useopen myfile.html
.)
Send the answers to questions 5-7, plus a screenshot of part of the fastq output to Jenny as proof of completion.