Skip to content

Latest commit

 

History

History
812 lines (602 loc) · 23.6 KB

shell_intro.md

File metadata and controls

812 lines (602 loc) · 23.6 KB

Unix Shell: scripting and automating

UQ Library 2024-09-16

Installation

The Unix shell we use for this lesson is called Bash.

Installation instructions are available on this page.

Data

Download the data archive from this link and extract its contents on your Desktop.

Introduction

The shell is a program that enables us to send commands to the computer and receive output. It is also referred to as the “terminal” or “command line”. When we use the shell, we use a command-line interface (or CLI) instead of a graphical user interface (or GUI). We type a command, and press enter to execute it.

Why use the shell?

  • The shell’s main advantages are its high action-to-keystroke ratio, its support for light task automation, and its capacity to access networked machines.
  • The shell’s main disadvantages are its primarily textual nature and how cryptic its commands and operation can be.

The Unix shell has been around longer than most of its users have been alive. It has survived so long because it’s a power tool that allows people to do complex things with just a few keystrokes. More importantly, it helps them combine existing programs in new ways and automate repetitive tasks so they aren’t typing the same things over and over again. Use of the shell is fundamental to using a wide range of other powerful tools and computing resources (including “high-performance computing” supercomputers). This lesson will start you on a path towards using these resources effectively.

Format

We will learn doing some live-coding, which means we will all be using the shell and typing the same things – a great way to learn. No need to take notes as they are available online for later reference.

Our data: Nelle’s research

The data we use as an example for this lesson is a collection of 1520 files that contain information about protein abundance in samples collected by a marine biologist, Nelle Nemo. They need to be run through a program called goostats but that would take too much time if each file was run manually.

The shell might be helpful to automate this repetitive task.

First, we’ll need to understand how to navigate our file system using the shell.

Navigating the file system

The part of the operating system responsible for managing files and directories is called the file system. It organises our data into files, which hold information, and directories (also called “folders”), which hold files or other directories.

To navigate our file system in the shell, let’s learn a few useful commands. Type the following command and press enter:

pwd

pwd stand for “print working directory” and outputs the name of the directory we are currently located in. For most, it will be the home directory of the current user.

Now, try this command:

ls

ls stands for “listing”. It lists the contents of the current working directory.

Commands can often take extra parameters, called flags (also called “options”). We can add the flag -a (for “all”) to our ls command in order to also list hidden elements:

ls -a

To find out more about a particular command, including what flags exist for it, use the --help flag after it, like so:

ls --help

You might have to use the command man ls (for manual) on some systems

To look at the contents of a different directory, we can specify it by adding the directory’s name as an argument:

ls Desktop

As you can see, a command can take both flags and arguments. For example, the command:

ls -lh Documents

… associates the two flags -l (for “long listing”) and -h (for “human-readable”) to output extra information and make file sizes more user-friendly, and specifies that we want to list what the Documents directory contains.

To navigate into our data directory, we’ll use a new command called cd for “change directory”.

cd Desktop
cd data-shell
cd data

We just navigated down three levels of directories, one at a time, starting from our home directory. It is also possible to do that in one command:

cd Desktop/data-shell/data

You can always check where you are currently with pwd, and have a look at where you can navigate next with ls.

I you want to go back to the data-shell directory, there is a shortcut to move up to the parent directory:

cd ..

Similarly, the shortcut to specify the current working directory is a single dot: ..

cd on its own will bring you back to your home directory.

We have been using relative paths so far, always referring to where we currently are in the file system, but we can also specify absolute paths by using a leading /, which represents the root directory (i.e. the highest in your file system). For example, you can always use one of the following commands to go to the data-shell folder, wherever you are (replace “username” by your user name):

cd /Users/username/Desktop/data-shell
cd /home/username/Desktop/data-shell

Two more shortcuts are handy when it comes to changing or specifying directories: ~ is the home directory, and - is the previous directory we were in.

Working with files and directories

We now know how to explore files and directories, but how do we create, modify and delete them?

In the data-shell directory, let’s create a new directory called thesis thanks to the mkdir command (for “make directory”):

cd ../..
mkdir thesis

To work more comfortably with the shell, it is a good idea to name files and directory without using whitespaces, as they are usually used to separate arguments in commands.

Using ls will now list the newly created directory.

We can check that the new directory is in fact empty:

ls thesis

Let’s move into it and create a new text file called draft.txt using a text editor called Nano:

cd thesis
nano draft.txt

Type a few lines of text, and save with Ctrl+O. (Nano uses the symbol ^ for the control key.) Nano also checks that you are happy with the file name: press enter at the prompt, and exit the editor with Ctrl+X.

Nano does not leave any ouptut, but you can check that the file exists with ls. You can also see the contents of a text file with the cat command (it stands for “concatenate”):

cat

If you are not happy with your work, you can remove the file with the rm command, but beware: in the shell, deleting is forever! There is no rubbish bin.

rm draft.txt

Let’s re-create that file and then move up one directory to /Users/username/Desktop/data-shell using cd ..:

nano draft.txt
ls
cd ..

If we try to delete the thesis directory, we get an error message:

rm thesis

This happens because rm by default only works on files, not directories.

To really get rid of thesis we must also delete the file draft.txt. We can do this with the recursive flag for rm:

rm -r thesis

Removing the files in a directory recursively can be a very dangerous operation. If we’re concerned about what we might be deleting we can add the “interactive” flag -i to rm which will ask us for confirmation before each step.

rm -ri thesis

This removes everything in the directory, then the directory itself, asking at each step for you to confirm the deletion. Type “y” and press Enter to confirm.

Let’s create the directory and file one more time:

mkdir thesis
nano thesis/draft.txt
ls thesis

The name of our new file is not very informative. We can change it with the mv command (for “move”):

mv thesis/draft.txt thesis/quotes.txt

The first argument tells mv what we’re “moving”, while the second is where it’s to go.

mv can silently overwrite any existing file with the same name, which is why using the -i flag is also a good idea here.

Let’s move quotes.txt into the current working directory, by using the . shortcut:

mv thesis/quotes.txt .

We can now check that thesis is empty, and that quotes.txt exists in the current directory:

ls thesis
ls quotes.txt

The cp command copies a file. Let’s copy the file into the thesis directory, with a new name, and check that the original file and the copy both exist:

cp quotes.txt thesis/quotations.txt
ls quotes.txt thesis/quotations.txt

Now, let’s delete the original file and check with ls that it is actually gone:

rm quotes.txt
ls quotes.txt

Filters and pipes

Filters and pipes are the two building blocks for more complex commands. Filters are commands that allow the transformation of a stream of input into a stream of output, whereas pipes send the output of a command as the input of another one. Many commands fit the definition of filters and constitute “small pieces” that can be “loosely joined”, i.e. stringed in new ways. The “pipes and filters” programming model is permitted by the Unix focus on creating small single-purpose tools that work well together.

In the molecules directory, let’s use the wc command (for “word count”):

cd molecules
wc *.pdb

The * wildcard is used to match zero of more characters. Other wildcards include ? to match one single character.

Notice how the output has three numbers for each file? They are the number of lines, words and characters. Flags for wc include -l for restricting the output to line numbers, -w for words, and -c for characters.

To figure out which file is the shortest, we can first redirect the number of lines into a new file thanks to >, so we can reuse it later on:

wc -l *.pdb > lengths.txt

This creates the file, or overwrites it if it already exists. >> on the other hand will append to an existing file.

The sort command will print the alphabetically sorted data to screen. Using the -n flag will sort it numerically instead:

sort -n lengths.txt

We now know that the top line is the shortest file. However, intermediate files make a long process complicated to follow, and clutter your hard drive. We can instead run two commands together:

wc -l *.pdb | sort -n

The vertical bar, |, is called a pipe. It tells the shell we want to use the output of the command on the left as the input for the command on the right.

head and tail will respectively show the beginning and the end of some text. It is possile to overwrite the default of 10 lines with a flag that specifies how many lines we want returned. Let’s use head in our process to only show the first line of the sorted text:

wc -l *.pdb | sort -n | head -1

We can string as many pipes and filters as we want, which makes it possible to do the whole task in one pipeline.

The pipeline can be read as a sentence: “Count the number of lines in all the PDB files, then sort them numerically, then return only the first line.”

Nelle’s pipeline

Nelle has run samples through the assay machines and created 17 files located in the north-pacific-gyre/2012-07-03 directory.

A useful feature in CLIs is “tab completion”. To access folders with longer names, it is often possible to auto-complete the folder name by hitting the Tab key after typing a few letters: typing cd nor and pressing the Tab key will auto-complete to cd north-pacific-gyre/. Another press of the Tab key will add 2012-07-03/ to the command as it is the only item in the folder. If there are several options, pressing the Tab key twice will bring up a list.

To check the consistency of her data, she types:

wc -l *.txt | sort -n | head -5

One file seems to be 60 lines shorter than the others. Before re-running that sample, she checks if other files have too much data:

wc -l *.txt | sort -n | tail -5

To re-run a command you typed not long ago, or to slightly modify it, use the up arrow to navigate your history of commands.

The numbers look good, but the “Z” in there is not expected: everything should be marked either “A” or “B”, by convention. To find others, she types:

ls *Z.txt

Those two files do not match with any depth she recorded, and she therefore won’t use them in her analysis. In case she still might need them later on, she won’t delete them; in the future, she might instead select the files she wants with a wildcard expression, like in this example:

ls *[AB].txt

This will match all files ending in A.txt or B.txt.

Loops

How can we perform the same action on many different files?

Loops are key to productivity improvements through automation as they allow us to execute commands repetitively. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes).

In the creatures directory (reached with cd ../../creatures), using the following command to create backups of our data files will throw an error:

cp *.dat backup-*.dat

The issue is that it expands to giving cp more than two inputs, and therefore expects the last one to be a directory where the copies can go.

The way around that is to use a loop, to do some operation once for each element in a list.

To solve our file copying problem we can use this loop:

for filename in *.dat
do
    cp $filename backup-$filename
done

In this loop, filename is a variable which is assigned a different file name in each run. The variable can be named whatever we want, but a descriptive name is better.

When running this loop, the shell does the following: * expand *.dat to create a list of files * execute the loop body for each of those files: * copy the currently processed file and prepend “original-” to its name. * close the loop with “done”

You can check that your loop will do what you expect it to do beforehand, by prefixing the command in the loop body with echo:

for filename in *.dat
do
    echo cp $filename original-$filename
done

echo is also useful to give extra information while the loop executes, as we’ll see later on.

If your file names contain spaces, you will have to use quotation marks around the filenames and the variable calls. But it is simpler to always avoid using whitespaces when naming files and directories!

Nelle’s pipeline

Nelle now wants to calculate stats on her data files with her lab’s program called goostats. The program takes two arguments: an input file (the raw data) and an output file (to store the stats).

Located in the north-pacific-gyre/2012-07-03, she designs the following loop:

for datafile in NENE*[AB].txt
    do bash goostats $datafile stats-$datafile
done

bash is a program that executes the contents of a script (here, the “goostats” script). More about scripts in a little bit!

When she runs it, the shell seems stalled and nothing gets printed to the screen. She kills the running command with Ctrl + C, uses the up arrow to edit the command and adds anecho` line to the loop body in order to know which file is being processed:

for datafile in NENE*[AB].txt; do echo $datafile; bash goostats $datafile stats-$datafile; done

Notice how you have to separate distinct parts of your code with a ; when it is written in one single line.

It looks like processing her whole dataset (1518 files) will take about two hours. She checks that a sample output file looks good, runs her loop and lets the computer process it all.

Here is another example of how useful a loop can be: to create a logical directory structure. Say a researcher wants to organise experiments measuring reaction rate constants with different compounds and different temperatures. They could use a nested loop like this one:

for species in cubane ethane methane butane
do
    for temperature in 25 30 37 40 50 60
    do
        mkdir $species-$temperature
    done
done

This nested loop would create 24 directories in less than a second. How much time would that take with a graphical file browser?

Shell scripts

How can I save and reuse commands?

We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.

To store her analytics and make them reproducible, Nelle creates a script. She create a new file with Nano:

nano do-stats.sh

… and write the following inside it:

for datafile in NENE*[AB].txt
do
    echo $datafile
    bash goostats $datafile stats-$datafile
done

Writing Nelle’s loop in the command line wasn’t very comfortable. When you start writing blocks of code that do more complex things, you want to use a code editor like Nano. Notice how it highlights parts of your code differently? This is called “syntax highlighting”.

Nelle can now run her script with:

bash do-stats.sh

This works well, but what if she wants to make her script more versatile, so the user can decide what the data files are? She modifies the script:

# Calculate stats for data files.
# Usage: bash do-stats.sh file(s)_to_process
for datafile in "$@"
do
    echo $datafile
    bash goostats $datafile stats-$datafile
done

In this new version, she added comments, with lines starting with #. These comments will be ignored by the shell, but will help others understand what the script does, and how to use it.

She also used the special variable $@, which means “any number of arguments”. The user can now provide one or several file names when using the script.

If you prefer to use a specific number of arguments, and use them according to their position, use the variable $1, $2, $3, etc.

Her script now lets her decide what files to process, but she has to remember to exclude the “Z” files.

Designing a script always involves tradeoffs between flexibility and complexity.

Finding things

How can I find files, and find things in files?

grep (for “global / regular expression / print”) is a command that finds and prints lines in files that match a pattern. To test this, we are going to work on a file that contains three haikus. To have a look at it, run the following commands:

cd
cd Desktop/data-shell/writing
cat haiku.txt

To find lines that contatin the word “not”, run the following:

grep not haiku.txt

The output is the three lines in the file that contain the letters “not”.

If we look for the pattern “The”:

grep The haiku.txt

… the output will show two lines, with one instance of those letters contained within a larger word: “Thesis”.

To restrict to lines containing “The” on its own, we can use the -w flag (for “word”):

grep -w The haiku.txt

We can also search for a phrase:

grep -w "is not" haiku.txt

We don’t have to use quotes for patterns without spaces, but we still can do that to be consistent. Another useful flag is -n (for line number):

grep -n "it" haiku.txt

We can also combine flags with this command. Let’s add the -i flag to make the search case-insensitive:

grep -nwi "the" haiku.txt

We can also invert our search with the -v flag, i.e. to output the lines that do not contain the pattern “the”:

grep -nwv "the" haiku.txt

The are many more flags available for grep. You can see a full list with the command grep --help.

grep’s real power comes from the fact that patterns can contain regular expressions. Regular expressions are both complex and powerful. For example, you can search for lines that have an “o” in the second position:

grep -E '^.o' haiku.txt

We use quotes and the -E flag (for “extended regular expression”) to prevent the shell from interpreting it. The ^ anchors the match to the start of the line; the . matches a single character; the o matches an actual lowercase “o”.

Let’s try to analyse a bigger file, like the text from Little Women by Louisa May Alcott. We want to figure out which of the four sisters in the book (Jo, Meg, Beth and Amy) is the most mentioned, something we can achieve with a for loop and grep:

cd data
for sis in Jo Meg Beth Amy
do
    echo $sis:
    grep -ow $sis LittleWomen.txt | wc -l
done

We use the -o flag (for “only matching”) in order to account for multiple occurences on a single line.

While grep finds lines in files, the find command finds files themselves. Let’s move into the writing directory and test it:

cd ..
find .

When given the current working directory as the only argument, find’s output is the names of every file and directory under the current working directory. We can start filtering the output with the -type flag. d is for directory, and f is for files:

find . -type d
find . -type f

We can also match by name:

find . -name *.txt

The issue here is that the shell expanded the wildcard before running the command. To find all the text files in the directory tree, we have to use quotes:

find . -name '*.txt'

If we want to combine find with other commands, we might need a different method than building a pipeline. For example, to count the lines in each one of the found files, one would intuitively try the following:

find . -name '*.txt' | wc -l

… which would only return the number of files find found. In order to pass each of the found files as separate arguments, we can use the following syntax instead:

wc -l $(find . -name '*.txt')

When the shell executes this command, it first expands whatever is inside $() before running the rest of the command, just like for wildcards. In short, $(command) inserts a command’s output in place.

Here is an example combining grep and find:

grep "FE" $(find .. -name '*.pdb')

This command will list all the PDB files that contain iron atoms.

Further resources

To learn more about the Unix Shell:

Licence

This short course is based on the longer course The Unix Shell developped by the non-profit organisation The Carpentries. The original material is licensed under a Creative Commons Attribution license (CC-BY 4.0), and this modified version uses the same license. You are therefore free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

… as long as you give attribution, i.e. you give appropriate credit to the original author, and link to the license.