Week 1: Linux, Command Line

Part I: Setting up a Linux machine

Several of the tools presented in this class work (and were developed) in a UNIX-style operating system. In particular, we will be using GNU/Linux:

Linux is the name of the operating system kernel; a kernel is a collection of low-level code that interfaces with hardware and provides basic functionality, such as a file system. For example, it contains code that allows it to recognize when a keyboard has been plugged into the USB port and register it so that you can use it properly.
GNU is a collection of programs, including libraries for developing software, text editors, a web-browser, etc.

Because GNU and Linux are free software, there is a variety of community-developed distributions of it. A Linux distribution ('distro') is essentially a collection of packages that are considered 'sane defaults' by a community. Importantly, Linux distributions contain package managers, which are pieces of software that allow you to manage, add and remove applications. If you haven't tried Linux before, we recommend trying a distribution that is easy to install and has plenty of documentation available online. Ubuntu and Debian are two such distributions.

Installing as a virtual machine

If you haven't used Linux before, it is probably a good idea to try it in a virtual machine. To do so, try the following:

Download and install Virtualbox, the software that will emulate the Linux machine.
Download a stable version of Ubuntu or Debian and make a virtual machine in Virtualbox. When prompted, make sure to allocate at least 2GB (or more!) of RAM and at least 16GB of hard drive space.

Installing on a dedicated machine

It is also possible to install GNU/Linux as a standalone operating system, either by itself or alongside Windows ('dual boot'). If you wish to do so, make sure to back up all your important files and read the installation instructions carefully.

Part II: The shell

What is the shell? The shell is a program that exposes an OS's services to a user (or another program). Contrary to popular belief, what we call the shell is not the same as what is called the terminal. However, in Linux the main way to interact with the shell is via the terminal, which provides a command-line interface (meaning: not graphical) to the shell, which is why we will sometimes use the words "shell" and "terminal" interchangeably.

There are serveral different shells available in Linux. Most distributions come with the bash shell. Other shells include zsh, fish, tcsh, csh.

Getting started

To get started, fire up your distribution's terminal. Depending on your distribution, it might be called "Terminal", "Console" or some variant thereof. This will start a command-line interface (CLI) where you will be able to type commands and see their output. You should see something like the following (called the prompt):

[user@computer ~]$

What does this mean? The shell keeps track of which user is currently logged and displays their login name (user). It also displays the name of the computer they are logged in to. The last piece of information is the working directory, which is the current location in your computer's filesystem. Here ~ is a special character that denotes your home directory (assuming one is defined; more on this later). Finally, anything on the left of the $ are commands that you type.

For example, my prompt looks like the one below:

[vchariso@vchariso-pc ~]$

Getting Help

Shell commands usually provide a manual page and/or can be invoked with the --help that documents their behavior and arguments, return values etc. To access the manual page, type man <command>. For example, the following:

[vchariso@vchariso-pc ~]$ man cd

will open up the manual page for the cd command (described below). Knowing about the existence of and using these manual pages is essential for working in the shell. Most newcomers forget that they even exist, and spend precious time googling for documentation (even though several manual pages also provide usage examples) that is already available.

Note: To use the man command, you need to know which command you are looking for help with. If you have a general idea of what you want to do but do not know which command to use for it, you can use the apropos utility, which will search inside the manual pages for a keyword (and also supports regular expressions; more on these later). For example, if you want to find out which command to use to make a directory, you can type the following:

[vchariso@vchariso-pc ~]$ apropos "make directories"
mkdir (1)            - make directories
mkdir (1p)           - make directories

Indeed, man mkdir will convince you that mkdir is what you want to use to make directories. The different numbers ((1) vs. (1p)) correspond to different sections in the manual pages. To understand what these are, you can read more here.

Another help utility you might occasionally need is type:

[vchariso@vchariso-pc ~]$ type python
python is /usr/bin/python
[vchariso@vchariso-pc ~]$ type echo
echo is a shell builtin

The shell is not very useful without knowing how to move around between different directories. The cd command does exactly that. For example, to change to a directory called "Documents", we type:

[vchariso@vchariso-pc ~]$ cd Documents
[vchariso@vchariso-pc ~/Documents]$

Since you will eventually build a mental map of where your files are starting from your home directory, typing cd on the shell (without any arguments) will return you to your home directory:

[vchariso@vchariso-pc ~/Documents/Books]$ cd
[vchariso@vchariso-pc ~]$

Paths: Absolute vs. Relative

When navigating in UNIX, it's important to distinguish between absolute (also known as full paths) and relative paths.

Absolute paths: They always start with "/" (the so-called base directory). For example, to find out the absolute path to the working directory, you can type pwd (from Print Working Directory):

[vchariso@vchariso-pc ~/Documents]$ pwd
/home/vchariso/Documents

In fact, "~" is a so-called shell expansion for the user's home directory, and the following two commands are equivalent:

[vchariso@vchariso-pc ~/SomeFolder]$ cd ~
[vchariso@vchariso-pc ~/SomeFolder]$ cd /home/vchariso

You can read more about shell expansions here.

Relative paths: a path that doesn't start with "/" is a relative path. More or less, relative paths are formed by prepending the current working directory to them. For example:

[vchariso@vchariso-pc ~/Documents]$ pwd
/home/vchariso/Documents
[vchariso@vchariso-pc ~/Documents]$ cd Books
[vchariso@vchariso-pc ~/Documents/Books]$ pwd
/home/vchariso/Documents/Books

As you type cd Books in the above, the shell interpreter prepends your working directory before calling the cd command with the full path.

Making & inspecting directories

To inspect the contents of your current directory, simply type ls:

[vchariso@vchariso-pc ~/Documents]$ ls
file.txt  Folder  program.sh

This shows that my Documents directory contains 2 files and 1 folder: file.txt, program.sh, and Folder. In fact, Folder is also a type of file (but a special one!).

However, unless your terminal environment uses colors or a special font to indicate different types of files, the output of the above ls command does not give you any information about whether or not Folder is a folder or just a terribly-named ordinary file. To get this type of information, you can invoke ls with an extra argument:

[vchariso@vchariso-pc ~/Documents]$ ls -F
file.txt  Folder/  program.sh*

I read about the -F argument on the ls manual page. Here, an indicator is appended to the file name to indicate its type. For example, an "/" is appended to "Folder" to indicate that it is an actual directory, and \* is appended to program.sh to indicate it is an executable file (i.e. a program).

Another option is to use ls -l, which prints a lot more information:

[vchariso@vchariso-pc ~/Documents]$ ls -l
total 4
-rw-r--r-- 1 vchariso   vchariso    0 Jan 25 22:00 file.txt
drwxr-xr-x 2 vchariso   vchariso 4096 Jan 25 22:01 Folder
-rwxr-xr-x 1 vchariso   vchariso    0 Jan 25 22:00 program.sh

The above will print more detailed information, including the date and time each file was last modified, its user and group owners, or its size (here, "total 4" just means the total size of the files in this directory is 4 bytes).

File permissions

What about the weird -rw-r--r-- bits at the beginning of the first line? This part is a sequence indicating the file's permissions - make sure to consult the table here! You can see that Folder has a d character in its permissions, which indicates it is a directory. Also, note that the x bit means that a file is executable, which shows us that program.sh is executable (even though it seems to be empty).

Somewhat (un?)surprisingly, UNIX allows you to change a file's permissions (as long as you are the owner of the file). For example, you might want program.sh to not be executable until you have inspected its contents (again, forget that it is empty for now). For that reason, you can use the chmod command. Consider the following two calls:

[vchariso@vchariso-pc ~/Documents]$ chmod -x program.sh
[vchariso@vchariso-pc ~/Documents]$ chmod +x program.sh

The first command removes the executable mode bit from program.sh, while the second one adds it. Another (advanced) way of using chmod is specifying the permission bits explicitly:

[vchariso@vchariso-pc ~/Documents]$ chmod ugo=rw,r,r program.sh

Here, "ugo" stands for "user, group, other". The above indicates:

the user that owns the file can read and write (i.e. edit) it
any additional users that are in the group that owns the file can only read it (only makes sense if the group contains more than just the current user)
any other users (i.e. not in the current group) can only read the file

To make a directory, you can use the command mkdir (from MaKe DIRectory):

[vchariso@vchariso-pc ~]$ mkdir test
[vchariso@vchariso-pc ~]$ cd test
[vchariso@vchariso-pc ~/test]$

By default, mkdir will only create directories with one level of nesting, i.e.

[vchariso@vchariso-pc ~]$ mkdir test/test

will succeed if there already is a directory called test in your working directory, and create another directory called test inside the former one. But if there was no directory called like that, it will fail with an error message:

[vchariso@vchariso-pc ~]$ mkdir one/two/three
mkdir: cannot create directory 'one/two/three': No such file or directory

How do we get around this issue? We'll ask the manual pages for help! Typing man mkdir will open up the manual page of mkdir, where you will see that you can use the -p argument if multiple directories need to be created. The following will work:

[vchariso@vchariso-pc ~]$ mkdir -p one/two/three

There is a wealth of shell commands we will encounter in the coming weeks, and it's completely normal to feel overwhelmed at the moment. For now, I encourage you to set up your Linux machine and browse around using the shell to get a feel for it. Think about tasks you ultimately want to accomplish (e.g. maybe a text processing pipeline) and try to find some commands that could help you do it using apropos and read about their usage using man.

Here are some quick exercises to get you started:

Exercise 1: Which command would you use to list the contents in the current directory, sorted by increasing order of file size? (Hint: man ls)

Solution

The following should work: ls --sort=size -r. The first argument instructs ls to sort contents by file size, and the second argument (-r) to reverse the order of the result.

Exercise 2: Suppose your current directory contains two files called test1.txt and test2.txt. You type the following commands:

$ touch test.txt
$ ls -lt

Which file do you expect to appear first in the directory listing? Why? (Hint: look up ls and touch)

Solution

The touch command updates the access and modification time of test.txt (and creates it if it did not exist before). Because the -t flag to ls indicates to sort by modification time, this means that test.txt will appear first.

Exercise 3: The whatis command displays one-line manual page descriptions. You are curious about what printf does, and decide to look it up. You get the following output:

$ whatis printf
printf (1)           - format and print data
printf (1p)          - write formatted output
printf (3)           - formatted output conversion
printf (3p)          - print formatted output

What do these numbers indicate? Can you tell which of these printfs is the one used by the shell?

Solution

These numbers indicate different parts of the manual. According to the output of man man, the first section of the manual contains pages for executable programs or shell commands, while the third section is about library calls. Therefore, the first two candidates are about the printf used by the shell.

Writing scripts

Here comes the fun part - our first script! Open a file in your distribution's text editor, name it example.sh, and write the following:

#!/bin/bash

echo "Hello World!"

There are two ways to run this program. The first, and more straightforward one, is to open up a terminal and navigate to the directory containing example.sh, and type bash example.sh. The other way is to run

$ chmod +x example.sh
$ ./example.sh
Hello World!

What is this doing? The first line adds the -x flag to the file modes, which makes it executable. The second line instructs the shell to execute example.sh. We prepend ./ before the actual filename, because we have to indicate that the file is located in the current working directory.

But how does the shell know that this is a bash executable? That's because of the first line in your script:

#!/bin/bash

This is an interpreter directive, which essentially tells us that this file is intended to be run by the executable found under /bin/bash (i.e. the bash shell itself).

Variables

Now that you wrote your first script, let us look into different syntactical constructs you can use. Arguably one of the most important ones is defining and using variables. To define a variable, you use the format <NAME>=<VALUE>. For example:

$ myvar=10
$ anothervar=hello

Shell variable names start with a letter or underscore and may contain any number of following letters, digits, or underscores. By default, bash interprets all variable values as strings, unless you explicitly declare them differently. There are 4 variable types in bash:

string variables (default)
integer variables
constant variables (i.e., read-only after they are declared)
array variables (rarely encountered in practice; not all shells support it.)

To access / retrieve a variable's value, you need to add the "$" symbol in front of the variable name:

$ echo $myvar
10
$ echo $anothervar
hello

When assigning a value that contains spaces, use quotes:

$ anothervar="hello world"
$ echo $anothervar
hello world

Quoting and variable substitution

We saw that $ is used to access a variable's content. The process of doing so is called variable substitution. Examine the three versions below:

$ echo $myvar           # output: 10
$ echo '$myvar'         # output: $myvar
$ echo "$myvar"         # output: 10

The above reveals two different types of quoting:

strong quoting, i.e. using single quotes. In this case, no variable substitution takes place.
weak quoting, i.e. using double quotes. Weak quoting does not interfere with substitution.

Generally, it is recommended to use weak quoting (i.e., write "$myvar"), especially when the content of a variable might contain whitespace. See here for a discussion.

Shell variables are mutable, which means you can update their values after you have defined them like we did here. If a variable is not defined, its value is the so-caled NULL value, and accessing it returns nothing:

$ echo $undefined_variable
<a blank line should be printed here>

Note: variables you define are not persistent across shell sessions. If you close your terminal after the above commands are issued and type echo $myvar, you will get a blank line. Even if you call a bash script from bash that tries to access this variable, it will not find it unless you explicitly export it. To convince yourselves, write a script called check.sh like below:

#!/bin/bash
echo "The value is: $myvar"

and then open up a shell, navigate to the directory containing check.sh, and type:

$ myvar=10
$ bash check.sh
The value is:

On the other hand, if you explicitly make the value of myvar available, scripts that you invoke from this shell will be able to access it:

$ myvar=10
$ export myvar
$ bash check.sh
The value is: 10

Keep that in mind when thinking about what the scripts you write will try to access. In fact, it is always a better idea to make sure that your scripts accept all the required information in terms of command-line arguments, which we examine below.

Special variables

There is a number of "special" variables in bash, whose values you may access but not set. These variables typically hold function / script parameters, process IDs, and so on. For example, having echo $1 inside a script will output the first positional parameter the script was called with. Consider the following script:

# example.sh
#!/bin/bash
echo "The first argument was: $1"

You should expect the following output:

$ bash example.sh first_param
The first argument was: first_param
$ bash example.sh 1 2 3
The first argument was: 1

You can read more about internal variables in bash here.

Control Flow

The term "control flow" refers to conditional statements, such as loops, if statements, and so on. Bash supports both of these constructs. An IF statement in bash looks like the following:

# example.sh
if <condition_1>
then
    <statements_1>
elif <condition_2>
then
    <statements_2>
else
    <more_statements>
fi

Note 1: You always need to add the then keyword after an if or an elif.

Note 2: The elif ... and/or else part is optional. As such, the following are all valid examples:

# example_noelse.sh
if <condition_1>
then
    <statements_1>
elif <condition_2>
then
    <statements_2>
fi

# example_noelif.sh
if <condition_1>
then
    <statements_1>
else
    <statements_2>
fi

# example_onlyif.sh
if <condition>
then
    <statements>
fi

Note 3: A common gotcha is when you include the then keyword, but do not put it in a separate line:

if <condition> then
    <statements>
fi

The correct way to write this is include a semicolon, as you would when writing two commands one after each other:

if <condition>; then
    <statements>
fi

Examples of conditions

We examined the skeleton of if statements above, so a natural question to ask is: what kind of conditions do we usually have? We already saw that bash treats the content of variables essentially as strings, so the answer is not obvious.

Indeed, the answer seems confusing at first: the <condition> blocks above are a sequence of statements. If the last statement executed exits successfully, the condition is met and we proceed to the "then" block. If the last statement executed does not exit successfully, we proceed analogously. Here is an example:

$ if echo "hi"; then echo "hello"; fi
hi
hello

What just happened? The command echo "hi" was executed, and it exited successfully (printing "hi" along the way). Therefore the condition was met, and we proceeded to the "then" block of our if statement.

Note: All shell commands have an exit status code that they emit after their execution. By convention, successful execution returns a status code of 0. All status codes >= 1 are considered failures, and their meaning can vary depending on the program. Manual pages document the precise meaning of each status code for the program at hand.

Of course, the example above was contrived. We usually want to test more interesting conditions. For that reason, we commonly use the test command, that evaluates a unary or binary expression and outputs exit status 0 if the expression evaluates successfully.

Here are some examples involving the use of test:

# test if file.txt exists
test -e file.txt

# test if file.txt is a directory
test -d file.txt

# test if file.txt is readable
test -r file.txt

# test if file.txt is a regular file (i.e. not a directory or other special file)
test -f file.txt

# test if variable var1 is greater than variable var2, when interpreted as integers
test $var1 -gt $var2

# same, but test if var1 is greater than or equal:
test $var1 -ge $var2

# test if $var1 is equal to $var2 (interpreted as strings):
test $var1 = $var2

You can find out more about possible conditions that you can check using test by checking the manual page: man test.

There is also a variant form of test, which works identically to test command:

# the two below are equivalent:
$ test <expression>
$ [ <expression> ]

Exercise 4: the spaces after [ and before ] above are important. Can you guess why?

Solution

The output of type [ tells us it is a shell built-in command. If we omit the space after [, bash will not recognize it as a command but will try to parse whatever the result is.

For example, we can rewrite some of the above test commands using its variant form:

# test if file.txt exists
[ -e file.txt ]

# test if file.txt is a directory
[ -d file.txt ]

# test if file.txt is readable
[ -r file.txt ]

# test if file.txt is a regular file (i.e. not a directory or other special file)
[ -f file.txt ]

Logical operators

You can combine more than one expressions in your test constructs, using the logical AND, OR, and NOT operators. For example:

[ -e file.txt -a -x parse_file.py ]

The above tests if file.txt exists and parse_file.py is executable (we use -a for the AND condition). However, it is recommended (for the sake of writing portable code) to use shell-level tests, using the !. &&, || operators:

# file.txt exists and parse_file.py is executable
[ -e file.txt ] && [ -x parse_file.py ]

# file.txt is readable and parse_file.py is not a directory
[ -r file.txt] && ! [ -d parse_file.py ]

# file.txt is a directory or a character special file
[ -d file.txt ] || [ -c file.txt ]

Shell-level tests also allow you to chain expressions, like below:

[ -d dir_name ] && cd dir_name

The above does the following: it first runs test -d dir_name. If it succeeds (dir_name indeed pointed to a directory), it runs the second command which changes the working directory to dir_name. Here is another demonstration:

( [ -d dir_name ] && cd dir_name ) || echo "dir_name is not a directory"

Here, we first check if dir_name points to a directory and make it our working directory if so; otherwise, we evaluate the other part of the OR (||) construct which outputs "dir_name is not a directory".

Exercise 5: Using the appropriate commands, write a one-liner that tests if file.txt is readable, and: * if it is readable, it displays its content. * if it is not, it changes its permissions so that it is readable.

Solution

The following command does exactly that:

[ -r file.txt ] && cat file.txt || chmod +r file.txt

Looping

There are three looping constructs: for, while, and until.

The for loop iterates over a list of objects and executes the loop body for each object. For example:

for i in *.out
do
    cat $i
done

This loops over all files that end with .out in the current directory (see here and here for an explanation on how the * pattern-matching operator is used) and displays their content. The in ... part is optional; if you omit it, the shell will loop over all the command line arguments, if any:

for i
do
    <process arguments here>
done

Looping over ranges of integers is easy using the syntax below:

for i in {1..10}
do
    echo $i
done

Equivalently, you can use the seq command, which also allows you to specify increments. For example; the following will output 1 3 5 7 9 (in separate lines)

for i in $(seq 1 2 10)   # pattern: seq <start> <increment> <stop>
do
    echo $i
done

Note: here, we wrap the seq 1 2 10 command using $(...) because we want to capture its output and loop over it. Here is what happens if we don't use this syntax:

for i in seq 1 2 10
do
    echo $i
done
# output:
# seq
# 1
# 2
# 10

Exercise 6: Write a bash script that lists all files in the current directory that are not themselves directories, sorted in decreasing file size. You can assume that none of the filenames contain spaces.

Solution

There are two steps here: ls -S will print the contents of the current directory in decreasing file size, but will also include directories (whose file descriptors take 4096 bytes by default). To do some postprocessing, we will use the [ -d <file> ] test to exclude directories:

for f in $(ls -S)
do
    [ ! -d "$f" ] && echo "$f"
done

Note that if any of the filenames contain spaces, the looping construct will fail. You can read more about this here.

Week 1: Linux, Command Line

Part I: Setting up a Linux machine

Installing as a virtual machine

Installing on a dedicated machine

Part II: The shell

Getting started

Getting Help

Navigation

Paths: Absolute vs. Relative

Making & inspecting directories

File permissions

Writing scripts

Variables

Special variables

Control Flow

Examples of conditions

Logical operators

Looping