Week 2: More shell, regex, version control

Part I: More shell

I/O redirection

Whenever you execute a command / script in Linux, three files are always open. These files are mapped to the standard input, standard output, and standard error streams (STDIN, STDOUT, STDERR). By default, STDIN is your keyboard and STDOUT is the terminal from which you are executing the command. For STDERR, it depends: shell commands typically appear on the terminal, but some programs direct the error log to an appropriate file.

Here are 3 examples of output redirection:

$ echo "foo" > file.txt
$ echo "bar" >> file.txt
$ :> file.txt

The first command outputs "foo", but redirects the standard output to file.txt instead of the terminal using the > operator (which is why you won't see "foo" printed when you execute it). However, if the file was nonempty, its previous content is lost. This is not the case with the second command: any output is appended to the file when using >>. Finally, the last command uses the : shell builtin, which is a null operator, and redirects its output to file.txt. Since : produces no output, that effectively resets the file contents.

Input redirection is done with the help of the < operator. Consider the read command, which reads standard input into variables (using whitespace to make tokens of the input). For example:

$ cat file.txt
foo bar
$ read var_1 var_2 < file.txt
$ echo $var_1
foo
$ echo $var_2
bar

Another useful construct related to input redirection is what is known as a here document. It is best described using an example:

$ cat <<EndOfMsg
This is a line.
This is another line.
More lines might follow.
EndOfMsg
# output:
# This is a line.
# This is another line.
# More lines might follow.

In the above, everything between the first and second appearances of EndOfMsg gets redirected to the standard input of cat (cat expects the user to type an input string by default, but here input is redirected). For it to work properly, nothing else should follow EndOfMsg in the same line. Finally, note the use of the << operator (instead of <).

Note: You can combine input and output redirection at will. Run the following examples on your terminal and see what they do:

$ echo "1 2 3" > file.txt
$ cat < file.txt > output1.txt
$ cat > output2.txt < file.txt
$ cat > output3.txt <<EndOfMsg
Foo
Bar
Baz
EndOfMsg

Finally, it is also possible to redirect one of the three streams to the other. The syntax is i>&j, where i and j are file descriptors. By default, 0 is STDIN, 1 is STDOUT, and 2 is STDERR. Here is an example, where we run the ls command, redirect its output to output.log, and redirect the standard error to the standard output (which has been set to output.log):

$ ls -l > output.log 2>&1.

In fact, you can use these file descriptors more generally. Some examples can be found here.

One of the advantages of the UNIX shell is composability: creating a pipeline of simple commands to accomplish complex tasks. This is why the last I/O redirection operator we will look at is the pipe, |. This operator "chains" input and output streams, connecting the STDOUT of the previous program to the STDIN of the next.

Here is a simple example, that uses the grep command introduced below. By default, grep <pattern> <input> searches for lines containing <pattern> in its input, which can be either STDIN or a list of files:

$ who | grep "vasilis"
vasilis tty7         2021-02-08 09:17 (:0)

Here, the output of who (a list of users currently logged into the system) becomes the standard input for grep, which searches for the pattern vasilis.

You can chain more than one pipes. For example, here we count the number of files in the current directory that contain the word "backup" using the wc utility; wc -l counts the number of lines in the standard input.

$ ls | grep "backup" | wc -l

Here is a similar example: suppose you have a list of contacts, contacts.txt, with lines in the format <FIRSTNAME> <SURNAME> <PHONENUMBER>. You want to find all the lines that contain the name "Johnson", sort them in increasing order according to first name, strip the phone numbers and put them in a file called johnsonses.txt:

$ grep "Johnson" contacts.txt | sort -k1 | awk '{print $1, $2}' > johnsonses.txt

Here, awk '{print $1, $2}' prints the first and second columns for every line in its standard input, which corresponds to the first and last names of all the Johnsonses. Similarly, you can collect all their phone numbers in a separate file:

$ grep "Johnson" contacts.txt | awk '{print $3}' > johnsonses_numbers.txt

Many nontrivial uses of piping arise in text processing. For that reason, we will now move on to regular expressions.

Part II: Regular expressions

Regular expressions are patterns that match characters in strings (called regex for short). They are a mix of "ordinary" characters (like substrings you wish to match exactly) and "special" characters that allow for repetitions, combinations, and other interesting features.

Regular expressions are supported by several languages and command-line tools. For example, the grep utility in UNIX allows you to probe files for patterns using regular expression syntax, the sed utility allows you to perform substitutions using regular expressions, and so on. Python has a library called re to create regular expressions as well. During this part of the course, we will be putting on our UNIX hat and working with command-line tools but feel free to use Python to practice them in your spare time.

Setup & Basic Constructs

The most common use of regular expressions is filtering a collection of strings trying to find matches to a given pattern. Writing correct and unambiguous patterns is the essence of writing regular expressions.

Consider the simplest task of filtering through a set of strings, returning all those that contain the sequence "vasilis". For example:

$ cat example.txt
My name is vasilis and I like to write code.
My name is also Vasilis but I don't like to write code.
I spoke to Mateo and he told me about Python.
I spoke to vasilid and he taught me about regular expressions.
asdfasdfasdfasdfvasilisasdfasdf.

This file contains a collection of sentences (one per line), and we wish to output each line that contains the sequence "vasilis". To do so, we can write a simple grep command as follows:

$ egrep "vasilis" example.txt
My name is vasilis and I like to write code.
asdfasdfasdfasdfvasilisasdfasdf.

The grep utility works as follows: it treats the first argument is the pattern and the second argument is typically the input file. It applies the pattern to each line in the file, and prints all the lines that match. Note that patterns are case-sensitive; for example, we ignored the second line that contains the word "Vasilis" because the leading "v" should be lower case to match the pattern. It is good practice to enclose the pattern in double-quotes when using grep in a script.

Note: We are using egrep here for reasons that will be clarified later; namely, to make sure that meta-characters are treated as expected.

Character ranges

To circumvent the above problem (only match "vasilis" but not "Vasilis") we introduce a fundamental construct: character ranges. If there is a part of your pattern where more than one characters match, you can enclose the set of letters in square brackets:

$ egrep "[Vv]asilis" example.txt
My name is vasilis and I like to write code.
My name is also Vasilis but I don't like to write code.
asdfasdfasdfasdfvasilisasdfasdf.

When using a character range, there are some tricks to simplify the resulting pattern. For example, if you want to match any number between 0 and 9, you can write [0123456789] or [0-9] - the two are equivalent. The same is true for [abcdeghijklmnopqrstuvwxyz] and [a-z]. If you want to be case-insensitive, you can also mix the two: [a-zA-Z] will match any letter between "a" and "z" as well as their capital versions.

A note of caution: whatever you put inside the brackets will be treated as a collection of characters to match (or not match), not as a string. For example, writing [vasilis] will match one letter from the set {a, i, l, s, v} rather than the string.

What if we want to exclude a set of characters from our pattern? In this case we can use the caret (^) inside the square brackets. For example:

$ egrep "vasili[^s]" example.txt
I spoke to vasilid and he taught me about regular expressions.

Here, we match all strings containing a set of characters "vasili" immediately followed by any character other than "s". If there are more than one characters you wish to avoid, you can add them inside the same block following the caret.

$ grep "vasili[^ds]" example.txt

Metacharacters

In the above, the brackets as well as the caret are so-called metacharacters, i.e., characters that take on special function and meaning inside regexes. If we want to match the meta-character itself, we typically add a backslash in front of it (something referred to as "escaping" the character). Note the difference between the following two:

$ egrep "\[vasilis\]" example.txt
...
$ egrep "[vasilis]" example.txt

In the first example, we escape [ and ] in order to indicate that we want to treat them as ordinary characters and match the substring "[vasilis]". In the second example, we are not escaping them and instead end up with a character range that will match any character from the set {a,i,l,s,v}.

Note: forgetting to escape a metacharacter is one of the most common mistakes for firstcomers in regular expressions. Make sure you remember the ones you learn!

Here is another metacharacter: the so-called Kleene star (*). The star operator indicates that the preceding character can be "matched" as many (or as few) times as necessary. Consider, for example, trying to match all strings of the form "hello", "helllo", "hellllo" etc. Here, the words we are looking for start with "he", followed by at least 2 "l" characters and the character "o" last. The following will work fine:

$ cat example_star.txt
hello
helllo
hellllo
helo
$ egrep "helll*o" example_star.txt
hello
helllo
hellllo

Here, we are telling grep to match any strings containing "hell" followed by any number of occurences of "l", followed by "o". A similar operator to the Kleene star is the Kleene plus (+), which matches at least one occurence of the preceding operator (recall that * can match as few as zero of them). For example:

$ cat example_plus.txt
heo
helo
hello
$ egrep "hel+o" example_plus.txt
helo
hello
$ egrep "hel*o" example_plus.txt
heo
helo
hello

Another useful construct is specifying the number of occurences explicitly. The general syntax for that is <character>{lower_bound,upper_bound}. For example:

$ egrep "hel{2,3}o" example_star.txt
hello
helllo

The above matched all strings starting with "he" followed by between 2 and 3 "l"'s, followed by "o". You can also omit the upper or lower bound:

$ egrep "hel{2,}o" example_star.txt
hello
helllo
hellllo

The example above matches at least 2 "l"'s. On the other hand, the command below matches at most 2 "l"'s:

$ egrep "hel{,2}o" example_star.txt
hello
helo

Note: omitting the lower bound will allow zero occurences of the sub-expression to be matched. For example:

$ egrep "hel{,2}o" <(echo heo)
heo

Note that the curly braces are also metacharacters, as demonstrated below:

$ cat example_meta.txt
hello
hel{,2}o
$ egrep "hel{,2}o" example_meta.txt
hello
$ egrep "hel\{,2\}o" example_meta.txt
hel{,2}o

Exercise: Write a regex matching US-style phone numbers, i.e., a 3-digit area code followed by a dash and 7 more digits. Note the first digit in the area code cannot be zero.

Solution

$ egrep "[1-9][0-9]{2,2}-[0-9]{7,7}" <file>

Here is one more: the "optional" metacharacter. Consider the following scenario: you are profiling a piece of code and generate a log file that reports how many function calls were performed during a test run. You wish to match lines that look like

24 calls found.
3  calls found.
1  call found.

Here, you decide to match any lines that contain "call", optionally followed by one "s" character. Two equivalent ways to do it:

$ egrep "calls{0,1} found" output.log
$ egrep "calls? found" output.log

Here, "?" applies to the preceding character and indicates that we should try to match "call" or "calls" (whichever produces a successful match).

Exercise: write a regular expression that matches a string starting with "a", followed by any sequence of letters, followed by at most 1 number, and ending in "z".

Solution

The following will work: a[a-zA-Z]*[0-9]?z

Conditional matches

This is another useful construct: suppose you are parsing a file containing paths to other files and want to list all image files that end in .jpeg or .png. Naively, you can write a regular expression that matches all ".jpeg" substrings, another that matches all ".png" substrings, and appends the output to a file:

$ egrep "\.jpeg" paths.log >> output.txt
$ egrep "\.png" paths.log >> output.txt

Notice that we are escaping the dot, since it is also a meta-character (matches any character). Because either match is valid, you can use the following:

$ egrep "\.(jpeg|png)" paths.log

Optionally, since some endings might be capitalized, you can use the -i flag of the egrep command to ignore case. This will also match, e.g., a line containing file.PNG.

More resources

You can find useful overviews of regular expression syntax here. Beyond grep and egrep, two programs that use regular expressions regularly (pun unintended) are sed and awk. You can find some cheatsheets here:

Part III: `git` and version control

Have you ever found yourself naming your files script.py, then script_1.py, script_2.py (or, even worse, script_1(1).py) because you want to be able to go back to the previous version in case anything goes wrong? If you are this person, git is the tool for you.

Basic workflow

The git workflow mostly adheres to the following pattern:

Create a new project (either local or remotely in a code repository)
Make incremental changes to the project (e.g. add / edit / remove files)
"Commit" the last batch of changes (with a message summarizing what they change in the project)
"Push" the changes to a remote repository
Repeat steps 2-4 until the project is completed (or abandoned :))

There are several variations to this, and the way people implement each step depends on the nature of the project. For example, if you are working at a software company, you likely want to maintain several "views" of the project:

A "stable" view, containing the version of the code that you serve to your customers. This code contains the implicit promise that is well-tested and free of any known software vulnerabilities.
A "testing" view, which is a version of your software that is experimental but stable enough to offer to the public, that acts as a beta-tester. This code is not bug-free, but having several users try it is key to finding any additional bugs.
A "development" view (or more!), where new features are currently implemented (a work in progress). Typically, this view is intended to be used by experienced users and other developers, but not the end-user.

Git offers tools that make this workflow remarkably easy (via the concept of branches, which we will introduce soon).

Creating a repository, adding and committing changes

If you just installed git, you first need to set a username and an email. This is done via git config:

$ git config --global user.name <your_username_here>
$ git config --global user.email <your_email_here>

This means that you will be using this username and email for all your git projects / repositories. You can also create a local configuration that only applies to a particular project (e.g. if you need to use your company's email domain or a particular username, or any other-project specific settings that are not necessarily username and email). You can read all about it with git config --help.

All git projects are developed in so-called repositories. The most common use case is when you create a repository in an online service, such as Github, and then create a local copy for the computer you are working on. To make a local copy, you use the git clone command:

$ git clone <repository_address>.git
# or
# git clone <repository_address>.git <local_directory>

The first command above creates a local folder with the same name as the remote repository, while the second command specifies the name of the the local directory (to be created). For now, we will assume that you are cloning a remote repository, which is the most common use case (if not, you must use the --local option in your call).

How does git track changes? The git system maintains a local index of changes to files (also sometimes called the staging area). For example, if you changed a file, you use git add <file> (or git rm if you deleted something, or git mv if you renamed it) to record the changes to the staging area. You can repeat this process multiple times:

# change file1 (e.g. via an editor), record the changes
$ git add file1.txt
# change file2, record the changes
$ git add file2.txt
# delete a file completely, record the changes
$ git rm file3.txt
# rename a file to something else
$ git mv file4.txt new_file4.txt

At this point, the staging area has the updated content for these 4 files. Now comes the important part: committing your changes. Whenever you git commit something, git creates a new snapshot of your project and assigns a unique identifier to it, called a commit hash.¹

$ git commit -m "Added files 1 and 2, deleted file3, renamed file4 -> new_file4"

Here we see the git commit command in action. The -m flag specifies a commit message, which is a short summary describing the changes introduced by the new snapshot. It is recommended to make your commit messages as informative as possible, as that gives you an idea of what changes in each snapshot without having to look at the code itself.

To keep commit messages short but informative, it is a good idea to try and make a habit of commiting changes in small chunks rather than introduce huge blocks of changes. For example, consider the sequence below:

# edit script.py
$ git add script.py
$ git commit -m "Added get_parameters(), get_info()"
# edit it some more
$ git add script.py
$ git commit -m "Fixed error in set_parameter()"
# rename it
$ git mv script.py utils/script.py
$ git commit -m "Move to utilities"

Contrast the above with the more brief (but messier) sequence:

# edit script.py
$ git add script.py
# edit it some more
$ git add script.py
# rename it
$ git mv script.py utils/script.py
$ git commit -m "Added some functions, fixed a typo and introduced new util"

In addition to the first sequence being way more informative (at the cost of extra commit messages), it makes it easier to identify where a mistake happened by looking at the history (reverting problematic changes is easier for the same reason). As a rule of thumb, you should create a new commit for every major change you make to a component of your code (that being said, you should not create a new commit for each new typo you find & fix).

Pushing and pulling changes

Keeping with our assumption that you are working with code on a remote repository, we now wish to push our changes to the remote repository, so that other people can grab the updated code. This is the role played by git push:

$ git push <remote_name> <branch_name>

This pushes your code to the remote repository (its address is specified by remote_name) at the given branch <branch_name>. This publishes your local changes and makes them available to other users.

Tip: Saving time

If your remote is pointing to a HTTPS address, you will be asked for your credentials every time you perform a git push. To avoid typing your password all the time, you can tell git to keep it in memory for a few minutes. To do so, type

git config --global credential.helper cache

By default, this keeps your credentials in memory for 15 minutes. To change the default duration, you can specify the time (in seconds):

git config --global credential.helper "cache --timeout 3600"

The above specifies that git should cache your credentials in memory for an hour (3600 seconds). Note the use of quotes here.

The names used most often are origin for the remote (by convention, the address from which you cloned the repository) and master for the branch (by convention, the "main" branch of your code). Below we introduce and explain these concepts in some more detail.

Branches

Branches are essentially different "paths in time" for your code. The "main" branch is called master by convention, and all other branches were derived from master at some point in time. The purpose of branches is best explained in terms of a development workflow.

Suppose you and your collaborator are working on a project and want to work on two different features at the same time. Since you will be working on your local copies of the project, you will be creating different snapshots that are interspersed with each other in time. In other words, if someone could see the snapshots of the project in the order you and your collaborator committed them, the order would not make sense. This has very real implications when you eventually both want to "push" your changes, since git does not know how to combine out-of-order changes (except in very special occasions). This is because every commit has a parent commit, and you and your collaborator's commits do not have consistent parents (except for your very first commit after you start working independently).

With branches, you and your collaborator can create parallel timelines and merge them at the end. For example, you create a branch called feature_1 and your collaborator creates a branch called feature_2. You create commits independently on the respective branches, and then you merge each of the two branches into master.

To work with branches, you typically use the git checkout command:

$ git checkout -b <new_branch>  # creates new branch
$ git checkout <existing_branch>  # updates the working branch

Branching is a very important feature of git, but things can get technical when explaining the mechanics behind them. Reading the reference tutorial for branching is a must for every git user.

Remotes

Remotes are repositories whose branches you are tracking. More often than not, you will only work with a single remote (the one where your project started). A common scenario for working with more than one remotes is if you maintain a code repository with multiple hosting services (e.g. Github and bitbucket). Because working with multiple remotes is somewhat uncommon, this is a topic better deferred to the reference tutorial.

Exercise

To get started with git, activate your Cornell Github account and create your first repository!

The commit hash is derived by the content of the snapshot and related metadata, such as author name, time of submission, parent snapshots etc. ↩
The only commits that have more than one parents are merge commits, which result when you have to merge two branches. ↩