Version control

Version control is software used to track modifications to a collection of files over time. The goal is not only to be able to determine exactly who made which modifications, but to also be able to have access to the full history of the project at any time.

This page is not meant to be a tutorial on using any particular version control system; there are excellent tutorials available online (and your system man pages) to consult.

There are many version control systems (VCS) available today, and they can largely be categorized as either centralized or distributed version control systems. While we will focus our attention on one particular distributed version control system (DVCS) – Git – we will discuss several here that enjoy common use.

Why version control?

Bob does not use version control. Bob is coding with the understanding that his current local copy of the project is the only one that exists. This means that merely starting work on the project immediately destroys his only stable version. Small mistakes can cost additional hours to fix, and catastrophic errors can cost days, all because Bob cannot revert to a previous version. Bob is a rational actor, so knowing these things, he will err on the side of caution, prompting not to take risks to improve his code. He may try to copy his project and store it elsewhere while he works on it, but the lack of documentation and labels can make it difficult for him to remember the differences of each version, and any additional organizational work annoyingly creates additional overhead. And if you think things couldn’t get worse, other people do not want to work with Bob because workflows without version control are unnecessarily restrictive. Bob is stuck in a quagmire. Don’t be like Bob; use version control.

Part I: Working locally

To understand the premise behind version control, one can use a directed acyclic graph to visualize a project’s revision history. We refer to a repository as the set of files in the project that are tracked with version control. At any point, one would be able to save the full current state of the repository as a checkpoint within the revision history. We refer to such a checkpoint as a commit (this can be used as a noun or a verb). In the graph representation, each commit is represented as a node. The root of the graph denotes the initial commit of the repository. Each arc is defined such that the child node is the next commit in the revision history with respect to the corresponding parent node.

The strength of version control is that it allows you to go back in history and look at previous states of the repository. The version of the repository currently sitting in your project directory corresponds to a certain commit. This variable is referred to as HEAD. Any reasonable version control software allows you to switch HEAD to any other commit in the revision history and load files associated with that current state. Another way of saying this is that you can checkout another commit.

The simplest kind of revision history is where one person adds features to the project in a linear fashion. For the corresponding graph, this means that the in-degree and out-degree of each node is no more than one. But version control allows one more flexibility in determining workflow. For example, suppose you have a project in a stable state and want to add a feature. Ideally, you would be able to record the history multiple tracks of development, or branches in Git terminology. The default branch for a newly created Git repository is called the master branch. Forcing yourself to always work and commit on the same branch makes life harder for yourself and takes away a lot of the flexibility afforded by version control systems. A more desirable approach entails creating a new branch and working on the new feature there, commiting as you go. Just as you are able to checkout previous commits, you can checkout different branches as well, and HEAD will now point the most recent commit of that branch (referred to as the head of that branch). So if you are working on the new feature, you are free back to the original branch, and the tracked files will now reflect their original state in the master branch. Using the directed graph interpretation, a branch is simply a selected node along with all nodes reachable from this node back to the root.

When you reach a point where the project with the added feature is stable again, you would like to be able to incorporate that work back in the original master branch. This is referred to merging one branch into another. Suppose you are merging the feature branch into the master branch of your repository. In the graph interpretation, if there exists a path from the head of the master branch to the head of the feature branch, then merging just entails incorporating all the new commits into the master branch. This is equivalent to moving the head of the master branch up to the head of the feature branch, which is why this type of merge is known as a fast-forward merge.

But since you are a seasoned veteran in the context of version control systems, life isn’t always that simple. Suppose at some point during the development of your new feature you found a bug in the original code in your master branch. Nothing is stopping you from fixing it right away in the master branch, and that’s the added benefit of using version control. Keep in mind that the bug is still present in your new feature branch, so when you merge later on, you might have to reconcile the differences between the two versions, especially if they heavily impact each other. If the differences only occur in non-overlapping parts of the project, the merge will be simple and everything will be automatic. However, if there is some amount of overlap, then you need to resolve these conflicts one-by-one. There are programs called merge tools that can help you with this.

This is the end to a crude introduction on the premise behind version control systems (with a strong bias towards Git), without getting into the commands behind them. There are many great tutorials online for whichever version control system you want to use, and I strongly encourage you to check them out.

See the directed graph for yourself

In Git, you can see your revision history as a graph with the command git log --graph. At the time of this writing, the command git log --graph --oneline produces the following output:

* bc630a3 Minor language changes.
*   50787e8 Merge branch 'master' of ssh://bitbucket.org/prsteele/orie-6125-sp2016
|\
* | 291163d Re-ordered the syllabus.
|/
*   cf0424e Merge branch 'vc'
|\
| * d749d42 Updated the commit hash.
| * 889c83c Updated the version control writeup.
* | 582b13f Improved the IEE 754 section.
|/
* 8b4b680 Added a commit reference.
* d2bd989 Added venv to the gitignore.
* 0d4bea5 Working on the version control section.
* a555334 Working on IEEE754.
* a2f463f Splitting off architecture and IEEE 754.
* 339a649 Improved the architecture writeup.
* 57450c5 Working on arch.
* ce83730 Removed the remainder of the old build-systems.
* c967e14 Added links to the syllabus.
* f025200 Added a bit on benchmarks.
* a3e8952 Progress on build systems.
* be0904e Working on build systems.
* ab6accf Working on getting a skeleton up there.
* 941dc18 Testing seems to be somewhat finished.
* 07c1979 Added a bit on hypothesis testing.
* 3f79e77 Progress on the testing writeup.
* 495f803 Cleaned up the testing section.
* 26e45a5 Moving to Sphinx.
*   3e5018e Merge remote-tracking branch 'origin/master'
|\
| *   7ddeee9 Merge branch 'build-systems'
| |\
| * | a884188 Initial commit.
|  /
* | 9060111 Progress on the syllabus.
* | 774846e Messy.
* | ac05c19 Working on the testing writeup.
* | 80ea2d0 Minor fixes.
|/
* 04ded6c Reasonable first pass.
* 095e8c9 Fixed a silly bug in main.c.
* 0872b41 Added an ignore for SCons.
* cbff3d0 Working on the writeup.
* c7b19e7 Working on a lecture for build systems.

As you can see the history is mostly linear except for some branching and merging. (If you are interested, the commit that generated this history is bc630a313a193b045eeb98c0fad620d894db02ef – you can see the first 7 characters of this commit in the first line of the log).

Part II: Working with remotes

Up until now, we assumed that all work was present only on the local machine. But in most situations, it is desirable to setup a remote repository that keeps track of the changes you make. This is helpful so you (and others with your permission) can access your own work from different places.

Many times, a repository will already exist, and you will want to be able to make a local copy of the repository to enable yourself to run the code or even make changes. This is called cloning the repository. Similarly, if you already have a copy of the repository, you might find that your version is out-of-date with respect to the main branch of the remote repository. To download the most recent commits, you perform a pull operation. Pulling entails two actions: first, the remote repository is fetched, meaning that all new commits to the branch you are pulling are copied into a new temporary branch. This branch is then merged into the local branch you have. If you made any commits before pulling, they will have to be reconciled with the commits from the remote branch when merging. Lastly, if you wanted to update the remote repository with your own commits (assuming you have permission), you perform a push operation. If your local branch is even slightly out-of-date and is missing commits from the remote branch, you will need to pull first, merge any conflicts, then push again. Otherwise, the remote branch will now have all your commits and will be updated to your local version (similar to a fast-forward merge).

Of course, working with other people makes everything more complicated. In general, there are two different types of version control systems that deal with the complexities of team development much differently.

Centralized version control systems

A centralized version control system has a notion of privileged copy of the repository. Each person working on the project can check out copies of files from the privileged repository, make changes, and commit those changes back to the repository. Anyone checking out a copy of the modified files later will receive the updates.

There is an obvious problem here, which is what happens when two or more people try to modify the same file. As an example, Alice and Bob both check out the README file for a project they are working on at roughly the same time. Alice adds her name to the list of authors, and commits this change. This commit succeeds, because no one has modified the file since she checked out her copy. Bob also adds his name to the authors list, but when he tries to commit he will get an error, since he has not yet merged changes from the privileged repository made by Alice. Bob will need to somehow merge these changes.

One way that a centralized version control system can prevent this situation is file locking. When you want to make a change to a file, you check out a copy of this file and lock it; this only succeeds if the file is currently unlocked. The privileged repository will no longer allow anyone else to commit to the locked file until the first user commits her changes and relinquishes the lock. This can make working in large teams difficult.

One advantage of centralized version control systems is that for large projects each team member need not have all the files on their system at once; whenever a file is needed it can be requested on-the-fly.

Notable centralized version control systems are CVS, SVN and Perforce. SVN is intended to replace CVS, and largely has; both are under permissive licenses. Perforce is a proprietary system.

There is a free book discussing how to utilize SVN.

Distributed version control systems

Unlike centralized version control systems, distributed version control systems have no notion of a privileged repository. Rather, each team member maintains a private copy of the full repository, and makes any changes they desire locally. When they are ready they can push or pull changes from peer repositories. Note that it is still possible to have a centrally networked repository; however, this repository has no special status beyond each personal repository.

Advantages of distributed version control systems include speed and reliability. Since all changes are being performed locally, there is no network access required to commit changes (unless you are publishing said changes to a remote repository). Since each team member has a full copy of the repository, there is also a small amount of protection against data loss.

The most well-known distributed version control systems are Git and Mercurial. To first order, these systems are comparable, and choosing which to use might come down to personal preference (there are even extensions allowing Mercurial and Git repositories to interact).

There are many tutorials for Git online, but the standard reference materials are quite good. The man pages can be a bit intimidating if you don’t already know what you are doing, but are very useful if you are trying to remember an infrequently used command or flag.

Workflows

There are many ways to use version control effectively. You can choose any one you like, but I would suggest the “feature branch workflow” as it works nicely for collaboration in small groups.

Terminology summary

There are several basic operations that are common to most version control systems (with a strong bias towards Git).

Repository

A collection of files under version control.

Cloning

Making a (local) copy of a (remote) repository.

Checkout

Moving your repository contents to a point in the project history.

Commit (noun)

A pointer to a checkpoint in the revision history.

Commit (verb)

Creating a checkpoint in the revision history. If you have created a set of changes to a collection of files and want to mark this work in the history, you commit your changes.

Branch (noun)

A line of development within the repository. There can be many branches in a single repository. For example, a new feature might be developed on a feature branch, while a stable copy of the working project might live on the master branch.

Branch (verb)

Creating a new branch in the repository.

Pulling

Fetching and merging changes from a remote repository with your local repository. This action should be performed whenever you want to get changes from someone else.

Pushing

Publishing local changes to a repository to a remote repository.

Merging

Bringing changes from one repository or branch into a local branch.