Test Driven Development

< crschmidt> No software is bug free

< FrankW> #/bin/sh

< FrankW> echo “Hello World”

< FrankW> That’s pretty bug free.

< crschmidt> FrankW: you missed a !

—Anon., bash.org

Introduction

A software test is a program that compares that compares the actual output of another program or function with the expected output over a wide range of inputs; this program then reports an discrepancies it finds. Test driven development (TDD) is a software design process focused on a development cycle consisting of three components:

  1. Planing a new piece of functionality.

    This step involves writing tests (that currently fail) that encapsulate the desired functionality.

  2. Implementing the functionality.

    This step generally ends when all test cases pass, indicating that the new functionality works as expected.

  3. Refactoring the new code.

    We have code that achieves the desired functionality from step 2; however, this code might not fit the style of the project, be needlessly complicated, or run too slowly. Since we have tests from step 1, we can verify that our refactoring does not break the new functionality.

The impetus behind TDD is the idea that “testable code is good code.” That is, by designing your code to be able to be tested you are likely to create code that is modular and handles global state in a sane manner. It would be very difficult to test code that stores all state in global variables that can be updated by any function, since testing any single function in that program might fail to properly update the global state. Modular code is easier to write and reason about, since each function can be planned and analyzed separately.

While TDD is generally considered good practice, it does not need to be followed strictly. In practice you might often performs step 2 before step 1, especially if you are exploring a new problem and are not entirely sure what your solution will look like yet. However, you should still strive to write testable code, since you will want to come back to step 1 later on.

We will now discuss various concepts related to testing, namely unit tests, regression tests, stress tests or timing tests, test coverage, and hypothesis testing.

Unit tests

Perhaps the simplest test you can write is a unit test. The goal of a unit test is to test a small “unit” of code which generally has a simple goal or purpose. For example, if you were writing a calculator application one possible unit test would be to test the add function. You generally try to write unit tests to cover all functionality of your program across a variety of inputs, including difficult corner cases or known error cases.

Conceptually a unit test (or collection of unit tests) is just a program that executes a number of tests and reports the results. However, unit tests are very common and there has been a great deal of work done to make writing such tests easier. These unit testing frameworks deal with the boilerplate code of handling error reporting, timing tests, and selecting which tests to be run.

Each unit test should be independent of other tests being run. This means that, for example, the order in which you run tests should not contribute to the outcome of the tests. For this reason most testing frameworks allow you to define setup and tear down functions that are run before and after each test to create a fresh testing environment. For example, if you were testing code that analyzes code stored in a database your setup function might create a new database and populate it with a small set of sample data; the tear down function would then delete this database. By doing this any unit test that modifies the database does not affect another unit test that reads from the database.

We will utilize the pytest module.

Regression Tests

A regression test is a test written after discovering a bug that triggers that bug. This ensures that future changes to the code to not re-introduce that particular bug.

When you discover a bug in your program, your first goal is likely to isolate a set of inputs or environmental factors that cause the bug to surface. You should then write a test that uses these inputs to trigger the bug and then test for that bug. Once you modify your program to fix the bug, the test should no longer fail.

The distinction between a unit test and a regression test is subtle. Almost every unit test could be considered a regression test since it ensures that a specific piece of functionality is present. However, unit tests are generally designed to test small atomic pieces of functionality. However, a regression test might need to construct an elaborate sequence of inputs spanning many functions to trigger a bug.

Test Coverage

The test coverage of a test suite is the number of lines of source code that are tested by at least one test in the test suite. Verifying coverage by hand would be tedious and error-prone; unsurprisingly, there are tools to compute this for us.

It is important to note that while “100% code coverage” is an admirable goal, attaining that goal does not imply that your program is correct. Don’t get sidetracked from writing good tests by chasing a number.

Hypothesis Testing

One downside of unit tests and regression tests is that we only test for scenarios that the programmer can think of and bothers to write. It would be wonderful if we could guard against a larger class of errors, which is what we acccomplish with hypothesis testing.

When proving the correctness of an algorithm one often comes up with algorithmic invariants that are maintained throughout the execution of the program. For example:

  • The simplex algorithm only ever visits basic feasible solutions, and never degrades the objective value.
  • When merging sublists together, merge sort ensures that each sublist is itself sorted.

In addition to these internal invariants, we can often reason about the behavior of the function as a whole. For example:

  • any stable sorting algorithm is idempotent; that is, sorting a list many times has the same outcome as sorting a list once
  • An function to reverse a list is its own inverse.

We can express these ideas as hypotheses or, if you can prove them, propositions.

A hypothesis testing library allows you to write tests that are supposed to succeed across all valid inputs, where you specify the family of inputs to draw from. The library will then test the hypothesis with a large number of random inputs. If it finds an input that causes the hypothesis to fail, it then tries to reduce that input to find a minimal counterexample to the hypothesis. A suite of hypothesis tests often amount to formally specifying the full behavior of a function, and so serve as another form of documentation.

Hypothesis testing can be used when creating a new implementation of an existing algorithm. If you have a (possible slow, possibly poorly-written, but correct) implementation of an algorithm, a natural proposition is that the new algorithm produces the same output as the old algorithm. A particularly useful example of this would be writing algorithms for which there exists a natural but slow implementation that is easy to verify, and another implementation that is faster but more difficult to reason about.

Consider the MAXSAT problem. This problem is NP-hard, and so we might be interested in creating a fast approximation algorithm for it. To verify the approximation ratio, we could write a \(O(2^n)\) brute-force algorithm to exactly solve small inputs of the problem and compare the results to the approximation algorithm.

A good motivating example for hypothesis tests in this course would be to generate new test data sets randomly to run our existing analysis code on. For example, you could write a hypothesis test of the form “the slope of a regression line doesn’t change too much if we add normal noise to all data points.”

If you are interested in writing hypothesis tests in Python you can use the hypothesis module.

Benchmark tests

Timing tests, stress tests, or benchmark tests are all designed to test the performance of code. While unit and regression tests can help ensure an “upgrade” to your code doesn’t break anything, if you are interested in tracking changes in execution speed you need to take measurements. A suite of benchmark tests can be used to formally track the changes in performance over time.

Unlike unit, regression, or hypothesis tests, there is no particular reason that a benchmark test needs to be written in the same language as the program itself (unless you are trying to benchmark internal functions, and not just the entire execution of the program). In fact, a simple way to run benchmarks is the time program, e.g.

time python3 myprogram.py

Note that bash has a built-in time command whose default behavior differs from /usr/bin/time, although they are very similar.