Test Driven Development

< crschmidt> No software is bug free

< FrankW> #/bin/sh

< FrankW> echo “Hello World”

< FrankW> That’s pretty bug free.

< crschmidt> FrankW: you missed a !

—Anon., bash.org

Introduction

A software test is a program that compares that compares the actual output of another program or function with the expected output over a wide range of inputs; this program then reports an discrepancies it finds. Test driven development (TDD) is a software design process focused on a development cycle consisting of three components:

  1. Planing a new piece of functionality.

    This step involves writing tests (that currently fail) that encapsulate the desired functionality.

  2. Implementing the functionality.

    This step generally ends when all test cases pass, indicating that the new functionality works as expected.

  3. Refactoring the new code.

    We have code that achieves the desired functionality from step 2; however, this code might not fit the style of the project, be needlessly complicated, or run too slowly. Since we have tests from step 1, we can verify that our refactoring does not break the new functionality.

The impetus behind TDD is the idea that “testable code is good code.” That is, by designing your code to be able to be tested you are likely to create code that is modular and handles global state in a sane manner. It would be very difficult to test code that stores all state in global variables that can be updated by any function, since testing any single function in that program might fail to properly update the global state. Modular code is easier to write and reason about, since each function can be planned and analyzed separately.

While TDD is generally considered good practice, it does not need to be followed strictly. In practice you might often performs step 2 before step 1, especially if you are exploring a new problem and are not entirely sure what your solution will look like yet. However, you should still strive to write testable code, since you will want to come back to step 1 later on.

We will now discuss various concepts related to testing, namely unit tests, regression tests, stress tests or timing tests, test coverage, and hypothesis testing.

Unit tests

Perhaps the simplest test you can write is a unit test. The goal of a unit test is to test a small “unit” of code which generally has a simple goal or purpose. For example, if you were writing a calculator application one possible unit test would be to test the add function. You generally try to write unit tests to cover all functionality of your program across a variety of inputs, including difficult corner cases or known error cases.

Conceptually a unit test (or collection of unit tests) is just a program that executes a number of tests and reports the results. However, unit tests are very common and there has been a great deal of work done to make writing such tests easier. These unit testing frameworks deal with the boilerplate code of handling error reporting, timing tests, and selecting which tests to be run.

Each unit test should be independent of other tests being run. This means that, for example, the order in which you run tests should not contribute to the outcome of the tests. For this reason most testing frameworks allow you to define setup and tear down functions that are run before and after each test to create a fresh testing environment. For example, if you were testing code that analyzes code stored in a database your setup function might create a new database and populate it with a small set of sample data; the tear down function would then delete this database. By doing this any unit test that modifies the database does not affect another unit test that reads from the database.

We will explore the nose and unittest modules for unit testing Python code; nose actually extends the unittest module, adding nice functionality such as timing and test coverage calculations. These frameworks are conceptually similar to most unit testing frameworks in popular languages. The table below lists similar frameworks in various languages.

Language Framework
Python nose
Python unittest
C++ Google Test
Java junit
Haskell HUnit
Javascript Jasmine

Example

We consider implementing a stack in Python. Conceptually, a stack supports the push and pop operations; we support several more for convenience. In particular, we want to support the following methods:

  • __init__(self)

    Create a new empty stack

  • push(self, x)

    Push an object to the top of the stack

  • pop(self)

    Remove the top object from the stack and return it

  • peek(self)

    Return the top object from the stack

  • __len__(self)

    Return the number of objects on the stack

  • clear(self)

    Remove all elements from the Stack.

We can begin by writing unit tests to check the functionality of each of these methods.

import unittest

from stack import Stack

class TestStack(unittest.TestCase):
    """A collection of unit tests for the Stack interface.

    """

    def setUp(self):
        """Create an empty Stack for each test"""
        
        self.stack = Stack()

    def test_push_1(self):
        """Push a single element on the stack."""

        self.stack.push(0)

        self.assertEqual(1, len(self.stack))

    def test_push_2(self):
        """Push many elements on the stack."""

        for i in range(10):
            self.stack.push(i)

        self.assertEqual(10, len(self.stack))

    def test_pop_empty(self):
        """We expect an exception to be thrown if we pop an empty stack"""

        with self.assertRaises(ValueError):
            self.stack.pop()

    def test_pop_1(self):
        """Push, then pop."""

        self.stack.push(1)
        self.assertEqual(1, self.stack.pop())

    def test_pop_2(self):
        """Push many, then pop."""

        for i in range(10):
            self.stack.push(i)
        self.assertEqual(9, self.stack.pop())

    def test_pop_3(self):
        """Push many, and pop them all"""

        values = list(range(10))
        
        for i in values:
            self.stack.push(i)

        for i in values[::-1]:
            self.assertEqual(i, self.stack.pop())

    def test_len_1(self):
        """The length of an empty stack is 0"""

        self.assertEqual(0, len(self.stack))

    def test_len_2(self):
        """Check the length after many pushes"""
        
        size = 0
        for i in range(10):
            self.stack.push(i)
            size += 1
            self.assertEqual(size, len(self.stack))

    def test_len_3(self):
        """Check the length after pushes and pops"""

        self.stack.push(0)
        self.assertEqual(1, len(self.stack))
        self.stack.push(0)
        self.assertEqual(2, len(self.stack))
        self.stack.push(0)
        self.assertEqual(3, len(self.stack))
        self.stack.pop()
        self.assertEqual(2, len(self.stack))
        self.stack.pop()
        self.assertEqual(1, len(self.stack))
        self.stack.push(0)
        self.assertEqual(2, len(self.stack))
        self.stack.push(0)
        self.assertEqual(3, len(self.stack))

    def test_peek_1(self):
        """Push, then peek."""

        self.stack.push(1)
        self.assertEqual(1, self.stack.peek())
        self.assertEqual(1, len(self.stack))

    def test_peek_2(self):
        """Push many, then peek."""

        for i in range(10):
            self.stack.push(i)
        self.assertEqual(9, self.stack.peek())

    def test_peek_nullipotent(self):
        """Peek cannot change the stack"""

        self.stack.push(0)

        for i in range(10):
            self.assertEqual(0, self.stack.peek())

By making a class that begins with Test*, we allow nose or unittest to automatically discover and run our tests. We inherit from unittest.TestCase to get access to the assert* methods; see here for details.

We can run these tests with

$ python3 -m nose

or

$ python3 -m unittest discover

Of course right now all tests will fail, since we cannot even import our Stack class. Let’s implement it:

class Stack:
    """A LIFO queue.

    """

    def __init__(self):
        """Initialize a new empty Stack.

        """

        self.contents = []
        self.size = 0

    def push(self, x):
        """Push a new element on top of the stack.

        """

        self.contents.append(x)
        self.size += 1

    def pop(self):
        """Return and remove the top element of the stack.

        """

        try:
            ret = self.peek()
            self.size -= 1
            self.contents.pop()
            return ret
        except:
            raise ValueError('Empty stack')

    def peek(self):
        """Return the top element of the stack.

        """

        try:
            return self.contents[-1]
        except:
            raise ValueError('Empty stack')

    def clear(self):
        """Clear the Stack.

        """
        
        self.contents = None

    def __len__(self):
        """Returns the number of elements in the stack.

        """

        return self.size

We can now run our unit tests, and we see the following output.

$ python3 -m nose test_stack
............
----------------------------------------------------------------------
Ran 12 tests in 0.005s

OK

Regression Tests

A regression test is a test written after discovering a bug that triggers that bug. This ensures that future changes to the code to not re-introduce that particular bug.

When you discover a bug in your program, your first goal is likely to isolate a set of inputs or environmental factors that cause the bug to surface. You should then write a test that uses these inputs to trigger the bug and then test for that bug. Once you modify your program to fix the bug, the test should no longer fail.

The distinction between a unit test and a regression test is subtle. Almost every unit test could be considered a regression test since it ensures that a specific piece of functionality is present. However, unit tests are generally designed to test small atomic pieces of functionality. However, a regression test might need to construct an elaborate sequence of inputs spanning many functions to trigger a bug.

Example

Despite its name, unittest can be used for regression tests.

Consider our ListStack implementation from above, and consider the following code:

>>> from list_stack import ListStack
>>> stack = ListStack()
>>> try:
>>>     stack.pop()
>>> except ValueError:
>>>     pass
>>> stack.push(1)
>>> print(len(stack))
0

This code should clearly print out 1, since we have pushed only one element to an otherwise empty stack. We should write a regression test to exploit this bug.

import unittest

from stack import Stack

class StackRegressionTests(unittest.TestCase):

    def setUp(self):
        """Create an empty Stack for testing.

        """

        self.stack = Stack()

    def negative_size_test(self):
        """Pop shouldn't cause size to be negative"""

        with self.assertRaises(ValueError):
            self.stack.pop()

        self.assertEqual(0, len(self.stack))

We can run just the regression tests with

$ python3 -m nose stack_regression_tests

or

$ python3 -m unittest stack_regression_tests

or can instead automatically find and run all tests with

$ python3 -m nose

or

$ python3 -m unittest discover

Test Coverage

The test coverage of a test suite is the number of lines of source code that are tested by at least one test in the test suite. Verifying coverage by hand would be tedious and error-prone; unsurprisingly, there are tools to compute this for us. Using nose, we can run

$ python3 -m nose test_stack --with-cover --cover-erase 
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
............
----------------------------------------------------------------------
Ran 12 tests in 0.001s

OK

This tells is that one statement is not tested, namely line 50 of stack.py:

45
46
47
48
49
50
    def clear(self):
        """Clear the Stack.

        """
        
        self.contents = None

We should write a test to ensure that the clear method behaves as expected.

It is important to note that while “100% code coverage” is an admirable goal, attaining that goal does not imply that your program is correct. Don’t get sidetracked from writing good tests by chasing a number.

Hypothesis Testing

One downside of unit tests and regression tests is that we only test for scenarios that the programmer can think of and bothers to write. It would be wonderful if we could guard against a larger class of errors, which is what we acccomplish with hypothesis testing.

When proving the correctness of an algorithm one often comes up with algorithmic invariants that are maintained throughout the execution of the program. For example:

  • The simplex algorithm only ever visits basic feasible solutions, and never degrades the objective value.
  • When merging sublists together, merge sort ensures that each sublist is itself sorted.

In addition to these internal invariants, we can often reason about the behavior of the function as a whole. For example:

  • any stable sorting algorithm is idempotent; that is, sorting a list many times has the same outcome as sorting a list once
  • An function to reverse a list is its own inverse.

We can express these ideas as hypotheses or, if you can prove them, propositions.

A hypothesis testing library allows you to write tests that are supposed to succeed across all valid inputs, where you specify the family of inputs to draw from. The library will then test the hypothesis with a large number of random inputs. If it finds an input that causes the hypothesis to fail, it then tries to reduce that input to find a minimal counterexample to the hypothesis. A suite of hypothesis tests often amount to formally specifying the full behavior of a function, and so serve as another form of documentation.

Hypothesis testing can be used when creating a new implementation of an existing algorithm. If you have a (possible slow, possibly poorly-written, but correct) implementation of an algorithm, a natural proposition is that the new algorithm produces the same output as the old algorithm. A particularly useful example of this would be writing algorithms for which there exists a natural but slow implementation that is easy to verify, and another implementation that is faster but more difficult to reason about.

Consider the MAXSAT problem. This problem is NP-hard, and so we might be interested in creating a fast approximation algorithm for it. To verify the approximation ratio, we could write a \(O(2^n)\) brute-force algorithm to exactly solve small inputs of the problem and compare the results to the approximation algorithm.

The QuickCheck package for Haskell is the gold standard for proposition testing. Many derivative implementations have been made for various languages; some are listed below.

Language Framework
Haskell QuickCheck
Python hypothesis
R quickcheck (R)

Example

Our goal is to create a sorting function that is equivalent to Python’s built-in list.sort method — that is, in-place, stable, and correct. We choose to implement quicksort, but can’t think of any easy invariants we can expres as a proposition. However, we can rely on Python’s built-in list.sort method to check the correctness of our implementation.

We define our quicksort implementation,

def quicksort(x):
    _quicksort(x, 0, len(x))

def _quicksort(x, a, b):
    """Sort the elements x[a:b] in-place.

    """

    if a >= b:
        return

    # Partition about the pivot; the pivot will go to position i
    pivot = x[b - 1]
    i = a
    for j in range(a, b):
        if x[j] < pivot:
            x[i], x[j] = x[j], x[i]
            i += 1
        j += 1
    x[i], x[b - 1] = x[b - 1], x[i]

    _quicksort(x, a, i)
    _quicksort(x, i + 1, b)

    

along with some simple tests

import unittest
from hypothesis import given
import hypothesis.strategies as st

from quicksort import quicksort

class F:
    """A simple class that defines a custom ordering.

    """

    def __init__(self, z):
        self.z = z

    def __lt__(self, other):
        return self.z < other.z

    def __str__(self):
        return 'F({})@{}'.format(self.z, id(self))

class TestQuicksort(unittest.TestCase):

    @given(st.lists(st.integers()))
    def test_ints(self, x):

        _x = list(x)
        x.sort()
        quicksort(_x)

        self.assertListEqual(x, _x)

    @given(st.lists(st.builds(F, st.integers())))
    def test_F(self, fs):
        _fs = list(fs)

        fs.sort()
        quicksort(_fs)

        self.assertListEqual(fs, _fs)

Note that we still used the unittest library for creating tests, and just decorated the tests using decorators from hypothesis. If we run our tests, we get

$ python3 -m nose test_quicksort 
F.
======================================================================
FAIL: test_F (test_quicksort.TestQuicksort)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/prsteele/Documents/Courses/orie-6125-sp2016/doc/testing/test_quicksort.py", line 33, in test_F
    def test_F(self, fs):
  File "/home/prsteele/.local/lib/python3.4/site-packages/hypothesis/core.py", line 540, in wrapped_test
    print_example=True, is_final=True
  File "/home/prsteele/.local/lib/python3.4/site-packages/hypothesis/executors.py", line 57, in default_new_style_executor
    return function(data)
  File "/home/prsteele/.local/lib/python3.4/site-packages/hypothesis/core.py", line 103, in run
    return test(*args, **kwargs)
  File "/home/prsteele/Documents/Courses/orie-6125-sp2016/doc/testing/test_quicksort.py", line 39, in test_F
    self.assertListEqual(fs, _fs)
nose.proxy.AssertionError: Lists differ: [<tes[22 chars]t 0x7f936c0c7470>, <test_quicksort.F object at 0x7f936c0db6d8>] != [<tes[22 chars]t 0x7f936c0db6d8>, <test_quicksort.F object at 0x7f936c0c7470>]

First differing element 0:
F(0)@140271149675632
F(0)@140271149758168

- [<test_quicksort.F object at 0x7f936c0c7470>,
-  <test_quicksort.F object at 0x7f936c0db6d8>]
? ^                                           ^

+ [<test_quicksort.F object at 0x7f936c0db6d8>,
? ^                                           ^

+  <test_quicksort.F object at 0x7f936c0c7470>]
-------------------- >> begin captured stdout << ---------------------
Falsifying example: test_F(self=<test_quicksort.TestQuicksort testMethod=test_F>, fs=[<test_quicksort.F at 0x7f936c0c7470>, <test_quicksort.F at 0x7f936c0db6d8>])

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 2 tests in 0.174s

FAILED (failures=1)

The output is a bit cluttered, but the lines

19
20
21
22
23
24
25
26
27
28
29
30
31
32
First differing element 0:
F(0)@140271149675632
F(0)@140271149758168

- [<test_quicksort.F object at 0x7f936c0c7470>,
-  <test_quicksort.F object at 0x7f936c0db6d8>]
? ^                                           ^

+ [<test_quicksort.F object at 0x7f936c0db6d8>,
? ^                                           ^

+  <test_quicksort.F object at 0x7f936c0c7470>]
-------------------- >> begin captured stdout << ---------------------
Falsifying example: test_F(self=<test_quicksort.TestQuicksort testMethod=test_F>, fs=[<test_quicksort.F at 0x7f936c0c7470>, <test_quicksort.F at 0x7f936c0db6d8>])

tell us what we need to know. Namely, the case that failed was passing [F(0), F(0)]; we’re failing because our sort is not stable.

Benchmark tests

Timing tests, stress tests, or benchmark tests are all designed to test the performance of code. While unit and regression tests can help ensure an “upgrade” to your code doesn’t break anything, if you are interested in tracking changes in execution speed you need to take measurements. A suite of benchmark tests can be used to formally track the changes in performance over time.

Unlike unit, regression, or hypothesis tests, there is no particular reason that a benchmark test needs to be written in the same language as the program itself (unless you are trying to benchmark internal functions, and not just the entire execution of the program). In fact, a simple way to run benchmarks is the time program.

$ /usr/bin/time -v sleep 2 
	Command being timed: "sleep 2"
	User time (seconds): 0.00
	System time (seconds): 0.00
	Percent of CPU this job got: 0%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.00
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1616
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 74
	Voluntary context switches: 2
	Involuntary context switches: 1
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Note that bash has a built-in time command whose default behavior differs from /usr/bin/time, although they are very similar.