Test Driven Development¶
< crschmidt> No software is bug free
< FrankW> #/bin/sh
< FrankW> echo “Hello World”
< FrankW> That’s pretty bug free.
< crschmidt> FrankW: you missed a !
—Anon., bash.org
Introduction¶
A software test is a program that compares that compares the actual output of another program or function with the expected output over a wide range of inputs; this program then reports an discrepancies it finds. Test driven development (TDD) is a software design process focused on a development cycle consisting of three components:
- Planing a new piece of functionality.
This step involves writing tests (that currently fail) that encapsulate the desired functionality.
- Implementing the functionality.
This step generally ends when all test cases pass, indicating that the new functionality works as expected.
- Refactoring the new code.
We have code that achieves the desired functionality from step 2; however, this code might not fit the style of the project, be needlessly complicated, or run too slowly. Since we have tests from step 1, we can verify that our refactoring does not break the new functionality.
The impetus behind TDD is the idea that “testable code is good code.” That is, by designing your code to be able to be tested you are likely to create code that is modular and handles global state in a sane manner. It would be very difficult to test code that stores all state in global variables that can be updated by any function, since testing any single function in that program might fail to properly update the global state. Modular code is easier to write and reason about, since each function can be planned and analyzed separately.
While TDD is generally considered good practice, it does not need to be followed strictly. In practice you might often performs step 2 before step 1, especially if you are exploring a new problem and are not entirely sure what your solution will look like yet. However, you should still strive to write testable code, since you will want to come back to step 1 later on.
We will now discuss various concepts related to testing, namely unit tests, regression tests, stress tests or timing tests, test coverage, and hypothesis testing.
Unit tests¶
Perhaps the simplest test you can write is a unit test. The goal of a
unit test is to test a small “unit” of code which generally has a
simple goal or purpose. For example, if you were writing a calculator
application one possible unit test would be to test the add
function. You generally try to write unit tests to cover all
functionality of your program across a variety of inputs, including
difficult corner cases or known error cases.
Conceptually a unit test (or collection of unit tests) is just a program that executes a number of tests and reports the results. However, unit tests are very common and there has been a great deal of work done to make writing such tests easier. These unit testing frameworks deal with the boilerplate code of handling error reporting, timing tests, and selecting which tests to be run.
Each unit test should be independent of other tests being run. This means that, for example, the order in which you run tests should not contribute to the outcome of the tests. For this reason most testing frameworks allow you to define setup and tear down functions that are run before and after each test to create a fresh testing environment. For example, if you were testing code that analyzes code stored in a database your setup function might create a new database and populate it with a small set of sample data; the tear down function would then delete this database. By doing this any unit test that modifies the database does not affect another unit test that reads from the database.
We will explore the nose and unittest modules for unit testing
Python code; nose
actually extends the unittest module,
adding nice functionality such as timing and test coverage
calculations. These frameworks are conceptually similar to most unit
testing frameworks in popular languages. The table below lists similar
frameworks in various languages.
Language | Framework |
---|---|
Python | nose |
Python | unittest |
C++ | Google Test |
Java | junit |
Haskell | HUnit |
Javascript | Jasmine |
Example¶
We consider implementing a stack in Python. Conceptually, a stack
supports the push
and pop
operations; we support several more
for convenience. In particular, we want to support the following
methods:
__init__(self)
Create a new empty stack
push(self, x)
Push an object to the top of the stack
pop(self)
Remove the top object from the stack and return it
peek(self)
Return the top object from the stack
__len__(self)
Return the number of objects on the stack
clear(self)
Remove all elements from the Stack.
We can begin by writing unit tests to check the functionality of each of these methods.
import unittest
from stack import Stack
class TestStack(unittest.TestCase):
"""A collection of unit tests for the Stack interface.
"""
def setUp(self):
"""Create an empty Stack for each test"""
self.stack = Stack()
def test_push_1(self):
"""Push a single element on the stack."""
self.stack.push(0)
self.assertEqual(1, len(self.stack))
def test_push_2(self):
"""Push many elements on the stack."""
for i in range(10):
self.stack.push(i)
self.assertEqual(10, len(self.stack))
def test_pop_empty(self):
"""We expect an exception to be thrown if we pop an empty stack"""
with self.assertRaises(ValueError):
self.stack.pop()
def test_pop_1(self):
"""Push, then pop."""
self.stack.push(1)
self.assertEqual(1, self.stack.pop())
def test_pop_2(self):
"""Push many, then pop."""
for i in range(10):
self.stack.push(i)
self.assertEqual(9, self.stack.pop())
def test_pop_3(self):
"""Push many, and pop them all"""
values = list(range(10))
for i in values:
self.stack.push(i)
for i in values[::-1]:
self.assertEqual(i, self.stack.pop())
def test_len_1(self):
"""The length of an empty stack is 0"""
self.assertEqual(0, len(self.stack))
def test_len_2(self):
"""Check the length after many pushes"""
size = 0
for i in range(10):
self.stack.push(i)
size += 1
self.assertEqual(size, len(self.stack))
def test_len_3(self):
"""Check the length after pushes and pops"""
self.stack.push(0)
self.assertEqual(1, len(self.stack))
self.stack.push(0)
self.assertEqual(2, len(self.stack))
self.stack.push(0)
self.assertEqual(3, len(self.stack))
self.stack.pop()
self.assertEqual(2, len(self.stack))
self.stack.pop()
self.assertEqual(1, len(self.stack))
self.stack.push(0)
self.assertEqual(2, len(self.stack))
self.stack.push(0)
self.assertEqual(3, len(self.stack))
def test_peek_1(self):
"""Push, then peek."""
self.stack.push(1)
self.assertEqual(1, self.stack.peek())
self.assertEqual(1, len(self.stack))
def test_peek_2(self):
"""Push many, then peek."""
for i in range(10):
self.stack.push(i)
self.assertEqual(9, self.stack.peek())
def test_peek_nullipotent(self):
"""Peek cannot change the stack"""
self.stack.push(0)
for i in range(10):
self.assertEqual(0, self.stack.peek())
By making a class that begins with Test*
, we allow nose
or
unittest
to automatically discover and run our tests. We inherit
from unittest.TestCase
to get access to the assert*
methods;
see here for details.
We can run these tests with
$ python3 -m nose
or
$ python3 -m unittest discover
Of course right now all tests will fail, since we cannot even import
our Stack
class. Let’s implement it:
class Stack:
"""A LIFO queue.
"""
def __init__(self):
"""Initialize a new empty Stack.
"""
self.contents = []
self.size = 0
def push(self, x):
"""Push a new element on top of the stack.
"""
self.contents.append(x)
self.size += 1
def pop(self):
"""Return and remove the top element of the stack.
"""
try:
ret = self.peek()
self.size -= 1
self.contents.pop()
return ret
except:
raise ValueError('Empty stack')
def peek(self):
"""Return the top element of the stack.
"""
try:
return self.contents[-1]
except:
raise ValueError('Empty stack')
def clear(self):
"""Clear the Stack.
"""
self.contents = None
def __len__(self):
"""Returns the number of elements in the stack.
"""
return self.size
We can now run our unit tests, and we see the following output.
$ python3 -m nose test_stack
............
----------------------------------------------------------------------
Ran 12 tests in 0.005s
OK
Regression Tests¶
A regression test is a test written after discovering a bug that triggers that bug. This ensures that future changes to the code to not re-introduce that particular bug.
When you discover a bug in your program, your first goal is likely to isolate a set of inputs or environmental factors that cause the bug to surface. You should then write a test that uses these inputs to trigger the bug and then test for that bug. Once you modify your program to fix the bug, the test should no longer fail.
The distinction between a unit test and a regression test is subtle. Almost every unit test could be considered a regression test since it ensures that a specific piece of functionality is present. However, unit tests are generally designed to test small atomic pieces of functionality. However, a regression test might need to construct an elaborate sequence of inputs spanning many functions to trigger a bug.
Example¶
Despite its name, unittest
can be used for regression tests.
Consider our ListStack
implementation from above, and consider the
following code:
>>> from list_stack import ListStack
>>> stack = ListStack()
>>> try:
>>> stack.pop()
>>> except ValueError:
>>> pass
>>> stack.push(1)
>>> print(len(stack))
0
This code should clearly print out 1
, since we have pushed only
one element to an otherwise empty stack. We should write a regression
test to exploit this bug.
import unittest
from stack import Stack
class StackRegressionTests(unittest.TestCase):
def setUp(self):
"""Create an empty Stack for testing.
"""
self.stack = Stack()
def negative_size_test(self):
"""Pop shouldn't cause size to be negative"""
with self.assertRaises(ValueError):
self.stack.pop()
self.assertEqual(0, len(self.stack))
We can run just the regression tests with
$ python3 -m nose stack_regression_tests
or
$ python3 -m unittest stack_regression_tests
or can instead automatically find and run all tests with
$ python3 -m nose
or
$ python3 -m unittest discover
Test Coverage¶
The test coverage of a test suite is the number of lines of source
code that are tested by at least one test in the test
suite. Verifying coverage by hand would be tedious and error-prone;
unsurprisingly, there are tools to compute this for us. Using
nose
, we can run
$ python3 -m nose test_stack --with-cover --cover-erase
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
............
----------------------------------------------------------------------
Ran 12 tests in 0.001s
OK
This tells is that one statement is not tested, namely line 50 of
stack.py
:
45 46 47 48 49 50 | def clear(self):
"""Clear the Stack.
"""
self.contents = None
|
We should write a test to ensure that the clear
method behaves as
expected.
It is important to note that while “100% code coverage” is an admirable goal, attaining that goal does not imply that your program is correct. Don’t get sidetracked from writing good tests by chasing a number.
Hypothesis Testing¶
One downside of unit tests and regression tests is that we only test for scenarios that the programmer can think of and bothers to write. It would be wonderful if we could guard against a larger class of errors, which is what we acccomplish with hypothesis testing.
When proving the correctness of an algorithm one often comes up with algorithmic invariants that are maintained throughout the execution of the program. For example:
- The simplex algorithm only ever visits basic feasible solutions, and never degrades the objective value.
- When merging sublists together, merge sort ensures that each sublist is itself sorted.
In addition to these internal invariants, we can often reason about the behavior of the function as a whole. For example:
- any stable sorting algorithm is idempotent; that is, sorting a list many times has the same outcome as sorting a list once
- An function to reverse a list is its own inverse.
We can express these ideas as hypotheses or, if you can prove them, propositions.
A hypothesis testing library allows you to write tests that are supposed to succeed across all valid inputs, where you specify the family of inputs to draw from. The library will then test the hypothesis with a large number of random inputs. If it finds an input that causes the hypothesis to fail, it then tries to reduce that input to find a minimal counterexample to the hypothesis. A suite of hypothesis tests often amount to formally specifying the full behavior of a function, and so serve as another form of documentation.
Hypothesis testing can be used when creating a new implementation of an existing algorithm. If you have a (possible slow, possibly poorly-written, but correct) implementation of an algorithm, a natural proposition is that the new algorithm produces the same output as the old algorithm. A particularly useful example of this would be writing algorithms for which there exists a natural but slow implementation that is easy to verify, and another implementation that is faster but more difficult to reason about.
Consider the MAXSAT problem. This problem is NP-hard, and so we might be interested in creating a fast approximation algorithm for it. To verify the approximation ratio, we could write a \(O(2^n)\) brute-force algorithm to exactly solve small inputs of the problem and compare the results to the approximation algorithm.
The QuickCheck package for Haskell is the gold standard for proposition testing. Many derivative implementations have been made for various languages; some are listed below.
Language | Framework |
---|---|
Haskell | QuickCheck |
Python | hypothesis |
R | quickcheck (R) |
Example¶
Our goal is to create a sorting function that is equivalent to
Python’s built-in list.sort
method — that is, in-place, stable,
and correct. We choose to implement quicksort, but can’t think of
any easy invariants we can expres as a proposition. However, we can
rely on Python’s built-in list.sort
method to check the
correctness of our implementation.
We define our quicksort implementation,
def quicksort(x):
_quicksort(x, 0, len(x))
def _quicksort(x, a, b):
"""Sort the elements x[a:b] in-place.
"""
if a >= b:
return
# Partition about the pivot; the pivot will go to position i
pivot = x[b - 1]
i = a
for j in range(a, b):
if x[j] < pivot:
x[i], x[j] = x[j], x[i]
i += 1
j += 1
x[i], x[b - 1] = x[b - 1], x[i]
_quicksort(x, a, i)
_quicksort(x, i + 1, b)
along with some simple tests
import unittest
from hypothesis import given
import hypothesis.strategies as st
from quicksort import quicksort
class F:
"""A simple class that defines a custom ordering.
"""
def __init__(self, z):
self.z = z
def __lt__(self, other):
return self.z < other.z
def __str__(self):
return 'F({})@{}'.format(self.z, id(self))
class TestQuicksort(unittest.TestCase):
@given(st.lists(st.integers()))
def test_ints(self, x):
_x = list(x)
x.sort()
quicksort(_x)
self.assertListEqual(x, _x)
@given(st.lists(st.builds(F, st.integers())))
def test_F(self, fs):
_fs = list(fs)
fs.sort()
quicksort(_fs)
self.assertListEqual(fs, _fs)
Note that we still used the unittest
library for creating tests,
and just decorated the tests using decorators from hypothesis
. If
we run our tests, we get
$ python3 -m nose test_quicksort
F.
======================================================================
FAIL: test_F (test_quicksort.TestQuicksort)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/prsteele/Documents/Courses/orie-6125-sp2016/doc/testing/test_quicksort.py", line 33, in test_F
def test_F(self, fs):
File "/home/prsteele/.local/lib/python3.4/site-packages/hypothesis/core.py", line 540, in wrapped_test
print_example=True, is_final=True
File "/home/prsteele/.local/lib/python3.4/site-packages/hypothesis/executors.py", line 57, in default_new_style_executor
return function(data)
File "/home/prsteele/.local/lib/python3.4/site-packages/hypothesis/core.py", line 103, in run
return test(*args, **kwargs)
File "/home/prsteele/Documents/Courses/orie-6125-sp2016/doc/testing/test_quicksort.py", line 39, in test_F
self.assertListEqual(fs, _fs)
nose.proxy.AssertionError: Lists differ: [<tes[22 chars]t 0x7f936c0c7470>, <test_quicksort.F object at 0x7f936c0db6d8>] != [<tes[22 chars]t 0x7f936c0db6d8>, <test_quicksort.F object at 0x7f936c0c7470>]
First differing element 0:
F(0)@140271149675632
F(0)@140271149758168
- [<test_quicksort.F object at 0x7f936c0c7470>,
- <test_quicksort.F object at 0x7f936c0db6d8>]
? ^ ^
+ [<test_quicksort.F object at 0x7f936c0db6d8>,
? ^ ^
+ <test_quicksort.F object at 0x7f936c0c7470>]
-------------------- >> begin captured stdout << ---------------------
Falsifying example: test_F(self=<test_quicksort.TestQuicksort testMethod=test_F>, fs=[<test_quicksort.F at 0x7f936c0c7470>, <test_quicksort.F at 0x7f936c0db6d8>])
--------------------- >> end captured stdout << ----------------------
----------------------------------------------------------------------
Ran 2 tests in 0.174s
FAILED (failures=1)
The output is a bit cluttered, but the lines
19 20 21 22 23 24 25 26 27 28 29 30 31 32 | First differing element 0:
F(0)@140271149675632
F(0)@140271149758168
- [<test_quicksort.F object at 0x7f936c0c7470>,
- <test_quicksort.F object at 0x7f936c0db6d8>]
? ^ ^
+ [<test_quicksort.F object at 0x7f936c0db6d8>,
? ^ ^
+ <test_quicksort.F object at 0x7f936c0c7470>]
-------------------- >> begin captured stdout << ---------------------
Falsifying example: test_F(self=<test_quicksort.TestQuicksort testMethod=test_F>, fs=[<test_quicksort.F at 0x7f936c0c7470>, <test_quicksort.F at 0x7f936c0db6d8>])
|
tell us what we need to know. Namely, the case that failed was passing
[F(0), F(0)]
; we’re failing because our sort is not stable.
Benchmark tests¶
Timing tests, stress tests, or benchmark tests are all designed to test the performance of code. While unit and regression tests can help ensure an “upgrade” to your code doesn’t break anything, if you are interested in tracking changes in execution speed you need to take measurements. A suite of benchmark tests can be used to formally track the changes in performance over time.
Unlike unit, regression, or hypothesis tests, there is no particular
reason that a benchmark test needs to be written in the same language
as the program itself (unless you are trying to benchmark internal
functions, and not just the entire execution of the program). In fact,
a simple way to run benchmarks is the time
program.
$ /usr/bin/time -v sleep 2
Command being timed: "sleep 2"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1616
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 74
Voluntary context switches: 2
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Note that bash
has a built-in time
command whose default
behavior differs from /usr/bin/time
, although they are very
similar.