Dilbert cartoon on agile programming

Wally’s comment is insightful. Giving a thing a name makes it possible discuss that thing. Witness the enormous discussion of Test Driven Development on the internet.

The purpose of this post is to give a name to a testing practice I have found useful, and to explain why it is useful. I call the practice Head to Head testing, or HTH. I propose HTH as a solution to a problem that arises in test driven development, and will introduce HTH by first explaining the problem. My explanation will use functional programming notation, but the underlying issues are the same in object oriented or any other form of programming. This is because the issues are mathematical in nature.

Suppose my task is to write a function to compute logarithms base 10. TDD demands that I begin by constructing some tests. OK; here they are.

(assert (= (log 1) 0))
(assert (= (log 10) 1))
(assert (= (log 100) 2))
(assert (= (log 1000) 3))

What do these tests accomplish? A function that fails these tests is certainly an inadequate log function, but passing this suite proves little. Infinitely many functions satisfy these tests. Let’s step back and consider the logic of testing.

Suppose we are testing the implementation of a function F defined on a finite domain D. An ideal test suite would have one test case for each element of D, and would exhaustively check whether the implementation of F was correct at each point in D. An implementation that passed this ideal test suite would be demonstrably correct.

The problem is that exhaustive testing is impossible in practice. Many functions have infinite domains, or domains so large that exhaustive testing is not physically feasible. Since we cannot test our implementation at each point in its domain, we are forced to settle for testing on a subset the domain. But which subset? I contend that we have now entered the realm of statistical sampling.

My test suite for the log function proposed to test at the points {1, 10, 100, 1000}. Looked at through the lens of statistical sampling, this test suite suffers from two defects. The lesser defect is that we have no idea whether a test of size N=4 is sufficient. The fatal defect is that the points chosen for testing were not chosen randomly. This makes the test suite statistically worthless.

Imagine by way of analogy that you engaged me to conduct a survey. I offer to use my friends as the sample population. You ought to conclude either that I am pulling your leg or that I am incompetent statistician. But in the programming world, this approach to sampling is commonly accepted. Test suites are hand written by programmers who write whatever test cases they deem appropriate. This flies in the face of statistical practice. A statistically valid test requires randomly sampled data, not data hand-chosen by a human being.

It could be argued that as software testers we are not in search of statistically valid tests, but of errors. And that is true as far as it goes, but the search for errors demands randomized data for two reasons. The computer can generate untold millions of random test cases for each one a tester writes write by hand. But the deeper purpose of randomization is to avoid blind spots or bias in the selection of test data. If I design the tests, presumably I choose the data based on what I think might go wrong in the software. What about all the things that never occur to me to test? In the long run, randomized tests go everywhere. They have no long term biases or blind spots.

Suppose I decide to do the right thing by testing the log function on randomly sampled data. How could I do that? It’s not hard. Here’s how to conduct a single trial. Randomly pick two positive real numbers x and y. The corresponding test is this:

(assert (= (log (* x y)
           (+ (log x) (log y)))))

The test exploits the fact that log(x * y) must equal log(x) + log(y). It would be a simple matter automate this test by constructing a stream of randomly generated pairs (x y). Doing this would make it possible to test the log function at hundreds of millions of points rather the relative few a developer or tester could test using hand written tests.

That’s all well and good, but my example is exceptionally well suited to randomized testing. In my example I could exploit a mathematical property of the log function to construct tests with known expected outcomes. It is not obvious how to do this while testing non-mathematical software. That is where Head to Head testing enters the picture.

The HTH approach is as follows. Suppose we are implementing a function F. We build not one but two independent implementations of F; call the implementations g and h. We also create a suitable of set randomized data D, and test at each point d in D whether g(d) = h(d). If g and h disagree on d, then at least one of them is in error. Notice that we do not have to trust that either g or h is correct, nor do we need to know correct value of F at d. If g and h disagree at d, at least one of them must be in error. Both must be investigated. This is a feature, not a bug. Examining the differences in the two implementations will lead us to the flaw much more quickly and easily than would a the study of a single implementation.

A nice side effect of this approach is that a failed HTH test does not result in a mere assertion of failure. We are handed a test case known to trigger the failure, a very useful thing for debugging. And it is rare to find a lone bug. Very likely running more tests will result in a growing collection data points that trigger failures.

Return for a moment to the question of sample size. How many test cases are appropriate? As software testers we have a tremendous advantage over statisticians. In general it is very difficult to determine how large a sample size N is required to create a statistical test of known power. But we are not doing mathematical statistics, we are testing software. Just make N huge. Better still, make N infinite. Run HTH tests continuously. Many advocate continuous builds; why not continuous testing? By continuous testing I don’t mean a test suite that runs as part of the build, with successful completion acting a gate on code check-in. I mean tests that run 24/7 on dedicate hardware, never completing, just searching tirelessly for bugs. Machines are cheap and bugs are expensive.

Is it practical to build two independent implementations of a function F as part of testing? Sometimes it’s not that hard to do. Other times you don’t have to, because someone else has already written the other implementation for you. Imagine you are writing statistical function. Chances are you are not the first to write that function. Test your version HTH against a commercial product, or an open source package. Let your competitor help you test. Maybe your implementation of F is multi-threaded. Run F head to head against itself on different boxes, with different thread pool sizes. Run the latest version of F against a previous version for regression testing, but do it against randomly generated data instead of a few dozen hand written tests.

But is it really practical to build two independent implementations of a function F in support of testing? I think it is. Go look at the history of your code in version control. How many times have you built and rebuilt the thing? How many times was another round of changes triggered by a bug? You are going to build more than one version. You can do reactively in response to bugs, or you can do it in advance, by design. Your choice.

HTH is a conceptual approach. You can find ways to use it if you try. And its not an either-or thing. By all means, keeping writing test cases by hand. Just don’t stop there.

These thoughts were set into motion by the recent Clojure bowling problem and the ensuing discussion of the role of TDD in functional programming.