Summary: Testing shows the presence, not the absence of bugs.
Recently, someone suggested to me that, thanks to test suites, things like changing compiler version or versions of library dependencies was "no big deal". If dependency changes still result in a passing test suite, then they have caused no harm. I disagree, and fortunately for me, Dijkstra explains it far more eloquently than I ever could:
Testing shows the presence, not the absence of bugs. Dijkstra (1969)
While a test suite can give you confidence in changes you make, it does not provide guarantees. Below are just a few reasons why.
The test suite does not cover all the code
For any reasonably sized code base (> 100 lines), covering all the lines of code is difficult. There are a number of factors that mean that mean a test suite is unlikely to provide 100% coverage:
- Producing tests is a resource intensive activity, and most projects do not have the necessary manpower to test everything.
- Sometimes there is no good way to test simple sugar functions - the definition is a specification of what the function should do.
- Testing corner cases is difficult. As the corners get more obscure, the difficulty increases.
- Testing error conditions is even harder. Some errors conditions have code to deal with them, but are believed to be unreachable.
The test suite does not cover all the ways through the code
Assuming the test suite really does cover every line of the code, making it cover every path through the code is almost certainly computationally infeasible. Consider a program taking a handful of boolean options. While it might be feasible to test each individual option in the true and false states, testing every state in conjunction with every other state requires an exponential amount of time. For programs with loops, testing every number of loop iterations is likely to be highly time consuming.
There is plenty of code you can't see
Even if you cover every line of source code, the compiler may still thwart your valiant efforts. Optimising compilers like to inline code (make copies of it) and specialise code (freeze in some details that would otherwise be dynamic). After such transformations, the compiler might spot undefined behaviour (something almost all C/C++ programs contain) and make modifications that break your code. You might have tested all the source code, but you have not tested all the code generated by the compiler. If you are writing in C/C++, and undefined behaviour and optimisation doesn't scare you a lot, you should read this LLVM article series.
Functions have huge inputs
Testing functions typically involves supplying their input and inspecting their output. Usually the input space is too large to enumerate - which is likely to be the case even if your function takes in an integer. As soon as your function takes a string or array, enumeration is definitely infeasible. Often you can pick cases at which the code is likely to go wrong (0, 1, -1, maxBound
) - but maybe it only fails for Carmichael numbers. Random testing can help, and is always advisable, but the effort to deploy random testing is typically quite a bit higher than input/output samples, and it is no panacea.
Functions are not functions
Testing functions usually assumes they really are functions, which depend only on their input. In pure functional languages that is mostly true, but in C/C++ it is less common. For example, functions that have an internal cache might behave differently under parallelism, especially if their cache is not managed properly. Functions may rely on global variables, so they might perform correctly until some seemingly unrelated operation is performed. Even Haskell programs are likely to depend on global state such as the FPU flags, which may be changed unexpectedly by other code.
In my experience, the non-functional nature of functions is one of the biggest practical difficulties, and is also a common place where dependency changes cause frustration. Buggy code can work successfully for years until an improved memory allocator allows a race condition to be hit.
Performance testing is hard
Even if your code gives the correct results, it may take too long or use too much memory. Alas, testing for resource usage is difficult. Resource numbers, especially runtime, are often highly variable between runs - more so if tests are run on shared hardware or make use of parallelism. Every dependency change is likely to have some impact on resource usage, perhaps as dependencies themselves chose to trade time for memory. Spotting erroneous variations often requires a human to make a judgement call.
What is the solution?
Tests help, and are valuable, and you should aim to test as much as you can. But for any reasonably sized program, your tests will never be complete, and the program will always contain unknown bugs. Most likely someone using your code will stumble across one of these bugs. In this case, it's often possible (and indeed, highly desirable) to add a new test case specifically designed to spot this error. Bugs have a habit of recurring, and a bug that happens twice is just embarrassing.
Thinking back to dependency versions, there is often strength in numbers. If all your users are on the same version of all the dependencies, then any bug that is very common is likely to be found by at least one user and fixed for all.
Thinking more generally, it is clear that many of these issues are somewhat ameliorated by pure functional programming. I consider testability and robustness to be one of the great strengths of Haskell.