Programming for coverage
I have been wanting to type up some notes on effective code coverage for a while. I am doing a talk at NDC Techtown 2025 on a few code coverage topics, which is an excellent opportunity to do so. These notes, and the talk, will touch on a few different topics; how coverage can guide testing, on effective metrics, coverage as a concept, and how different programming styles affect the possibility and feasibility of coverage.
Notes on coverage
For me, coverage sits in this odd place where it is both underappreciated and applied so poorly it becomes counterproductive. I think line coverage is partly to blame, which is just a poor metric we should deprecate. Line coverage feels like it is not worth the effort because full line coverage is hopelessly inadequate at finding defects, while at the same time being somewhat expensive to achieve. In my experience, some code is just hard to reach (high cost), such as out-of-memory errors or specific helper functions, and the payoff of just exercising each line is poor (low value). Lines may be perfectly fine in isolation, but be problematic when combined with other code (or data), which is why the context-less line coverage is so underwhelming.
A common “fix” for this is to set an arbitrary coverage target, say 80%, but this plays right into Goodhart’s law, commonly stated as when a measure becomes a target, it ceases to be a good measure. Code coverage is not a goal in itself, functioning and correct code is. Furthermore, not all code is created equal; there are segments of the program that warrants much more scrutiny and testing, but this is not reflected in coverage targets. If the goal is to identify “blind spots” that are (mostly) untouched by tests you don’t really need a target, you need to analyze the coverage report. Meeting the threshold may give a false sense of security; “this is well tested, we have 80% coverage” while important code slips through the cracks with poor test coverage. Code coverage doesn’t find defects, testing and validation does. Code coverage is quite useful for guiding testing.
Before we move on, a note on coverage requirements. Safety critical systems tend to require 100% MC/DC in order to ensure adequate testing. Avionics software has to comply with DO-178B and DO-178C which requires MC/DC for critical (Level A) systems, and the automotive standard ISO 26262 requires MC/DC for safety critical systems (e.g. brakes, not necessarily the music system). MC/DC does not guarantee a defect free system, but it is does require a minimum number of effective tests. The 100% coverage requirement means that also the hard-to-reach code need full coverage.
More sophisticated criteria, like MC/DC and (prime) path coverage, require more test cases than simpler metrics like block or edge coverage, but are in return are excellent as guide for writing good test cases. Difficult to reach code tends to be difficult to write the test setup for, but once that is in place it becomes ok to write many test cases and use the stronger coverage metrics.
Should we design for coverage?
Should we write our programs with coverage in mind? I suppose it depends on what exactly “write with coverage in mind” means, and I don’t think there’s an obvious answer to that. Code is never written in a vacuum, and the programmer has to balance all sorts of properties; familiarity, clarity, speed, complexity, testability, etc. Coverage would be yet another such property, or maybe an aspect of testability, and writing testable code is generally a good thing. This in no way means we should prioritise coverage above all else; if “coverability” is at odds with clarity, it could be justified that the clarity is more important.
There are several metrics that try to objectively quantify the complexity (or testability) of a function, such as McCabe’s cyclomatic complexity, Nejmeh’s NPATH, and Bagnara et al. ACPATH. These metrics are closely related to counting the prime paths, as the prime paths are maximal simple paths or simple cycles. The simple path is by definition acyclic with no repeated nodes. By reducing the number of paths we reduce the complexity and the number of tests necessary to cover them.
One of the best way to reduce the cost of testing is to refactor. Put bluntly, adding a test “because coverage said so” is never really a good answer. I do think, however, “to make testing easier” is a very good reply to the question “why did you rewrite/refactor this code?”.
To summarise, the cost of testing and coverage is driven by how is easy it is to exercise a specific code path, and the number of test cases. This can be interpreted as the cost of testing is driven by the complexity of program. The ability to exercise specific code paths is closely related to function size, (global) state dependency, and the complexity of data- and control flow, all of which are good to minimize when possible.
Where does coverage fit in the development loop?
The unit tests are often (usually?) written in tandem with the code they support. This allows for early detection of problems and a sort of scratch pad for testing the interface of the unit, while providing some real-world feedback. Limiting structural complexity makes units easier to test; code is easier to reach, and there are fewer paths to exercise, yet coverage is often postponed or applied as an afterthought. Why?
I have lately experimented with weaving coverage into the write-compile-test cycle. I measure prime path coverage before and after running the test to verify that it actually runs the code I expect it to. I spend some time looking at the control flow graph, and try to figure out if it looks different than it should for the problem I’m solving. It has been a pleasant experience, and given me a lot of “huh, I never thought about it that way” moments, the occasional deep insight, and a fresh way of thinking about what I write.
While tempting, I try not to add test cases just because the coverage says a path is not covered. By first figuring out why the path isn’t already covered by the tests I have written, I can make an informed decision on whether that path is meaningful to test at all. For example, do we bother testing for allocation failures when resizing strings when we just propagate the exception? Maybe, it depends. Maybe there should not be an exception at all, and we should instead preallocate buffers outside the function, or use a different error signaling mechanism. In a way, the feedback from coverage does not just guide testing, it guides programming.
The classic coverage-when-done as a part of a larger validation and QA effort still makes sense, but I have found it quite nice to integrate coverage into the core programming loop.
Wrapping up
This will be the first post in a series on code coverage. The next posts will include more specifics on programming styles, techniques, refactoring, and how it all affects testing.