In terms of Unit Test statistics, we are all pretty familiar with the measure of code coverage which is often used as an indicator of code quality. I would tend to view this statistic as a form of "negative testing" by which I mean that any code not covered by tests should be considered at definite risk. The inverse is not necessarily true, code that is covered by tests cannot automatically be considered at no risk and should be considered at potential risk. Covering all code branches with tests doesn't guarantee our code works correctly. For example, take the following very simple method:
In this case it will only take a single test to obtain 100% coverage but I wouldn't consider that method sufficiently tested.
So this brings me to thinking about a notion of depth of coverage. A measure of how much a method, or code branch, has been tested. In my example a single test gives us 100% coverage but doesn't actually test the quality of the code very much. My regex pattern may be way off the mark but if my single test passes then the stats say I'm good. If I were to add more Test cases for different input strings then I would be adding more value to the test coverage but my statistics don't reflect this. By adding the extra tests I might say I'm adding more depth of coverage. It is hard to quantify "how much depth is enough depth" as it is a subjective measure related to your own code but it is an interesting idea.
Is this a statistic that is recognised at all by the testing and development community? Are there any tools out there that report this kind of statistical analysis? Perhaps something that reports how often code is run by your test suite.
Some existing tools do report the number of executions per line, for instance, on top of a binary line coverage. However, that's also a weak indicator of depth in the sense that you might've just executed that line of code with the exact same data over and over again.
With that said, I think you're on to something here with "depth coverage". Professional testers will tell you all about equivalence classes and boundary values. Covering all boundary values and all equivalence classes with at least one sample would be a step closer to true "depth coverage". Unfortunately, we'd have to teach the computer to understand the semantics of our domain in order for it to be able to identify the equivalence classes and boundary values. Now that would be quite a feat!
I recognise that this approach is open for abuse in much the same way code coverage is too. We can always write tests that traverse the code but assert nothing. All of these measurements are useless unless you have confidence in your team to write meaningful tests and not try to cheat the system.
It would be unreasonable to expect that a testing tool would have intricate knowledge of our system domain to be able to identify the equivalence classes and boundary values. For the tool to be useful in this way it would have to be able to derive this without being taught and I think the sheer complexity of such a system would be approaching the realms of 'voodoo magic'. I doubt I'd have much confidence in it.
The key point with any statistical measure like this is it only has value if you have confidence in your team writing the tests. The confidence to say that if a piece of code has been tested 4 times then we have tested that code with 4 meaningful test scenarios.