The Metrics Maid

My blog posts related to IT strategy, enterprise architecture, digital transformation, and cloud have moved to a new home: ArchitectElevator.com.

Some of the most hated people in San Francisco must be the meter maids, the DPT people who drive around in golf carts and hand out tickets to anyone who overslept street cleaning or did not have enough quarters for the meter. On some projects, the most hated people are the metric maids, the people who go around and try to sum up a developer’s hard work and intellectual genius in a number between 1 and 10.

Many managers love metrics: “You can’t manage it if you can’t measure it”. I am actually a big proponent of extracting and visualizing information from large code bases or running systems (see Visualizing Dependencies). But when one tries to boil the spectrum between good and evil down to a single number we have to be careful as to what this number actually expresses.

I Got You Covered

A frequently used metric for the quality of code is test coverage, typically expressed as a percentage computed by the (executable) lines of code that are hit by test cases over the total number of (executable) lines of code. If I have 80% test coverage this means that 80 out of 100 lines are being hit by my tests. It is a fairly safe statement to claim that a code module with 80% test coverage is better than one with 20% coverage. However, is one that has 90% really better than one that has 80%? Are two modules with 80% coverage equally “good”? We quickly find that a single number cannot do justice to all that is going on in a few thousand lines of source code. Still, finding an abstraction is nice so let’s see how far we can get with this metric.

May the Source Be With You

First we need to qualify the metric with how it is obtained. 80% coverage obtained via fine-grained, automated unit tests probably counts more than 80% coverage obtained via coarse regression tests. We need to keep in mind that code coverage counts every line that was hit by a test, intentionally or unintentionally. Also, the metric does not make any statement whether any test actually verified that the specific line actually works correctly. It is a little scary to realize that test cases with no assert statements achieve the same coverage as tests with asserts. I have seen a few proposals to remedy this problem:

You might only count lines executed in the class specifically targeted by the unit test. For example, if fooTest exercises 60 out of 100 lines in Foo but also hits 50 lines in Bar, the lines in Bar do not count towards Bar's test coverage because this test was not intended to test Bar. In fact, Bar being hit at all might have been the result of insufficient mocking or stubbing of the test for Foo. Such behavior should not be rewarded.

A particularly interesting technique is mutation testing. This class of tools modify the source code (or byte code) in the class under test to see whether the change breaks any tests. If no tests break as a result of the random change the mutated line does not count as thoroughly unit tested. Ivan Moore wrote Jester, which performs Java mutation testing. It can produce nice reports that show what was actually changed and whether the change resulted in a failed test.

Metric Magic

A while ago we refactored a pretty miserable chunk of code. It took us more than a day to extract the meaningful concepts, refactor the code and add meaningful unit tests. Being proud of our work we ran a test coverage tool against the class to give ourselves a pad on the back about how much better our test coverage was. Sadly, the pad on the back was more like a kick in the pants. For one class the test coverage crept up by a mere 2 percentage points and for the other one it actually decreased! It turned out that the portion of the code that we refactored had actually decent test "coverage" (as determined by the tool), even though it was very poorly written. Rewriting it (and simplifying the logic) reduced the number of lines in this code segment, causing the overall percentage of tested code in the class actually to decrease because the remaining (still) poorly tested code did not shrink.

Dangling the Carrot

You should also be cautious about which metric carrot you want to dangle in front of the developers. In Managing Software Maniacs Ken Whittaker already warned that you should "reward, never incent". The biggest danger of incentives is that you might just get what you asked for, but not what you had in mind. If achieving a certain test coverage goal brings a bonus you have to expect developers to have high morals not to come up with a quick scheme to get coverage (such as tests without asserts). If your developers are really that sophisticated and honest maybe they need no incentive at all to do the right thing. A lot can be said about the dangers of incentives for developers but it seems that incentives based on metrics are particularly dangerous because a simple number can rarely represent the real intent.

Recommended Use

Even with all these caveats metrics can be quite useful. I frequently use metrics in the following situations:

Hotspotting - If I measure cyclomatic complexity across a large code base and most methods come out with a value of 5-10 but a dozen or so methods have a value of 20 I should probably have a closer look at the outliers.
Trending - If I observe the unit test coverage in my system to be around 70% that by itself does not mean much -- some people might consider it quite high, others will frown upon this number. However, if my coverage used to be 70% a few months ago and it is 65% now, that is probably a bad sign. New code that is being checked into the system apparently has lower coverage than existing code. There might be a variety of reasons for this (maybe the system design calls for the addition of a large number of dumb data transfer objects that are not deemed complex enough to warrant tests) but unless there is a good explanation you should be concerned.
Verifying assumptions - I actually run coverage tools most frequently to see whether my coverage matches my expectations. Coverage tools such as Emma create excellent reports that color code ines that were executed as part of a test vs. those that weren't. Again, I do not strive for a certain magic number but a I quickly scan the red lines (the ones that were not covered) to see whether I thought they were covered by a test. If there is a mismatch between my expectations and the actual line coverage it might indicate my misunderstanding of the logic or a test that succeeds without actually hitting the functionality I had in mind.

Oh, and I forgot one useful application of metrics: during performance review times they can be a handy weapon. Who can argue with "my code has 95% test coverage"? Just make sure your boss does not read this blog...