Gregor's Ramblings
Home Patterns Ramblings Articles Talks Download Links Books Contact

The Metrics Maid

April 28, 2006

Recent Ramblings
DDD - Diagram Driven Design
What Does It Mean to Use Messaging?
A Chapter a Day...
EIP Visions
Clouds and Integration Patterns at JavaOne
My First Google Wave Robot
Google I/O
Into the Clouds on New Acid
Design Patterns: More than meets the eye
Reflecting on Enterprise Integration Patterns
Google Gears Live From Japan
Double-Dipping: OOPSLA and Colorado Software Summit
Bubble 2.0
Enterprise Mashup Summit
Facebook Developer Garage
Mashups Tools Market
Mashups == EAI 2.0?
Mashup Camp
I Want My Events

Some of the most hated people in San Francisco must be the meter maids, the DPT people who drive around in golf carts and hand out tickets to anyone who overslept street cleaning or did not have enough quarters for the meter. On some projects, the most hated people are the metric maids, the people who go around and try to sum up a developer’s hard work and intellectual genius in a number between 1 and 10.

Many managers love metrics: “You can’t manage it if you can’t measure it”. I am actually a big proponent of extracting and visualizing information from large code bases or running systems (see Visualizing Dependencies). But when one tries to boil the spectrum between good and evil down to a single number we have to be careful as to what this number actually expresses.

I Got You Covered

A frequently used metric for the quality of code is test coverage, typically expressed as a percentage computed by the (executable) lines of code that are hit by test cases over the total number of (executable) lines of code. If I have 80% test coverage this means that 80 out of 100 lines are being hit by my tests. It is a fairly safe statement to claim that a code module with 80% test coverage is better than one with 20% coverage. However, is one that has 90% really better than one that has 80%? Are two modules with 80% coverage equally “good”? We quickly find that a single number cannot do justice to all that is going on in a few thousand lines of source code. Still, finding an abstraction is nice so let’s see how far we can get with this metric.

May the Source Be With You

First we need to qualify the metric with how it is obtained. 80% coverage obtained via fine-grained, automated unit tests probably counts more than 80% coverage obtained via coarse regression tests. We need to keep in mind that code coverage counts every line that was hit by a test, intentionally or unintentionally. Also, the metric does not make any statement whether any test actually verified that the specific line actually works correctly. It is a little scary to realize that test cases with no assert statements achieve the same coverage as tests with asserts. I have seen a few proposals to remedy this problem:

You might only count lines executed in the class specifically targeted by the unit test. For example, if fooTest exercises 60 out of 100 lines in Foo but also hits 50 lines in Bar, the lines in Bar do not count towards Bar's test coverage because this test was not intended to test Bar. In fact, Bar being hit at all might have been the result of insufficient mocking or stubbing of the test for Foo. Such behavior should not be rewarded.

A particularly interesting technique is mutation testing. This class of tools modify the source code (or byte code) in the class under test to see whether the change breaks any tests. If no tests break as a result of the random change the mutated line does not count as thoroughly unit tested. Ivan Moore wrote Jester, which performs Java mutation testing. It can produce nice reports that show what was actually changed and whether the change resulted in a failed test.

Metric Magic

A while ago we refactored a pretty miserable chunk of code. It took us more than a day to extract the meaningful concepts, refactor the code and add meaningful unit tests. Being proud of our work we ran a test coverage tool against the class to give ourselves a pad on the back about how much better our test coverage was. Sadly, the pad on the back was more like a kick in the pants. For one class the test coverage crept up by a mere 2 percentage points and for the other one it actually decreased! It turned out that the portion of the code that we refactored had actually decent test "coverage" (as determined by the tool), even though it was very poorly written. Rewriting it (and simplifying the logic) reduced the number of lines in this code segment, causing the overall percentage of tested code in the class actually to decrease because the remaining (still) poorly tested code did not shrink.

Dangling the Carrot

You should also be cautious about which metric carrot you want to dangle in front of the developers. In Managing Software Maniacs Ken Whittaker already warned that you should "reward, never incent". The biggest danger of incentives is that you might just get what you asked for, but not what you had in mind. If achieving a certain test coverage goal brings a bonus you have to expect developers to have high morals not to come up with a quick scheme to get coverage (such as tests without asserts). If your developers are really that sophisticated and honest maybe they need no incentive at all to do the right thing. A lot can be said about the dangers of incentives for developers but it seems that incentives based on metrics are particularly dangerous because a simple number can rarely represent the real intent.

Recommended Use

Even with all these caveats metrics can be quite useful. I frequently use metrics in the following situations:

Oh, and I forgot one useful application of metrics: during performance review times they can be a handy weapon. Who can argue with "my code has 95% test coverage"? Just make sure your boss does not read this blog...


Gregor is a software architect with Google. He is a frequent speaker on asynchronous messaging and service-oriented architectures and co-authored Enterprise Integration Patterns (Addison-Wesley). His mission is to make integration and distributed system development easier by harvesting common patterns and best practices from many different technologies.