blog_post_tests/20120513011940.blog

Engineering zero-defect software
<p>I&#8217;ve been pounding on GPSD with the Coverity static analyzer&#8217;s self-build procedure for several days.  It is my great pleasure to report that we have just reached zero defect reports in 72.8KLOC.  Coverity says this code is clean. And because I think this should be an example unto others, I shall explain how I think others can do likewise.</p>
<p><span id="more-4340"></span></p>
<p>OK, if you&#8217;re scratching your head&#8230;Coverity is a code-analysis tool &#8211; an extremely good one, probably at this moment the best in the world (though LLVM&#8217;s open-source &#8216;scan-build&#8217; is chasing it and seems likely to pass it sometime down the road), It&#8217;s proprietary and normally costs mucho bucks, but as a marketing/goodwill gesture the company allows open source projects to register with them and get remote use of an instance hosted at the company&#8217;s data center.</p>
<p>I dislike proprietary tools in general, but I also believe GPSD&#8217;s reliability is extremely important. Navigation systems are life-critical &#8211; bugs in them can kill people. Therefore I&#8217;ll take all the help I can get pushing down our error rate, and to hell with ideological purity if that gets in the way.</p>
<p>Coverity won&#8217;t find everything, of course &#8211; it&#8217;s certainly not going to rescue you from a bad choice of algorithm. But it&#8217;s very, very good at finding the sorts of lower-level mistakes that human beings are very <em>bad</em> at spotting &#8211; memory allocation errors, resource leaks, null-pointer dereferences and the like.  These are what drive bad code to crashes, catatonia, and heisenbugs.</p>
<p>Excluding false positives and places Coverity was being a bit anal-retentive without finding an actual bug, I found 13 real defects on this pass &#8211; all on rarely-used code paths, which makes sense for reason I&#8217;ll explain shortly.  That&#8217;s less than 1 defect per 5 KLOC (KLOC = 1000 logical lines of code) which is pretty good considering our last scan was in 2007. Another way to look at that data is that, even while adding large new features like AIS support and NMEA200 and re-engineering the entire reporting protocol, we&#8217;ve introduced a bit fewer than three detectable defects <em>per year</em> in the last five years.</p>
<p>Those of you who are experienced software engineers will be picking your jaws up off the floor at that statistic.  Those of you aren&#8217;t &#8211; this is at least two <em>orders of magnitude</em> better than typical.  There are probably systems architects at Fortune 500 companies who would kill their own mothers for defect rates that low. Mythically, military avionics software and the stuff they load on the Space Shuttle is supposed to be this good, except I&#8217;ve heard from insiders that rather often it isn&#8217;t.</p>
<p>So, how did we do it? On no budget and with all of three core developers, only one working anywhere even near full time?</p>
<p>You&#8217;ll be expecting me to say the power of open source, and that&#8217;s not wrong. Sunlight is the best disinfectant, many eyeballs make bugs shallow, etc. etc.  While I agree that&#8217;s next to a necessary condition for defect rates this low, it&#8217;s not sufficient.  There are very specific additional things we did &#8211; things I sometimes had to push on my senior devs about because they at times looked like unnecessary overhead or obsessive tailchasing.</p>
<p>Here&#8217;s how you engineer software for zero defects:</p>
<h2>1. Be open source.</h2>
<p>And not just because you get helpful bug reports from strangers, either, all though that does happen and can be very important. Actually, my best bug-finders are semi-regulars who don&#8217;t have commit access to the code but keep a close eye on it anyway.  Like, there&#8217;s this Russian guy who often materializes on IRC late at night and can barely make himself understood in English, but his patches speak clearly and loudly.</p>
<p>But almost as importantly, being open source plugs you into things like the Debian porterboxes.  A couple of weeks ago I spent several days chasing down port failures that I thought might indicate fragile or buggy spots in the code.  It was hugely helpful that I could ssh into all manner of odd machines running Linux, including a System 390 mainframe, and run my same test suite on all of them to spot problems due to endianness or word-size or signed-char-vs.-unsigned-char differences.</p>
<p>Closed-source shops, in general, don&#8217;t have any equivalent of the Debian porterboxes because they can&#8217;t afford them &#8211; their support coalition isn&#8217;t broad enough. When you play with the open-source kids, you&#8217;re in the biggest gang with the best toys.</p>
<h2>Invest your time heavily in unit tests and regression tests</h2>
<p>GPSD has around 90 unit tests and regression tests, including sample device output for almost every sensor type we support. I put a lot of effort into making the tests easy and fast to run so they can be run often &#8211; and they are, almost every time executable code is modified.  This makes it actively difficult for random code changes to break our device drivers without somebody noticing right quick.  </p>
<p>Which isn&#8217;t to say those drivers can&#8217;t be wrong, just that the ways they can be wrong are constrained to be through either (a) a protocol-spec-level misunderstanding of what the driver is supposed to be doing, or (b) an implementation bug somewhere in the program&#8217;s state space that is obscure and difficult to reach.  Coverity only turned up two driver bugs &#8211; static buffer overruns in methods for changing the device&#8217;s reporting protocol and line speed that escaped notice because they can&#8217;t be checked in our test harnesses but only on a live device.</p>
<p>This is also why Coverity didn&#8217;t find defects on commonly-used code paths.  If there&#8217;d been any, the regression tests probably would have smashed them out long ago. I put in a great deal of boring, grubby, finicky work getting our test framework in shape, but it has paid off hugely.</p>
<h2>Use every fault scanner you can lay your hands on.</h2>
<p>Ever since our first Coverity scan in 2007 I&#8217;d been trying to get a repeat set up, but Coverity was unresponsive and their internal processes clearly rather a shambles until recently. But there were three other static analyzers I had been applying on a regular basis &#8211; splint, cppcheck, and scan-build.</p>
<p>Of these, splint is (a) the oldest, (b) the most effective at turning up bugs, and (c) far and away the biggest pain in the ass to use. My senior devs dislike the cryptic, cluttery magic comments you have to drop all over your source to pass hints to splint and suppress its extremely voluminous and picky output, and with some reason.  The thing is, splint checking turns up real bugs at a low but consistent rate &#8211; one or two each release cycle. </p>
<p>cppcheck is much newer and much less prone to false positives. Likewise scan-build.  But here&#8217;s what experience tells me: each of these three tools finds overlapping but <em>different</em> sets of bugs.  Coverity is, by reputation at least, capable enough that it might dominate one or more of them &#8211; but why take chances?  Best to use all four and constrain the population of undiscovered bugs into as small a fraction of the state space as we can.</p>
<p>And you can bet heavily that as new fault scanners for C/C++ code become available I&#8217;ll be jumping right on top of them.  I like it when programs find low-level bugs for me; that frees me to concentrate on the high-level ones they can&#8217;t find.</p>
<h2>Be methodical and merciless</h2>
<p>I don&#8217;t think magic or genius is required to get defect densities as low as GPSD&#8217;s.  It&#8217;s more a matter of sheer bloody-minded persistence &#8211; the willingness to do the up-front work required to apply and discipline fault scanners, write test harnesses, and automate your verification process so you can run a truly rigorous validation with the push of a button.</p>
<p>Many more projects could do this than do.  And many more projects should.</p>