This repository has been archived on 2017-04-03. You can view files and clone it, but cannot push or open issues or pull requests.
blog_post_tests/20140923090332.blog

19 lines
4.6 KiB
Plaintext
Raw Normal View History

2014-11-19 15:42:25 +00:00
Never let an invariant go untested
<p>I&#8217;ve been blog-silent the last couple of days because I&#8217;ve been chasing down the bug I mentioned in <a href="http://esr.ibiblio.org/?p=6255">Request for help I need a statistician</a>.</p>
<p>I have since found and fixed it. Thereby hangs a tale, and a cautionary lesson.</p>
<p><span id="more-6258"></span></p>
<p>Going in, my guess was that the problem was in the covariance-matrix algebra used to compute the DOP (dilution-of-precision) figures from the geometry of the satellite skyview.</p>
<p>(I was originally going to write a longer description than that sentence &#8211; but I ruefully concluded that if that sentence was a meaningless noise to you the longer explanation would be too. All you mathematical illiterates out there can feel free to go off and have a life or something.)</p>
<p>My suspicion particularly fell on a function that did partial matrix inversion. Because I only need the diagonal elements of the inverted matrix, the most economical way to compute them seemed to be by minor subdeterminants rather than a whole-matrix method like Gauss-Jordan elimination. My guess was that I&#8217;d fucked that up in some fiendishly subtle way.</p>
<p>The one clue I had was a broken symmetry. The results of the computation should be invariant under permutations of the rows of the matrix &#8211; or, less abstractly, it shouldn&#8217;t matter which order you list the satellites in. But it did.</p>
<p>How did I notice this? Um. I was refactoring some code &#8211; actually, refactoring the data structure the skyview was kept in. For <s>hysterical raisins</s> historical reasons the azimuth/elevation and signal-strength figures for the sats had been kept in parallel integer arrays. There was a persistent bad smell about the code that managed these arrays that I thought might be cured if I morphed them into an array of structs, one struct per satellite.</p>
<p>Yeeup, sure enough. I flushed two minor bugs out of cover. Then I rebuilt the interface to the matrix-algebra routines. And the sats got fed to them in a different order than previously. And the regression tests broke loudly, oh shit.</p>
<p>There are already a couple of lessons here. First, <em>have</em> a freakin&#8217; regression test. Had I not I might have sailed on in blissful ignorance that the code was broken.</p>
<p>Second, though &#8220;If it ain&#8217;t broke, don&#8217;t fix it&#8221; is generally good advice, it is overridden by this: If you don&#8217;t know that it&#8217;s broken, but it smells bad, trust your nose and <em>refactor the living hell out of it</em>. Odds are good that something will shake loose and fall on the floor.</p>
<p>This is the point at which I thought I needed a statistician. And I found one &#8211; but, I thought, to constrain the problem nicely before I dropped it on him, it would be a good idea to isolate out the suspicious matrix-inversion routine and write a unit test for it. Which I did. And it passed with flying colors.</p>
<p>While it was nice to know I had not actually screwed the pooch in that <em>particular</em> orifice, this left me without a clue where the actual bug was. So I started instrumenting, testing for the point in the computational pipeline where row-symmetry broke down.</p>
<p>Aaand I found it. It was a stupid little subscript error in the function that filled the covariance matrix from the satellite list &#8211; k in two places where i should have been. Easy mistake to make, impossible for any of the four static code checkers I use to see, and damnably difficult to spot with the Mark 1 eyeball even if you <em>know</em> that the bug has to be in those six lines somewhere. Particularly because the wrong code didn&#8217;t produce crazy numbers; they looked plausible, though the shape of the error volume was distorted.</p>
<p>Now let&#8217;s review my mistakes. There were two, a little one and a big one. The little one was making a wrong guess about the nature of the bug and thinking I needed a kind of help I didn&#8217;t. But I don&#8217;t feel bad about that one; ex ante it was still the most reasonable guess. The highest-complexity code in a computation is generally the most plausible place to suspect a bug, especially when you know you don&#8217;t grok the algorithm.</p>
<p>The big mistake was <em>poor test coverage</em>. I should have written a unit test for the specialized matrix inverter when I first coded it &#8211; and I should have tested for satellite order invariance.</p>
<p>The general rule here is: to constrain defects as much as possible, <em>never let an invariant go untested.</em></p>