19 lines
4.6 KiB
Plaintext
19 lines
4.6 KiB
Plaintext
Never let an invariant go untested
|
||
<p>I’ve been blog-silent the last couple of days because I’ve been chasing down the bug I mentioned in <a href="http://esr.ibiblio.org/?p=6255">Request for help – I need a statistician</a>.</p>
|
||
<p>I have since found and fixed it. Thereby hangs a tale, and a cautionary lesson.</p>
|
||
<p><span id="more-6258"></span></p>
|
||
<p>Going in, my guess was that the problem was in the covariance-matrix algebra used to compute the DOP (dilution-of-precision) figures from the geometry of the satellite skyview.</p>
|
||
<p>(I was originally going to write a longer description than that sentence – but I ruefully concluded that if that sentence was a meaningless noise to you the longer explanation would be too. All you mathematical illiterates out there can feel free to go off and have a life or something.)</p>
|
||
<p>My suspicion particularly fell on a function that did partial matrix inversion. Because I only need the diagonal elements of the inverted matrix, the most economical way to compute them seemed to be by minor subdeterminants rather than a whole-matrix method like Gauss-Jordan elimination. My guess was that I’d fucked that up in some fiendishly subtle way.</p>
|
||
<p>The one clue I had was a broken symmetry. The results of the computation should be invariant under permutations of the rows of the matrix – or, less abstractly, it shouldn’t matter which order you list the satellites in. But it did.</p>
|
||
<p>How did I notice this? Um. I was refactoring some code – actually, refactoring the data structure the skyview was kept in. For <s>hysterical raisins</s> historical reasons the azimuth/elevation and signal-strength figures for the sats had been kept in parallel integer arrays. There was a persistent bad smell about the code that managed these arrays that I thought might be cured if I morphed them into an array of structs, one struct per satellite.</p>
|
||
<p>Yeeup, sure enough. I flushed two minor bugs out of cover. Then I rebuilt the interface to the matrix-algebra routines. And the sats got fed to them in a different order than previously. And the regression tests broke loudly, oh shit.</p>
|
||
<p>There are already a couple of lessons here. First, <em>have</em> a freakin’ regression test. Had I not I might have sailed on in blissful ignorance that the code was broken.</p>
|
||
<p>Second, though “If it ain’t broke, don’t fix it” is generally good advice, it is overridden by this: If you don’t know that it’s broken, but it smells bad, trust your nose and <em>refactor the living hell out of it</em>. Odds are good that something will shake loose and fall on the floor.</p>
|
||
<p>This is the point at which I thought I needed a statistician. And I found one – but, I thought, to constrain the problem nicely before I dropped it on him, it would be a good idea to isolate out the suspicious matrix-inversion routine and write a unit test for it. Which I did. And it passed with flying colors.</p>
|
||
<p>While it was nice to know I had not actually screwed the pooch in that <em>particular</em> orifice, this left me without a clue where the actual bug was. So I started instrumenting, testing for the point in the computational pipeline where row-symmetry broke down.</p>
|
||
<p>Aaand I found it. It was a stupid little subscript error in the function that filled the covariance matrix from the satellite list – k in two places where i should have been. Easy mistake to make, impossible for any of the four static code checkers I use to see, and damnably difficult to spot with the Mark 1 eyeball even if you <em>know</em> that the bug has to be in those six lines somewhere. Particularly because the wrong code didn’t produce crazy numbers; they looked plausible, though the shape of the error volume was distorted.</p>
|
||
<p>Now let’s review my mistakes. There were two, a little one and a big one. The little one was making a wrong guess about the nature of the bug and thinking I needed a kind of help I didn’t. But I don’t feel bad about that one; ex ante it was still the most reasonable guess. The highest-complexity code in a computation is generally the most plausible place to suspect a bug, especially when you know you don’t grok the algorithm.</p>
|
||
<p>The big mistake was <em>poor test coverage</em>. I should have written a unit test for the specialized matrix inverter when I first coded it – and I should have tested for satellite order invariance.</p>
|
||
<p>The general rule here is: to constrain defects as much as possible, <em>never let an invariant go untested.</em></p>
|