Never let an invariant go untested
I’ve been blog-silent the last couple of days because I’ve been chasing down the bug I mentioned in Request for help – I need a statistician.
I have since found and fixed it. Thereby hangs a tale, and a cautionary lesson.
Going in, my guess was that the problem was in the covariance-matrix algebra used to compute the DOP (dilution-of-precision) figures from the geometry of the satellite skyview.
(I was originally going to write a longer description than that sentence – but I ruefully concluded that if that sentence was a meaningless noise to you the longer explanation would be too. All you mathematical illiterates out there can feel free to go off and have a life or something.)
My suspicion particularly fell on a function that did partial matrix inversion. Because I only need the diagonal elements of the inverted matrix, the most economical way to compute them seemed to be by minor subdeterminants rather than a whole-matrix method like Gauss-Jordan elimination. My guess was that I’d fucked that up in some fiendishly subtle way.
The one clue I had was a broken symmetry. The results of the computation should be invariant under permutations of the rows of the matrix – or, less abstractly, it shouldn’t matter which order you list the satellites in. But it did.
How did I notice this? Um. I was refactoring some code – actually, refactoring the data structure the skyview was kept in. For hysterical raisins historical reasons the azimuth/elevation and signal-strength figures for the sats had been kept in parallel integer arrays. There was a persistent bad smell about the code that managed these arrays that I thought might be cured if I morphed them into an array of structs, one struct per satellite.
Yeeup, sure enough. I flushed two minor bugs out of cover. Then I rebuilt the interface to the matrix-algebra routines. And the sats got fed to them in a different order than previously. And the regression tests broke loudly, oh shit.
There are already a couple of lessons here. First, have a freakin’ regression test. Had I not I might have sailed on in blissful ignorance that the code was broken.
Second, though “If it ain’t broke, don’t fix it” is generally good advice, it is overridden by this: If you don’t know that it’s broken, but it smells bad, trust your nose and refactor the living hell out of it. Odds are good that something will shake loose and fall on the floor.
This is the point at which I thought I needed a statistician. And I found one – but, I thought, to constrain the problem nicely before I dropped it on him, it would be a good idea to isolate out the suspicious matrix-inversion routine and write a unit test for it. Which I did. And it passed with flying colors.
While it was nice to know I had not actually screwed the pooch in that particular orifice, this left me without a clue where the actual bug was. So I started instrumenting, testing for the point in the computational pipeline where row-symmetry broke down.
Aaand I found it. It was a stupid little subscript error in the function that filled the covariance matrix from the satellite list – k in two places where i should have been. Easy mistake to make, impossible for any of the four static code checkers I use to see, and damnably difficult to spot with the Mark 1 eyeball even if you know that the bug has to be in those six lines somewhere. Particularly because the wrong code didn’t produce crazy numbers; they looked plausible, though the shape of the error volume was distorted.
Now let’s review my mistakes. There were two, a little one and a big one. The little one was making a wrong guess about the nature of the bug and thinking I needed a kind of help I didn’t. But I don’t feel bad about that one; ex ante it was still the most reasonable guess. The highest-complexity code in a computation is generally the most plausible place to suspect a bug, especially when you know you don’t grok the algorithm.
The big mistake was poor test coverage. I should have written a unit test for the specialized matrix inverter when I first coded it – and I should have tested for satellite order invariance.
The general rule here is: to constrain defects as much as possible, never let an invariant go untested.