Phase-of-moon-dependent bugs suck
I just had a rather hair-raising experience with a phase-of-moon-dependent bug.
I released GPSD 3.11 this last Saturday (three days ago) to meet a deadline for a Debian freeze. Code tested ninety-six different ways, run through four different static analyzers, the whole works. Because it was a hurried release I deliberately deferred a bunch of cleanups and feature additions in my queue. Got it out on time and it’s pretty much all good – we’ve since turned up two minor build failures in two unusual feature-switch cases, and one problem with the NTP interface code that won’t affect reasonable hardware.
I’ve been having an extremely productive time since chewing through all the stuff I had deferred. New features for gpsmon, improvements for GPSes watching GLONASS birds, a nice space optimization for embedded systems, some code to prevent certain false-match cases in structured AIS Type 6 and Type 8 messages, merging some Android port tweaks, a righteous featurectomy or two. Good clean fun – and of course I was running my regression tests frequently and noting when I’d done so in my change comments.
Everything was going swimmingly until about two hours ago. Then, as I was verifying a perfectly innocent-appearing tweak to the SiRF-binary driver, the regression tests went horribly, horribly wrong. Not just the SiRF binary testloads, all of them.
My friends, do you know what it looks like when the glibc detects a buffer overflow at runtime? Pages and pages of hex garble, utterly incomprehensible and a big flare-lit clue that something bad done happened.
“Yoicks!” I muttered, and backed out the latest change. Ran “scons check” again. Kaboom! Same garble. Wait – I’d run regressions successfully on that revision just a few minutes previously, or so I thought.
Don’t panic. Back up to the last revision were the change comment includes the reassuring line “All regression tests passed.” Rebuild. “scons check”. Aaaand…kaboom!
Oh shit oh dear. Now I have real trouble. That buffer overflow has apparently been lurking in ambush for some time, with regression tests passing despite it because the phase of the moon was wrong or something.
The first thing you do in this situation is try to bound the damage and hope it didn’t ship in the last release. I dropped back to the release 3.11 revision, rebuilt and tested. No kaboom. Phew!
These are the times when git bisect is your friend. Five test runs later I found the killer commit – a place where I had tried recovering from bad file descriptor errors in the daemon’s main select call (which can happen if an attached GPS dies under pessimal circumstances) and garbage-collecting the storage for the lost devices.
Once I had the right commit it was not hard to zero in on the code that triggered the problem. By inspection, the problem had to be in a particular 6-line loop that was the meat of the commit. I checked out the head version and experimentally conditioned out parts of it until I had the kaboom isolated to one line.
It was a subtle – and entirely typical – sort of systems-programming bug. The garbage-collection code iterated over the array of attached devices conditionally freeing them. What I forgot when I coded this was that that sort of operation is only safe on device-array slots that are currently allocated and thus contain live data. The test operation on a dead slot – an FD_ISSET() – was the kaboomer.
The bug was random because the pattern of stale data in the dead slots was not predictable. It had to be just right for the kaboom to happen. The kaboom didn’t happen for nearly three days, during which I am certain I ran the regression tests well over 20 times a day. (Wise programmers pay attention to making their test suites fast, so they can be run often without interrupting concentration.)
It cannot be said too often: version control is your friend. Fast version control is damn near your best friend, with the possible exception of a fast and complete test suite. Without these things, fixing this one could have ballooned from 45 minutes of oh-shit-oh-dear to a week – possibly more – of ulcer-generating agony.
Version control is maybe old news, but lots of developers still don’t invest as much effort on their test suites as they should. I’m here to confirm that it makes programming a hell of a lot less hassle when you build your tests in parallel with your code, do the work to make them cover well and run fast, then run them often. GPSD has about 100 tests; they run in just under 2 minutes, and I run them at least three or four times an hour.
This puts out little fires before they become big ones. It means I get to spend less time debugging and more time doing fun stuff like architecture and features. The time I spent on them has been multiply repaid. Go and do thou likewise.