blog_post_tests/20110317021948.blog

The bug that didn&#8217;t bite in the night-time: an anti-disaster story
<p>A very curious thing happened with GPSD this week.  In fact it&#8217;s so odd I&#8217;m still having trouble believing it.  In software engineering we often have trouble getting seemingly simple things to work reliably.  How does one react when an incredibly complex, fragile piece of bit-twiddling code works &#8211; perfectly &#8211; after six years without real-world testing, during which the surrounding architecture underwent such massive changes that any rational person would have expected the feature to bit-rot into garbage?</p>
<p>No, really, this one is <em>weird</em>.  Let me unfold to you the strange tale of The RTCM2 Analyzer That Shouldn&#8217;t Have Worked.  Really.  At All.</p>
<p><span id="more-3044"></span></p>
<p>A principal source of errors in GPS position fixes is variable time delays in satellite signals induced by variations in ionospheric conditions.  A way to compensate for this involves what&#8217;s called <em>differential correction</em> &#8211; receiving signals from a ground station at a fixed location and comparing their satellite pseudoranges with yours. Because the station&#8217;s location is known with high precision, your receiver can use its information to compute the local ionospheric delay (as long as you&#8217;re within a couple hundred kilometers of the station) and refine its fixes.</p>
<p>There are a couple of different ways you can get such corrections.  The old-school method is to get a specialized radio that listens for broadcasts from differential-correction stations and reports them to you is a digital bitstream over a serial (RS232 or USB) link.  The new-school way is to get that bitstream from servers that make it available over the Internet.</p>
<p>If all you want is to correct your GPS, all you actually need to do is shove that bitstream at the GPS&#8217;s serial or USB port as it comes in (though some GPSes want you to push the data into a special-purpose auxiliary serial port).  Many GPSes simply interpret it on the fly. If yours claims to have DGPS support, it will have this capability.  You don&#8217;t actually have to know what&#8217;s in the stream of corrections.</p>
<p>Actually understanding the corrections as they come in &#8211; making something from them that a human can see sense in &#8211; is much more difficult. But there are reasons you might want this.  If you&#8217;re doing atmospheric physics, the network of differential correction stations is, in effect, a huge distributed ionosphere observatory.  If you&#8217;re concerned about really high-precision geolocation, you may need to monitor the health of the differential-correction network in order to know when your confidence intervals are nice and tight.</p>
<p>Or you could be like me, annoyed beyond reasonableness at the idea of data flowing through your software that is opaque to you.  In mid-2005 I decided to try to actually analyze the correction data flowing through GPSD.</p>
<p>This was no small undertaking.  The protocols used for differential corrections are <em>nasty</em>.  The most widely used of them is called RTCM2.  It actually has two layers, each deeply hideous in its own special way.</p>
<p>The lower layer is the downlink protocol used by GPS satellites.  It&#8217;s a bitstream, not a bytestream, and that raw bitstream (segmented into 8-bit characters along boundaries that don&#8217;t mean anything) is what you get from an RTCM2 source over a serial or socket link.  This lower layer doesn&#8217;t formally have a name, but because it&#8217;s specified in a famously headache-inducing document called ISGPS200, those unlucky few of us who have had to deal with it tend to think of it as the ISGPS layer.</p>
<p>You start with your ISGPS bitstream.  Your first challenge is to turn it into data frames.  This involves sliding it bit-by-bit into a 36-bit buffer and looking for two fixed header bytes checked by six bits of parity information. When you sync successfully, you get a thirty-bit data word.</p>
<p>I&#8217;m making this sound simpler than it is.  I started with an RTCM2 decoder in C written by two guys named Wolfgang Rupprecht and John Sager around 1999 (they&#8217;ve both since disappeared off the net), and my first job was to pry the RTCM2 layer loose from the ISGPS bit-sync code.  I did it &#8211; you can look in the results in <a href="http://git.berlios.de/cgi-bin/gitweb.cgi?p=gpsd;a=blob;f=isgps.c;h=3c71f1798b2c8065659654e78eca058c7f625cff;hb=bcf86f9ed2e8045b900f2288d14c4f0552d244e9">isgps.c</a> in the GPSD codebase &#8211; but all by mechanical refactoring steps.  Full of magic numbers, shifts, and inversions; I didn&#8217;t understand it then and don&#8217;t now.  Oh, and it broke GCC&#8217;s optimizer.  Twice.</p>
<p>And that upper layer?  This is the RTCM2-specific part. Once you have your sequence of 30-bit data frames, you get to slice packed bitfields out of them according to one of a bunch of different message formats, identified by a type field in the first or header word.  Then you apply scaling divisors because many of the data fields are float quantities.</p>
<p>This upper layer is simple in principle, but getting any bitfield length wrong reduces all following to garbage.  Debugging this sort of thing is tedious and painful; is it the lower layer, the upper layer, or yet another damned optimizer bug?</p>
<p>The simplest(!) way to tackle it turned out to be to define a big ugly C union full of bitfields, cast the 30-bit-buffer pointer to a pointer to that union, and read the fields. Would fail on architectures that force padding between bitfields, but since modern CPUs all have barrel shifters I was reasonably sure I could say #pragma pack(1) and have the right thing happen. </p>
<p>(I lied.  There are actually <em>two</em> big ugly unions with bitfields &#8211; one for big-endian machines, one for little-endian. Look in <a href="http://git.berlios.de/cgi-bin/gitweb.cgi?p=gpsd;a=blob;f=driver_rtcm2.c;h=fa28a25d789aca8c4932a8dfd12bd2cc922bcb73;hb=bcf86f9ed2e8045b900f2288d14c4f0552d244e9">driver_rtcm2.c</a> in the GPSD codebase. If you&#8217;re not frightened of that code, you are perhaps blessed in your ignorance.)</p>
<p>The only reason I had any confidence in the results was because Sager&#8217;s code included a short segment of RTCM2 and an ASCII dump of same.  At every refactoring step I checked that my code still gave the same output from the input.  But I wasn&#8217;t <em>really</em> confident&#8230;because the Sager sample only included one type (type 9) and I therefore had no way to test the decodes of types 1, 3, 4, 5, and 16 (and later, when they upgraded the protocol, 13, 14 and 31).</p>
<p>Later, I got a different check for the ISGPS layer.  We were able to use it to decode a special GPS message called a &#8216;subframe&#8217; and see the current leap-second value we expected at the place we expected it.  This did nothing to verify the RTCM2 layer, though.</p>
<p>One of my continuing headaches over the next six years was that I couldn&#8217;t get my hands on a decent test pair for this thing.  What I needed was a pair of files like the Sager sample and its dump, but with a larger set of message types in it.  High and low did I search the Internet, but in vain.</p>
<p>And then, last week, a guy who&#8217;s been working with us on our support for Ntrip (a differential-correction service popular in Europe) finally handed me a test pair consisting of a raw RTCM2 bitstream segment and an ASCII dump of it made by some ugly proprietary Windows program.  The dump included reports for message types 1, 3, 14, and 31, but no type 9s as in the Sager sample.  (This was OK, as the type 1 and type 9 message formats are effectively identical).</p>
<p>I had what I&#8217;d been looking for for six years&#8230;but to say that I feared actually collating that dump with GPSD&#8217;s output would be to wallow in understatement.  So thick was the murk in the depths of my decoder that I had no freaking idea if I&#8217;d even be <em>able</em> to reconcile any differences.</p>
<p>And indeed it didn&#8217;t look good, initially.  The first reports appeared to match, except that a field called the zcount (a sort of timestamp) was ever so slightly off.  Groan.  Eventually I figured out that the time boundaries of the binary and the ASCII dump didn&#8217;t quite coincide.  If I looked a few reports after the first in the dump I saw something that looked like a match.</p>
<p>But it still didn&#8217;t look right.  There were sentences of type 14 and 31 in the binary data that my decoder didn&#8217;t interpret because they weren&#8217;t <em>defined</em> six years ago.  Groveling through some obscure documentation, I added more complexity to my huge pile of bitfield declarations and parsing code.  Better, but&#8230;</p>
<p>I still thought I had lossage&#8230;until I noticed that all those wrong satellite ID fields in the type 31s were wrong by a constant offset of 40, and realized that this was because the author of the other program had mapped GLONASS satellite numbers upwards so as not to collide with the GPS satellite IDs in the type 1 messages.</p>
<p>I looked at my JSON dump.  I looked at the ASCII dump from Windows-land.  Four message types I&#8217;d never tested before.  All.  Matched.  <em>Perfectly.</em>  Even down to the low digits on the float quantities.</p>
<p>My jaw dropped open.  &#8220;That just isn&#8217;t possible!&#8221; I thought to myself.  &#8220;<em>Something</em> here has to be wrong. What are the odds?&#8221;</p>
<p>I looked again.  And it was still right.  Every bit of it.</p>
<p>Anybody who isn&#8217;t a pretty hard-core programmer probably died of boredom about a thousand words ago.  But in case you made it this far, and don&#8217;t know much about what writing low-level bit-twiddling code is like&#8230;this is <em>freaky</em>. Once-in-a-blue-moon freaky.  You&#8217;re-full-of-shit-I-don&#8217;t-believe-you freaky. Not so much like dodging a bullet as like dancing stark-naked through aimed fire from a platoon of machine-gunners.</p>
<p>So blindingly unlikely is this, in fact, that my skill level doesn&#8217;t enter into the odds.  OK, I think I&#8217;m pretty good, and there&#8217;s objective evidence to back that up&#8230;but if I were Ken Thompson, Alan Turing, and Thoth Trismegistus rolled up into one ball of pulsating awesome I <em>still</em> wouldn&#8217;t have expected right-first-time on code like this.</p>
<p>So, what actually happened here?</p>
<p>I can isolate two things.  First, the ISGPS layer, fearsome black art though it is, must have gotten better test coverage from the Sager sample and the odd subframe decode than I realized during those six years. That code terrifies me, and it will terrify you if you ever read it (&#8220;You are not expected to understand any of this.&#8221;) but in retrospective fact the only things that ever broke it were GCC optimizer bugs.</p>
<p>Second, hanging on to the invariant that the Sager sample decoded correctly through all the changes in the rest of the codebase was a <em>very good idea</em>, even if it was only one sentence type.  It did nothing to verify the rest of the decoder, but at least it ensured that the framework around the rest didn&#8217;t get subtly bent out of shape while I wasn&#8217;t looking.</p>
<p>And I got really, really, really, <em>really</em> lucky.  </p>
<p>So lucky that I&#8217;m still feeling a bit dazed by it.  And I don&#8217;t think I can draw any lessons from this.  It&#8217;s too random and weird.  But we have so many legends of software disaster that I thought a story of anti-disaster would be worth sharing anyway.</p>