blog_post_tests/20100324231254.blog

Scenes from the Life of a System Architect
<p>I&#8217;ve been doing some heavy work on the core code of gpsd recently, and realized it would be a good idea to explain the whys and wherefores to my co-developers on the project.  After I wrote my explanation and reread it, I realized I had managed to generate something that might be relatively accessible, and perhaps even interesting, to people who aren&#8217;t intimate with the GPSD codebase. </p>
<p>I guess I&#8217;m aiming this at junior programmers and particularly curious non-programmers.  It&#8217;s a slice of what software systems design &mdash; the thing that project leads and architects do &mdash; is like in the real world, where the weight of history is often as pressing as today&#8217;s requirements list.  I think this note shows an example of doing it right, or at least not getting it too badly wrong.</p>
<p>If you find the technicalese in here difficult, it may be useful to refer back to some of my previous posts about this project: </p>
<p><a href="http://esr.ibiblio.org/?p=1494">GPSD and Code Excellence</a></p>
<p><a href="http://gpsd.berlios.de/protocol-evolution.html">GPSD-NG: A Case Study in Application Protocol Evolution</a></p>
<p><a href="http://esr.ibiblio.org/?p=801">Why GPSes suck, and what to do about it</a></p>
<p><span id="more-1859"></span></p>
<p>Those of you following the commit list will have noticed a flurry of changes to the gpsd core code recently.  It occurred to me this afternoon that I should explain what I&#8217;m doing, because it&#8217;s not a good thing that I still seem to be the only person who really understands the part of the architecture I&#8217;m modifying.</p>
<h2>0. Prerequisites</h2>
<p>To understand what follows, you have to remember that GPSes have reporting cycles.  Start of cycle is when the firmware computes a fix. It then ships a burst of sentences reporting the fix.  Wait a bit &#8211; usually 1 second &#8211; and repeat.  This becomes an issue because when your device speaks NMEA 0183 you have to aggregate reports from all the sentences in a reporting cycle to get a complete fix.</p>
<p>You also need to know that the part of the code I&#8217;m modifying is the dispatcher layer.  This is where the daemon&#8217;s main select loop lives, waiting on input from client sessions, active GPSes, and the special control port.  The dispatcher allocates and deallocates sockets for the client sessions and the control port, and handles user commands. It calls the packet sniffer to accumulate traffic from the GPSes into packets that can be analyzes for fixes, applies some policy and error modeling, and ships reports to client sessions.</p>
<p>To do these things, the dispatcher calls out to a core library that assembles packets and a bunch of drivers to analyze them once assembled.  The details of that level aren&#8217;t important for this discussion.  What&#8217;s mainly significant here is the fact that they&#8217;re separated from the dispatcher logic by an API I was pretty careful about designing.</p>
<h2>1. The motivation: bring back polling</h2>
<p>The modifications I&#8217;m doing have been triggered by Chris Kuethe&#8217;s request that I bring back support for client-initiated polling in the new JSON-based protocol.  At the moment, all clients are expected to set watcher mode, then poll their socket often enough to handle the bursts of time-position-velocity JSON that gpsd will send at them once per cycle.</p>
<p>The original gpsd protocol was polling based, but had significant design defects. One of them was that it was normal to poll for a bunch of partial reports (latitude/longitude, altitude, speed, etc.) but they weren&#8217;t timestamped, so it wasn&#8217;t easy to tell when the data was stale.</p>
<p>There were also some unpleasant edge cases arising from the fact that, in order to conserve power on battery-driven devices, the daemon didn&#8217;t try to activate any attached device until you polled for position data.  Thus, you couldn&#8217;t actually do a single-shot poll reliably under old protocol &#8211; first time you tried it would wake up some device but return no data, and you would probably start getting good data on the second or third poll (depending on the timing of the next cycle edge).</p>
<p>There were other problems as well, and I eventually gave up on client-initiated polling entirely; when I redesigned the client API for use with the new protocol, I left it out. Chris has a decent use case for it, though, and I&#8217;m aiming towards bringing it back.  The way it will probably work is that the client starts by setting a ?WATCH that doesn&#8217;t do the once-per-cycle JSON bursts, but just clues the daemon to activate devices. Thereafter the client will be able to poll for the state of the cache from the last fix(es) &#8211; that&#8217;s more than one if there are multiple sensors attached.</p>
<h2>2. Clearing the way</h2>
<p>But there&#8217;s major work needed to prepare the ground.  The good news is that it&#8217;s entirely in the dispatcher layer and the core library, and won&#8217;t destabilize the packet sniffer or the device drivers.  Changes in that lower layer are <b>*messy*</b> and I approach them with trepidation.</p>
<p>Basically, before I can re-implement client polling in a clean way I need to clear away a lot of scar tissue and complexity.  I ripped out the command interpreter for old protocol weeks ago, but there&#8217;s stuff underneath the new one that&#8217;s pretty tangled.  Most (but not all) of it is a result of the infamous &#8211; and now dead &#8211; J command.</p>
<p>Those of you who have been around for a while will remember that old protocol used to have a per-user-session policy switch controlling the circumstances under which old data was held over and merged into new data; you could have display-jitter reduction at the risk of occasionally seeing stale data from the previous reporting cycle, or you could guarantee fresh data only but pay for it with display jitter.  </p>
<p>The reason J was needed was that the sentence mix shipped by NMEA devices is highly variable, and we didn&#8217;t have a reliable way to pin down the end of cycle &#8211; users had to know about their devices and choose the right policy.  I was never happy with this, because zero configuration is one my goals, but that UI issue was nearly trivial compared to the complexifying effect on the dispatcher internals.</p>
<p>Because the J policy switch was per user, you could have user sessions with opposite smoothing policies (minimize jitter vs. avoid staleness) getting data from the same underlying device.  Several steps of implication later, this meant the dispatcher layer had to maintain a pair of this-fix and last-fix buffers per client session, copy data up into them from the device-driver layer, and apply some policy-specific logic before emitting an actual report.</p>
<p>Eventually I figured how to do automatic end-of-cycle detection and the J command went away. But the tricky buffer-shuffling used to implement it didn&#8217;t!  I knew I&#8217;d have to clean that up someday, but didn&#8217;t want to tackle that sooner than I had to.  I knew it would require major surgery on the dispatcher.</p>
<p>When Chris made his request, though, I decided that it would be begging for obscure bugs to reimplement polling on top of the user-session-level buffering, and concluded that I needed to rip the latter out first.</p>
<p>One of the hard parts is already done. There still needs to be a buffer pair (this fix and last fix) for things like computing speed and course if the GPS doesn&#8217;t supply them.  But it can be per-device and live in the core library rather than per-user-session and live in the dispatcher.  I successfully dropped those buffers down a level a couple of days ago; as a happy side effect, this has already reduced gpsd&#8217;s memory footprint some.</p>
<p>What I&#8217;m working on now is getting rid of the channel structure. Under old protocol client sessions listened to a specific list of devices; under new protocol they listen to everything unless they&#8217;ve selected a <b>*single*</b> device. The difference means that the channel structure array, which exists to manage (potentially multiple) subscriber-to-device links, isn&#8217;t really needed anymore</p>
<p>Once I abolish channels I&#8217;ll have a clean, fairly simple dispatcher architecture that&#8217;s matched to the way new protocol actually works. That will be a good base on which to re-implement polling.</p>
<h2>3. Removing code is good</h2>
<p>One very nice thing about the work I&#8217;m doing is that it&#8217;s mostly ripping out chunks of code that aren&#8217;t needed anymore, with some refactoring to separate out those chunks first.  While it&#8217;s possible to introduce bugs during code removal, they tend to be the unsubtle kind that lead to immediate crashes or gross regression-test failures. It&#8217;s thus relatively easy to have confidence in the results when I take a step forward in the process.</p>
<p>The line count, memory usage, and overall complexity of the dispatcher is going to drop significantly even after I have re-implemented polling.  Polling will be handled by a relatively small, contiguous span of code in the command interpreter that raids the device-level buffers to generate a report; the removed code, by contrast, will have been from all through the dispatcher, especially the grottier parts that kept data structures with different roles in proper synchrony.</p>
<p>This is all good.  Decreased line count means increased maintainability, and given the high reliability requirements of something that&#8217;s going to be used in navigational systems (and thus potentially life-critical) that&#8217;s especially important for gpsd.</p>
<p>UPDATE: I did successfully remove the channel structure, and I added a polling command &#8211; in 23 continuous lines of code.</p>