CIA and the perils of overengineering

But long before I saw this diagram there were several aspects of the design that seemed rather iffy. Why one centralized server? Why the elaboration of XML-RPC? Why do you have to register your project on cia.vc to use the notification, with the mapping from your project to IRC channels lurking in an opaque database on a distant server, rather than being simply declared in the (arguments to your) repository hook?

The answer seems to be that the original designer fell in love with the idea of data-mining and filtering the notification stream. It is quite visible on the CIA site how much of the code is concerned with automatically massaging the commit stream into pretty reports. I’m told there is a complicated and clever feature involving XML rewrite rules that allows one to filter commit reports from any number of projects by the file subtrees they touch, then aggregate the result into a synthetic notification channel distinct from any of the ones those projects declared themselves.

Bletch! Bloat, feature creep, and overkill! With chrome like this piled on top of the original simple concept of a notication relay, the resulting complexity collapse should no longer be any surprise. Additionally, this is a near-perfect case study in how to make your service scale up poorly and be maximally vulnerable to single-point failures – if that one database gets lost or corrupted, everybody’s notifications will go haywire. The CIA design would have been over-centralized even if the implementation weren’t broken.

Of course the way to prove this kind of indictment is to do better. But once I got this far in my thinking, I realized that wouldn’t be difficult. And started to write code. The result is irkerd, a simple service daemon. One end of it listens on a socket for JSON requests that specify a server/channel pair and a message string. The other end behaves like a specialized IRC client that maintains concurrent session state for any number of IRC-server instances. All irkerd is, really, is a message bus that routes notification requests to the right servers. (And is multithreaded so it won’t block on a server stall, and times out inactive sessions.)

That’s it. Less than 400 lines of Python replaces CIA’s core notification service. The code for a repo hook to talk to it is simpler than any existing CIA hook. And it doesn’t require a centralized server. The right way to deploy this thing will be to host multiple instances of irker on repository sites, not publicly visible (because otherwise they could too easily be used to spam IRC channels) but available to the repository’s hooks running inside the site firewall.

Filtering? Aggregation? As previously noted, they don’t need to be in the transmission path. One or more IRC bots could be watching #commits, generating reports visible on the web, and aggregating synthetic feeds. The only agreement needed to make this happen is minimal regularity in the commit message formats that the hooks ship to IRC, which is really no more onerous than the current requirement to gin up an XML-RPC blob in a documented format.

I must note one drawback to this way of partitioning things. Because IRC has a message length limit, naively shipping commits with very long metadata (due to for example, large lists of modified files) would make only a truncated version available on IRC (and thus, to an IRC watcher bot gathering statistics).

It might be that this was the original motivation for using an XML-RPC transport on CIA’s input end. Indeed, when I first recognized the problem I started sketching a design for a auxiliary daemon that would do nothing but accept XML-RPC requests in something very close to CIA’s preferred format, then forward short digests of them to an irker instance for shipping to IRC. This auxiliary could collect statistics based on the un-truncated metadata…

Fortunately, I experienced a rush of good sense before I actually started coding this thing. It would have hugely complicated deployment and testing to handle an unusual case – observably from #commits, most commit messages are short and touch few files. We get a much simpler system if we accept two reduction rules:

1. If a commit notification would be longer than 510 bytes, we omit the filenames list. An empty filenames list is to be interpreted by filtering software as “may touch any file in the project”.

2. Then…we just ship it. If the IRC server truncates it at 510 bytes, so be it. Humans watching the commit stream won’t need more than that to put the commit in context (especially not for projects which use git’s first-line-is-a-summary convention) and the hypothetical statistics-gathering bots won’t understand natural language well enough to care that it’s truncated.

This is how you keep things simple. And that is how you prevent your projects from collapsing under complexity.

I wrote irkerd to accomplish two things: (1) Light a fire under the CIA salvage crew, attempting to speed up their success, and (2) provide a viable alternative in case they didn’t succeed. To this I now add (3) illustrate what healthy minimalism in software design looks like. Antoine de St-ExupĂ©ry said it best: Perfection (in the design of software, as well as his airplanes) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.

Accordingly, note the nonexistence of irkerd configuration options and the complete absence of anything resembling a control dotfile. I even, quite deliberately, omitted the usual option to change the port that irker listens on. Because if you think you need an option like that, you actually have a problem you need to solve at your firewall.

But releasing irkerd, of course, is not the end of the story. For it to do any good, instances of the daemon and its repo hook will need to be running and documented at sites like SourceForge, GitHub, Gitorious, Gna, and Savannah. As I noted at the beginning of this essay, I expect pushing the deployment along will eat up a lot of my time in the near future – probably more time than it took to write and test the code. These forge sites are all chronically understaffed and have long issue backlogs.

Still, at least we now have a simple and robust design, and working code. And – this can’t be emphasized enough – single-site outages will no longer be fatal. If there’s one thing the history of the Internet should have taught us, it’s that you get robust and scalable services not by centralizing but by distributing them. It’s too bad the designer of CIA never internalized that lesson, and there can be no better finish to this tale of failure than by reinforcing it.

UPDATE: I’ve changed my mind about statistics-gathering. I no longer think a bot watching the #commits channel is any kind of good idea – the notifications are too easily spoofed, or could just be garbled by software or configuration errors. If you want activity statistics, Ohloh shows the way – analysis tools operating on the repositories.