End-to-end arguments in software design
My title is, of course, a reference to the 1984 paper End-to-End Arguments in System Design by Reed, Saltzer, and Clark. They enunciated what has since become understood as perhaps the single most central and successful principle of the design of the Internet. If you have not read it, do chase the link; it well deserves its status as a classic.
The authors wrote mostly about the design of communications networks. But the title referred not to “network design” but to system design, and some of the early language in the paper hints that the authors thought they had discovered a design rule with implications beyond networking. I shall argue that indeed they had – there is a version of the end-to-end principle that applies productively and forcefully to the design of (non-networked) software. I shall develop that version, illustrating it with a case study from experience.
To apply the end-to-end principle to software design, we need a way to state it that is general enough to be lifted out of the specific context of network design. The network-design version is this: intelligence belongs at the network endpoints, not in the pipes. Trying to make the pipes “smart” duplicates functions like error correction and receipt acknowledgment that the endpoints are going to have to do anyway to ensure end-to-end integrity. Trying to make the pipes smart also introduces tricky failure modes when the smarts inevitably go wrong.
Now I’m going to tell a story about a software failure. As I write, I have spent most of the last two weeks spinning up a replacement for a widely-used social-computing service that irreperably crashed on us. The replacement involves a service daemon I named “irkerd”, which is expected to relay notification requests submitted as simple JSON objects on a listening socket to Internet Relay Chat channels.
To do this, irkerd (which is written in Python) relies on a Python IRC library designed to speak the client end of the message protocol defined by RFC2812. The library is pretty good; without it, the irkerd implementation would have taken much, much longer. In turn, irkerd has stressed the library in ways it hadn’t been stressed before. I’ve been contributing fixes and patches back to the library, and the maintainer has shipped a couple of point releases as a result.
Yesterday, just as the irkerd codebase was stabilizing after some early problems with thread safety, some of my test users on the irker chat channel began reporting a new fatal bug – a Unicode decoding error being thrown from deep inside the IRC library. Investigation and an email query to the maintainer revealed that this was the result of a recent design decision and a consequent change to the library internals in the previous day’s point release.
Previously, the IRC library had made no assumptions about the character encoding of the chat data it received from IRC servers. It simply passed those strings as uninterpreted payloads of events made visible by the library to the calling application (in this case irkerd). Because irkerd is a sending relay rather than an interactive client, it just threw that received chat data away. The only received traffic it cares about is the IRC server’s responses to login and message-transmission commands, which are plain ASCII.
In this point release, the maintainer changed the library to perform UTF-8 decoding early in the processing of the chat strings, so the event payloads would be guaranteed Unicode. Which was fine and dandy until a server shipped irkerd a chat line with a bad continuation byte. At that point the early UTF decode threw an exception from deep inside the library, crashed one of irkerd’s main threads, and hung the daemon.
The library maintainer thought he’d be doing calling applications a favor by performing UTF-8 decoding so they don’t have to. What he did instead was introduce a new fatal failure mode on bad chat data – a particularly annoying one for irkerd, which only sees that data because the protocol requires it to, and really wants to just ignore the chat lines.
Assumptions about character encoding properly belong not in the IRC library but in the calling application. There are several reasons for this, but they all come down to two points: (1) only decoding exceptions raised locally can be handled locally, and (2) how to handle them is a policy decision that the application must make – so the library should not try to pre-empt it.
The library maintainer violated a software version of the end-to-end principle. It reads like this: in any software data path, whether networked or not, interior components that can be indifferent to the nature of the data they are handling should remain indifferent. Every assumption introduced while handling data implies new exceptions and failure cases; to minimize your failures, minimize your assumptions.
I will finish by noting that I wrote this essay because web searches on “end-to-end arguments” and “end-to-end principle” suggested that the above software-generalized version of it may not have been written down before – all the references I could find were about network design. I believe that this is one of those folk theorems that every sufficiently experienced software system designer eventually learns without necessarily becoming conscious about it. I hope that by writing it down, so it can be learned more rapidly and explicitly, I will have helped improve the practice of software design.