Repositories in Translation

I’ve been doing a lot of repository conversions recently, lifting ancient project histories from Subversion or even CVS into modern distributed version control systems. I’ve written about the technical problems with these conversions elsewhere but they also raise issues that are almost philosophical – and not unlike, actually, the challenges natural-language translators face moving a literary work between human languages.

In translations between human languages, there’s always an issue with constructions and idioms that are present in one language but not the other. This leads to a fundamental question about whether to prefer a literal or figurative translation. For example consider the French phrase “mon petit chou”; translating it literally gives you “my little cabbage”, which sounds very silly in English even if you know enough French to be well aware that it’s a term of endearment analogous to, say, “sweetie”. So most translators will opt for a free rendering that conveys authorial intent rather than wording – and then perhaps find they’re in technical trouble a few paragraphs later if another character makes a pun about cabbages.

Repository conversions raise startlingly similar issues. For a really low-level example of idioms mismatching, consider changeset references. In Subversion, a changeset reference is a number that increments from one; the conventional way of writing a reference just looks like “r235″ for the 235th commit. All the commits take place on one server and are time-ordered, so monotonically increasing commit numbers make sense.

This isn’t so in DVCSes, which are built around history merging. They’re like moving from Newtonian to Einsteinian physics; there’s no reliable linear time ordering from a single clock anymore, just lots of change histories in different peer repos diverging and reconverging. A commit ID is a hash of the contents of an entire code tree (and the hashes of its parents); it describes content, not sequence.

So, when you’re converting a Subversion repo to a DVCS, and the change comments contain references to other changesets – like, say, “r235″ – what do you do? You can’t leave it untranslated, it would just be a meaningless tag that would impair the readability of the history. That would be bad, because project histories (in a way more literary works never do) have a function. People read them to understand code and fix bugs. We never want to throw away the information that the author of commit A intended to refer to a specific other commit B in his comment; that could be important.

The tactic I use is to replace commit references in the language of any given VCS with a combination of committer name and commit timestamp (or unambiguous part thereof). In principle I could use the target DVCS’s hash for the commit, but why translate to a magic cookie that’s going to break – and need to be retranslated – if the codebase is ever moved to another DVCS?

So…”Reverted r235, it introduced a crash bug”, becomes “Reverted Fred Foonly’s commit at 2011-05-19T03:30:26Z, it introduced a crash bug.” I’m doing what a literary translator has to do: render intent instead of wording. And I’m justifying it in detail because hackers get weirdly twitchy about “rewriting history”.

There’s actually good reason for this twitchiness. People irresponsibly or maliciously messing with change histories threaten all the reputation incentives that make the whole hacker social machine hum. The weirdness and emotional intensity about it comes partly from how vitally important those incentives are, and partly from the fact that many hackers understand them only half-consciously and not very well. Consequently, they overprotect; I’ve actually had people accuse me of having made an evil tool because reposurgeon can easily edit comment histories.

Pressed, a hacker will admit that you pretty much have to rewrite commit references during a repo translation in order for them to make sense – but odds are he’ll feel uncomfortable, as though by modifying the original comment you’ve done something vaguely unclean or disreputable.

Now, one could translate like this: “Reverted Fred Foonly’s commit at 2011-05-19T03:30:26Z, it introduced a crash bug. (This reference was to ‘r235′ in the Subversion history.)” But, really, there wouldn’t be much functional point to this. The r235 is still meaningless in the new context; adding that parenthetical would be at best a ritual gesture of appeasement towards Thou Shall Not Rewrite History.

There is actually good reason not to perform that gesture. It’s a functional reason; the comment is there to convey a meaning, which the gesture distracts from. We’re actually near the territory of literary translation here, where one of the things a translator is not supposed to do is obtrude between the reader and the author.

Here’s another idiom mismatch where we don’t even have an equivalent of the ritual gesture available: tag names. A tag is just an arbitrary name for a version in the repository history. The most common use for tags is to identify shipped releases. So, for example, in the GPSD repository there’s a tag “3.3” for the release 3.3 I just shipped. (One wants this, of course, so that when someone reports a bug in public release 3.3 one can see exactly what that code was and what has changed since.)

The reason there’s a translation issue here is because older VCSes often had limitations on the form of a tag name – had to begin with a letter, couldn’t contain dots or other punctuation, that sort of thing. So tags from older systems tend to look like this: “release-3-3″.

When I translate a tag like that into modern, less restrictive VCS, I change it into “3.3”. It’s not what the original committer wrote, but it is probably what he/she was actually thinking. Intent, not wording. More generally, a really high-quality conversion of a Subversion or CVS repo into (for example) git should, as much as possible, look like what the developers would have created if they had been using git from day one – as opposed to carefully preserving Subversion/CVS idioms in amber because that’s more authentic.

Why? Again, repo translations are not primarily history or literature; they are intended to support the difficult work of understanding the code history so problems can be solved. The best translation of ancient history in a repo interferes as little as possible with the process of understanding the actual code.

Fortunately, this rule doesn’t collide with the social requirement that past hackers be properly credited (or blamed!) for their work. Nothing in a repository translation will ever require that anybody’s name be struck from the annals. Actually, translating to a modern DVCS requires that committers be identified by an unambiguous internet-wide identifier (an email address) rather than the old-style local usernames of CVS and Subversion. Thus, the attribution information in a DVCS is actually better than in older systems. It is impossible for hackers to object to this :-).

The rule that a good translation minimizes the reader’s cognitive friction – actually hides the degree to which it is a translation, except in a discreet note up front – is parallel to good practice in literary translation as well. But when you start applying it seriously, you will do things that make hackers twitchy.

Here’s one: When I’m reviewing old changeset comments during a repository lift, I fix obvious typos. Just one less thing for future hackers to trip over in the history when they’re trying to get serious analysis done. Intent, not wording!

Here’s another: On some translations where I was able to deduce the right referent for missing release tags, I silently added them…and attributed them to the project lead or release manager, back-dating them to the proper changeset. I chose to treat the repo as a slightly incomplete or damaged representation of the good practice its developers were trying to achieve, rather than authentically leaving the omissions in place.

One reason to take this attitude is that older automated repo-conversion tools such as cvs2svn really did have a tendency to damage that representation – often you just can’t tell whether the absence of a tag or slight garbling of other metadata was the result of a human slipup or a bug in a conversion tool.

The place where free translation raises the most issues is massaging comments into the form DVCSes prefer. Modern systems have log-summary tools that strongly encourage formatting change comments not as running paragraphs but as one standalone summary line optionally followed by a blank line and running paragraphs.

Most comments are one-liners anyway. Many of the multiline comments can be
turned into summary-line-plus form by inserting whitespace after the first sentence. But in extreme cases (about 0.5% of the time) I find I have to write entire summary sentences myself. (Yeah, that’ll send some of my readers to their fainting couches!)

The gods of Thou Shalt Not Rewrite History should not go entirely unappeased. I think it is a good idea to embed some indication of what you’ve changed in the repo history. Which is why the very first commit comment in the git translation of the Hercules repo now reads like this:

Initial repository setup

{You are actually looking at the result of two history lifts. This
project started out in CVS. In January of 2009 it was moved to
Subversion. On October 24 2011 it was moved to git, with the comment
history edited to turn Subversion commit references into references by
content and/or commit date, and massage comments into git form with a
summary line. Here and elsewhere, comment portions in curly braces
without preceding $ were added at conversion time.}

I’ve tried to hold the number of curly-brace comments to a minimum. I still haven’t decided whether I should sign my name in this one.

UPDATE: A commenter persuades me of two things:

1. The square brackets often used to mark ellipses and editorial comments may be a better convention – doubled, to avoid ambiguity with square brackets used in code snippets that might appear in comments. So: [[Note from the translator: ...]]

2. For the sake of future browsing tools, inserted commit IDs should be in a uniform format, like this: “[[2011-10-25T15:11:09Z/fred@foonly.com]]“. This is an ISO8601 timestamp followed by an RFC822 mail address – unambiguous and easy to machine-parse. The brackets indicate a translation note and are not intended to be part of the ID format.