Ugliest…repository…conversion…ever

Blogging has been light lately because I’ve been up to my ears in reposurgeon’s most serious challenge ever. Read on for a description of the ugliest heap of version-control rubble you are ever likely to encounter, what I’m doing to fix it, and why you do in fact care – because I’m rescuing the history of one of the defining artifacts of the hacker culture.

Imagine a version-control history going back to 1985 – yes, twenty-nine years of continuous development by no fewer than 579 people. Imagine geologic strata left by no fewer than five version-control systems – RCS, CVS, Arch, bzr, and git. The older portions of the history are a mess, with incomplete changeset coalescence in the formerly-CVS parts and crap like paths prefixed with “=” to mark RCS masters of deleted files. There are hundreds of dead tags and dozens of dead branches. Comments and changelogs are rife with commit-reference cookies that no longer make sense in the view through more modern version-control systems.

Your present view of the history is a sort of two-headed monster. The official master is in bzr, but because of some strange deficiences in bzr’s export tools (which won’t be fixed because bzr is moribund) you have to work from a poor-quality read-only git mirror that gets automatically rebuilt from the bzr history every 15 minutes. But you can’t entirely ignore the bzr master; you have to write custom code to data-mine it for bzr-related metadata that you need for fixing references in your conversion.

Because bzr is moribund, your mission is to produce a full standalone git conversion that doesn’t suck. Criteria for “not sucking” include (a) complete changeset coalescence in the RCS and CVS parts, (b) fixing up CVS and bzr commit references so a human being browsing through git can actually follow them, (c) making sense out of the mess that is RCS deletions in the oldest part of the history.

Also, because the main repo is such a disaster area, there is at least one satellite repo for a Mac OS X port that really wants to be a branch of the main repo, but isn’t. (Instead it’s a two-tailed mutant clone of a nine-year old version of the main repo.) You’ve been asked to pull off a cross-repository history graft so that after conversion day it will look as though the whole nine years of OS X port history has been a branch in this repo from the beginning.

Just to put the cherry on top, your customers – the project dev group – are a notoriously crusty lot who, on the whole, do not go out of their way to be helpful. If not for a perhaps surprising degree of support from the project lead the full git conversion wouldn’t be happening at all. Fortunately, the lead groks it is important in order to lower the barrier to entry for new talent.

I have been working hard on this conversion for eight solid weeks. Supporting it has required that I write several major new features in reposurgeon, including a macro facility, large extensions to the selection-set sublanguage, and facilities for generic search-and-replace on both metadata and blobs.

Experiments and debugging are a pain in the ass because the repository is so big and gnarly that a single full conversion run takes around ten hours. The lift script is over 800 lines of complex reposurgeon commands – and that’s not counting the six auxiliary scripts used to audit and generate parts of it, nor an included file of mechanically-generated commands that is over two thousand lines long.

You might very well wonder what could make a repository conversion worth that kind of investment of time and effort. That’s a good question, and one of those for which you either have enough cultural context that a one-word answer will suffice or else hundreds of words of explanation wouldn’t be enough.

The one word is: Emacs.