If RCS can stand it, why can’t your system?

I’ve written software for a lot of different reasons besides pure utility in the past. Sometimes I’ve been making an aesthetic statement, sometimes I’ve hacked to perpetuate a tribal in-joke, and at least once I have written a substantial piece of code exactly because the domain experts solemnly swore that job was impossible to automate (wrong, bwahahaha).

Here’s a new one. Today I released a program that is ugly and only marginally useful, but specifically designed to shame other hackers into doing the right thing.

Those of you who have been following the saga of reposurgeon on this blog will be aware that it relies on the rise of git fast-import streams as a universal history-interchange format for VCSes (version-control systems).

Alas, support for this is still spotty. On git, where the format was invented, it’s effectively perfect. Elsewhere, bzr comes closest to getting it right with official import and export plugins but a weird asymmetry between export and import. Mercurial has a fairly solid third-party exporter, but its third-party importer is crap. Elsewhere the situation is worse; Subversion is typical in that it has a proof-of-concept third-party exporter that loses some metadata, and no importer.

And this is ridiculous, actually. It’s already generally understood that writing exporters to the stream format is dead easy – the problem there seems political, in that VCS devteams are perhaps a bit reluctant to support tools that make migration off their systems easier. But having written one myself for reposurgeon, I know that a high-quality importer (which encourages migration towards your VCS) is not all that difficult either. Thus, there’s no excuse, either technical or political, for any self-respecting VCS not to have an importer.

I decided to prove this point with code. So I dusted off the oldest, cruftiest version-control system still in anything resembling general use – Walter Tichy’s Revision Control System (RCS). And I wrote a lossless importer for it. Took me less than a week, mostly repurposing code I’d written for reposurgeon. The hardest part was mapping import-stream branches to RCS’s screwy branch-numbering system.

To appreciate how silly this was on any practical level, you need to know that RCS doesn’t have changesets (the ability to associate the same comment and metadata with a change to multiple files). I cheat by embedding changeset-oriented data as RFC822 headers in RCS comment fields. An exporter could be written to invert this and completely recover the contents of the import stream, and I’ve been communicating with the author of rcs-fast-export.rb; it may actually do this soon.

There is one circumstance under which rcs-fast-import might be useful; if you wanted to break a project repo into multiple pieces, blowing it apart into constituent RCS files and re-exporting separately from the cliques might be a way to do it. But mainly I wrote this as a proof of principle. If crufty old RCS can bear the semantic weight of an import stream, there is simply no excuse left for VCSes that claim to be modern production-quality tools to be behindhand on this. None.

At this point there is an inevitable question burning in the minds of those of you who are moderately clued in about ancient VCSes. And that is: “What about SCCS?”

Ah, yes. The only VCS even cruftier and more ancient than RCS. Those of you really clued in about ancient version-control systems will have guessed the answer; they’re so similar that making rcs-fast-import speak SCCS if anyone ever wants that would be pretty trivial (in particular they have the same semantics of branching, which is the hard part). Actually the code is already factored to support this; out of 841 lines only 36 are the plugin class that exercises the RCS command set, and an SCCS plugin wouldn’t be more than a few lines longer.

But I targeted RCS partly because it’s still in actual use; some wiki engines employ it as a page-versioning backend because it’s fast, lightweight, and they neither want nor need changesets. In truth, if you have a directory full of documents each one of which you want to treat as an atomic unit, RCS still has utility (I use it that way). SCCS, on the other hand, survives if at all in a handful of creakingly ancient legacy installations.

(What made the difference? RCS was open-source from the get-go. SCCS wasn’t. We all know how that dance goes.)

Yes, 841 lines. 574 of them, 65%, stripped out of reposurgeon. Less than a week of work, crammed in around corners while I was busy with other things. It’s not complicated or tricky code. The trick is in having the insight that it’s possible. And a living rebuke to every modern VCS that hasn’t gotten its fast-import act this together yet.

One entertaining side-effect of this project is that I figured out, in detail, how CVS could have been written to not suck.

Those of you into VCS archeology will know that CVS was a layer over a RCS file store, a layer that tried to provide changesets. It was notoriously failure-prone in some important corner cases. This is what eventually motivated the development of Subversion by a group of former CVS maintainers.

Well…here I am, writing rcs-fast-import to make RCS hold the data for losslessly reconstructing import-stream changesets…and at some point I found myself doing a double-take because I had realized I had solved CVS’s problems. Here’s how I explained it to Giuseppe Bilotta, the author of rcs-fast-export:

Incidentally, a side effect of writing the importer was that I figured
out how CVS should have been written so it wouldn’t have sucked :-) It
had a tendency to break horribly near deletes, renames and copies;
this is because the naive way to implement these (which they used)
involved deleting, copying, and renaming RCS master files.

In fact, I figured out last night. while building my importer so it
would round-trip, that you can have reliable behavior from changesets
layered over RCS only if you *never delete a master*, not even to
rename it. I know what the right other rules are, but it’s nearly
twenty years too late for that to matter.

Sigh. If I had looked at this problem in 1985 I could have saved
the world a lot of grief.

Giuseppe said:

> I’m very curious to hear about your solution about tracking these
> operations. I mean, git doesn’t track them because it only cares about
> contents and trees, but how would you do that in a file-based vcs?

Here’s how I explained it:

On delete, don’t delete the master. Enter a commit that’s an empty file
and set the state to “Deleted”. (The second part isn’t strictly necessary,
but will be handy if you need to do forensics because other parts of
your repo metadata have been corrupted.)

On rename, don’t delete the master. Copy it to the new name, so the
renamed file has the history from before the rename, but also *leave
that history in the original*. Give the original a new empty commit
as you did for delete, with a state of ‘Renamed’. If your RCS has
a properties extension, give that commit a rename-target property
naming what it was renamed to. Give the same commit on the master copy a renamed-from property referencing the source.

On copy, check out the file to be copied and start a new master with
no history. If your RCS has a properties extension, give that commit
a copied-target property naming what it was renamed to, and give the
initial commit of the copy a copied-from property referencing the
source.

On every commit, write a record to a journal file that looks like a
git-fast-import commit record, except that the <ref> parts are RCS
revision numbers.

You’re done. It may take you a bit to think through all the
retrieval cases, but they’re all covered by one indirection through
the journal file.

Don’t like the journal file? No problem, you just write a
sequence-numbered tag to all files for each commit. This would be
slower, though.

There are optimizations possible. Strictly speaking, if you have
an intact chain of rename properties you can get away with not
copying history to the target of a rename.

The key point is that once a revision has been appended to a specific
master, you *never delete it*. Ever. That simple rule and some pointer
chasing gives you a revision store that is robust in the presence of D,
R, and C operations. Nothing else does.

Not by coincidence, modern DVCSes use the same strategy.

He did raise one interesting point:

> Considering the debates still going on the “proper” way to handle
> renames and copies, I’m not sure it would have been accepted ;-)

But it turns out that has an answer, too:

The above scheme gives you git-like semantics. For bzr-like semantics,
add one more wrinkle; a unique identity cookie assigned at master-creation
time that follows it through renames. Your storage management doesn’t
care about this cookie; only the merge algorithm cares.

All this is so simple [and actually implemented in rcs-fast-import] that
I’m now quite surprised the CVS people got it wrong. I could ship a
daemon that implemented these rules in a week, and could have done
so in 1985 if I’d known it was important.

Sigh…sometimes the hardest part is knowing what to spend your thinking time on. I console myself with the thought that, after all, I have gotten this right some times that it mattered.