blog_post_tests/20120123114047.blog

Another bite of the reposturgeon
<p>Five weeks ago I wrote that <a href="http://esr.ibiblio.org/?p=4004">direct Subversion support in reposurgeon is coming soon</a>. I&#8217;m waiting on one final acceptance test before I ship an official 2.0; in the meantime, for those of you kinky enough to find the details exciting, description follows of why this feature has required such a protracted and epic struggle.  With (perhaps entertaining) rambling through the ontology of version control systems, and at least one lesson about good software engineering practice.</p>
<p><span id="more-4071"></span></p>
<p>Scene-setting: reposurgeon is a command interpreter for performing tricky editing operations on version-control histories.  It gets a lot of its power from the fact that it knows almost nothing about individual version-control systems other than how to turn a repository into a git-style import stream, and an import stream back into a repository. And it expects to call helper programs to do those things &#8211; in the git case, git fast-export and git fast-import.</p>
<p>What looks like editing of repositories is actually editing of import streams. By leaving serialization and deserialization to be somebody else&#8217;s problem, reposurgeon avoids getting entangled in a lot of low-level hairiness about how individual VCSes store and represent files and versions.</p>
<p>The benefit of this strategy is that reposurgeon gets to concentrate on the interesting part: high-level surgical operations on the repository&#8217;s metadata and changeset DAG. The cost is that reposurgeon has a lot of trouble editing any metadata that won&#8217;t fit into the import-stream representation.  In practice, because import streams do a pretty good job of capturing the right abstractions, reposurgeon can win big.</p>
<p>Well, that&#8217;s how the model looked two months ago.  There have been some changes since.  The reposurgeon 1.x model that I&#8217;ve just described fails on Subversion because the pre-existing tools for exporting a Subversion repository to an import stream are either weak or broken or both. But no blame attaches; it turns out that these tools suck because the problem is really quite difficult.  Two months, people &#8211; <em>two months</em> of my <em>concentrated</em> attention.</p>
<p>The symptoms of the problem are this: </p>
<p>1. Subversion doesn&#8217;t have a native exporter to import-stream format. I have pretty good zorch with the Subversion developers (I&#8217;m a past code contributor with commit access, though never a core dev) and I&#8217;ve campaigned hard for them to write an official exporter, but it has never happened.</p>
<p>2. Many of the third-party export tools out there only handle linear repositories (no branching). There&#8217;s a strong tendency for people writing these to get to the point where they need to do the mapping from Subversion branches to gitspace branches, pull up short, leave an embarrassed &#8220;to be done&#8221; comment in the code or README, and disappear never to be heard from again.</p>
<p>3. There are a few export tools that do support branchy repositories; the git project&#8217;s own git-svn is probably overall the least bad of these. These require the user to pre-declare the repository&#8217;s branch structure rather than deducing it.  Goddess help you if you skip those declarations or get them wrong.  </p>
<p>4. Even if you do the right thing, the tools often don&#8217;t. They are brittle, slow, and lossy. </p>
<p>Example: in Subversion it is possible to commit a changeset that modifies files in multiple branches. This is a bad idea and people seldom do it on purpose, but it can happen by accident too &#8211; every sufficiently large and old Subversion repository has a few such accidents in it, and they confuse conversion tools horribly.</p>
<p>Example: The right way to create a Subversion branch is &#8220;svn copy&#8221;; the wrong way (which not infrequently happens by finger error) is an ordinary directory copy followed by an svn add of the directory. If you do this, later checkouts will look right but the internal information linking the new branch to the rest of the repository will be missing.  When you try to convert such a repo (with anything other than reposurgeon), you&#8217;ll end up with a detached branch floating in midair.</p>
<p>Example: In Subversion it is possible to delete a branch or rename it, then immediately create another branch with the same name. But if you feed such a repository to a conversion tool, the result is almost certain to unpleasantly surprise you.</p>
<p>I could go on at book length about more symptoms. Underlying them, Subversion&#8217;s model of how version control works is tricky and complicated, with edge cases that sneak up on you. It allows combinations of operations that are rather perverse (cross-branch mixed commits being one of the easier cases in point to understand). </p>
<p>The import-stream model is much simpler &#8211; there are fewer combinations of primitive operations and thus less to go wrong there. This is good, but moving content and metadata from one to the other in full generality is a stone bitch. My criticisms of the pre-existing tools may seem harsh, but having grappled with this problem myself I have nothing but sympathy for the people who failed at it previously.  The difficulties are very like what hardware people call an impedance mismatch.</p>
<p>My first real step to solving this problem was to let Subversion itself do as much of the work as possible.  It has no import-stream exporter. but it does have the ability to serialize a repository into a dumpfile not totally dissimilar from an import stream, and relatively easy to parse.  What reposurgeon does is parse this dump file.</p>
<p>Actually, in a very early version of the code I <em>didn&#8217;t</em> parse the dumpfile; instead, I used Subversion&#8217;s own client tools to mine information from each repository. There turned out to be two problems with this: (1) piecing all the data together from different tool reports is complicated, and (2) the Subversion tools are horrifyingly slow. I finally gave up on this approach when I discovered that mining a 3K-commit repository this way took <em>eight hours</em>.</p>
<p>In retrospect I should have started with a Subversion dumpfile parser sooner; I was distracted from that approach by the prospect of building a general history-replaying framework that could be applied with minor changes to mining other repository types as well.  I had to give up on that objective to make real progress with Subversion, though the replay framework still lives, unused, inside of reposurgeon.  It might get used for something else someday, if it hasn&#8217;t bit-rotted first.</p>
<p>Those of you familiar enough with Subversion might be wondering, at this point, why I didn&#8217;t use Subversion&#8217;s client API rather than writing a dumpfile parser. Three reasons: (1) complexity, (2) stability, and (3) documentation.  </p>
<p>(1) Holding the markup features and semantics of a relatively simple plain-text dump format in your head is generally easier than remembering all the ins and outs of a complicated API accessing the deserialized version of the same data. It&#8217;s more concrete; you can visualize things.</p>
<p>(2) APIs change.  Subversion&#8217;s client API has changed on a faster scale than the dumpfile format, and that can be expected to continue.  By parsing the dumpfile directly I avoid a whole class of completely artifactual version-skew problems with the API.  (Actually, APIs &#8211; there are two competing ones for Python.)</p>
<p>(3) The APIs are thinly and poorly documented.  So was the dumpfile format.  But because it&#8217;s easy to see all the way down to the bottom of the latter, I liked my chances of coping with the poor documentation of the dumpfile better.</p>
<p>One of the things I ended up doing as a side effect of this project, actually, was writing much more complete and detailed documentation of the Subversion dumpfile format than had existed before &#8211; I plied the Subversion developers with questions in order to do this.  The results now live in the Subversion repo.</p>
<p>If you take no other lesson from this essay, heed this one: Should you ever find yourself in a similar situation (exploratory parsing of a poorly-documented textual format), stop.  Stop coding immediately.  <em>Document the format first.</em> Check your conjectures with the host program&#8217;s developers, make sure you know what&#8217;s going on, push the resulting document upstream, and get it accepted.  </p>
<p>This may sound like a lot of work, but I <em>guarantee</em> you that a few days of pre-documenting will save you weeks of arduous debugging time. Writing that documentation is not just a worthy service to other programmers in the future, it&#8217;s an implicit specification of what <em>your</em> parser has to do that will save you from flailing around in the dark.</p>
<p>Once I fully grokked the dumpfile format, syntax and semantics both, I could tackle the actual meat of the problem &#8211; mapping Subversion&#8217;s ontology to the ontology of import streams.  This is where &#8220;impedance mismatch&#8221; starts to be a relevant concept.  It&#8217;s where the pre-existing tools fall down badly.</p>
<p>The differences between these ontologies cluster around two large ones: (1) Subversion has flows, while import streams do not, and (2) the treatment of branching is quite different.  </p>
<p>A &#8220;flow&#8221; is what internal Subversion documentation rather confusingly calls a &#8220;node&#8221; &#8211; a time series of changes to a single file or directory considered as a unit for purposes of change tracking.  If you create a file, modify it several times, delete it, and then again create a file with the same name,  that will be two different flows that happen to have the same path.</p>
<p>In the import-stream world there are no flows &#8211; just file paths pointing to  blobs of content. The practical difference is that the semantics of some legal Subversion operations &#8211; notably directory deletes and renames &#8211; are difficult to translate into the language of import streams.  In fact import streams don&#8217;t have any notion of &#8220;directories&#8221; in themselves; they&#8217;re expected to be automatically created when the creation of a file requires it, and to be garbage-collected when they become empty due to file deletions.</p>
<p>(I&#8217;m not actually very clear about <em>why</em> Subversion has flows; the obvious guess would be to help in deductions about history-sensitive merging, but that&#8217;s something Subversion has never actually handled very well. On the other hand, it&#8217;s only fair to note that <em>nobody</em> was handling it well when Subversion was designed.)</p>
<p>While the existence of flows in Subversion mainly just produces a few odd edge cases that you have to be careful of, the other major difference &#8211; in the semantics of branching &#8211; is a much bigger deal.  Its effects are pervasive.</p>
<p>In Subversion, a branch is nothing more or less than a copy of a source directory, made with &#8220;svn copy&#8221; so it preserves an invisible link back to the source directory and the revision when the copy was done. All branches are always visible from the top level of the repository.  Some branches are used to represent tags (release states of the code) and are never touched by commits  after the copy; others represent lines of development and <em>are</em> changed by later commits. A branch copy is an operation that stays visible in the commit history.</p>
<p>Conventionally there is a &#8220;trunk&#8221; branch directly under the repo root that represents the main line of development, tags live under a &#8220;tags&#8221; subdirectory, and branches live under a &#8220;branches&#8221; subdirectory &#8211; but nothing enforces these conventions.  Hello, cross-branched mixed commits! </p>
<p>In git (and other version-control systems that speak import streams) the model is completely different.  Branches are always used for lines of development, never for tags &#8211; the import-stream model has real annotated tags instead.  Only one branch is available for modification at any given time &#8211; there&#8217;s no possibility of a cross-branch mixed commit.  Branch creations don&#8217;t show up as operations in the commit history. </p>
<p>It took quite a bit of time and thought to figure out how to map smoothly from the Subversion branch model to the import-stream one. One of my requirements was that the user should <em>not</em> have to declare the branch structure!  You&#8217;ll be able to read the detailed rules on reposurgeon 2.0&#8217;s manual page; the short version is that if trunk is present, then trunk, branches/*, and tags/* are treated as candidate branches, <em>and so is every other directory immediately under the repository root</em>.  But: a candidate branch is turned into a tag if there are no commits after the copy that created it.  </p>
<p>If trunk is not present, no branch analysis is done &#8211; that is, the repo is translated as a straight linear sequence of commits.  There&#8217;s an option to force this behavior if your repo&#8217;s Subversion branch structure is so weird that the above rules would mangle it.  In that case you&#8217;ll need to do your own post-conversion surgery.</p>
<p>The rules I ended up with are simple, but implementing them is not. Analysis of the Subversion copy operations is the trickiest and most bug-prone part of the dumpfile analyzer. I had a pretty fair idea going in how hairy this part was going to be, which is why I approached the Network UPS Tools project and asked to convert their Subversion repo for them. </p>
<p>As a past contributor, I knew that the NUT Subversion repo is large, complex in branch structure, and old enough to have begun life as a CVS repo.  That last part matters because some of the ugliest translation problems lurking in the back history of Subversion projects are strange Subversion operation sequences (including combinations of branch copy operations) generated by cvs2svn.</p>
<p>So: I explained to the NUT crew what I&#8217;m doing.  I told them they&#8217;d get a better quality conversion to git than any of the existing tools will deliver &#8230; eventually.  I was up front about the conversion code being a beta that would probably break nineteen different ways on their repo before I was done, and that&#8217;s sort of the point.  Fortunately, I got support from the project lead (thank you, Arnaud Quette!) and active cooperation from the project&#8217;s internal advocate for a git switchover (thank you, Charles Lepple!).  </p>
<p>Setting up this real-world test turned out to be a Good Thing.  Charles Lepple has pointed out more than a dozen bugs that turned out to be due to my code not handling strange cases in the Subversion metadata well enough, cases that a smaller and younger and less grungy repository history might never have exhibited.  I fixed another one this morning. There is a realistic chance it will have been the last one&#8230;but maybe not.</p>
<p>I&#8217;ll ship reposurgeon 2.0 when Charles and Arnaud sign off on the NUT-UPS conversion.  Besides stomping all the branch-analysis bugs we can find, there&#8217;s one more feature to make work.  I want to add a surgical primitive that can find and perform merges back to trunk for Subversion branch tips that are in a mergeable state. This would not actually be a Subversion-specific feature, but applicable to any import stream.</p>
<p>If I shipped 2.0 today, reposurgeon would already blow every other utility for lifting Subversion repositories clean out of the water. I&#8217;m not satisfied with that; I want it to be <em>bulletproof</em>. But for those of you itching to get your hands on the beta: </p>
<p>git@gitorious.org:reposurgeon/reposurgeon.git</p>
<p>No warranties express or implied, etc.  I have documented it all, however. Throw it at the gnarliest repo you can find and let me know if you spot bugs.</p>