blog_post_tests/20101109220503.blog

Lessons learned from reposurgeon
<p>OK, I&#8217;m officially coming out of my cave now, after what amounted to a two-week coding orgy.  I&#8217;ve shipped <a href="http://www.catb.org/esr/reposurgeon/">reposurgeon 0.5</a>; the code looks and feels pretty solid, the <a href="http://www.catb.org/esr/reposurgeon/reposurgeon.html">documentation is written</a>, the test suite is in place, and I&#8217;ve got working repo-rebuild support for two systems, one of which is not git.  </p>
<p>The rest is cleanup and polishing.  Likely the next release or the one after will be 1.0.  It&#8217;s time for an after-action report.  As usual, I learned a few things from this project.  Some are worth sharing.</p>
<p><span id="more-2727"></span></p>
<p>The basic concept was to edit repositories by (1) serializing them to git-import streams with export tools, (2) deserializing that representation into a bunch of Python objects in the front end of reposurgeon, (3) hacking the objects, (4) serializing them to a modified git-import stream with the back end of reposurgeon, and (5) feeding that to importers to recreate a repository.</p>
<p>Did it work?  Oh, yes, indeed it did.  By decoupling reposurgeon from the internals of any particular VCS I was able to write a program that, for its power and range, is <em>ridiculously</em> short and simple.  2665 lines of Python can perform complex editing operations on git and bzr,  and will add more VCSes to its range as fast as they grow importers.  (Next one will almost certainly be hg; its exporter works, but its importer is still somewhat flaky.)</p>
<p>In fact I exceeded my planned feature-list &#8211; found myself implementing a couple of extra operations (topological cut and commit split), simply because the representation-as-Python-objects was so easy to work with that it suggested possibilities.</p>
<p>Another thumping success was the interface pattern &#8211; a line-oriented interpreter with the property that command-line arguments are simply treated as lines to be passed to it, with one magic cookie &#8220;-&#8221; meaning &#8220;go into interactive mode and take commands until EOF&#8221;.  This proved extremely flexible and easy to script, especially after I write the &#8220;script&#8221; command (execute command lines from a named file). </p>
<p>With this interface in place, adding stuff to the regression-test suite was almost absurdly easy.  Which meant that I wrote a lot of tests.  Which is good!</p>
<p>What&#8217;s the biggest drawback?  Speed, or rather lack of it.  On large repos, waiting for that repo read to finish is going to take a while. Can&#8217;t be helped, really; first, the exporter has to serialize the repo, then reposurgeon has to read and digest that whole input stream.  I profiled, and as is normal with code like this it spends most of its time either on disk I/O or in the string library.  There just aren&#8217;t any well-defined optimization targets sticking up to be flattened.</p>
<p>The performance issue raises an obvious question: was Python a mistake here?  I love the language, but there is no denying that it is a high-overhead, poor-performing language even relative to others in it class, let alone  relative to a compiled language like C.</p>
<p>On the other hand&#8230;no way this would have been a 16-day project in C.  I&#8217;d have been lucky to get it done in sixteen <em>weeks</em> in C, probably with 14 of them spent chasing resource-management bugs and a grievously high maintenance burden down the road. Python I can fire and, more or less, forget &#8211; no core-dump heisenbugs for me, baby.</p>
<p>You pays your money and you takes your choice. I think Python was right here &#8211; heck, if I&#8217;d had to do in C I might not have been brave enough to try it at all, and my other scripting languages are now rusty enough that I would have lost a lot of project time to relearning them.  But a slowish language is a relatively easy tradeoff here because this is not a program I expect to be in daily production use by anyone; it&#8217;s a coping mechanism for unusual situations.</p>
<p>Jay Maynard happened to be staying at my place during the first frenzied week of hacking.  He actually argued semi-seriously that it&#8217;s a <em>good thing</em> reposurgeon is slow, as repo surgery is such a fraught and potentially shady proposition that it <em>should</em> be difficult and give you plenty of time to reflect. Less chance of fingers getting ahead of brain that way, maybe.</p>
<p>Most fun part of the project?  That&#8217;s a toss-up, I think. One candidate was the graph-coloring algorithm I wrote to check whether breaking a given parent/child link would actually accomplish a topological cut  of the commit graph.  Another candidate is the bit of reduction algebra that I wrote to simplify sequences of file operations after a commit deletion.  Both held challenges pleasing to my ex-mathematicianly self.</p>
<p>The least fun part was being dependent on flaky import-export tools.  And I found out that, even though it&#8217;s the first non-git system I have support for in reposurgeon, I don&#8217;t like bzr.  I don&#8217;t like it at <em>all</em>.</p>
<p>In your bog-standard generic DVCS &#8211; of which git and hg are the two leading examples &#8211; the unit of work (the thing that&#8217;s passed around when you clone and push and pull) is an entire repository that can contain multiple branches and merges and looks like a general DAG (directed acyclic graph).  Sync operations between repositories are merge operations aimed at coming up with a reconciled history DAG that both sides of the merge can agree on.</p>
<p>(Yes, I love thinking about this stuff.  Shows, doesn&#8217;t it?)</p>
<p>bzr has a conceptual problem.  It can&#8217;t decide whether it wants its unit of work to be a repo or a branch.  The way it wants you to work is to create a branch which is copied from and tied back to a branch in a remote repo.  Your branch <em>isn&#8217;t</em> the entire project history, just one strictly linear segment of it that you occasionally  push and merge to other peoples&#8217; branches (or collections of branches).</p>
<p>Maybe this model has virtues I&#8217;m oblivious to, but it seems both restrictive and overcomplicated to me.  Here&#8217;s where it bites reposurgeon: the bzr exporter can only export branches, but the importer creates an (implicitly multi-branch) repo with your branch living in a subdirectory named &#8216;trunk&#8217;. </p>
<p>It seems deeply wrong to me that the import and export operations aren&#8217;t lossless inversions of each other as they are in git (well, except for cryptosigned tags) and will be in hg when it gets its importing act together.  I&#8217;m now trying to persuade the relevant bzr dev  that they&#8217;ve missed the point of import streams and should be exporting entire repo graphs rather than just branches.  We&#8217;ll see how that goes.</p>
<p>Late nights, monomaniacal concentration, and code code code.  Ahhh.   Not the life for everyone, but it suits me fine.</p>