blog_post_tests/20131205142311.blog

Heads up: the reposturgeon is mutating!
<p>A few days ago I released reposurgeon 2.43. Since then I&#8217;ve been finishing up yet another conversion of an ancient repository &#8211; groff, this time, from CVS to git at the maintainer&#8217;s request. In the process, some ugly features and irregularities in the reposurgeon command language annoyed me enough that I began fixing them.</p>
<p>This, then, is a reposurgeon 3.0 release warning. If you&#8217;ve been using 2.43 or earlier versions, be aware that there are already significant non-backwards-compatible changes to the language in the repository head version and may be more before I ship.  Explanation follows, embedded in more general thoughts about the art of language design. </p>
<p><span id="more-5135"></span></p>
<p>First, a justification.  Most computer languages (including domain-specific languages like reposurgeon&#8217;s) incur high costs when they change incompatibly. It&#8217;s a bad thing when a program breaks halfway through its expected lifetime &#8211; or worse, when its behavior changes in subtle ways without visibly breaking. Responsible language maintainers don&#8217;t make such changes at all if they can help it, and never do so casually.</p>
<p>But reposurgeon has an unusual usage pattern. Lift procedures written in reposurgeon are generally written once, used for a repository conversion, then discarded.  This means that users are exposed to incompatibility problems only if they change versions while a conversion is in progress.  This is usually easy to avoid, and when it can&#8217;t be avoided the lift recipes are generally short and relatively easy to verify.</p>
<p>Thus, the costs from reposurgeon compatibility breakage are unusually low, and I have correspondingly more freedom to experiment than most language designers.  Still, conservatism about breaking compatibility sometimes does deter me, because I don&#8217;t want to casually obsolesce the knowledge of reposurgeon in my users&#8217; heads. Making them re-learn the language at every release would be rude and obtrusive of me.</p>
<p>That conservatism has a downside beyond just slowing the evolution of the language, however.  It can sometimes lead to design decisions, made to preserve compatibility, that produce warts on the language and that you come to regret later.  Over time these pile up as a kind of technical debt that eventually has to be discharged. That discharge is what&#8217;s happening to reposurgeon now.</p>
<p>Now I&#8217;ll stop speaking abstractly and point at some actual ugly spots.  The early design of reposurgeon&#8217;s language was strongly influenced by the sorts of things you can easily do in Python&#8217;s Cmd class for building line-oriented interpreters. What Cmd wants you to do is write command handler methods that are chosen based on the first whitespace-separated token on the line, and get the rest of the line as an argument. Thus, when reposurgeon interprets this:</p>
<pre language="Python">
read foobar.fi random extra text
</pre>
<p>what actually happens is that it&#8217;s turned into a method call to </p>
<pre language="Python">
do_read("foobar.fi random extra text")
</pre>
<p>and how you parse that text input in do_read() is up to you. Which is why, in the original reposurgeon design, I used the simplest possible syntax. If you said</p>
<pre language="Python">
read foobar.svn
delete /nasty content/ obliterate
write foobar2.fi
</pre>
<p>this was interpreted as &#8220;read and parse the Subversion stream dump in foobar.svn in the file foobar.fi, delete every commit for which the change comment includes the string &#8220;nasty content&#8221;, then write out the resulting history as an fast-import stream to the file foobar2.fi.</p>
<p>Looks innocent enough, yes?  But there&#8217;s a problem lurking here.  I first bumped into it when I wanted to specify an optional behavior for stream writes.  In some circumstances you want some extra metainformation appended to each change comment as it goes out, a fossil identification (like, say, a Subversion commit number) retained from the source version control system.  The obvious syntax for this would look like this:</p>
<pre language="Python">
write fossilize foobar2.fi
</pre>
<p>or, possibly, with the &#8216;fossilize&#8217; command modifier after the filename rather than before it.  But there&#8217;s a problem; &#8220;write&#8221; by itself on a line means &#8220;stream the currently selected history to standard output&#8221;, just as &#8220;read&#8221; means &#8220;read a history dump from standard input&#8221;.  So, if I write</p>
<pre language="Python">
write fossilize
</pre>
<p>what do I mean?  Is this &#8220;write a fossilized stream to standard output&#8221;, or &#8220;write an unfossilized stream to the file &#8216;fossilize'&#8221;? Ugh&#8230;</p>
<p>What the universe was trying to tell me is that my Cmd-friendly token-oriented syntax wasn&#8217;t rich enough for my semantic domain. What I needed to do was take the complexity hit in my command language parser to allow it to look at this</p>
<pre language="Python">
write --fossilize foobar2.fi
</pre>
<p>and say &#8220;aha, &#8211;fossilize is led with two dashes so it&#8217;s an option rather than a command argument&#8221;  The handler would be called more or less like this:</p>
<pre language="Python">
do_read("foobar2.fi, options=["--fossilize"])
</pre>
<p>I chose at the time not to do this because I wanted to keep the implementation simplicity of just treating whitespace-separated tokens on the command line as positional arguments.  What I did instead was introduces a &#8220;set&#8221; command (and a dual &#8220;clear&#8221; command) to manipulate global option flags.  So the fossilized write came to look like this.</p>
<pre language="Python">
set fossilize
write foobar2.fi
clear fossilize
</pre>
<p>That was my first mistake. Those of you with experience at this sort of design will readily anticipate what came of opening this door &#8211; an ugly profusion of global option flags.  By the time I shipped 2.43 there were <em>seven</em> of them.</p>
<p>What&#8217;s wrong with this is that global options don&#8217;t naturally have the same lifetime as the operations they&#8217;re modifying.  You can get unexpected behavior in later operations due to persistent global state.  That&#8217;s bad design; it&#8217;s a wart on the language.</p>
<p>Eventually I ended up having to write my own command parser anyway, for a different reason.  There&#8217;s a &#8220;list&#8221; command in the language that generates summary listings of events in a history. I needed to be able to save reports from it to a file for later inspection.  But I ran into the modifier-syntax problem again. How is the do_list() handler supposed to know which tokens in the line passed to it are target filenames?</p>
<p>Command shells like reposurgeon have faced this problem before.  Nobody has ever improved on the Unix solution to the problem, which is to have an output redirection syntax.  Here&#8217;s a reminder of how that works:</p>
<pre language="Python">
ls foo          # Give me a directory listing of foo on standard output
ls >bar         # Send a listing of the current directory to file bar
ls foo >bar     # Send a listing of foo to the file bar
ls >bar foo     # same as above - ls never sees the ">bar"
</pre>
<p>In reposurgeon-2.9 I bit the bullet and implemented redirection parsing in a general way.  I found almost all the commands that could be described as report generators and used my new parser to make them support.  A few commands that took file inputs got re-jiggered to use &#8220;&lt;&#8221; instead.</p>
<p>For example, there&#8217;s an &#8220;authors read&#8221; command that reads text files mapping local Subversion- and CVS-style usernames to DVCS-style IDs. Before 2.9, the command to apply an author map looked like this:</p>
<pre language="Python">
authors read foo.map
</pre>
<p>That changed to </p>
<pre language="Python">
authors read &lt;foo.map
</pre>
<p>But notice that I said &#8220;<em>almost</em> all&#8221;.  To be completely consistent, the expected syntax of my first example should have changed to look like this:</p>
<pre language="Python">
read &lt;foobar.svn
delete /nasty content/ obliterate
write >foobar2.fi
</pre>
<p>That is, read and write should have changed to always require redirection rather than ever taking filenames as arguments.  But when I got to that point, I retained I/O filename arguments for those commands only, also supporting the new syntax but not decommissioning the old.</p>
<p>That was my second mistake. Technical debt piling up&#8230;but, you see, I thought I was being kind to my users.  The other commands I had changed to require redirection were rarely used; &#8220;read&#8221; and &#8220;write&#8221;, on the other hand, pretty much have to occur in every lift script. Breaking my users&#8217; mental model of them seemed like the single most disruptive change I could possibly make.  Put plainly, I chickened out.  </p>
<p>Now we fast-forward to 2.42 and the groff conversion, during which the technical debt finally piled high enough to topple over.</p>
<p>There&#8217;s a reposurgeon command &#8216;unite&#8221; that&#8217;s used to merge multiple repositories into one.  I won&#8217;t go into the full algorithm it uses except to note that if you give it two repositories that are linear, and the root of one of them was committed later than the tip of the other, the obvious graft occurs &#8211; the later root commit is made the child of the earlier tip commit.  I needed this during the groff conversion.</p>
<p>Every time you do a unite you have a namespace-management problem. The repositories you are gluing together may have collisions in their branch and tag names &#8211; in fact they almost certainly have one collision, on the default branch name &#8220;master&#8221;. The unite primitive needs to do some disambiguation.</p>
<p>The policy it had before 2.43 was very simple; every tag and branch name gets either prefixed or suffixed with the name of the repo it came from. Thus, if you merge two repos named &#8220;early&#8221; and &#8220;late&#8221;, you end up with  two tags named &#8220;master-early&#8221; and &#8220;master-late&#8221;.</p>
<p>This turns out to be dumb and heavyhanded when applied to to two linear repos with &#8220;master&#8221; as the <em>only</em> collision.  The natural thing to do in that case is to leave all the (non-colliding) names alone, rename the early tip branch to &#8220;early-master&#8221; and leave the late repo&#8217;s &#8220;master&#8221; branch named &#8220;master&#8221;.</p>
<p>I decided I wanted to implement this as a policy option for unite &#8211; and then ran smack dab into the modifier-syntax problem again,  Here&#8217;s what a unite command looks like (actual example from recent work):</p>
<pre language="Python">
unite groff-old.fi groff-new.fi
</pre>
<p>Aarrgh!  Redirection sequence won&#8217;t save me this time. Any token I could put in that line as a policy switch would look like a third repository name.  Dammit, I need a real modifier syntax and I need it <em>now</em>.</p>
<p>After reflecting on the matter, I once again copied Unix tradition and added a new syntax rule: tokens beginning with &#8220;&#8211;&#8221; are extracted from the command line and put in a separate option set also available to the command handler. Because why invent a a new syntactic style when your audience already knows one that will suit?  It&#8217;s good interface engineering to re-use classic notations.</p>
<p>I mentioned near the beginning of this rant that this is what I should have done to the parser much sooner.  Now my new unite policy can be invoked something like this:</p>
<pre language="Python">
unite --natural groff-old.fi groff-new.fi
</pre>
<p>OK, so I implemented option extraction in my command parser. Then it hit me:  if I&#8217;m prepared to accept a compatibility break, <em>I can get rid of most or all of those ugly global flags</em> &#8211; I can turn them into options for the read and write commands.  Cue angelic choirs singing hosannahs&#8230;</p>
<p>Momentary aside: This is not exceptional. This is what designing domain-specific languages is like all the time.  You run into these same sorts of tradeoffs over and over again.  The interplay between domain semantics and expressive syntax, the anxieties about breaking compatibility, even the subtle sweetness of finding creative ways to re-use classic tropes from previous DSLs&#8230;I love this stuff.  This is my absolute favorite kind of design problem.</p>
<p>So, I gathered up my shovels and rakes and other implements of destruction and went off to abolish global flags, re-tool the read &#038; write syntax, and otherwise strive valiantly for truth, justice, and the American way.  And that&#8217;s when I received my just comeuppance.  I collided head-on with a kluge I had put in place to preserve the old, pre-redirection syntax of read and write.</p>
<p>Since 2.9 the code had supported two different syntaxes</p>
<pre language="Python">
read foobar.fi       # Old
read &lt;foobar.fi      # New
</pre>
<p>The problem was the easiest way to do this had been to look at the argument line before the redirection parser sees it, and prepend &#8220;&lt;&#8221; if it doesn&#8217;t already begin with one.  But that means that if I type &#8220;read &#8211;fossilize foobar.fi&#8221; the read handler will get this: &#8220;&lt;&#8211;fossilize foobar.fi&#8221;. With the &#8220;&lt;&#8221; in entirely the wrong place!</p>
<p>Friends, when this sort of thing happens to you, here is what you will do if you are foolish.  You will compound your kluge with another kluge, groveling through the string with some kind of rule like &#8220;insert < before the first token that does not begin with --".  And that kluge will, as surely as politicians lie, come back around to bite you in the ass at some future date.</p>
<p>If you are wise, you will recognize that the time has come to commit a compatibility break, and repent of your error in not doing so sooner.  That is why in the repository tip version of reposurgeon the old pre-redirection sequemce of "read" and "write" is now dead; the command interpreter will throw an error if you try it.</p>
<p>(Minor complication: "read foo" still does something useful if foo is a directory rather than a file.  But that's OK because we have unambiguous option syntax now.)</p>
<p>But another thing about this kind of design is that once you've accepted you need to do a particular compatibility break, it becomes a propitious time for others. Because one big break is usually easier to cope with than a bunch of smaller  ones spread over time.</p>
<p>That means it's open season until 3.0 ships on changes in command names and syntactic elements.  If you have used reposurgeon, and there is something you consider a wart on the design, now is the time to tell me about it.</p>
<p></foobar></p>