blog_post_tests/20130324194131.blog

Python speed optimization in the real world
<p>I shipped reposurgeon 2.29 a few minutes ago.  The main improvement in this version is speed &#8211; it now reads in and analyzes Subversion repositories at a clip of more than 11,000 commits per minute.  This, is, in case you are in any doubt, <em>ridiculously</em> fast &#8211; faster than the native Subversion tools do it, and for certain far faster than any of the rival conversion utilities can manage.  It&#8217;s well over an order of magnitude faster than when I began seriously tuning for speed three weeks ago.  I&#8217;ve learned some interesting lessons along the way.</p>
<p><span id="more-4861"></span></p>
<p>The impetus for this tune-up was the Battle for Wesnoth repository.  The project&#8217;s senior devs finally decided to move from Subversion to git recently. I wan&#8217;t actively involved in the decision myself, since I&#8217;ve been semi-retired from Wesnoth for a while, but I supported it and was naturally the person they turned to to do the conversion.  Doing surgical runs on that repository rubbed my nose in the fact that code with good enough performance on a repository 500 or 5000 commits long won&#8217;t necessarily cut it on a repository with over 56000 commits. Two-hour waits for the topological-analysis phase of each load to finish were kicking my ass &#8211; I decided that some serious optimization effort seemed like a far better idea than twiddling my thumbs.</p>
<p>First I&#8217;ll talk about some things that didn&#8217;t work.</p>
<p><a href="http://pypy.org/">pypy</a>, which is alleged to use fancy JIT compilation techniques to speed up a lot of Python programs, failed miserably on this one.  My pypy runs were 20%-30% <em>slower</em> than plain Python.  The pypy site warns that pypy&#8217;s optimization methods can be defeated by tricky, complex code, and perhaps that accounts for it; reposurgeon is nothing if not algorithmically dense.</p>
<p><a href="http://www.cython.org/">cython</a> didn&#8217;t emulate pypy&#8217;s comic pratfall, but didn&#8217;t deliver any speed gains distinguishable from noise either.  I wasn&#8217;t very surprised by this; what it can compile is mainly control structure. which I didn&#8217;t expect to be a substantial component of the runtime compared to (for example) string-bashing during stream-file parsing.</p>
<p>My grandest (and perhaps nuttiest) plan was to translate the program into a Lisp dialect with a decent compiler.  Why Lisp?  Well&#8230;I needed (a) a language with unlimited-extent types that (b) could be compiled to machine-code for speed, and (c) minimized the semantic distance from Python to ease translation (that last point is why you Haskell and ML fans should refrain from even drawing breath to ask your obvious question; instead, go read <a href="http://norvig.com/python-lisp.html">this</a>). After some research I found Steel Bank Common Lisp (SBCL) and began reading up on what I&#8217;d need to do to translate Python to it.</p>
<p>The learning process was interesting. Lisp was my second language; I loved it and was already expert in it by 1980 well before I learned C.  But since 1982 the only Lisp programs I&#8217;ve written have been Emacs modes.  I&#8217;ve done a whole hell of a lot of those, including some of the most widely used ones like GDB and VC, but semantically Emacs Lisp is a sort of living fossil coelacanth from the 1970s, dynamic scoping and all.  Common Lisp, and more generally the evolution of Lisp implementations with decent alien type bindings, passed me by.  And by the time Lisp got good enough for standalone production use in modern environments I already had Python in hand.</p>
<p>So, for me, reading the SBCL and Common Lisp documentation was a strange mixture of learning a new language and returning to very old roots. Yay for lexical scoping!  I recoded about 6% of reposurgeon in SBCL, then hit a couple of walls.  Once of the lesser walls was a missing feature in Common Lisp corresponding to the __str__ special method in Python.  Lisp types don&#8217;t know how to print themselves, and as it turns out reposurgeon relies on this capability in various and subtle ways. Another problem was that I couldn&#8217;t easily see how to duplicate Python&#8217;s subprocess-control interface &#8211; at all, let alone portably across common Lisp implementations.</p>
<p>But the big problem was CLOS, the Common Lisp Object System.  I like most of the rest of Common Lisp now that I&#8217;ve studied it.  OK, it&#8217;s a bit baroque and heavyweight and I can see where it&#8217;s had a couple of kitchen sinks pitched in &#8211; if I were choosing a language on purely esthetic grounds I&#8217;d prefer Scheme.  But I could get comfortable with it, except for CLOS. </p>
<p>But me no buts about multimethods and the power of generics &#8211; I get that, OK?  I see why it was done the way it was done, but the brute fact remains that CLOS is an ugly pile of ugly.  More to the point in this particular context, CLOS objects are quite unlike Python objects (which are in many ways more like CL defstructs).  It was the impedance mismatch between Python and CLOS objects that really sank my translation attempt, which I had originally hoped could be done without seriously messing with the architecture of the Python code.  Alas, that was not to be. Which refocused me on algorithmic methods of improving the Python code.</p>
<p>Now I&#8217;ll talk about what did work.</p>
<p>What worked, ultimately, was finding operations that have instruction costs O(n**2) in the number of commits and squashing them. At this point a shout-out goes to Julien &#8220;FrnchFrgg&#8221; Rivaud, a very capable hacker trying to use reposurgeon for some work on the Blender repository.  He got interested in the speed problem (the Blender repo is also quite large) and was substantially helpful with both patches and advice.  Working together, we memoized some expensive operations and eliminated others, often by incrementally computing reverse-lookup pointers when linking objects together in order to avoid having to traverse the entire repository later on.</p>
<p>Even just finding all the O(n**2) operations isn&#8217;t necessarily easy in a language as terse and high-level as Python; they can hide in very innocuous-looking code and method calls.  The biggest bad boy in this case turned out to be child-node computation.  Fast import streams express &#8220;is a child of&#8221; directly; for obvious reasons, a repository analysis often has to look at all the children of a given parent.  This operation blows up quite badly on very large repositories even if you memoize it; the only way  to make it fast is to precompute all the reverse lookups and update them when you update the forward ones.</p>
<p>Another time sink (the last one to get solved) was identifying all tags and resets attached to a particular commit.  The brute-force method (look through all tags for any with a from member matching the commit&#8217;s mark) is expensive mainly because to look through all tags you have to look through all the events in the stream &#8211; and that&#8217;s expensive when there are 56K of them.  Again, the solution was to give each commit a list of back-pointers to the tags that reference it and make sure all the mutation operations update it properly.</p>
<p>It all came good in the end.  In the last benchmarking run before I shipped 2.29 it processed 56424 commits in 303 seconds.  That&#8217;s 186 commits per second, 11160 per minute.  That&#8217;s good enough that I plan to lay off serious speed-tuning efforts; the gain probably wouldn&#8217;t be worth the increased code complexity.</p>
<p>UPDATE: A week later, after more speed-tuning mainly by Julien (because it was still slow on the very large repo he&#8217;s working with) analysis speed is up to 282 commits/sec (16920 per minute) and a curious thing has occurred.  pypy now actually produces an actual speedup, up to around 338 commits/sec (20280 per minute).  We don&#8217;t know why, but apparently the algorithmic optimizations somehow gave pypy&#8217;s JIT better traction. This is particularly odd because the density of the code actually increased.</p>