blog_post_tests/20141014213855.blog

A low-performance mystery: the adventure continues
<p>The mystery I described two posts back has actually been mostly solved (I think) but I&#8217;m having a great deal of fun trying to make cvs-fast-export run <em>even faster</em>, and my regulars are not only kibitzing with glee but have even thrown money at me so I can upgrade my PC and run tests on a machine that doesn&#8217;t resemble (as one of them put it) a slack-jawed yokel at a hot-dog-eating contest.  </p>
<p>Hey, a 2.66Ghz Intel Core 2 Duo with 4GB was hot shit when I bought it, and because I avoid bloatware (my window manager is i3) it has been sufficient unto my needs up to now.  I&#8217;m a cheap bastard when it comes to hardware; tend to hold onto it until I actually <em>need</em> to upgrade.  This is me loftily ignoring the snarking from the peanut gallery, except from the people who actually donated money to the Help Stamp Out CVS In Your Lifetime hardware fund.  </p>
<p>(For the rest of you, the PayPal and Gratipay buttons should be clearly visible to your immediate right.  Just sayin&#8217;&#8230;)</p>
<p>Ahem.  Where was I?  Yes.  The major mystery &#8211; the unexplained slowdown in stage 3 of the threaded version &#8211; appears to have been solved. It appears this was due to a glibc feature, which is that if you link with threads support it tries to detect use of threads and use thread locks in stdio to make it safe.  Which slows it down.  </p>
<p><span id="more-6379"></span></p>
<p>A workaround exists and has been applied.  With that in place the threaded performance numbers are now roughly comparable to the unthreaded ones &#8211; a bit slower but nothing that isn&#8217;t readily explicable by normal cache- and disk-contention issues on hardware that is admittedly weak for this job. Sometime soon I&#8217;ll upgrade to some beastly hexacore monster with lots of RAM by <em>today&#8217;s</em> standards and then we&#8217;re see what we&#8217;ll see.</p>
<p>But the quest to make cvs-fast export faster, ever faster, continues. Besides the practical utility, I&#8217;m quite enjoying doing down-and-dirty systemsy stuff on something that isn&#8217;t GPSD. And, my regulars are having so much fun looking over my shoulder and offering suggestions that I&#8217;d kind of hate to end the party.</p>
<p>Here&#8217;s where things are, currently:</p>
<p>1.  On the input side, I appear to have found a bug &#8211; or at least some behavior severely enough misdocumented that it&#8217;s tantamount to a bug &#8211;  in the way flex-generated scanners handle EOF.  The flex maintainer has acknowledged that something odd is going on and discussion has begun on flex-help.</p>
<p>A fix for this problem &#8211; which is presently limiting cvs-fast-export to character-by-character input when parsing master files, and preventing use of some flex speedups for non-interactive scanners &#8211; is probably the best near-term bet for significant performance gains.</p>
<p>If you feel like working on it, grab a copy of the code, delete the custom YY_INPUT macro from lex.l, rebuild, and watch what happens.  Here&#8217;s a <a href="http://sourceforge.net/p/flex/mailman/message/32929761/">more detailed description</a> of the problem.</p>
<p>Diagnosing this more precisely would actually be a good project for somebody who wants to help out but is wary of getting involved with the black magic in the CVS-analysis stuff. Whatever is wrong here is well separated from that.</p>
<p>2. I have given up for the moment on trying to eliminate the shared counter for making blob IDs.  It turns out that the assumption those are small sequential numbers is important to the stage 3 logic &#8211; it&#8217;s used as an array index into a map of external marks.</p>
<p>I may revisit this after I&#8217;ve collected the optimizations with a higher expected payoff, like the flex fix and tuning.</p>
<p>3. A couple of my commenters are advocating a refinement of the current design that delegates writing the revision snapshots in stage 1 to a separate thread accessed by a bounded queue. This is clever, but it doesn&#8217;t attack the root of the problem, which is that currently the snapshots have to be written to disk, persist, and then be copied to standard output as blobs when the stream is being generated.</p>
<p>If I can refactor the main loop in generate() into a setup()/make-next-snapshot()/wrapup() sequence, something much more radical is possible. It might be that snapshot generation could be deferred until <em>after</em> the big black-magic branch merge in stage 2, which only needs each CVS revision&#8217;s metadata rather than its contents.  Then make-next-snapshot() could be called iteratively on each master to generate snapshots just in time to be shipped on stdout as blobs.</p>
<p>This would be huge, possibly cutting runtime by 40%. There would be significant complications,though.  A major one is that a naive implementation would have a huge working set, containing the contents of the last revision generated of all master files.  There would have to be some sort of LRU scheme to hold that size down.</p>
<p>4. I have a to-do item to run more serious profiling with cachegrind and iostat, but I&#8217;ve been putting that off because (a) there are obvious wins to be had (like anything that reduces I/O traffic), (b) those numbers will be more interesting on the new monster machine I&#8217;ll be shopping for shortly.</p>