A low-performance mystery
<p>OK, I&#8217;ll admit it.  I&#8217;m stumped by a software-engineering problem.  </p>
<p>This is not a thing that happens often, but I&#8217;m in waters relatively unknown to me. I&#8217;ve been assiduously avoiding multi-threaded programming for a long time, because solving deadlock, starvation, and insidious data-corruption-by-concurrency problems isn&#8217;t really my idea of fun. Other than one minor brush with it handling PPS signals in GPSD I&#8217;ve managed before this to avoid any thread-entanglement at all.</p>
<p>But I&#8217;m still trying to make cvs-fast-export run faster.  About a week ago an Aussie hacker named David Leonard landed a <em>brilliant</em> patch series in my mailbox. Familiar story: has a huge, gnarly CVS repo that needs converting, got tired of watching it grind for days, went in to speed it up, found a way.  In fact he applied a technique I&#8217;d never heard of (Bloom filtering) to flatten the worst hot spot in the code, an O(n**3) pass used to compute parent/child links in the export code.  But it still needed to be faster.</p>
<p>After some discussion we decided to tackle parallelizing the code in the first stage of analysis.  This works &#8211; separately &#8211; on each of the input CVS masters, digesting them into in-core revision lists and generating whole-file snapshots for each CVS delta; later these will become the blobs in the fast-export stream.  Then there&#8217;s a second stage that merges these per-file revision lists, and a third stage that exports the merged result.</p>
<p>Here&#8217;s more detail, because you&#8217;ll need it to understand the rest.  Each CVS master consists of a sequence of deltas (sequences of add-line and delete-line operations) summing up to a sequence of whole-file states (snapshots &#8211; eventually these will become blobs in the translated fast-import-stream).   Each delta has an author, a revision date, and a revision number (like 1.3 or 1.2.1.1).  Implicitly they form a tree.  At the top of the file is a tag table mapping names to revision numbers, and some other relatively unimportant metadata.</p>
<p>The goal of stage 1 is to digest each CVS master into an in-core tree of metadata and a sequence of whole-file snapshots, with unique IDs in the tree indexing the snapshots. The entire collection of masters is made into a linked list of these trees; this is passed to stage 2, where <a href="http://esr.ibiblio.org/?p=6228">black magic that nobody understands</a> happens.</p>
<p>This first stage seems like a good target for parallelization because the analysis of each master consists of lumps of I/O separated by irregular stretches of compute-intensive data-shuffling in core.  In theory, if the program were properly parallelized, it would seldom actually block on an I/O operation; instead while any one thread was waiting on I/O, the data shuffling for other masters would continue.  The program would get faster &#8211; possibly <em>much</em> faster, depending on the time distribution of I/O demand.</p>
<p>Well, that&#8217;s the theory, anyway.  Here&#8217;s what actually happened&#8230;</p>
<p><span id="more-6364"></span></p>
<p>First, I had to make the parser for the master file format re-entrant. I did that, and then <a href="http://esr.ibiblio.org/?p=6341">documented the method</a>.</p>
<p>Then I had to examine every single other bit of global state and figure out if it needed to be encapsulated into a local-context data structure for each first-stage parse or could remain shared.  Doing this made me happy; I hate globals, they make a program&#8217;s dataflow harder to grok. Re-entrant structures are <em>prettier</em>.</p>
<p>Once I thought I had the data structures properly refactored, I had to introduce actual threading.  OK, time to actually learn the pthreads API. Did that.  Wrote a multi-threading tasklet scheduler; I&#8217;d never done one before and took me about four tries to get it right, but I knew I was going in a good direction because the design kept getting <em>simpler</em>.  Current version is less than 70 LOC, that&#8217;s for the dispatcher loop and the worker-subthread code both.  I am proud of this code &#8211; it&#8217;s pretty, it works, and it&#8217;s <em>tight</em>.</p>
<p>While this was going on, my Aussie friend was writing a huge complex map/reduce-based scheduler to do the same job.  He realized he&#8217;d succumbed to overengineering and gave it up just about the time I got mine working.</p>
<p>Not all was wine and roses.  Of course I missed some shared state the first time through; multithreaded operation revealed this by sporadically crashing. I misunderstood the documentation for pthreads condition variables and had a few headaches in consequence until David clued me in.  But these were solvable problems.</p>
<p>I got the code to the point where I could choose between straight sequential and N-threaded operation by flipping a command-line switch, and it was solid either way.  Felt a warm glow of accomplishment.  Then I profiled it &#8211; and got a deeply unpleasant shock.  The threaded version was <em>slower</em>.  Seriously slower.  Like, less than half the throughput of naive one-master-at-at-time sequential parsing.</p>
<p>I know what the book says to suspect in this situation &#8211; mutex contention overhead. But I knew from the beginning this was unlikely. To explain why, I have to be more specific about what the contention points are.</p>
<p>The scheduler manages an array of worker threads.  There&#8217;s one <em>slot</em> mutex for each worker thread slot, asserted when it&#8217;s active (that is, there&#8217;s a live thread processing a master in it).  There&#8217;s one <em>wakeup</em> mutex associated with a condition variable that each worker thread uses to wake up the manager loop when it&#8217;s done, so another master-parsing thread can be scheduled into the vacant slot.  And there&#8217;s another <em>output</em> mutex to make updates to the list of parsed masters atomic.</p>
<p>These mutexes, the ones for the scheduler itself, don&#8217;t change state very often. The history of a CVS master parse causes only these events: slot mutex up, output mutex up, output mutex down, slot mutex down, wakeup mutex up, signal, wakeup mutex down. The critical regions are small.  This just isn&#8217;t enough traffic to generate noticeable overhead.</p>
<p>There are just two real mutexes that handle contention among the masters.  One guards a counter so that the code can issue sequential blob IDs common to all threads; that one gets called every time a delta gets turned into a revision blob.  The other guards a common table of named tags.  Neither showed up as hotspots in profiling.</p>
<p>Only the counter mutex seemed even remotely likely to be getting enough hits to reduce throughput by half, so I replaced it with an atomic fetch-and-increment instruction (the comment reads <tt>/* the sexy lockless method */</tt>).  That worked, but&#8230;no joy as far as increased performance went.</p>
<p>(Yes, I know, I could be replacing the guard mutex with an equally sexy lockless read-copy-update operation on the output linked list.  Not worth it; there&#8217;s only one of these per master file, and at their fastest they&#8217;re on the order of 50 milliseconds apart).</p>
<p>Time to seriously profile.  David had clued me in about Linux perf and I knew about gprof of old; armed with those tools,  I went in and tried to figure out where the hell all those cycles were going in the threaded version.  Some things stood out&#8230;</p>
<p>One: Mutex calls were not &#8211; repeat not &#8211; showing up in the top-ten lists. </p>
<p>Two: The first stage, which should have been sped up by multithreading, was running a factor of two slower than the straight sequential one-master-at-a-time version.</p>
<p>Three: Stage two, the merge magic, was running about the same speed as before.</p>
<p>Four: Here&#8217;s where it gets not just puzzling but outright weird.  Stage three, the report generator, as far as it&#8217;s possible to get from any mutex and still be in the same program &#8211;  running <em>two to four times slower</em>, consistently.</p>
<p>Very, very strange.</p>
<p>I noticed that the threaded version has a much bigger memory footprint.  That&#8217;s as expected, it&#8217;s holding data for multiple masters at the same time.  Could I be seeing heavy swap overhead?</p>
<p>Turns out not.  I profiled on a much smaller repo, enough smaller for the working set to fit in physical core without swapping.  Same strange statistics &#8211; Stage 1 slower, stage 3 <em>much</em> slower than the unthreaded code. Factors of about 2 and 4 respectively, same as for the larger repo.</p>
<p>(For those of you curious, the large repo is groff &#8211; 2300 deltas.  The small one is robotfindskitten, 269 deltas. These are my standard benchmark repos.) </p>
<p>Now I&#8217;m fresh out of ideas.  But I have noticed something separately interesting.  According to the profiles, both of them are spending upwards of 40% of their time in one small routine, a copy operation in the delta-assembly code. Having no better plan, I decide to try to flatten this hot spot.  Maybe something will shake loose?</p>
<p>I start putting in <tt>const</tt> and <tt>restricted</tt> declarations.  Replace some array indexing with pointers.  I normally don&#8217;t bother with this kind of micro-optimization; usually, when you have a speed problem you need a better algorithm rather than microtweaks.  But in this case I think I already have good algorithms &#8211; I&#8217;m trying to help the compiler optimizer do a better job of reducing them to fast machine instructions, and hoping for insight into the slowdown on the way.</p>
<p>This works rather better than I expected.  A few hours of work later Stage 1 is running at rough speed parity in threaded amd unthreaded versions. It ought to be faster, but at least the threaded version is no longer absurdly slow,</p>
<p>Stage 3 of threaded execution is <em>still</em> absurdly slow.</p>
<p>And that&#8217;s where things stand now.  I wrote this partly to clarify my own thoughts and partly to invite comment from people with more experience optimizing threaded code than I have.</p>
<p>Does anything jump out of this fact pattern and say boo?  If you were in my shoes, what forensics would you try?</p>
<p>Code is here: https://gitorious.org/cvs-fast-export/</p>
<p>Benchmark repos can be fetched with cvssync as follows:</p>
<p>cvssync cvs.sourceforge.net:/cvsroot/rfk robotfindskitten</p>
<p>cvssync anonymous@cvs.savannah.gnu.org:/sources/groff groff</p>