A low-performance mystery, part deux
<p>Well, the good news is, I get to feel wizardly this morning.  Following sensible advice from a couple of my regulars, I rebuilt my dispatcher to use threads allocated at start time and looping until the list of masters is exhausted.</p>
<p>78 LOC.  Fewer mutexes.  And it worked correctly <em>first time I ran it</em>. W00t &#8211; looks like I&#8217;ve got the <s>hang</s> non-hang of this threads thing.</p>
<p>The bad news is, threaded performance is still atrocious in exactly the same way. Looks like thread-spawn overhead wasn&#8217;t a significant contributor.</p>
<p>In truth, I was expecting this result.  I think my regulars were right to attribute this problem to cache- and locality-busting on every level from processor L1 down to the disks.  I believe I&#8217;m starting to get a feel for this problem from watching the performance variations over many runs.</p>
<p>I&#8217;ll profile, but I&#8217;m sure I&#8217;m going to see cache misses go way up in the threaded version, and if I can find a way to meter the degree of disk thrashing I won&#8217;t be even a bit surprised to see that either.</p>
<p>The bottom line here seems to be that if I want better threaded performance out of this puppy I&#8217;m going to have to at least reduce its working set a lot.  Trouble is, I&#8217;m highly doubtful &#8211; given what it has to do during delta assembly &#8211; that this is actually possible.  The CVS snapshots and deltas it has to snarf into memory to do the job are intrinsically both large and of unpredictably variable size.</p>
<p>Maybe I&#8217;ll have an inspiration, but&#8230;Keith Packard, who originally wrote that code, is a damn fine systems hacker who is very aware of performance issues; if he couldn&#8217;t write it with a low footprint in the first place, I don&#8217;t judge my odds of second-guessing him successfully are very good.</p>
<p>Ah well.  It&#8217;s been a learning experience. At least now I can say of multi-threaded application designs &#8220;Run! Flee! Save yourselves!&#8221; from a position of having demonstrated a bit of wizardry at them myself.</p>
<p>UPDATE: One of my regulars found a minor bug in the mutex handling that cost some performance.  Alas, fixing this didn&#8217;t have any impact above the noise level of my profiling.  Also, I managed to unify the threaded and non-threaded dispatchers; the LOC specific to threading is now down to about 30.</p>