11 lines
2.4 KiB
Plaintext
11 lines
2.4 KiB
Plaintext
A low-performance mystery, part deux
|
|
<p>Well, the good news is, I get to feel wizardly this morning. Following sensible advice from a couple of my regulars, I rebuilt my dispatcher to use threads allocated at start time and looping until the list of masters is exhausted.</p>
|
|
<p>78 LOC. Fewer mutexes. And it worked correctly <em>first time I ran it</em>. W00t – looks like I’ve got the <s>hang</s> non-hang of this threads thing.</p>
|
|
<p>The bad news is, threaded performance is still atrocious in exactly the same way. Looks like thread-spawn overhead wasn’t a significant contributor.</p>
|
|
<p>In truth, I was expecting this result. I think my regulars were right to attribute this problem to cache- and locality-busting on every level from processor L1 down to the disks. I believe I’m starting to get a feel for this problem from watching the performance variations over many runs.</p>
|
|
<p>I’ll profile, but I’m sure I’m going to see cache misses go way up in the threaded version, and if I can find a way to meter the degree of disk thrashing I won’t be even a bit surprised to see that either.</p>
|
|
<p>The bottom line here seems to be that if I want better threaded performance out of this puppy I’m going to have to at least reduce its working set a lot. Trouble is, I’m highly doubtful – given what it has to do during delta assembly – that this is actually possible. The CVS snapshots and deltas it has to snarf into memory to do the job are intrinsically both large and of unpredictably variable size.</p>
|
|
<p>Maybe I’ll have an inspiration, but…Keith Packard, who originally wrote that code, is a damn fine systems hacker who is very aware of performance issues; if he couldn’t write it with a low footprint in the first place, I don’t judge my odds of second-guessing him successfully are very good.</p>
|
|
<p>Ah well. It’s been a learning experience. At least now I can say of multi-threaded application designs “Run! Flee! Save yourselves!” from a position of having demonstrated a bit of wizardry at them myself.</p>
|
|
<p>UPDATE: One of my regulars found a minor bug in the mutex handling that cost some performance. Alas, fixing this didn’t have any impact above the noise level of my profiling. Also, I managed to unify the threaded and non-threaded dispatchers; the LOC specific to threading is now down to about 30.</p>
|