A low-performance mystery, part deux

Well, the good news is, I get to feel wizardly this morning. Following sensible advice from a couple of my regulars, I rebuilt my dispatcher to use threads allocated at start time and looping until the list of masters is exhausted.

78 LOC. Fewer mutexes. And it worked correctly first time I ran it. W00t – looks like I’ve got the hang non-hang of this threads thing.

The bad news is, threaded performance is still atrocious in exactly the same way. Looks like thread-spawn overhead wasn’t a significant contributor.

In truth, I was expecting this result. I think my regulars were right to attribute this problem to cache- and locality-busting on every level from processor L1 down to the disks. I believe I’m starting to get a feel for this problem from watching the performance variations over many runs.

I’ll profile, but I’m sure I’m going to see cache misses go way up in the threaded version, and if I can find a way to meter the degree of disk thrashing I won’t be even a bit surprised to see that either.

The bottom line here seems to be that if I want better threaded performance out of this puppy I’m going to have to at least reduce its working set a lot. Trouble is, I’m highly doubtful – given what it has to do during delta assembly – that this is actually possible. The CVS snapshots and deltas it has to snarf into memory to do the job are intrinsically both large and of unpredictably variable size.

Maybe I’ll have an inspiration, but…Keith Packard, who originally wrote that code, is a damn fine systems hacker who is very aware of performance issues; if he couldn’t write it with a low footprint in the first place, I don’t judge my odds of second-guessing him successfully are very good.

Ah well. It’s been a learning experience. At least now I can say of multi-threaded application designs “Run! Flee! Save yourselves!” from a position of having demonstrated a bit of wizardry at them myself.

UPDATE: One of my regulars found a minor bug in the mutex handling that cost some performance. Alas, fixing this didn’t have any impact above the noise level of my profiling. Also, I managed to unify the threaded and non-threaded dispatchers; the LOC specific to threading is now down to about 30.