A low-performance mystery

OK, I’ll admit it. I’m stumped by a software-engineering problem.

This is not a thing that happens often, but I’m in waters relatively unknown to me. I’ve been assiduously avoiding multi-threaded programming for a long time, because solving deadlock, starvation, and insidious data-corruption-by-concurrency problems isn’t really my idea of fun. Other than one minor brush with it handling PPS signals in GPSD I’ve managed before this to avoid any thread-entanglement at all.

But I’m still trying to make cvs-fast-export run faster. About a week ago an Aussie hacker named David Leonard landed a brilliant patch series in my mailbox. Familiar story: has a huge, gnarly CVS repo that needs converting, got tired of watching it grind for days, went in to speed it up, found a way. In fact he applied a technique I’d never heard of (Bloom filtering) to flatten the worst hot spot in the code, an O(n**3) pass used to compute parent/child links in the export code. But it still needed to be faster.

After some discussion we decided to tackle parallelizing the code in the first stage of analysis. This works – separately – on each of the input CVS masters, digesting them into in-core revision lists and generating whole-file snapshots for each CVS delta; later these will become the blobs in the fast-export stream. Then there’s a second stage that merges these per-file revision lists, and a third stage that exports the merged result.

Here’s more detail, because you’ll need it to understand the rest. Each CVS master consists of a sequence of deltas (sequences of add-line and delete-line operations) summing up to a sequence of whole-file states (snapshots – eventually these will become blobs in the translated fast-import-stream). Each delta has an author, a revision date, and a revision number (like 1.3 or 1.2.1.1). Implicitly they form a tree. At the top of the file is a tag table mapping names to revision numbers, and some other relatively unimportant metadata.

The goal of stage 1 is to digest each CVS master into an in-core tree of metadata and a sequence of whole-file snapshots, with unique IDs in the tree indexing the snapshots. The entire collection of masters is made into a linked list of these trees; this is passed to stage 2, where black magic that nobody understands happens.

This first stage seems like a good target for parallelization because the analysis of each master consists of lumps of I/O separated by irregular stretches of compute-intensive data-shuffling in core. In theory, if the program were properly parallelized, it would seldom actually block on an I/O operation; instead while any one thread was waiting on I/O, the data shuffling for other masters would continue. The program would get faster – possibly much faster, depending on the time distribution of I/O demand.

Well, that’s the theory, anyway. Here’s what actually happened…

First, I had to make the parser for the master file format re-entrant. I did that, and then documented the method.

Then I had to examine every single other bit of global state and figure out if it needed to be encapsulated into a local-context data structure for each first-stage parse or could remain shared. Doing this made me happy; I hate globals, they make a program’s dataflow harder to grok. Re-entrant structures are prettier.

Once I thought I had the data structures properly refactored, I had to introduce actual threading. OK, time to actually learn the pthreads API. Did that. Wrote a multi-threading tasklet scheduler; I’d never done one before and took me about four tries to get it right, but I knew I was going in a good direction because the design kept getting simpler. Current version is less than 70 LOC, that’s for the dispatcher loop and the worker-subthread code both. I am proud of this code – it’s pretty, it works, and it’s tight.

While this was going on, my Aussie friend was writing a huge complex map/reduce-based scheduler to do the same job. He realized he’d succumbed to overengineering and gave it up just about the time I got mine working.

Not all was wine and roses. Of course I missed some shared state the first time through; multithreaded operation revealed this by sporadically crashing. I misunderstood the documentation for pthreads condition variables and had a few headaches in consequence until David clued me in. But these were solvable problems.

I got the code to the point where I could choose between straight sequential and N-threaded operation by flipping a command-line switch, and it was solid either way. Felt a warm glow of accomplishment. Then I profiled it – and got a deeply unpleasant shock. The threaded version was slower. Seriously slower. Like, less than half the throughput of naive one-master-at-at-time sequential parsing.

I know what the book says to suspect in this situation – mutex contention overhead. But I knew from the beginning this was unlikely. To explain why, I have to be more specific about what the contention points are.

The scheduler manages an array of worker threads. There’s one slot mutex for each worker thread slot, asserted when it’s active (that is, there’s a live thread processing a master in it). There’s one wakeup mutex associated with a condition variable that each worker thread uses to wake up the manager loop when it’s done, so another master-parsing thread can be scheduled into the vacant slot. And there’s another output mutex to make updates to the list of parsed masters atomic.

These mutexes, the ones for the scheduler itself, don’t change state very often. The history of a CVS master parse causes only these events: slot mutex up, output mutex up, output mutex down, slot mutex down, wakeup mutex up, signal, wakeup mutex down. The critical regions are small. This just isn’t enough traffic to generate noticeable overhead.

There are just two real mutexes that handle contention among the masters. One guards a counter so that the code can issue sequential blob IDs common to all threads; that one gets called every time a delta gets turned into a revision blob. The other guards a common table of named tags. Neither showed up as hotspots in profiling.

Only the counter mutex seemed even remotely likely to be getting enough hits to reduce throughput by half, so I replaced it with an atomic fetch-and-increment instruction (the comment reads /* the sexy lockless method */). That worked, but…no joy as far as increased performance went.

(Yes, I know, I could be replacing the guard mutex with an equally sexy lockless read-copy-update operation on the output linked list. Not worth it; there’s only one of these per master file, and at their fastest they’re on the order of 50 milliseconds apart).

Time to seriously profile. David had clued me in about Linux perf and I knew about gprof of old; armed with those tools, I went in and tried to figure out where the hell all those cycles were going in the threaded version. Some things stood out…

One: Mutex calls were not – repeat not – showing up in the top-ten lists.

Two: The first stage, which should have been sped up by multithreading, was running a factor of two slower than the straight sequential one-master-at-a-time version.

Three: Stage two, the merge magic, was running about the same speed as before.

Four: Here’s where it gets not just puzzling but outright weird. Stage three, the report generator, as far as it’s possible to get from any mutex and still be in the same program – running two to four times slower, consistently.

Very, very strange.

I noticed that the threaded version has a much bigger memory footprint. That’s as expected, it’s holding data for multiple masters at the same time. Could I be seeing heavy swap overhead?

Turns out not. I profiled on a much smaller repo, enough smaller for the working set to fit in physical core without swapping. Same strange statistics – Stage 1 slower, stage 3 much slower than the unthreaded code. Factors of about 2 and 4 respectively, same as for the larger repo.

(For those of you curious, the large repo is groff – 2300 deltas. The small one is robotfindskitten, 269 deltas. These are my standard benchmark repos.)

Now I’m fresh out of ideas. But I have noticed something separately interesting. According to the profiles, both of them are spending upwards of 40% of their time in one small routine, a copy operation in the delta-assembly code. Having no better plan, I decide to try to flatten this hot spot. Maybe something will shake loose?

I start putting in const and restricted declarations. Replace some array indexing with pointers. I normally don’t bother with this kind of micro-optimization; usually, when you have a speed problem you need a better algorithm rather than microtweaks. But in this case I think I already have good algorithms – I’m trying to help the compiler optimizer do a better job of reducing them to fast machine instructions, and hoping for insight into the slowdown on the way.

This works rather better than I expected. A few hours of work later Stage 1 is running at rough speed parity in threaded amd unthreaded versions. It ought to be faster, but at least the threaded version is no longer absurdly slow,

Stage 3 of threaded execution is still absurdly slow.

And that’s where things stand now. I wrote this partly to clarify my own thoughts and partly to invite comment from people with more experience optimizing threaded code than I have.

Does anything jump out of this fact pattern and say boo? If you were in my shoes, what forensics would you try?

Code is here: https://gitorious.org/cvs-fast-export/

Benchmark repos can be fetched with cvssync as follows:

cvssync cvs.sourceforge.net:/cvsroot/rfk robotfindskitten

cvssync anonymous@cvs.savannah.gnu.org:/sources/groff groff