Analysis of scaling problems in build systems
My post SCons is full of win today triggered some interesting feedback on scaling problems in SCons. In response to anecdotal assertions that SCons is unusably slow on large projects, I argued that build systems in general must scale poorly if they are to enforce correctness. Subsequently, I received a pointer to a very well executed empirical study of SCons performance to which I replied in the same fashion.
In this post, I intend to conduct a more detailed analysis of algorithmic requirements and complexity in an idealized build system, and demonstrate the implied scaling laws more rigorously. I will also investigate tradeoffs between correctness and performance using the same explanatory framework.
An idealized build system begins work with four inputs: a set of objects, a rule forest, a set of rule generators, and a build state. I will describe each in turn.
An object is any input to or product of the build – typically a source or object file in a compiled language, but also quite possibly a document master in some markup format, or a rendered version of such a document, or even the content of a timestamped entry in some database. Each object has two interesting properties: its unique name and a version stamp. The version stamp may be a timestamp or an implied content hash.
A rule forest is a set of rules that connect objects to each other. Each rule enumerates a set of source objects, a set of target objects, and a procedure for generating the latter from the former. A build system begins with a set of explicit rules (which may be empty).
A rule generator is a procedure for inspecting objects to add rules to the forest. For example, we typically want to inspect C source files and add for each one a rule making the corresponding compiled object dependent on the C source and any header files it includes that are in the set of source objects.
A build state is a boolean relation on the set of objects. The value of the relation is true if the right-hand object is out-of-date with respect to the left-hand object. In some systems, the build state is implied by comparing object timestamps. In others, the build system records a mapping of of source-object names to version stamps for each derived object, and considers a derived object to be out of date with respect to any source name for which the current hash currently fails to match the recorded one.
We say that a build system guarantees correctness if it guarantees that the build state will contain no ‘true’ entries on termination of the build. It guarantees efficiency if it never does excess work (corresponding to a DAG edge for which the corresponding value of the build state is false but the rule on the downstream site fires anyway).
The stages of an idealized build look like this:
1. Scan the object set to generate implied rules.
2. Stitch the rule forest into a DAG expressing the entire dependency structure of the system. Each DAG node is identified by and with an object name.
3. For each selected build target, recursively build it. The recursion looks like this: scan the set of immediate ancestor nodes A(x) of each target x; if the set is empty, you’re done. For each node y in A(x), check the build state to see if x is out-of-date with respect to y. If so, rebuild y. The dependency graph must be acyclic for this recursion to have a well-defined termination state.
Now that we understand the sequence of events, let’s consider the algorithmic complexity of the steps in the process.
Scanning to detect and record implied rules will be O(n) in the total size of the objects. Each object will need a name lookup for each reference to another object (such as an #include) that it contains. Thus O(n log n) in the number of objects, though a naive implementation could be O(n**2).
Stitching the rule forest into a DAG will also be minimally O(n log n) in the number of rules, because every object name occurring as a source will need to be checked against every object name occurring as a target to see if that target should be added to the source’s ancestor list. In typical builds where most source files are C sources and thus have one dependent which is a .o file, this implies O(n log n) in the number of source files. (Note: I originally estimated this as O(n**2), thinking of the naive algorithm.)
The recursive build may have a slightly tricky order but as a graph traversal should be expected to be O(n) in the number of DAG nodes.
We notice two things immediately. First, the dominating cost term is that of assembling the dependency DAG from the rule forest. Second – and perhaps a bit counterintuitively – the build system overhead will be relatively insensitive to whether we’re doing a clean build or many up-to-date derived objects already exist. (Total build time will be shorter in the latter case, of course.)
Now we have a cost model for an SCons-like build system that guarantees build correctness. As I orginally posted this, I thought that a complexity of O(n**2) in the number of objects was empirically confirmed by Eric Melski’s performance plot – but, as it turns out, this curve could fit O(n log n) as well.
But Melski also shows that other build systems – in particular those using bare makefiles and two-phase systems like autotools that use makefile generators – achieve O(n) performance. What does this mean?
Our complexity analysis shows us two things: If you want to pull build overhead below O(n log n), you need to not incur the cost of stitching up the dependency DAG on each build, and you need to also not pay for implicit-dependency scanning on each build.
Handcrafted makefiles don’t do implicit-dependency scanning, avoiding O(n log n) overhead. They do have to stitch up the entire DAG, but on projects large enough to be an issue much of the overhead for that is dodged by partitioning into recursive makefiles. The build process stitches up DAGs for the rules each makefile, but this is a sum of quadratic-order costs for much smaller ns.
The problems with bare makefiles and recursive makefiles are well understood. You get performance, but you trade that for much higher odds that your rule forest is failing to describe the actual dependencies correctly. This is especially true when dependencies cross boundaries between makefiles. The symptoms include excess work during actual builds and (much more dangerously) failure to correctly rebuild stale dependents.
Two-phase build systems such as autotools and CMake attempt to recover correctness by bringing back implicit-dependency scanning, but also keep performance by segregating the O(n log n) cost of implicit-dependency scanning into a configuration pre-phase that generates makefiles. This sharply reduces the likelihood of error by reducing the number of dependencies that have to be hand-maintained. But it is still possible for a build to be incorrect if (for example) a source change introduces a new implicit dependency or deletes an old one.
Another well-known problem with two-phase systems is that build recipes are difficult to debug. When make throws an error, you get a message with context in the generated makefile, not whatever master description it was made from.
My major conclusion is that it is not possible to design a build system with better than O(n log n) performance in the number of objects without sacrificing correctness. If the build system does not assemble a complete dependency DAG, some dependencies may exist that are never checked during traversal. Belt-and-suspenders techniques to avoid this (for example in recursive makefiles) tend to force redundant builds of interior objects such as libraries, sacrificing efficiency.
A minor conclusion, but interesting considering the case that drove me to think about this, is this: SCons is just as bad, but not necessarily worse, than it has to be.
UPDATE: I originally misestimated the cost of DAG building as O(n**2). This weakens the minor conclusion slightly; it is possible that SCons is worse than it has to be, if there is a naive quadratic-time algorithm being used for lookup somewhere. Since it’s implemented in Python, however, it is almost certainly the case that the lookup is done through a Python hash with sub-quadratic cost.
Objections to the above analysis focused on exploiting parallelism are intelligent but don’t address quite the same case I was after. SCons and other build systems on its historical backtrail (such as waf and autotools) will all try to parallelize if you ask them to; unless there are very large differences in how well they do task partitioning, Amdahl’s-Law like constraints pretty much guarantee that this can’t change relative performance much. And in cases where the dependency notation tends to underconstrain the build (I’m looking at you, makefiles!) attempting to parallelize is quite dangerous to build correctness.