Sunday, May 26, 2013

Three types of build-system dependency

Summary: There are three types of dependencies you might want to express in a build system, all of which are supported by Shake.

A build system, at its heart, is a system which runs commands in an order satisfying user-specified dependencies. But what kind of dependencies can be expressed? This post describes three different types of dependency, only one of which is available in Make, but all of which are available in both Shake and Ninja.

Feature 1: Static dependencies (available in every build system)

The most basic form of dependency is a static dependency, where a rule produces an output from some inputs:

-- In Make --
result.tar : file1 file2
    tar -cf result.tar file1 file2

-- In Shake --
"result.tar" *> \out -> do
    let deps = ["file1","file2"]
    need deps
    cmd "tar -cf" [out] deps

This rule says that the file result.tar depends on the inputs file1 and file2, and provides a command to build result.tar. Whenever file1 or file2 change, the command will be run, and result.tar will be built.

Static dependencies occur in almost every build rule, and are supported by all build tools, including Make and Shake.

Feature 2: Dynamic dependencies (available in Shake, Ninja, Redo and tup)

A more advanced dependency is where the list of dependencies itself depends on the results of previous dependencies. Imagine we want to build result.tar from the list of files stored in list.txt. The dependencies of result.tar cannot be specified statically, but depend on the contents of list.txt. In Shake we can write:

"result.tar" *> \out -> do
    need ["list.txt"]
    contents <- readFileLines "list.txt"
    need contents
    cmd "tar -cf" [out] contents

This rule describes how to build result.tar. We depend on (need) the file list.txt. We read each line from list.txt into the variable contents - being a list of the files that should go into result.tar. Next, we depend on all the files in contents, and finally call the tar program. If either list.txt changes, or any of the files listed by list.txt change, then result.tar will be rebuilt.

This feature is necessary in almost every build system, yet is shockingly lacking from most build tools - I am only aware of it being available in Shake, Ninja, Redo and tup. As a common example, in Make you might write:

result.o : result.c result_header1.h result_header2.h
    gcc ...

The file result.o depends on both the C source file result.c and all headers that file includes. But listing the headers both in result.c with #include directives, and in the Makefile, is a brittle form of duplication. A better approach is for the build system to run gcc -M result.c and extract the includes from there. In Shake we can write:

"result.o" *> \out -> do
    let src = "result.c"
    Stdout stdout <- cmd "gcc -MM" [src]
    need $ src : drop 2 (words stdout)
    cmd "gcc -o" [out] "-c" [src]

My experience is that about a quarter of rules require some kind of additional dependency based on previous dependencies. While you can hack round some of the issues in Make, and people have become disturbingly adept at doing so, the result often only approximates the dependencies - building either too much or too little.

Feature 3: Multiple outputs from one rule (available in Shake and Ninja)

The final feature is producing multiple outputs from one command, and is used far more rarely (perhaps one or two rules in a complex build system) - but when needed, is essential. Some programs, such as GHC, can produce two outputs with one command - compiling Foo.hs produces both Foo.o and Foo.hi. As a first approximation, the .o file depends on the entire contents of the source file, while the .hi file depends only on the type signatures. A single ghc invocation needs to do all the work to produce both, but often the .hi file will be left unchanged. In Shake we can write:

["Foo.hi","Foo.o"] *>> \_ -> do
    need ["Foo.hs"]
    cmd "gcc -c Foo.hs"

While it is often possible to construct a series of dependencies to approximate a single rule producing multiple outputs, it only works in some cases, and is brittle. The only build systems I am aware of which support multiple outputs are Shake and Ninja.

Essential features

My standard advice when people ask about writing a build system is "don't". If some existing build system (e.g. ghc --make or Cabal) is capable of building your project, use that instead. Custom build systems are necessary for many complex projects, but many projects are not complex. If you have decided your project is complex, you should use a build tool that can express complex dependencies, both for writing the initial system and to provide the flexibility to make the inevitable changes required.

Looking only at dependency features, I would consider it unwise to start a complex build system using a tool other than Shake or Ninja, or perhaps either Redo or tup (if you accept the absence of multiple outputs from one rule).

Weak dependency specification in build tools, particularly Make, has left its mark on many programs. I recently talked to some OCaml hackers complaining that their tools were not "Make friendly" because they produced multiple output files. I wonder what lengths other tools have gone to in order to cope with weak dependency specification...

Update: The relative power of tup was reported as a comment, and it appears to have the necessary power, but I haven't yet checked. Following further research into Ninja I suspect it may not be as powerful as originally stated and may not have Feature 2, but am not yet sure. Make gained support for multiple outputs after this post was published.