Sunday, May 26, 2013

Three types of build-system dependency

Summary: There are three types of dependencies you might want to express in a build system, all of which are supported by Shake.

A build system, at its heart, is a system which runs commands in an order satisfying user-specified dependencies. But what kind of dependencies can be expressed? This post describes three different types of dependency, only one of which is available in Make, but all of which are available in both Shake and Ninja.

Feature 1: Static dependencies (available in every build system)

The most basic form of dependency is a static dependency, where a rule produces an output from some inputs:

-- In Make --
result.tar : file1 file2
    tar -cf result.tar file1 file2

-- In Shake --
"result.tar" *> \out -> do
    let deps = ["file1","file2"]
    need deps
    cmd "tar -cf" [out] deps

This rule says that the file result.tar depends on the inputs file1 and file2, and provides a command to build result.tar. Whenever file1 or file2 change, the command will be run, and result.tar will be built.

Static dependencies occur in almost every build rule, and are supported by all build tools, including Make and Shake.

Feature 2: Dynamic dependencies (available in Shake, Ninja, Redo and tup)

A more advanced dependency is where the list of dependencies itself depends on the results of previous dependencies. Imagine we want to build result.tar from the list of files stored in list.txt. The dependencies of result.tar cannot be specified statically, but depend on the contents of list.txt. In Shake we can write:

"result.tar" *> \out -> do
    need ["list.txt"]
    contents <- readFileLines "list.txt"
    need contents
    cmd "tar -cf" [out] contents

This rule describes how to build result.tar. We depend on (need) the file list.txt. We read each line from list.txt into the variable contents - being a list of the files that should go into result.tar. Next, we depend on all the files in contents, and finally call the tar program. If either list.txt changes, or any of the files listed by list.txt change, then result.tar will be rebuilt.

This feature is necessary in almost every build system, yet is shockingly lacking from most build tools - I am only aware of it being available in Shake, Ninja, Redo and tup. As a common example, in Make you might write:

result.o : result.c result_header1.h result_header2.h
    gcc ...

The file result.o depends on both the C source file result.c and all headers that file includes. But listing the headers both in result.c with #include directives, and in the Makefile, is a brittle form of duplication. A better approach is for the build system to run gcc -M result.c and extract the includes from there. In Shake we can write:

"result.o" *> \out -> do
    let src = "result.c"
    Stdout stdout <- cmd "gcc -MM" [src]
    need $ src : drop 2 (words stdout)
    cmd "gcc -o" [out] "-c" [src]

My experience is that about a quarter of rules require some kind of additional dependency based on previous dependencies. While you can hack round some of the issues in Make, and people have become disturbingly adept at doing so, the result often only approximates the dependencies - building either too much or too little.

Feature 3: Multiple outputs from one rule (available in Shake and Ninja)

The final feature is producing multiple outputs from one command, and is used far more rarely (perhaps one or two rules in a complex build system) - but when needed, is essential. Some programs, such as GHC, can produce two outputs with one command - compiling Foo.hs produces both Foo.o and Foo.hi. As a first approximation, the .o file depends on the entire contents of the source file, while the .hi file depends only on the type signatures. A single ghc invocation needs to do all the work to produce both, but often the .hi file will be left unchanged. In Shake we can write:

["Foo.hi","Foo.o"] *>> \_ -> do
    need ["Foo.hs"]
    cmd "gcc -c Foo.hs"

While it is often possible to construct a series of dependencies to approximate a single rule producing multiple outputs, it only works in some cases, and is brittle. The only build systems I am aware of which support multiple outputs are Shake and Ninja.

Essential features

My standard advice when people ask about writing a build system is "don't". If some existing build system (e.g. ghc --make or Cabal) is capable of building your project, use that instead. Custom build systems are necessary for many complex projects, but many projects are not complex. If you have decided your project is complex, you should use a build tool that can express complex dependencies, both for writing the initial system and to provide the flexibility to make the inevitable changes required.

Looking only at dependency features, I would consider it unwise to start a complex build system using a tool other than Shake or Ninja, or perhaps either Redo or tup (if you accept the absence of multiple outputs from one rule).

Weak dependency specification in build tools, particularly Make, has left its mark on many programs. I recently talked to some OCaml hackers complaining that their tools were not "Make friendly" because they produced multiple output files. I wonder what lengths other tools have gone to in order to cope with weak dependency specification...

Update: The relative power of tup was reported as a comment, and it appears to have the necessary power, but I haven't yet checked. Following further research into Ninja I suspect it may not be as powerful as originally stated and may not have Feature 2, but am not yet sure. Make gained support for multiple outputs after this post was published.

20 comments:

  1. As part of "Don't do it" it is worth noting that there are several things that are 'must-have' in build systems that many people omit in home-grow ones (DESTDIR, prefix, cross-compilation, CFLAGS/FFLAGS/LDFLAGS/... to name a few) because they tend to be useful for power users, packagers etc. not necessary programmers.

    ReplyDelete
  2. Anonymous12:23 AM

    I think make does support multiple outputs:

    https://www.gnu.org/software/make/manual/make.html#Multiple-Targets

    ReplyDelete
  3. Somewhat off topic, is there a way to tell `gcc` to not complain about missing header files when generating dependencies using `-MM`?

    For example, if `temp.c` depends on a dynamically generated header, `local.h`, then if that header hasn't been generated at least once, then the command `gcc -MM temp.c` will error instead of returning a list of dependencies.

    Also, note that `gcc`'s man file says: "If there are many included files then the rule is split into several lines using \-newline.", which may cause errors using the simple parser in this example.

    This is all just nitpicking, though. Shake is awesome.

    ReplyDelete
  4. How then should a team chose between shake and ninja?

    Building "too much" becomes important in very large builds, and the ninja team appear to obsess about minimal incremental build timing as well as correctness.

    Does shake?

    ReplyDelete
  5. Maciej: I fully agree!

    Anon: The multiple targets in Make is just syntax sugar, so Foo.o Foo.hi : Foo.hs would run the action twice producing the target files separately.

    Noah: In practice I always do header files as per http://community.haskell.org/~ndm/downloads/paper-shake_before_building-10_sep_2012.pdf, S6.4, which handles the 'local.h' problem perfectly. Interesting about the \ issue, fortunately should be easy to remedy if anyone wants to use that approach for real.

    Macker: Shake is also obsessed about minimal incremental build changes and correctness. Some evidence for incremental time:

    * I rewrote the storage layer so incremental changes don't involve writing the metadata (http://neilmitchell.blogspot.co.uk/2012/06/shake-storage-layer.html).
    * The inner build algorithm is highly optimised to reduce things like lock contention in incremental builds (http://community.haskell.org/~ndm/downloads/paper-shake_before_building-10_sep_2012.pdf, S4.3)
    * The file modification time checks are highly optimised and keep at bytes out of structures (https://github.com/ndmitchell/shake/blob/master/Development/Shake/FileTime.hs).

    The end result is that for an no-op build, running ghc --make to build a 30 module build system takes longer than running a 30,000 object build system. Checking modification time from the disk dominates, and that is highly optimised.

    I have yet to benchmark Shake against Ninja. There are a few reasons I think Ninja might win (better on-disk serialisation) and a few reasons I think Shake might win (rules are statically compiled), so will have to try it to find out.

    For correctness, I have lots of tests, including random stress tests, small tests targeting specific areas and a huge build system that relies on it.

    As for picking Ninja or Shake, I would obviously go for Shake, for reasons I will delve into fully in a future blog post. One thing to note is that you cannot go for Ninja directly, but have to use a system like CMake or Gyp on top, or your own custom generator - in that sense Ninja is a build engine, whereas Shake is a complete build tool.

    ReplyDelete
  6. Anonymous3:58 PM

    Neil: Regarding multiple targets in GNU Makefiles, I tested it, and I read the documentation, and you're wrong in both theory and practice - it only runs the rule once.

    ReplyDelete
  7. Anon: I haven't tried it, but reading the linked section of the docs it does say it is syntactic sugar for two separate rules, which would imply being run twice. Any chance you can post the makefile you used to test it, so I can try it myself?

    ReplyDelete
  8. Anon: Ah, if you do:

    foo bar : baz
    echo 2 > bar
    echo 2 > foo

    Then try and build both foo and bar, it will then run the rule once to build foo, but the second time round it will not run since it will detect bar is newer than baz. That might give the appearance of one rule producing multiple outputs, but if foo was Foo.o and bar was Foo.hi, and the compilation only changed the .o file not the .hi file, then you would end up running the rule twice.

    ReplyDelete
  9. Seems I'm a bit late to the party, but IIUC tup also handles the first 2 cases:
    http://gittup.org/tup/

    ReplyDelete
  10. Evan M8:19 PM

    Maciej, Ninja's manual discusses whether it's appropriate for your project. Briefly, most other build systems (including Shake) are more user-friendly. Ninja is most appropriate when you have both very large (>10k files) projects that already have build files written (and would like to autogenerate their output).

    http://martine.github.io/ninja/manual.html#_using_ninja_for_your_project

    ReplyDelete
  11. Nimrod: I suspect you are right, so have updated the blog, but I can't guarantee without giving it a go - which I will do in future. I actually think Ninja might not have the power I attributed to it (but it can do the example attached, just not some follow up ones), so it is a bit subtle. I think I need to do something a bit more formal at some point so I can be sure.

    ReplyDelete
  12. Hello Neil,

    What would you recommend for something like a Java build, where multiple output files (*.class) are produced from multiple input files (*.java) in a single pass of the compiler? For example the compiler should be invoked if any one of the input .java files is newer than its corresponding .class file.

    I've been experimenting with implementing these in Shake but got a bit stuck, I would appreciate any help and advice you can offer.

    Neil Bartlett, njbartlett@gmail.com

    ReplyDelete
  13. Hi Neil Bartlett, can you ask that question on Stack Overflow with the tag shake-build-system? It seems a great question that would be generally useful to have written down. I don't have time to answer it right now (just heading to bed), but if it remains unanswered in the morning I'll certainly have a go at figuring it out. [If you don't want to post it on StackOverflow, I'll post it there and answer it there anyway, since it is the "best" place for people to find answers like this]

    ReplyDelete
  14. Thanks Neil, I didn't think of using StackOverflow! Here's the question.

    ReplyDelete
  15. A build system, at its heart, is a system which runs commands in an order satisfying user-specified dependencies.
    I'm sorry if I will appear to quibble. The very definition is wrong. A command is a function with side effects. A build system must only run functions without side effects. That's how it can optimize its job (do not forget about having several processors). If 2 build functions interfere via side effects, this is the abundant source of incorrect building results.

    ReplyDelete
  16. @beroal: I am treating the creation of the output files and the reading the input files as side effects, you are saying they are input/output to a function. I think we're both saying the same thing though.

    I prefer to think of them as arbitrary side-effecting commands that are tamed with explicitly listing inputs/outputs since most of them are command line tools and it helps think about the ways things go wrong and the problems that arise. If I didn't let you do IO, but always required applicative combinators and everything in Haskell world I would probably think of them more like functions.

    ReplyDelete
  17. Anonymous3:24 AM

    You might want to add SCons to your inventory. It's not Haskell-specific, but it solidly supports your 3 features.

    ReplyDelete
  18. Eric: I'm somewhat familiar with SCons, and I'm pretty sure it doesn't support feature 2, and I don't think it supports feature 3. However, I've recently been trying to nail down exactly what does and does not support a particular feature with specific test cases - my work is at https://github.com/ndmitchell/build-shootout. I'd really love SCons entries to that project (or any build system in fact), which should figure out the answer definitively.

    ReplyDelete
  19. Anonymous2:39 PM

    $ man tup|grep 'output multiple' -C1
    The outputs section specifies the files that will be written to by the command. Only one command can write to a specific file, but a single command can
    output multiple files (such as how a bison command will output both a .c and .h file). The output can use any %-flags except %o. Once a file is specified
    in an output section, it is put into the tup database. Any following rules can use that file as an input, even if it doesn't exist in the filesystem yet.

    ReplyDelete
  20. Anonymous2:40 PM

    https://www.gnu.org/software/make/manual/make.html#Multiple-Targets is not about multiple outputs, just a shortcut for similar rules.

    The only way to do multiple outputs in gnumake is with pattern rules: http://www.cmcrossroads.com/article/rules-multiple-outputs-gnu-make?page=0%2C2

    ReplyDelete