Tuesday, April 01, 2014

Exceptional Testing

Summary: Failing properties should throw exceptions, not return False.

When testing properties in Haskell with QuickCheck we usually write a predicate that takes some arguments, and returns a boolean. For example:

import Test.QuickCheck
main = quickCheck $ \x -> exp (log (abs x)) == abs (x :: Double)

Here we are checking that for all positive Double values, applying log then exp is the identity. This statement is incorrect for Double due to floating point errors. Running main we get:

*** Failed! Falsifiable (after 6 tests and 2 shrinks):

QuickCheck is an incredibly powerful tool - it first found a failing example, and then automatically simplified it. However, it's left out some important information - what is the value of exp (log 3)? For my tests I usually define === and use it instead of ==:

a === b | a == b = True
        | otherwise = error $ show a ++ " /= " ++ show b

Now when running main we get:

*** Failed! Exception: '3.0000000000000004 /= 3.0'
(after 2 tests and 2 shrinks):

We can immediately see the magnitude of the error introduced, giving a kick-start to debugging.

Do we still need Bool?

When writing functions returning Bool there are three interesting cases:

  • The function returns True, the test passes, continue on.
  • The function returns False, the test fails, with no information beyond the input arguments.
  • The function throws an exception, the test fails, but with a human readable error message about the point of failure.

Of these, returning False is the least useful, and entirely subsumed by exceptions. It is likely there was additional information available before reducing to False, which has now been lost, and must first be recovered before debugging can start.

Given that the only interesting values are True and exception, we can switch to using the () type, where passing tests return () and failing tests throw an exception. However, working with exceptions in pure code is a bit ugly, so I typically define:

import Control.Monad
(===) :: (Show a, Eq a) => a -> a -> IO ()
a === b = when (a /= b) $ error $ show a ++ " /= " ++ show b

Now === is an action, and passing tests do nothing, while failing tests raise an error. This definition forces all tests to end up in IO, which is terrible for "real" Haskell code, where pure and IO computations should be segregated. However, for tests, as long as the test is repeatable (doesn't store some temporary state) then I don't worry.

To test the IO () returning property with QuickCheck we can define:

instance Testable () where
    property () = property True
instance Testable a => Testable (IO a) where
    property = property . unsafePerformIO

Now QuickCheck can work with this === definition, but also any IO property. I have found that testing IO properties with QuickCheck is very valuable, even without using ===.

Non-QuickCheck tests

Whilst I have argued for exceptions in the context of QuickCheck tests, I use the same exception pattern for non-parameterised/non-QuickCheck assertions. As an example, taken from Shake:

test = do
    dropDirectory1 "aaa/bbb" === "bbb"
    dropDirectory1 "aaa/" === ""
    dropDirectory1 "aaa" === ""
    dropDirectory1 "" === ""

Using IO () properties allows for trivial sequencing. As a result, the tests are concise, and the effort to add additional tests is low.

(===) in QuickCheck

In QuickCheck-2.7 the === operator was introduced (which caused name clashes in both Shake and Hoogle - new versions of both have been released). The === operator uses the Property data type in QuickCheck to pass both the Bool value and additional information together. I prefer my definition of === because it's simpler, doesn't rely on QuickCheck and makes it clear how to pass additional information beyond just the arguments to ===. As an example, we could also return the log value:

\x -> when (exp (log (abs x)) /= abs x) $
          error $ show (x,log $ abs x, exp $ log $ abs x)

However, the QuickCheck === is a great improvement over ==, and should be a very popular addition.

Thursday, March 13, 2014


Summary: HLint now has a real API.

HLint is a tool for making suggestions about your Haskell source code, available as a standalone tool (via cabal install hlint) and from several Haskell environments including the FP Complete IDE, Eclipse FP and IHaskell. Until now, HLint has provided an API which took a list of command line arguments, and essentially ran the hlint executable, but in process. With the release of hlint-1.8.59 I've now included a "real" API as Language.Haskell.HLint2. Some features include:

  • You can create your own hint as a Haskell function, using the HLint facilities to apply it to your code.
  • More information is provided about the resulting suggestions, including source spans.
  • The API accepts haskell-src-exts modules, and provides easy ways to parse strings, meaning files are not required and you can use existing syntax trees.
  • Some of the utilities used inside HLint are now available, including liberal name resolution and lenient parsing.

I recommend all users of the old API migrate to the new one, as I intend to phase out the old API in the near future (please let me know if that will cause you any problems). I plan to evolve the new API in response to user feedback, and then in a little while delete the old API, rename HLint2 to HLint, and bump the version number. I welcome feedback on the API, and am always curious to see where HLint gets used.

Monday, February 24, 2014

The Build System Shootout

Summary: I am collecting examples to show how build systems relate in terms of power. Contributions welcome.

A while back I tried to classify the relative power of various build systems. I did that by reasoned argument, which, unsurprisingly, meant I didn't get it right. In an attempt to properly clarify the relative power of various build systems I've started the Build System Shootout. Compared to the Computer Language Shootout, this Shootout attempts to answer whether a build system is capable of expressing a particular dependency structure, but does not measure performance. The following build systems have at least seven entries:

  • Make (GNU version), the de facto standard build system.
  • Ninja, as used by Chrome.
  • Shake, my Haskell build system.
  • tup, contributed by Andrew Wagner.
  • fabricate, with help from the fabricate mailing list.

The project currently consists of twelves examples, each of which has pseudo-code for the equivalent untracked straight-line shell script and a set of test cases that are used to decide if a fragment correctly expresses the build example.

I welcome contributions, including:

  • Examples in different build systems (e.g. Scons, Ant, Waf).
  • New implementations for existing build systems (e.g. parallelism for fabricate).
  • New test cases (provided they show something interesting) (e.g. dependencies by hash not modification time).
  • Corrections of my egregious errors.

Monday, January 06, 2014

Optimising Haskell for a tight inner loop

Summary: I walk through optimising a Haskell string splitter to get a nice tight inner loop. We look at the Haskell code, the generated Core, C-- and assembly. We get down to 6 assembly instructions per input character.

Let's start with some simple code:

break (`elem` " \r\n$") src

This code scans a string looking for a space, newline or $ and returns the string before and the string after. Our goal is to make this code faster - by the end we'll get down to 6 assembly instructions per input character. Before making things go faster we should write test cases (so we don't break anything), profile (so we are optimising the right thing) and write benchmarks (to check our changes make things go faster). To write this post, I did all those steps, but the post is only going to look at the generated Core, C-- and assembly - and be guided by guesses about what should go faster. The complete code is available online, along with the Core/C--/assembly for each step as produced by GHC 7.6.3.

Version 1

To turn our example into a complete program, we write:

module InnerLoop(innerLoop) where

innerLoop :: FilePath -> IO (String, String)
innerLoop file = do
    src <- readFile file
    return $ break test src

test x = x `elem` " \r\n$"

We can save this code as InnerLoop.hs and compile it with:

ghc -c -O2 InnerLoop.hs -ddump-simpl -ddump-cmm -ddump-asm > log.txt

The full output of log.txt is available here. It contains the GHC Core (which looks a bit like Haskell), then the C-- (which looks a bit like C) and finally the assembly code (which looks exactly like assembly). When optimising we usually look at the Core, then at the C--, then at the assembly - stopping whenever our profiling says we are done. Let's take a look at the inner loop in Core (with some light editing):

innerLoop_3 = GHC.CString.unpackCString# " \r\n\$"

test_1 = \ (x :: GHC.Types.Char) ->
    GHC.List.elem @ GHC.Types.Char GHC.Classes.$fEqChar x innerLoop_3

innerLoop_2 =
    case GHC.List.$wbreak @ GHC.Types.Char test_1 x of _
        (# a, b #) -> (a, b)

The best way to read the Core is by looking for what you can understand, and ignoring the rest - it contains a lot of boring detail. We can see that a lot of things are fully qualified, e.g. GHC.List.elem. Some things have also been a bit mangled, e.g. $wbreak, which is roughly break. The interesting thing here is that break is being passed test_1. Looking at test_1 (which will be called on each character), we can see we are passing $fEqChar - a pair containing a function of how to perform equality on characters - to the elem function. For each character we are going to end up looping through a 4 element list (innerLoop_3) and each comparison will be going through a higher order function. Clearly we need to improve our test function.

Version 2

We can unroll the elem in test to give:

test x = x == ' ' || x == '\r' || x == '\n' || x == '$'

Compiling again and looking at the Core we see:

test_2 =
  \ (x :: GHC.Types.Char) ->
    case x of _ { GHC.Types.C# c ->
    case c of _ {
      __DEFAULT -> GHC.Types.False;
      '\n' -> GHC.Types.True;
      '\r' -> GHC.Types.True;
      ' ' -> GHC.Types.True;
      '$' -> GHC.Types.True

Now for each character we extract the raw character (pattern matching against C#) then test it against the possibilities. GHC has optimised our repeated ==/|| into a nice case expression. It looks quite nice. Now the bottleneck is the break function.

Version 3

The break function is working on a String, which is stored as a linked list of characters. To get better performance we can move to ByteString, writing:

innerLoop :: FilePath -> IO (ByteString, ByteString)
innerLoop file = do
    src <- BS.readFile file
    return $ BS.break test src

For many people this is the reasonable-performance version they should stick with. However, let's look at the Core once more:

go = \ (a :: Addr#) (i :: Int#) (w :: State# RealWorld) ->
    case i >=# len of _ {
      GHC.Types.False ->
        case readWord8OffAddr# @ GHC.Prim.RealWorld a 0 w
        of _ { (# w, c #) ->
        case chr# (word2Int# c) of _ {
          __DEFAULT -> go (plusAddr# a 1) (i +# 1) w;
          '\n' -> (# w, GHC.Types.I# i #);
          '\r' -> (# w, GHC.Types.I# i #);
          ' ' -> (# w, GHC.Types.I# i #);
          '$' -> (# w, GHC.Types.I# i #)
      GHC.Types.True -> (# w, l_a1J9 #)

The first thing that should strike you is the large number of # symbols. In Core, a # means you are doing strict primitive operations on unboxed values, so if the optimiser has managed to get down to # that is good. You'll also notice values of type State# RealWorld which I've renamed w - these are an encoding of the IO monad, but have zero runtime cost, and can be ignored. Looking at the rest of the code, we have a loop with a pointer to the current character (a :: Addr#) and an index of how far through the buffer we are (i :: Int#). At each character we first test if the index exceeds the length, and if it doesn't, read a character and match it against the options. If it doesn't match we continue by adding 1 to the address and 1 to the index. Of course, having to loop over two values is a bit unfortunate.

Version 4

A ByteString needs an explicit length so it knows when it has come to the end of the buffer, so needs to keep comparing against explicit lengths (and for efficiency reasons, also maintaining those lengths). Looking to C for inspiration, typically strings are terminated by a \0 character, which allows looping without comparing against a length (assuming the source file does not contain \0). We can define our own null-terminated ByteString type with a break operation:

newtype ByteString0 = BS0 ByteString

readFile0 :: FilePath -> IO ByteString0
readFile0 x = do
    src <- BS.readFile x
    return $ BS0 $ src `BS.snoc` '\0'

We define a newtype wrapper around ByteString so we gain some type safety. We also define a readFile0 that reads a file as a ByteString0, by explicitly calling snoc with \0. We can now define our own break0 function (this is the only big chunk of Haskell in this article):

break0 :: (Char -> Bool) -> ByteString0 -> (ByteString, ByteString0)
break0 f (BS0 bs) = (BS.unsafeTake i bs, BS0 $ BS.unsafeDrop i bs)
        i = Internal.inlinePerformIO $ BS.unsafeUseAsCString bs $ \ptr -> do
            let start = castPtr ptr :: Ptr Word8
            let end = go start
            return $! end `minusPtr` start

        go s | c == '\0' || f c = s
             | otherwise = go $ inc s
            where c = chr s

chr :: Ptr Word8 -> Char
chr x = Internal.w2c $ Internal.inlinePerformIO $ peek x

inc :: Ptr Word8 -> Ptr Word8
inc x = x `plusPtr` 1

We define break0 by finding the position at which the condition stops being true (i) and calling unsafeTake/unsafeDrop to slice out the relevant pieces. Because we know the second part is still null terminated we can rewrap in ByteString0. To find the index, we mostly use code copied from the bytestring library and modified. We convert the ByteString to a Ptr CChar using unsafeUseAsCString which just lets us look at the internals of the ByteString. We then loop over the pointer with go until we get to the first character that passes f and find how far we travelled. The function go looks at the current character using chr, and if it's \0 (the end) or the function f passes, returns the address at this point. Otherwise it increments the pointer. We use chr to peek at the pointer directly, and inlinePerformIO to do so purely and fast - since we know these buffers are never modified, the inlinePerformIO is morally defensible (we could have put chr in IO but that breaks a future optimisation we'll need to do).

Compiling to Core we see:

go = \ (x :: GHC.Prim.Addr#) ->
    case readWord8OffAddr# @ RealWorld x 0 realWorld#
    of _ { (# _, c #) ->
    case GHC.Prim.chr# (GHC.Prim.word2Int# c) of _ {
      __DEFAULT -> go (GHC.Prim.plusAddr# x 1);
      '\NUL' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      '\n' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      '\r' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      ' ' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      '$' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x

Now we have a Core inner loop to be proud of. We loop round with a single pointer, peek at a byte, and compare it to our options. Time to look onwards to the C--, where I've included just the inner loop:

        Hp = Hp + 8;
        if (Hp > HpLim) goto c1Tx;
        _s1RN::I32 = %MO_UU_Conv_W8_W32(I8[I32[Sp + 0]]);
        _s1T5::I32 = _s1RN::I32;
        _s1T6::I32 = _s1T5::I32;
        if (_s1T6::I32 < 13) goto c1TG;
        if (_s1T6::I32 < 32) goto c1TH;
        if (_s1T6::I32 < 36) goto c1TI;
        if (_s1T6::I32 != 36) goto c1TJ;
        _s1T4::I32 = I32[Sp + 0] + 1;
        I32[Sp + 0] = _s1T4::I32;
        Hp = Hp - 8;
        jump InnerLoop.$wgo_info; // []

Reading the code, we first mess around with Hp, then pull a value out of the array and into _s1RN, then do some comparisons, and if they don't match jump to c1TJ, mess around with Hp again and jump back to start again.

There are three obvious problems with the code: 1) we mess around with Hp; 2) we are doing too many tests to get to the default case; 3) there is a jump in the middle of the loop.

Version 5

Let's start with the Hp variable. Hp is the heap pointer, which says how much heap GHC is using - if the heap gets above a certain limit, it triggers a garbage collection. The Hp = Hp + 8 reserves 8 bytes of heap for this function, Hp > HpLim checks if we need to garbage collect, and Hp = Hp - 8 at the bottom of the loop gives back that heap space. Why do we allocate 8 bytes, only to give it back at the end? The reason is that in the return path after the loop we do allocation. It's a long standing performance issue that GHC doesn't push the heap test down to the exit path, but we can fix it ourselves. Looking at the Core, we saw:

case GHC.Prim.chr# (GHC.Prim.word2Int# c) of _ {
  __DEFAULT -> go (GHC.Prim.plusAddr# x 1);
  '\NUL' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;

The expression GHC.Ptr.Ptr @ GHC.Word8 x is allocating a constructor around the pointer to return. Looking at the Ptr type we discover:

data Ptr a = Ptr Addr#

So Ptr is simply a constructor wrapping our address. To avoid the Ptr in the inner loop, we can switch to returning Addr# from go:

i = Internal.inlinePerformIO $ BS.unsafeUseAsCString bs $ \ptr -> do
    let start = castPtr ptr :: Ptr Word8
    let end = go start
    return $! Ptr end `minusPtr` start

go s@(Ptr a) | c == '\0' || f c = a
             | otherwise = go $ inc s
    where c = chr s

We also add back the Ptr around end do call minusPtr. Looking at the Core we now see a very simple return path:

case GHC.Prim.chr# (GHC.Prim.word2Int# ipv1_a1D0) of _ {
  __DEFAULT -> InnerLoop.$wgo (GHC.Prim.plusAddr# ww_s1PR 1);
  '\NUL' -> ww_s1PR;

And dropping down to C-- we see:

     _s1Ry::I32 = %MO_UU_Conv_W8_W32(I8[I32[Sp + 0]]);
     _s1SP::I32 = _s1Ry::I32;
     _s1SQ::I32 = _s1SP::I32;
     if (_s1SQ::I32 < 13) goto c1Tn;
     if (_s1SQ::I32 < 32) goto c1To;
     if (_s1SQ::I32 < 36) goto c1Tp;
     if (_s1SQ::I32 != 36) goto c1Tq;
     R1 = I32[Sp + 0];
     Sp = Sp + 4;
     jump (I32[Sp + 0]); // [R1]
     _s1SO::I32 = I32[Sp + 0] + 1;
     I32[Sp + 0] = _s1SO::I32;
     jump InnerLoop.$wgo_info; // []

Not a single mention of Hp. We still have a lot more tests than we'd like though.

Version 6

The current code to check for our 5 terminating characters compares each character one by one. This entire example is based on lexing Ninja source files, so we know that most characters will be alphanumeric. Using this information, we can instead test if the character is less than or equal to $, if it is we can test for the different possibilities, otherwise continue on the fast path. We can write:

test x = x <= '$' && (x == ' ' || x == '\r' || x == '\n' || x == '$')

Now looking at the Core we see:

go = \ (ww_s1Qt :: GHC.Prim.Addr#) ->
    case GHC.Prim.readWord8OffAddr#
           @ GHC.Prim.RealWorld ww_s1Qt 0 GHC.Prim.realWorld#
    of _ { (# _, ipv1_a1Dr #) ->
    case GHC.Prim.chr# (GHC.Prim.word2Int# ipv1_a1Dr) of wild_XH {
      __DEFAULT ->
        case GHC.Prim.leChar# wild_XH '$' of _ {
          GHC.Types.False -> go (GHC.Prim.plusAddr# ww_s1Qt 1);
          GHC.Types.True ->
            case wild_XH of _ {
              __DEFAULT -> go (GHC.Prim.plusAddr# ww_s1Qt 1);
              '\n' -> ww_s1Qt;
              '\r' -> ww_s1Qt;
              ' ' -> ww_s1Qt;
              '$' -> ww_s1Qt
      '\NUL' -> ww_s1Qt

The code looks reasonable, but the final \NUL indicates that the code first checks if the character is \NUL (or \0) and only then does our fast < $ test.

Version 7

To perform our < $ test before checking for \0 we need to modify go. We require that the argument predicate must return False on \0 (otherwise we'll run off the end of the string) and can then write:

go s@(Ptr a) | f c = a
             | otherwise = go $ inc s
    where c = chr s

test x = x <= '$' &&
    (x == ' ' || x == '\r' || x == '\n' || x == '$' || x == '\0')

The Core reads:

InnerLoop.$wgo =
  \ (ww_s1Qq :: GHC.Prim.Addr#) ->
    case GHC.Prim.readWord8OffAddr#
           @ GHC.Prim.RealWorld ww_s1Qq 0 GHC.Prim.realWorld#
    of _ { (# _, ipv1_a1Dr #) ->
    let {
      c1_a1uU [Dmd=Just L] :: GHC.Prim.Char#
      [LclId, Str=DmdType]
      c1_a1uU = GHC.Prim.chr# (GHC.Prim.word2Int# ipv1_a1Dr) } in
    case GHC.Prim.leChar# c1_a1uU '$' of _ {
      GHC.Types.False -> InnerLoop.$wgo (GHC.Prim.plusAddr# ww_s1Qq 1);
      GHC.Types.True ->
        case c1_a1uU of _ {
          __DEFAULT -> InnerLoop.$wgo (GHC.Prim.plusAddr# ww_s1Qq 1);
          '\NUL' -> ww_s1Qq;
          '\n' -> ww_s1Qq;
          '\r' -> ww_s1Qq;
          ' ' -> ww_s1Qq;
          '$' -> ww_s1Qq

The C-- reads:

     _s1Se::I32 = %MO_UU_Conv_W8_W32(I8[I32[Sp + 0]]);
     _s1Sh::I32 = _s1Se::I32;
     _s1Sg::I32 = _s1Sh::I32;
     _c1TZ::I32 = _s1Sg::I32 <= 36;
     if (_c1TZ::I32 >= 1) goto c1Ui;
     _s1Ty::I32 = I32[Sp + 0] + 1;
     I32[Sp + 0] = _s1Ty::I32;
     jump InnerLoop.$wgo_info; // []

And the assembly reads:

    movl 0(%ebp),%eax
    movzbl (%eax),%eax
    cmpl $36,%eax
    jbe _c1Ui
    incl 0(%ebp)
    jmp InnerLoop.$wgo_info

We have ended up with a fairly small 6 instruction loop.

Version 8

We've now exhausted my Haskell bag of tricks, and have to stop. But the assembly code could still be improved. In each loop we read the contents of the memory at %ebp into %eax, and increment the contents of the memory at %ebp at the end - we're manipulating the value on the top of the stack (which is pointed to by %ebp). We could instead cache that value in %ebx, and write:

    movzbl (%ebx),%eax
    cmpl $36,%eax
    jbe _c1Ui
    incl %ebx
    jmp _c1Uf

One less instruction, two less memory accesses. I tried the LLVM backend (using LLVM 2.8 at -O3), but it generated significantly worse assembly code. I don't know how to optimise any further without dropping down to the C FFI, but I'm sure one day GHC/LLVM will automatically produce the shorter assembly code.

Update: It appears on 64bit x86 GHC already produces the minimal assembly.

Saturday, January 04, 2014

Running Makefiles with Shake

Summary: Shake can interpret some Makefiles, just like Make.

Since version 0.10 the Shake build system also installs a shake executable that can interpret a Makefile, just like you would use with make. To try it out, just type shake where you would normally type make (i.e. a directory containing a Makefile) and Shake will attempt to perform the build. I say attempt, as there are plenty of caveats, and it is unlikely that randomly chosen Makefile will work.

The current state

  • shake can interpret some Makefile examples. For example, the first four examples from this Makefile tutorial can be executed by shake. There are plenty of missing features, too many to list. It is likely that some tutorials or simple projects will work, but most "real" projects will not.
  • shake is sufficiently compatible with make (for the subset of features it supports) that I actually run the GNU Make testsuite using shake (see test-make.sh). I currently pass 16 tests from 8 categories, out of 83 tests in 39 categories. One of the tests I pass is even failed by the MinGW version of make.
  • The command like arguments to shake are mostly a superset of those to make. In fact, all Shake programs that use shakeWithArgs support most of these options - including things like --jobs, --keep-going, --directory and --always-make.

The original plan

My original plan had 3 steps (spoiler alert: I never even finished step 1):

Step 1: I wanted to interpret many common Makefile patterns, so some simpler projects could use either make or shake interchangeably. I never expected it to work for complex projects like GHC or the Linux Kernel, but did hope to work for the average one-author open source project.

Step 2: I wanted to add need as a command that pulled in additional dependencies. One of the weaknesses of make (fixed by Shake) is that you have to specify all dependencies up front, before you have generated other dependencies. I wanted to extend the Makefile syntax to support:

foo.tar : foo.txt
    cat foo.txt | need
    tar -cf foo.tar `cat foo.txt`

Here cat foo.txt | need would add dependencies on all the files listed by foo.txt and build them if necessary, before continuing.

Step 3: I wanted to write a converter that took an extended Makefile and produced a Shakefile.hs Haskell program that used the standard Shake library, giving users a seamless migration path.

What now?

This feature has no active users, and I don't immediately see who would chose to use it. There are advantages of using shake over make, primarily profiling and progress prediction, but they aren't killer advantages. Make is a large and complex program, and while you could replicate it in Haskell on top of Shake, you probably don't want to. As an example, if a rule for a file does not update the file, Shake will not rerun that rule in subsequent builds but Make will. Arguably the Make behaviour is a bug, but like so many misfeatures, someone will have relied on it somewhere. Replicating Make bug-for-bug would be a lot of effort, end up very ugly, and if you don't have complete compatibility, big projects are unlikely to be able to switch.

Despite my pessimism around this feature, I don't intend to remove it. If it gains users I would be pleasantly surprised. If there is some small feature someone thinks it should support (ideally to gain additional users!) I'm happy to implement it. If people want to send pull requests, or even take over development of this part, I'd be very happy. Shake has gained a number of useful features while implementing Make compatibility, so even if the code is never used in anger, it was still of benefit.

A quick tour of the code

The code for the Make compatibility is all in the repo under Development/Make. It's relatively short and might serve as a nice example for intermediate level Haskell programmers. There are 5 files:

  • Type defines the type Makefile, which is the parsed representation of a Makefile (58 lines).
  • Parse takes a file and parses it, producing a Makefile value (71 lines).
  • Env provides a representation of the Make environment, describes which variables are assigned to which values, and evaluates expressions in the environment (40 lines).
  • Rules defines a new Shake rule type, mostly like the normal Shake file rule, but having slightly different semantics - if a rule fails to generate a file, that isn't an error (59 lines).
  • All puts all the pieces together - parsing a Makefile, evaluating the expressions within it while building up an environment, and then translating to Shake rules (135 lines).

The code is fairly well segregated and serves as a reasonable template for interpreting alternative input build specifications under Shake. The Development/Ninja source code follows much the same pattern.

Friday, January 03, 2014

Announcing Shake 0.11

Summary: Shake 0.11 is out, upgrading should be pretty easy.

Shake is a Haskell library for writing build tools, and I've just released version 0.11. There are a few incompatible changes, and a few new features.

Incompatible changes:

  • Everyone should move to the cmd and command functions for running external programs. I've deprecated all of the other functions (including sys, system' and systemCwd).
  • newCache is now keyed on anything, and takes an action. To obtain the same behaviour as before, make sure you need the file argument in the function passed to newCache.
  • If you are writing your own types of Rule (fairly rare) you need to import Development.Shake.Rule.
  • The storedValue function of the Rule class now gets given the current ShakeOptions.

The move to cmd and command will require the most effort to upgrade, but none of these functions have been removed, so you can upgrade and then convert over after. The other changes are likely to hit very few people, and can be adapted to very easily. If you have any difficulties, email the mailing list.

Major new features in 0.11:

  • There is tracking support, to check you have not missed any dependencies. There are support functions for tracking (e.g. trackUse, trackChange) and on Windows there is support for using tracker.exe to call these functions.
  • The Shake database is now more compact, about 40% smaller for some examples.
  • If you have a few rules that overlap (e.g. one for foo.txt and one for *.txt) you can use the alternatives function to combine them.
  • The Ninja parser is faster (4x faster than Ninja itself to parse the Chromium Ninja files on one machine).
  • The continuous integration tests include a test that Shake can build the Ninja binary from Ninja build files faster than Ninja can (it is typically about 10% faster to build from scratch).

The biggest features since 0.10:

  • Support for Unicode filenames.
  • Addition of newThrottle to support throttled resources, where you can only perform n operations per minute.
  • Progress prediction.
  • Shake installs an executable shake which can build either Ninja projects (almost all) or Makefile projects (barely any).

A complete list of changes since the beginning is available here.

Thursday, December 12, 2013

Progress Reporting in Shake

Summary: Shake can predict you how long your build will take. The prediction makes use of stateful applicative streams.

The Shake build system can predict how long your build will take. As an example, on Windows 7 I see:

The progress message is displayed in the titlebar of the console window (on Windows, Linux and Mac), for example 3m12s (82%) to indicate that the build is 82% complete and is predicted to take a further 3 minutes and 12 seconds. If you are running Windows 7 or higher and place the shake-progress utility somewhere on your %PATH% then the progress will also be displayed in the taskbar progress indicator.

Progress reporting is great for end users of Shake based build systems. They may not care that Shake is fast (since they probably have no frame of comparison for your particular build), or that it is robust, or that it is fully featured etc. But everyone wants to know if they should grab a coffee or just engage in some quick sword-fighting.


Predicting the future is tricky. Shake is a very dynamic build system, so unlike Make, while running it can both discover new things to do, and decide that things it previously intended to do are unnecessary. It also runs rules in parallel, and each rule takes a different amount of time. All these factors make the progress a "best guess", rather than a statement of fact. In particular:

  • The first run (or after deleting the Shake database) will have worse predictions, as Shake has no idea how long each rule takes to execute.
  • In the first few seconds of a build, the predicted progress may vary more dramatically, as Shake is still determining how much is left to do and how fast things are going.
  • If you have a small number of long executing rules (e.g. taking minutes) then progress changes will be very granular, as no attempt is made at predicting in-progress rules.

These limitations aside, the feature has changed the way I work, and is valuable to many people. One user remarked that "showing the estimated build time remaining in the terminal window's title is genius".

Turning the feature on

Using Shake 0.10.2 or above (from March 2012), you can pass the command line flag --progress or the set the ShakeOptions field shakeProgress to progressSimple. Using the shakeProgress field you can customise how the progress messages are calculated and displayed, for example you can generate annotations for the Team City continuous-integration tool.

Collecting the data

The primary goal for this feature was to have zero overhead for people not using it, and to not complicate the internals of the build system. I achieved that goal by making progress generation take the internal Shake state every 5 seconds (configurable, using shakeProgress) and generate summary statistics from the state. The result is the Progress data type (to follow the rest of this post, you do not need to understand the actual fields below):

data Progress = Progress
    {isFailure :: Maybe String -- ^ Starts out 'Nothing', becomes 'Just' a target name if a rule fails.
    ,countSkipped :: Int -- ^ Number of rules which were required, but were already in a valid state.
    ,countBuilt :: Int -- ^ Number of rules which were have been built in this run.
    ,countUnknown :: Int -- ^ Number of rules which have been built previously, but are not yet known to be required.
    ,countTodo :: Int -- ^ Number of rules which are currently required (ignoring dependencies that do not change), but not built.
    ,timeSkipped :: Double -- ^ Time spent building 'countSkipped' rules in previous runs.
    ,timeBuilt :: Double -- ^ Time spent building 'countBuilt' rules.
    ,timeUnknown :: Double -- ^ Time spent building 'countUnknown' rules in previous runs.
    ,timeTodo :: (Double,Int) -- ^ Time spent building 'countTodo' rules in previous runs, plus the number which have no known time (have never been built before).

This structure counts the number of rules in each state, and where available, provides execution times. It reflects the information easily computable by the internals of Shake.

Computing the progress

To compute the progress I first spent a long time measuring real builds, and the Progress values, and trying to detect some correlation. The results were disappointing, and I got nowhere - the progress jumps around too much (I tried Kalman filters etc, but they didn't help). I then thought hard and decided that the best predictor of completion time is:

(number-of-rules-to-build * time-to-build-each-rule) / work-rate

So you take:

  • The number of things you have to do.
  • The time you expect each one to take.
  • How fast you are running rules, which you would expect to be roughly the -j parallelism argument.

This measure does not attempt to predict if rules are skipped or discovered, but these are unpredictable events, so ignoring them is sensible (and likely why the data-driven approach fell down). These measures can be computed by:

  • number-of-rules-to-build is known, and stored as countTodo.
  • time-to-build-each-rule is influenced by timeTodo, comprising the time to build some of the rules, and the number of rules that have never been built before. We can therefore compute the time-to-build-each-rule from the times we have, and predicted-time-to-build for the times we are missing.
  • predicted-time-to-build can be computed by taking the time things took to build this run (timeBuilt) and dividing by the number of things that rebuilt (countBuilt). Given the available data, there are a number of plausible alternatives for calculating predicted-time-to-build.
  • work-rate is roughly the parallelism flag, but for laptops, may be skewed by things like whether the machine was using battery power. We compute the work rate by tracking how long we have been building, and how much timeBuilt time we have managed to execute.

Using these definitions, given the Progress structure and a Double for the number of seconds the build has been running, we can define:

progressMessage :: Progress -> Double -> String

Stateful progress

The formulae above give a reasonable approximation of progress time, but they have a number of weaknesses:

  • At the beginning we have completed no work, so timeBuilt is zero, and therefore work-rate is zero.
  • If we are running a long-running rule (e.g. 15s) then the work-rate keeps going down, as timeBuilt only increases at the end, but the time building keeps going up.
  • Builds often go in phases, meaning that unknown rules at the beginning are typically fast code generators, in the middle are typically compilers, and at the end are slow linkers. The average of all rules is often too small by the end.

Many of these weaknesses are due to taking the current state and computing the progress, without incorporating some information from previous progress calculations. We need to redefine our progressMessage function as:

progressMessage :: ProgressState -> Progress -> (ProgressState, String)

Now we take some ProgressState (which can include the time spent building), and in computing the message, we produce some new state, which will be used to produce the next message. For example, ProgressState can include some decaying value of predicted-time-to-build, where the new value is a weighted average between the old value and the new value, allowing new values to dominate.

Streamed progress

My first stateful progress predictor made use of an explicit state, but it's hard to thread around, and doesn't feel very Haskelly. It also made minor local edits to the definitions require global state modifications, and significantly hampered tweaking the definitions. The solution was to introduce an abstraction. I played with a number of variants, but eventually settled on:

newtype Stream i a = Stream {runStream :: i -> (a, Stream i a)}

A Stream can be supplied with an input value of type i and produces an output value a and a new Stream. In our case, we will have a value of type Stream Progress String, and provide new Progress values to get out progress messages. Any stateful information can be captured in the closure. (This definition is unrelated to other Stream definitions in Haskell, which are usually just infinite lists.)

I define four primitive operations on Stream, namely:

  • pure from the Applicative class, which produces a new Stream which, regardless of the input, always produces the same output.
  • <*> from the Applicative class, which applies a Stream of functions to a Stream of values to produce a Stream of results.
  • idStream :: Stream i i which is the stream which always echoes its input as its output. This primitive is the only way to get at the input.
  • foldStream :: (a -> b -> a) -> a -> Stream i b -> Stream i a, which is like scanl but for streams, keeping an accumulating a value. This primitive is the only one to keep any state between successive stream outputs.

On top of these primitives I define a number of useful auxiliaries. As an example, oldStream produces a stream where each value is a pair of the previous value and the new value. We can define oldStream as:

oldStream :: a -> Stream i a -> Stream i (a,a)
oldStream old = foldStream (\(_,old) new -> (old,new)) (old,old)

Using these primitives we can build up complex definitions such as:

decay :: Double -> Stream i Double -> Stream i Double -> Stream i Double
decay f a b = foldStream step 0 $ (,) <$> oldStream 0 a <*> oldStream 0 b
    where step r ((a,a'),(b,b')) = ((r*b) + f*(a'-a)) / (b + f*(b'-b))

Here we are computing a / b, but using f to decay the old value, where 1 is no decay. This definition maintains a lot of state between steps of the stream, but thanks to the Stream abstraction, we can focus on the mathematics, instead of being lost in a sea of state.

Using these functions we can start to calculate the progress, for example:

unknownRuleTime = decay 10 (timeBuild <$> progress) (fromInt . countBuild <$> progress)

Here unknownRuleTime is the time spent building, divided by the number of rules that have been built, decaying at a fairly rapid rate, meaning more recent rules will have a greater impact on unknownRuleTime.

The entire progress generator takes 37 lines, and has some of the highest comment densities in Shake. The complete Stream abstraction and progress prediction code can be found in the git repo. I would be surprised if someone has not already discovered Stream, given it a name, and investigated its properties - any pointers most welcome.

Update: I have renamed Stream to Mealy, following the machines package.