Tuesday, February 26, 2013

Shake Links

Summary: Shake is now hosted on GitHub and there is a mailing list.

I've just converted Shake over to Git, hosted on GitHub, and created a mailing list. The full set of Shake related links is now:

  • Download the Haskell package from Hackage and install it using Cabal.
  • Documentation, including examples and a list of all the available functions.
  • Mailing list for any questions/thoughts on Shake.
    • Questions may also be asked on StackOverflow with the tag shake-build-system.
    • Bugs may also be reported on the GitHub issue tracker, but if you aren't sure, use the mailing list.
  • Source code in a git repo, stored at GitHub.

If you want to talk to me about Shake you can now write an email to the mailing list, create a question on StackOverflow, or open a GitHub issue (or still contact me by personal email). I'm available on any means of communication, so pick what suits you best - I'm expecting most users will start with a StackOverflow question if it's a "how do I..." question, or the mailing list if it's more of a discussion/suggestion.

I'm very interested in hearing tales of people who have successfully used Shake, so if Shake works for you, perhaps write it up on a blog, or just send an email to the mailing list. I'm also interested in what doesn't work, what was difficult, what documentation was lacking etc.

Why the sudden Shake activism? Some readers may have noticed a sudden surge in Shake related activity. The reason is that I was reminded of the problems with bad build systems, after using only shake/ghc for many years. I had to work with a team for several hours with a build system written in SCons. The "null build" (with nothing to do) took 38 seconds. The build was structured as three separate SCons build systems, one which compiled a code generator, one which ran the code generator, then another that compiled the result. The build could only run single threaded because it made use of the current directory. A project that was not particularly complex was suffering at the hands of their build system. Some of these are limitations in SCons, some are weaknesses in their use of SCons, but the build system was hindering them, not helping them. I wasted a lot of my time waiting for rebuilds that should have taken seconds. Developers deserve better.

Monday, February 25, 2013

Chasing a Space Leak in Shake

Summary: Shake v0.3 had a serious space leak that went undiagnosed for over a year, this post describes how I tracked it down.

Introduction to Space Leaks

One of the downsides of a lazy language is the possibility of space leaks. A space leak is when your program uses more memory than it should, typically because lazy evaluation is holding on to things that if evaluated would disappear. As an example, let's build up a set:

myset = Set.delete dead $ Set.fromList [alive, dead]

Consider the value myset. We first create a set with alive and dead, then delete dead. Assuming the values dead and alive are used nowhere else in the program, which values does the garbage collector consider to be alive?

In a strict language only alive would be considered alive. But in a lazy language, as we have not yet evaluated myset, both alive and dead are kept alive. You can think of myset being represented in memory as Thunk (Set.delete dead) (Value (Set.fromList [alive,dead])), instead of just Value (Set.fromList [alive]). You are storing the list of operations that will be applied to the set, rather than the result of those operations. These operations reference dead, so we cannot garbage collect it. Once we have identified the problem, we can apply strictness operations to force the value, for example:

evaluate myset

The evaluate function forces evaluation of the first part of myset. If our set data type is strict, that will force evaluation of the entire myset value. After evaluate there will be no references to dead, and it will be garbage collected. In other situations bang patterns, strict fields or seq can be used for eliminating space leaks.

Finding the Space Leak in Shake

The space leak in Shake was found over a period of a few weeks, ultimately with me waking up at 4am on New Years Day with knowledge of where an exclamation mark should be placed. However, I will present the account somewhat more methodically...

Step 1: Admitting you have a problem

The first step to fixing a space leak is the realisation that a space leak exists. My suspicion is that many (most?) large Haskell programs have space leaks, but they often go unnoticed. My first report of "memory issues" with Shake was a few months before I tracked it down. Shake is a library that makes use of user supplied rules, and in this example, the user supplied rules were far from trivial. Complicating the problem is the fact that Shake stores a list of all rules it has run in memory, which naturally grows over time. What finally convinced me that there was a problem was when several clean builds failed by exhausting memory, taking greater than 2Gb on a 32bit system.

Step 2: Seeing the problem with your own eyes

After suspecting a space leak, the first thing to do is measure memory usage, which I did using Process Explorer. The graph to the left shows a clear increase in memory. The memory doesn't grow all at once, but in steps - typical of a garbage collected language. The time between memory growth is not evenly spaced, which is reasonable for a program that runs different user rules all of which take different amounts of time. I reran the program several times, and while memory always increased, the shape of the graph varied quite considerably - as expected when running a non-deterministic program such as Shake. While I still didn't have any real information on what caused the memory leak, I could at least observe the memory leak myself.

Step 3: Hit it with your tools

Before thinking too hard, it is worth applying whatever tools are to hand. In the case of space leaks, the GHC manual describes how to use the heap profiling tools. My standard pattern is to compile with -rtsopts -prof -auto-all -caf-all, run with +RTS -h and view with hp2ps -c. Since the test case was huge (over an hour), I always terminated it early with Ctrl-C, which produces an invalid .hp file. Fortunately, you can delete everything after the last END_SAMPLE to recover a profile that hp2ps can understand.

The results on the right show steadily increasing memory allocated to PINNED values, of type ARR_WORDS, which I suspect are bytestrings. However, the total memory usage as reported by GHC and Process Explorer far exceeded that shown in the profile. It seemed like there was something missing from the profile. I upgraded to the latest GHC version but memory still seemed to be missing. The tools were not giving me any insight - perhaps due to the high degree of parallelism or other complexities in the build system.

Step 4: Simplify

The next step was to produce a simpler example - one I could run to completion in a feasible amount of time. My standard approach is to take the full test case and remove things, checking the problem has not gone away, until nothing more can be removed - leaving a minimal test case. There are two reasons why that was infeasible in this instance: 1) With a program which naturally consumes memory over time, it is not clear if the space leak has disappeared or merely become smaller as things are removed. 2) The first reduction steps would have each taken over an hour.

After deciding reduction was infeasible, I decided to try and produce a test case from scratch (which is usually a terrible idea). The original example used many different types of build rule, in many different ways - it was entirely possible that only one particular variant led to the memory leak. I decided to start with a small example, then (if necessary) try adding features until the space leak returned. Since I had been meaning to produce a benchmark for Shake at some point, I figured I could write a benchmark test which would hopefully show the bug, and even if not, be useful elsewhere.

For benchmarking, generating a random build tree is not so useful, so I attempted to define a regular shaped but interesting dependency pattern. The pattern I settled on was parameterised by breadth and depth, where every node a depth n depended on every node at depth n+1, to a limit of the given depth, with a given number of nodes at each level given by breadth. To accurately benchmark I used the file rules, as these are by far the most common.

I found that with breadth=1000 depth=1000 I was able to exhaust the memory. In order to try and simplify the test case I tried passing flags to the test case to turn off certain features and try and make the problem easier to investigate. I was able to turn off multithreading (-j1), profile reports (--no-report), always build from scratch (--clean) and to build rules in a deterministic order (--deterministic). The final command was:

$ shake benchmark breadth=1000 depth=1000 --deterministic -j1 --clean --no-report

This command works using a released version of Shake if you install the test program (install with --flags=testprog).

With the knowledge that 1000x1000 exceeded available memory I start reducing the numbers so the benchmark would complete in a reasonable timeframe (< 15s), but use an unreasonable amount of memory (> 500Mb). I found that 100x100 gave a reasonable balance.

Step 5: Hit it with your tools (harder)

With a simple example, I once again turned to the heap profiling tools available in GHC. Using +RTS -h I still saw memory unaccounted for, as before. The two obvious guesses are that GHC knows about the memory, but has decided to not show it in the heap profile, or that GHC does not know about the memory (for example, allocated on a different heap). Using +RTS -s I saw that GHC was aware of the additional memory, suggesting GHC had at least some knowledge of the memory. With nothing else to try, I ran through the list of heap profiling flags trying each in turn.

The magic flags turned out to be -xt -hy, producing the graph on the left. The profile shows STACK takes up the majority of the space, and starts to give memory usage in about the right ballpark. I concluded that the space leak must include stack values.

Step 6: Hypothesize and test

A stack in GHC is created by a thread, and a thread is typically created by forkIO. There are three reasons stacks could take too much memory - too large, too many, too long:

  • The stacks might be becoming too large. I added -K1K to limit all threads to 1Kb of stack. The small test case ran to completion, without any threads exceeding their stack, so the stacks were not growing too large.
  • There might be too many threads active in the program. I added logging of every call to forkIO, and every time a thread finished. I found I was churning through 1000's of threads, but at most 60 were alive at any one time.
  • The threads might be keeping their stacks alive after they had finished for too long. Having eliminated the other possibilities, this seemed likely.

Looking for places threads were referenced, and thus potentially kept alive, I found a set I was building up containing ThreadIds. The set is used so that if one Shake thread throws an exception all other threads are killed immediately. When a thread started it was added to the set. When a thread finished it was removed. Under most circumstances the set was never evaluated, but simply thrown away when the build finished. This situation correponds to the example at the start of this post, and was causing a space leak of ThreadId.

I guessed that if the ThreadId is kept alive then the stack is also. As a result, leaking a small amount of memory for a thread identifier was amplified by also leaking the stack for that thread. GHC HQ could certainly modify the runtime system so that the ThreadId did not keep the stack alive, but holding on to completed ThreadIds is rarely a good idea, so I imagine it is not worth it. For more details, including a profiling showing that leaking ThreadIds does leak stacks, see this blog post on the GHC scheduler.

Step 7: Fix the problem

Having identified the likely culprit, I simply needed to force evaluation of the set at each step. The set was held within a larger data structure:

data S = S {threads :: Set.HashSet ThreadId, ...}

To always force the set, I added an exclamation mark, turning it into a strict record:

data S = S {threads :: !(Set.HashSet ThreadId), ...}

Since I had worked for days to find somewhere to insert a single character, I also added a comment, just in case I ever thought about removing the exclamation mark.

Step 8: Check the problem went away

To check the problem had gone away I reran my reduced test case, producing the graph on the right, and showing far less memory usage in Process Explorer. I then reran the full build system and saw a typical reduction in memory of around 1Gb - a most welcome improvement!

Shake-0.4 and later contain that important exclamation mark, so an upgrade is strongly advised.

Does your Haskell program need an additional exclamation mark? Here are some you can use: !!!!!!!!!!!!!!

Sunday, February 17, 2013

Finite Resources in Shake

Summary: Management of finite resources is an important part of any modern build system, only properly available in Shake and Ninja.

I've just released Shake 0.9, a build system library, with a few bug fixes and a bunch of new features (the change log has a complete list). This release contains an incompatible change which makes the Resource feature easier to use, so I thought I'd describe the motivation and use of Resources in Shake. A full upgrade guide is at the bottom of this post.

What are Resources for?

When you run -j10 (shakeThreads=10) you are asking the build system to limit computation so it uses no more than ten CPU resources at a time. The CPU is certainly a precious resource, but there are other resource limitations a build system may need to obey:

  • Some APIs are global in nature, if you run two programs that access the Excel API at the same time things start to fail.
  • Many people have large numbers of CPUs, but only one slow rotating hard drive. If you run ten hard-drive thrashing linkers simultaneously the computer is likely to grind to a halt.
  • Some proprietary software requires licenses, a fixed number of which can be purchased and managed using a license manager. As an example, the Kansas Lava team only have access to 48 licenses for modelsim.

Resources in other build systems

I know of two approaches used by other build systems to obey resource constraints:

  • Limit the number of CPUs to hit your target - for example, the Lava build system could cap the number of CPUs to the number of licenses. People with 24 CPUs might ask the build system to use only 8, so the linkers do not make their machines unusable (and even then, a link heavy rebuild may still harm interactive performance). This solution wastes CPU resources, leaving CPUs that could be building your code idling.
  • Add locks to suspend jobs that are competing for the shared resource. For example any rule using Excel could take the Excel lock, either a mutex/MVar in some build systems, or creating a file to serve as the lock in make based build systems. Locking can be made to work, but is tricky if you have to fake locks using the file system, and still squanders CPU resources - instead of blocking the CPU should be running another rule.

The one exception is the Ninja build system which has a concept of "pools", which properly model finite resources.

Resources in Shake

In Shake the Resource type represents a finite resource, which multiple build rules can use. Resource values are created with newResource and used with withResource. As an example, only one set of calls to the Excel API can occur at one time, therefore Excel is a finite resource of quantity 1. You can write:

shake shakeOptions{shakeThreads=2} $ do
    want ["a.xls","b.xls"]
    excel <- newResource "Excel" 1
    "*.xls" *> \out ->
        withResource excel 1 $
            system' "excel" [out,...]

Now we will never run two copies of excel simultaneously. Moreover, it will never block waiting for excel if there are other rules that could be run.

For most programming languages the compiler is CPU bound but the linker is disk bound. Running 8 linkers will often cause an 8 CPU system to grid to a halt. We can limit ourselves to 4 linkers with:

disk <- newResource "Disk" 4
want [show i <.> "exe" | i <- [1..100]]
    "*.exe" *> \out ->
        withResource disk 1 $
            system' "ld" ["-o",out,...]
    "*.o" *> \out ->
        system' "cl" ["-o",out,...]

Now we can use 7 or 8 CPUs while still leaving the computer responsive enough to browse the web.

Software licenses are another finite resource and can be managed in the same way. For a complete example see the Kansas Lava test program, which uses Shake.

Porting from Shake 0.8

In Shake 0.9 the newResource function has been renamed to newResourceIO - rename newResource to newResourceIO everywhere and your code will work again.

However, you may have noticed that newResourceIO (as it is now called) forces you to create the resource before calling the shake function, meaning that often the creation and use of the resource are far apart. I have introduced a function newResource which runs in the Rules monad, allowing you to create a resource and then use it nearby. Moving the creation and use of resources closer together makes it much easier to check your resource constraints are met.

The only other breaking change is that shakeVersion has become a String rather than an Int, allowing you to store more precise information about the version (for example, your build system might want to encode the GHC version and the version of the build system in the string).

Updated 18 Feb 2013: Ninja also supports finite resources.

Thursday, February 07, 2013

A Nofib build system using Shake

Last February I had a few discussions with David Terei about replacing the Nofib benchmark suite build system with something based on Shake. Naturally, like all Make to Shake conversions, I thought this was an excellent idea. In further discussions with Simon Peyton Jones and Simon Marlow I refined the build system to match the useful features of the old Make based system.

I have now put the build system online: Shake build system for Nofib.

Unfortunately, I don't think much has happened with the build system in the last year. I invite people to play with it and use it for whatever purpose they want. The rest of this post is divided into two sections - the first section is how to use the build system (for people running nofib) the second half is how it works and how it was written (for people writing build systems with Shake).

Running the nofib suite

Grab the nofib suite, take the above file and put it in the root directory. You can then run:

$ runhaskell Nofib -j2 imaginary --way="-O2 -fvia-C" --run
... build related output ...
Build completed
Running imaginary/bernouilli...501ms
Running imaginary/digits-of-e1...513ms
... more tests ...
Running imaginary/wheel-sieve2...238ms
Running imaginary/x2n1...27ms

The -j2 is the parallelism to use when building (not when testing), the --way is which flags to pass to the compiler (so it should be trivial to experiment LLVM vs not), the "imaginary" is which section to run, but naming an individual test such as "x2n1" or just leaving it blank for all also works. The --run says run the tests afterwards, defaulting to norm, but --run=fast/--run=slow also works.

The system uses whatever GHC is on your path first, and stores the output under that GHC version, but if you want two GHC's built with differing bits, you can pass --compiler=ghc-with-a-tweak to do the build under a separate directory.

A few more examples, assuming the build system is compiled as runner, with the nofib command and the new equivalent:

Run a single test:

$ cd nofib/imaginary/exp3_8
$ make

runner exp3_8 --run

Run a test with some extra GHC flags:

$ cd nofib/imaginary/exp3_8
$ make EXTRA_HC_OPTS=-fllvm

runner exp3_8 --way="-O1 -fllvm" --run

Run a test with a different GHC:

$ cd nofib/imaginary/exp3_8
$ make HC=ghc-7.0.4

runner exp3_8 --compiler=ghc-7.0.4 --run

Just build a test, don't run it:

$ make NoFibRuns=0


Writing the build system

This build system is well commented, so I encourage people to read the code. It makes use of CmdArgs for command line parsing (lines 60-114) combined with Shake for dependency based tracking (lines 187-281). The total system is 358 lines long, and the rest is mostly pieces for running tests.

When writing the build system I had no idea what the old Makefile system did - I read what docs I could find, took a brief stab and reading the Makefile, but failed to even run it on the laptop I was working on. Therefore, I wrote the new system from scratch, and then looked back to see what the Makefiles did that the new system did not.

As is common is Make based build systems, the Makefile stored a combination of build rules (how to compile files, line 187), static confirmation data (which targets to build, line 31) and dynamic configuration data (what inputs to give when testing, line 289). The build rules and static configuration data can be easily moved into the new build system, either as function rules or top-level constants. The dynamic configuration can be moved over as top-level constants, but that means the build system requires modifying more than it should, and it is harder to track the dependencies on this data. Instead I followed an approach I've used in several build systems, by writing a converter which sucks the dynamic data out of the Makefile and produces a much simpler config file. The converter is in the build system, so the config files are generated from the Makefiles, and updated appropriately. To actually run the tests, I query the config files. The hope is that in future everyone will decide to delete the Makefiles, and the config files can be checked in as source files, and the translation aspect can be deleted.

When writing the build system I ran into a GHC bug that when two linkers ran simultaneously they failed. To work round this bug I created a resource of a linker, with quantity 1 (line 152). Whenever I call the GHC linker I acquire this resource (line 215). Now Shake will run the rest of the build system with maximum parallelism, but ensure no linkers ever run in parallel, working round the bug.