Monday, February 25, 2013

Chasing a Space Leak in Shake

Summary: Shake v0.3 had a serious space leak that went undiagnosed for over a year, this post describes how I tracked it down.

Introduction to Space Leaks

One of the downsides of a lazy language is the possibility of space leaks. A space leak is when your program uses more memory than it should, typically because lazy evaluation is holding on to things that if evaluated would disappear. As an example, let's build up a set:

myset = Set.delete dead $ Set.fromList [alive, dead]

Consider the value myset. We first create a set with alive and dead, then delete dead. Assuming the values dead and alive are used nowhere else in the program, which values does the garbage collector consider to be alive?

In a strict language only alive would be considered alive. But in a lazy language, as we have not yet evaluated myset, both alive and dead are kept alive. You can think of myset being represented in memory as Thunk (Set.delete dead) (Value (Set.fromList [alive,dead])), instead of just Value (Set.fromList [alive]). You are storing the list of operations that will be applied to the set, rather than the result of those operations. These operations reference dead, so we cannot garbage collect it. Once we have identified the problem, we can apply strictness operations to force the value, for example:

evaluate myset

The evaluate function forces evaluation of the first part of myset. If our set data type is strict, that will force evaluation of the entire myset value. After evaluate there will be no references to dead, and it will be garbage collected. In other situations bang patterns, strict fields or seq can be used for eliminating space leaks.

Finding the Space Leak in Shake

The space leak in Shake was found over a period of a few weeks, ultimately with me waking up at 4am on New Years Day with knowledge of where an exclamation mark should be placed. However, I will present the account somewhat more methodically...

Step 1: Admitting you have a problem

The first step to fixing a space leak is the realisation that a space leak exists. My suspicion is that many (most?) large Haskell programs have space leaks, but they often go unnoticed. My first report of "memory issues" with Shake was a few months before I tracked it down. Shake is a library that makes use of user supplied rules, and in this example, the user supplied rules were far from trivial. Complicating the problem is the fact that Shake stores a list of all rules it has run in memory, which naturally grows over time. What finally convinced me that there was a problem was when several clean builds failed by exhausting memory, taking greater than 2Gb on a 32bit system.

Step 2: Seeing the problem with your own eyes

After suspecting a space leak, the first thing to do is measure memory usage, which I did using Process Explorer. The graph to the left shows a clear increase in memory. The memory doesn't grow all at once, but in steps - typical of a garbage collected language. The time between memory growth is not evenly spaced, which is reasonable for a program that runs different user rules all of which take different amounts of time. I reran the program several times, and while memory always increased, the shape of the graph varied quite considerably - as expected when running a non-deterministic program such as Shake. While I still didn't have any real information on what caused the memory leak, I could at least observe the memory leak myself.

Step 3: Hit it with your tools

Before thinking too hard, it is worth applying whatever tools are to hand. In the case of space leaks, the GHC manual describes how to use the heap profiling tools. My standard pattern is to compile with -rtsopts -prof -auto-all -caf-all, run with +RTS -h and view with hp2ps -c. Since the test case was huge (over an hour), I always terminated it early with Ctrl-C, which produces an invalid .hp file. Fortunately, you can delete everything after the last END_SAMPLE to recover a profile that hp2ps can understand.

The results on the right show steadily increasing memory allocated to PINNED values, of type ARR_WORDS, which I suspect are bytestrings. However, the total memory usage as reported by GHC and Process Explorer far exceeded that shown in the profile. It seemed like there was something missing from the profile. I upgraded to the latest GHC version but memory still seemed to be missing. The tools were not giving me any insight - perhaps due to the high degree of parallelism or other complexities in the build system.

Step 4: Simplify

The next step was to produce a simpler example - one I could run to completion in a feasible amount of time. My standard approach is to take the full test case and remove things, checking the problem has not gone away, until nothing more can be removed - leaving a minimal test case. There are two reasons why that was infeasible in this instance: 1) With a program which naturally consumes memory over time, it is not clear if the space leak has disappeared or merely become smaller as things are removed. 2) The first reduction steps would have each taken over an hour.

After deciding reduction was infeasible, I decided to try and produce a test case from scratch (which is usually a terrible idea). The original example used many different types of build rule, in many different ways - it was entirely possible that only one particular variant led to the memory leak. I decided to start with a small example, then (if necessary) try adding features until the space leak returned. Since I had been meaning to produce a benchmark for Shake at some point, I figured I could write a benchmark test which would hopefully show the bug, and even if not, be useful elsewhere.

For benchmarking, generating a random build tree is not so useful, so I attempted to define a regular shaped but interesting dependency pattern. The pattern I settled on was parameterised by breadth and depth, where every node a depth n depended on every node at depth n+1, to a limit of the given depth, with a given number of nodes at each level given by breadth. To accurately benchmark I used the file rules, as these are by far the most common.

I found that with breadth=1000 depth=1000 I was able to exhaust the memory. In order to try and simplify the test case I tried passing flags to the test case to turn off certain features and try and make the problem easier to investigate. I was able to turn off multithreading (-j1), profile reports (--no-report), always build from scratch (--clean) and to build rules in a deterministic order (--deterministic). The final command was:

$ shake benchmark breadth=1000 depth=1000 --deterministic -j1 --clean --no-report

This command works using a released version of Shake if you install the test program (install with --flags=testprog).

With the knowledge that 1000x1000 exceeded available memory I start reducing the numbers so the benchmark would complete in a reasonable timeframe (< 15s), but use an unreasonable amount of memory (> 500Mb). I found that 100x100 gave a reasonable balance.

Step 5: Hit it with your tools (harder)

With a simple example, I once again turned to the heap profiling tools available in GHC. Using +RTS -h I still saw memory unaccounted for, as before. The two obvious guesses are that GHC knows about the memory, but has decided to not show it in the heap profile, or that GHC does not know about the memory (for example, allocated on a different heap). Using +RTS -s I saw that GHC was aware of the additional memory, suggesting GHC had at least some knowledge of the memory. With nothing else to try, I ran through the list of heap profiling flags trying each in turn.

The magic flags turned out to be -xt -hy, producing the graph on the left. The profile shows STACK takes up the majority of the space, and starts to give memory usage in about the right ballpark. I concluded that the space leak must include stack values.

Step 6: Hypothesize and test

A stack in GHC is created by a thread, and a thread is typically created by forkIO. There are three reasons stacks could take too much memory - too large, too many, too long:

  • The stacks might be becoming too large. I added -K1K to limit all threads to 1Kb of stack. The small test case ran to completion, without any threads exceeding their stack, so the stacks were not growing too large.
  • There might be too many threads active in the program. I added logging of every call to forkIO, and every time a thread finished. I found I was churning through 1000's of threads, but at most 60 were alive at any one time.
  • The threads might be keeping their stacks alive after they had finished for too long. Having eliminated the other possibilities, this seemed likely.

Looking for places threads were referenced, and thus potentially kept alive, I found a set I was building up containing ThreadIds. The set is used so that if one Shake thread throws an exception all other threads are killed immediately. When a thread started it was added to the set. When a thread finished it was removed. Under most circumstances the set was never evaluated, but simply thrown away when the build finished. This situation correponds to the example at the start of this post, and was causing a space leak of ThreadId.

I guessed that if the ThreadId is kept alive then the stack is also. As a result, leaking a small amount of memory for a thread identifier was amplified by also leaking the stack for that thread. GHC HQ could certainly modify the runtime system so that the ThreadId did not keep the stack alive, but holding on to completed ThreadIds is rarely a good idea, so I imagine it is not worth it. For more details, including a profiling showing that leaking ThreadIds does leak stacks, see this blog post on the GHC scheduler.

Step 7: Fix the problem

Having identified the likely culprit, I simply needed to force evaluation of the set at each step. The set was held within a larger data structure:

data S = S {threads :: Set.HashSet ThreadId, ...}

To always force the set, I added an exclamation mark, turning it into a strict record:

data S = S {threads :: !(Set.HashSet ThreadId), ...}

Since I had worked for days to find somewhere to insert a single character, I also added a comment, just in case I ever thought about removing the exclamation mark.

Step 8: Check the problem went away

To check the problem had gone away I reran my reduced test case, producing the graph on the right, and showing far less memory usage in Process Explorer. I then reran the full build system and saw a typical reduction in memory of around 1Gb - a most welcome improvement!

Shake-0.4 and later contain that important exclamation mark, so an upgrade is strongly advised.

Does your Haskell program need an additional exclamation mark? Here are some you can use: !!!!!!!!!!!!!!


Unknown said...

An entertaining and informative read. Thanks for sharing!

Unknown said...

The way you outlined your general approach is very useful. It helped me chase a leak in my own code. Thank you!

Raoul Duke said...

This recount is absolutely classic.

Tom Ellis said...

Did you ever go back to this bug to find out whether it can be diagnosed with the technique in

Neil Mitchell said...

Tom: No, I didn't. I reintroduced the specific bug above, but it didn't reappear - I think the problem is that I only evaluate the call stack on the case of exceptions, and those are common in my test suite, but only for smaller examples. I would need a big example, which runs for a while churning threads, and than reached an error - not something I tend to test. Furthermore, at that point I have an exception, so it's not clear I'd get the "right" exception out anyway.