Monday, March 18, 2019

GHC Rebuild Times - Shake profiling

Summary: GHC rebuild times are slow. Using the Shake profiler and Hadrian we can find out why.

I just checked out GHC, and using Hadrian (the Shake-based build system), built GHC on Windows using:

hadrian\build.stack.bat -j --flavour=quickest --integer-simple --configure --profile

Namely use stack to install dependencies and build Hadrian itself, then compile as quick as I can get it, on all CPUs (8 on my machine), run configure for me and a profile report.html output. After compiling Hadrian, 40m54s later I had a built GHC. That's not a quick process! Why did it take so long? If I bought a better machine would it go faster? How might we optimise GHC? These questions and more can be answered with Shake profiling.

Shake has had profiling for years, but in the recently-released Shake 0.17.7 I've overhauled it. The profile report is generated as a web page, and the generated output in the new version is smaller (2x smaller), loads faster (100x or more) and is more intuitive (not really a numeric thing). In the rest of this post I'll pepper some screenshots from the Shake profiler without thoughts about what it could mean. I've also uploaded the profile so you can play around with it:

Hadrian Profile

Summary Page

The first page you see when opening the report is the summary.

This page gives us some basic stats. There was 1 run of the build system. It ran 3,645 traced actions (e.g. command line calls or other expensive actions) and there were 15,809 rules run (where a rule is something with dependency information - somewhere between one third to two thirds of those are likely to be files in typical build systems).

Turning to performance, the entire build, on one CPU would take 2h26m. The build on my 8 CPU machine took 40m45s, with on average 3.58 commands going at once (quite a bit less than the 8 I wanted). The critical path is about 37m14s, so that's the lower bound with infinite CPUs, so buying a machine with more CPUs won't really help (faster CPUs and a faster disk probably would though).

OK, so I'm now unhappy that GHC doesn't execute enough in parallel. So let's see what it does manage to do in parallel by switching to the Command Plot.

Command Plot

We now see a graph of what was executing at each point during the build. We see spikes in a hideous light blue for GHC, showing that when GHC gets going, it can get near to the 8 CPUs we requested. However, we see lots of periods of time with only 1 task executing. In most cases these are either sh at the start (which I happen to know is configure), or cabal-configure (which is more obviously configure). However, there are also Haskell blips where we get down to 1 CPU. I'm now increasingly convinced that the biggest problem Hadrian has (performance wise) is lack of parallelism. To confirm that, let's switch to the Parallelizability tab.

Parallelizability

This next tab predicts how long it will take to build Hadrian at various different thread counts. The green line is if there were no dependencies, the blue line is with the dependencies we have, and the yellow line is the difference. As we can see, at 8 CPU's the difference is about 16m - I could have had my GHC a whole 16m faster if we could parallelise GHC more. At the same time, it looks like the sweet spot for compiling GHC is currently around 6 CPUs - more doesn't make a huge amount of difference. How sad. To investigate let's jump to the Rules tab.

Rules

Now we've moved on from pretty graphs to tables of rules. The most interesting columns for performance work are Time (how long something took), ETime (how long it took if you only pay for the fraction of the computer you are using) and WTime (how long you were the only thing running for). The first is most relevant if you want to take less CPU, the second two if you aren't hitting the parallelism you are hoping for. Since we aren't hitting the parallelism, we can sort by WTime.

For WTime, if we eliminated that step, the total build would improve by that amount of time. Looking at the first two entries, which are the initial configure and then configure of the base library, we see a total of 8m38s. If we could get rid of configure, or speed it up, or interleave it with other operations, we could save almost 10 minutes off a GHC build. Using the search bar we can figure out how much configure costs us in total.

Now we have used the search bar to filter to only rules that run the command cabal-configure or sh, and we've named them all in the group configure (so it sums them up for us). We see we spend 15m18s configuring, and would go about 10m faster if we didn't - so it's expensive, and serialises the build a lot. I hate configure.

Slow Stage0 Compilation

Ignoring configure, the next slow things are building the stage0 compiler, so let's focus in on that.

While we can use the search bar for entering JavasScript expressions, we can equally just enter substrings. Let's delve into /compiler/ and sort by Time. We quickly see a bunch of stage0 and stage1 compiles, with HsInstances.o and DynFlags.o right at the top. Those files take a really long time to compile, and end up serialising the build quite a bit. However, it's a bit odd that we see stage0, but a lot less of stage1. Let's compare the two stages:

Now we're using a regular expression to pull out the .o compiles in compiler, and group them by their stage. We can see that both involve 1,527 compiles, but that stage0 takes 43m, while stage1 is under 18m. Why? Either we're using different sets of flags (e.g. compiling stage0 with higher optimisations or warnings), or GHC HEAD (the output of stage0 which we use to compile stage1) is significantly faster than GHC 8.6.3 (which I used to compile stage0). I've no idea, but tracking down the difference could save at least 7 minutes on the wall clock time of building GHC.

Conclusion

Compiling GHC is slow, but the biggest problem is it doesn't parallelise well. Using Shake profiling we've found that configure and stage0 are the culprits. There's lots to be done, and hopefully a Summer of Code project too.

10 comments:

Andreas said...

For quickest stage0 should be compiled with -O, while stage2 will be compiled using -O0.

A factor of 3 between the two doesn't seem unreasonable at a glance.

I wondered a few times if we could reduce serialization by scheduling tasks different. Half the compiler depends on DynFlags for example and it takes quite a while to build.

Unknown said...

Seeing that DynFlags takes over a minute with `-O2` but only 12 seconds with `-O0`, I think we probably could get by with deactivating optimizations just for `DynFlags`.

Sebastian Graf said...

I'm not sure we can do the same for HsInstances, though. Seems quite vital to GHC performance.

Sebastian Graf said...

Maybe we should just profile and adjust the responsible optimisations instead.

Sebastian Graf said...

https://gitlab.haskell.org/ghc/ghc/issues/16463

Axman6 said...

To find a good optimal set of optimisations, something like acovea (https://github.com/Acovea/libacovea) might be useful - would be interesting to find the set which optimises both compile time and compile time from the produced compiler.

Josef said...

Have you thought about generating a graph to visualize the dependencies? Such a graph could be very useful to identify bottlenecks in the dependency graph that inhibits parallelization.

Neil Mitchell said...

Josef: Shake used to generate a GraphViz graph. Turned out it was nearly no use, because by the time the project gets big enough to require it, the graph is too huge to actually use. I suspect some graph visualisation might work, but it's not GraphViz.

Jens Petersen said...

Thanks for this, very interesting.

Do you mean "rebuild" or "build"?
Sounds like you are building rather than rebuilding? :)

I assume the rebuild times are rather better. :)

Are the current results comparable to the old buildsystem?

Neil Mitchell said...

@Jens: You're right, this is building. The rebuild times are almost instant (sub second), but unfortunately my system is currently broken and I can't give accurate measurements.

I don't have any useful comparison to the old system.