Summary: GHC rebuild times are slow. Using the Shake profiler and Hadrian we can find out why.
I just checked out GHC, and using Hadrian (the Shake-based build system), built GHC on Windows using:
hadrian\build.stack.bat -j --flavour=quickest --integer-simple --configure --profile
Namely use stack
to install dependencies and build Hadrian itself, then compile as quick as I can get it, on all CPUs (8 on my machine), run configure
for me and a profile report.html
output. After compiling Hadrian, 40m54s later I had a built GHC. That's not a quick process! Why did it take so long? If I bought a better machine would it go faster? How might we optimise GHC? These questions and more can be answered with Shake profiling.
Shake has had profiling for years, but in the recently-released Shake 0.17.7 I've overhauled it. The profile report is generated as a web page, and the generated output in the new version is smaller (2x smaller), loads faster (100x or more) and is more intuitive (not really a numeric thing). In the rest of this post I'll pepper some screenshots from the Shake profiler without thoughts about what it could mean. I've also uploaded the profile so you can play around with it:
Summary Page
The first page you see when opening the report is the summary.
This page gives us some basic stats. There was 1 run of the build system. It ran 3,645 traced actions (e.g. command line calls or other expensive actions) and there were 15,809 rules run (where a rule is something with dependency information - somewhere between one third to two thirds of those are likely to be files in typical build systems).
Turning to performance, the entire build, on one CPU would take 2h26m. The build on my 8 CPU machine took 40m45s, with on average 3.58 commands going at once (quite a bit less than the 8 I wanted). The critical path is about 37m14s, so that's the lower bound with infinite CPUs, so buying a machine with more CPUs won't really help (faster CPUs and a faster disk probably would though).
OK, so I'm now unhappy that GHC doesn't execute enough in parallel. So let's see what it does manage to do in parallel by switching to the Command Plot.
Command Plot
We now see a graph of what was executing at each point during the build. We see spikes in a hideous light blue for GHC, showing that when GHC gets going, it can get near to the 8 CPUs we requested. However, we see lots of periods of time with only 1 task executing. In most cases these are either sh
at the start (which I happen to know is configure
), or cabal-configure
(which is more obviously configure
). However, there are also Haskell blips where we get down to 1 CPU. I'm now increasingly convinced that the biggest problem Hadrian has (performance wise) is lack of parallelism. To confirm that, let's switch to the Parallelizability tab.
Parallelizability
This next tab predicts how long it will take to build Hadrian at various different thread counts. The green line is if there were no dependencies, the blue line is with the dependencies we have, and the yellow line is the difference. As we can see, at 8 CPU's the difference is about 16m - I could have had my GHC a whole 16m faster if we could parallelise GHC more. At the same time, it looks like the sweet spot for compiling GHC is currently around 6 CPUs - more doesn't make a huge amount of difference. How sad. To investigate let's jump to the Rules tab.
Rules
Now we've moved on from pretty graphs to tables of rules. The most interesting columns for performance work are Time (how long something took), ETime (how long it took if you only pay for the fraction of the computer you are using) and WTime (how long you were the only thing running for). The first is most relevant if you want to take less CPU, the second two if you aren't hitting the parallelism you are hoping for. Since we aren't hitting the parallelism, we can sort by WTime.
For WTime, if we eliminated that step, the total build would improve by that amount of time. Looking at the first two entries, which are the initial configure
and then configure
of the base
library, we see a total of 8m38s. If we could get rid of configure, or speed it up, or interleave it with other operations, we could save almost 10 minutes off a GHC build. Using the search bar we can figure out how much configure costs us in total.
Now we have used the search bar to filter to only rules that run the command cabal-configure
or sh
, and we've named them all in the group configure
(so it sums them up for us). We see we spend 15m18s configuring, and would go about 10m faster if we didn't - so it's expensive, and serialises the build a lot. I hate configure
.
Slow Stage0 Compilation
Ignoring configure
, the next slow things are building the stage0
compiler, so let's focus in on that.
While we can use the search bar for entering JavasScript expressions, we can equally just enter substrings. Let's delve into /compiler/
and sort by Time. We quickly see a bunch of stage0
and stage1
compiles, with HsInstances.o
and DynFlags.o
right at the top. Those files take a really long time to compile, and end up serialising the build quite a bit. However, it's a bit odd that we see stage0
, but a lot less of stage1
. Let's compare the two stages:
Now we're using a regular expression to pull out the .o
compiles in compiler
, and group them by their stage. We can see that both involve 1,527 compiles, but that stage0
takes 43m, while stage1
is under 18m. Why? Either we're using different sets of flags (e.g. compiling stage0
with higher optimisations or warnings), or GHC HEAD (the output of stage0
which we use to compile stage1
) is significantly faster than GHC 8.6.3 (which I used to compile stage0
). I've no idea, but tracking down the difference could save at least 7 minutes on the wall clock time of building GHC.
Conclusion
Compiling GHC is slow, but the biggest problem is it doesn't parallelise well. Using Shake profiling we've found that configure
and stage0
are the culprits. There's lots to be done, and hopefully a Summer of Code project too.
For quickest stage0 should be compiled with -O, while stage2 will be compiled using -O0.
ReplyDeleteA factor of 3 between the two doesn't seem unreasonable at a glance.
I wondered a few times if we could reduce serialization by scheduling tasks different. Half the compiler depends on DynFlags for example and it takes quite a while to build.
Seeing that DynFlags takes over a minute with `-O2` but only 12 seconds with `-O0`, I think we probably could get by with deactivating optimizations just for `DynFlags`.
ReplyDeleteI'm not sure we can do the same for HsInstances, though. Seems quite vital to GHC performance.
ReplyDeleteMaybe we should just profile and adjust the responsible optimisations instead.
ReplyDeletehttps://gitlab.haskell.org/ghc/ghc/issues/16463
ReplyDeleteTo find a good optimal set of optimisations, something like acovea (https://github.com/Acovea/libacovea) might be useful - would be interesting to find the set which optimises both compile time and compile time from the produced compiler.
ReplyDeleteHave you thought about generating a graph to visualize the dependencies? Such a graph could be very useful to identify bottlenecks in the dependency graph that inhibits parallelization.
ReplyDeleteJosef: Shake used to generate a GraphViz graph. Turned out it was nearly no use, because by the time the project gets big enough to require it, the graph is too huge to actually use. I suspect some graph visualisation might work, but it's not GraphViz.
ReplyDeleteThanks for this, very interesting.
ReplyDeleteDo you mean "rebuild" or "build"?
Sounds like you are building rather than rebuilding? :)
I assume the rebuild times are rather better. :)
Are the current results comparable to the old buildsystem?
@Jens: You're right, this is building. The rebuild times are almost instant (sub second), but unfortunately my system is currently broken and I can't give accurate measurements.
ReplyDeleteI don't have any useful comparison to the old system.