Sunday, February 17, 2019

Quadratic "deriving Generic" Compile Times

Summary: For large data types, deriving Generic can take a long time to compile.

I was building GHC using Hadrian, and part of that process involves compiling Cabal multiple times - once to build Hadrian itself, and then once per GHC stage thereafter (typically 3 stages). Compiling Cabal takes quite a long time, but one thing I spotted was that compiling the LicenseId file alone was taking 30 seconds! Given that it seemed to be on the critical path, and usually gets compiled 4 times, that seemed worth investigating.

Looking at the file, it defines an enumeration with 354 license types in it. Pulling apart the file, I found that with only the data type and deriving Generic it was taking about 9 seconds without optimisation or 22 seconds with optimisation, using GHC 8.6.3. I then wrote a test program that generated and compiled instances of the program:

{-# LANGUAGE DeriveGeneric #-}
module Sample where
import GHC.Generics
data Sample = Ctor1 | Ctor2 | ...
   deriving Generic

The resulting graph shows that compilation time is quadratic in the number of constructors:


This phenomenon could either be because the deriving Generic itself is quadratic, or more likely that the output generated by deriving Generic provokes quadratic behaviour in some subsequent part of the compiler. I'm going to report this as a GHC bug.

Update: There's already GHC bug 5642 which explains that it might just be a fundamental property of how Generic instances look.

For Cabal the fix might be somewhat simpler - the deriving Generic is only used to define a Binary instance, which can be written in 2 lines for an Enum anyway - discussion on the issue tracker. Hopefully this change will speed up compiling Cabal, and thus speed up compiling GHC.

As I watched the GHC output go by I noticed that both Parser and DynFlags in GHC itself were also very slow, with the latter taking nearly 2 minutes. I'm sure there's more scope for optimising GHC builds, so Andrey Mokhov and I have proposed a Google Summer of Code project to tackle other low-hanging fruit.

The quick program I wrote for generating the data in this blog post is below:

import Control.Monad
import Data.List
import System.Time.Extra
import System.Process.Extra
import Numeric.Extra

main = do
    forM_ [1..] $ \n -> do
        writeFile "Sample.hs" $ unlines $
            ["{-# LANGUAGE DeriveGeneric #-}"
            ,"module Sample where"
            ,"import GHC.Generics"
            ,"data Sample = " ++ intercalate " | " ["Ctor" ++ show i | i <- [1..n]]
            ,"   deriving Generic"
            ]
        xs <- forM [0..2] $ \o ->
            fmap fst $ duration $ systemOutput_ $ "ghc -fforce-recomp Sample.hs -O" ++ show o
        putStrLn $ unwords $ show n : map (showDP 2) xs

Tuesday, February 05, 2019

Announcing ghc-lib

On behalf of Digital Asset I'm delighted to announce ghc-lib, a repackaging of the GHC API to allow it to be used on different GHC versions. The GHC API allows you use the GHC compiler as a library, so you can parse, analyze and compile Haskell code.

The GHC API comes pre-installed with GHC, and is tied to that GHC version - if you are using GHC 8.6.3, you get version 8.6.3 of the API, and can't change it. The ghc-lib package solves that problem, letting you mix and match versions of the GHC compiler and GHC API. Why might you want that?

  • Imagine you are writing a tool to work with several versions of the GHC compiler. The GHC API changes significantly between each version, so doing this would require writing a lot of C preprocessor code to support it. An alternative is to use one version of ghc-lib which works across multiple versions of GHC.
  • Imagine you are modifying the GHC API or want features from GHC HEAD. With ghc-lib you can depend on the revised GHC API, without upgrading the compiler used to build everything, speeding up iteration.

While ghc-lib provides the full GHC API, it doesn't contain a runtime system, nor does it create a package database. That means you can't run code produced by ghc-lib (no runtime), and compiling off-the-shelf code is very hard (no package database containing the base library). What you can do:

The package ghc-lib is released on Hackage, and can be used like any normal package, e.g. cabal install ghc-lib. Since ghc-lib conflicts perfectly with the GHC API and template-haskell, you may wish to ghc-pkg hide ghc-lib and use the language extension PackageImports to do import "ghc-lib" GHC. There will be two release streams within the ghc-lib name:

  • Version 8.8.1 will be the version of ghc-lib produced against the released GHC 8.8.1, once GHC 8.8.1 is released. There is no release against GHC 8.6.3 because we had to make changes to GHC to enable ghc-lib, which were only upstreamed in the last few months.
  • Version 0.20190204 is the version of ghc-lib using GHC HEAD on the date 2019-02-04.

We've been developing and using ghc-lib internally at Digital Asset for the last six months. The use of ghc-lib has enabled us to track GHC HEAD for certain projects, and develop improvements to GHC itself, and then integrate them without requiring us to rebuild all Haskell source code on every step. Smoothing out that development loop has been a massive productivity boon to us.

While this is Digital Asset's first open source project in a while, we have been making lots of contributions behind the scenes - it's no coincidence several of my recent posts involve my colleague Shayne. In particular, our engineers have been behind several GHC proposals.

While I'm announcing the project, much of the work has been done by Shayne Fletcher and Jussi Maki, but neither of them have active blogs just yet!