Sunday, February 17, 2019

Quadratic "deriving Generic" Compile Times

Summary: For large data types, deriving Generic can take a long time to compile.

I was building GHC using Hadrian, and part of that process involves compiling Cabal multiple times - once to build Hadrian itself, and then once per GHC stage thereafter (typically 3 stages). Compiling Cabal takes quite a long time, but one thing I spotted was that compiling the LicenseId file alone was taking 30 seconds! Given that it seemed to be on the critical path, and usually gets compiled 4 times, that seemed worth investigating.

Looking at the file, it defines an enumeration with 354 license types in it. Pulling apart the file, I found that with only the data type and deriving Generic it was taking about 9 seconds without optimisation or 22 seconds with optimisation, using GHC 8.6.3. I then wrote a test program that generated and compiled instances of the program:

{-# LANGUAGE DeriveGeneric #-}
module Sample where
import GHC.Generics
data Sample = Ctor1 | Ctor2 | ...
   deriving Generic

The resulting graph shows that compilation time is quadratic in the number of constructors:


This phenomenon could either be because the deriving Generic itself is quadratic, or more likely that the output generated by deriving Generic provokes quadratic behaviour in some subsequent part of the compiler. I'm going to report this as a GHC bug.

Update: There's already GHC bug 5642 which explains that it might just be a fundamental property of how Generic instances look.

For Cabal the fix might be somewhat simpler - the deriving Generic is only used to define a Binary instance, which can be written in 2 lines for an Enum anyway - discussion on the issue tracker. Hopefully this change will speed up compiling Cabal, and thus speed up compiling GHC.

As I watched the GHC output go by I noticed that both Parser and DynFlags in GHC itself were also very slow, with the latter taking nearly 2 minutes. I'm sure there's more scope for optimising GHC builds, so Andrey Mokhov and I have proposed a Google Summer of Code project to tackle other low-hanging fruit.

The quick program I wrote for generating the data in this blog post is below:

import Control.Monad
import Data.List
import System.Time.Extra
import System.Process.Extra
import Numeric.Extra

main = do
    forM_ [1..] $ \n -> do
        writeFile "Sample.hs" $ unlines $
            ["{-# LANGUAGE DeriveGeneric #-}"
            ,"module Sample where"
            ,"import GHC.Generics"
            ,"data Sample = " ++ intercalate " | " ["Ctor" ++ show i | i <- [1..n]]
            ,"   deriving Generic"
            ]
        xs <- forM [0..2] $ \o ->
            fmap fst $ duration $ systemOutput_ $ "ghc -fforce-recomp Sample.hs -O" ++ show o
        putStrLn $ unwords $ show n : map (showDP 2) xs

No comments: