HLint is a tool for suggesting improvements to Haskell code. Recently I've put some effort in to optimisation and HLint is now over 20 times faster. The standard benchmark (running HLint over the source code of HLint) has gone from 30 seconds to just over 1 second. This blog post is the story of that optimisation, the dead ends I encountered, and the steps I took. I've deliberately included reasonable chunks of code in this post, so interested readers can see the whole story - less technical readers should feel free to skip them. The results of the optimisation are all available on
Hackage, as new versions of
hlint,
uniplate and
derive.
Before I start, I'd like to share my five guiding principles of optimisation:
- Tests - make sure you have tests, so you don't break anything while optimising.
- Profile - if you don't profile first, you are almost certainly optimising the wrong thing.
- Necessity - only start the optimisation process if something is running too slowly.
- Benchmark - use a sensible and representative test case to benchmark, to make sure you optimise the right thing.
- Optimising - to make a function faster, either call it less, or write it better.
Below are the steps in the optimisation, along with their speed improvement.
Special support for Rational in Uniplate.Data, 30s to 10sHLint uses
Uniplate extensively. HLint works over large abstract syntax trees, from the library
haskell-src-exts (HSE), so a generics library is essential. There are two main variants of Uniplate - Data builds on the Scrap Your Boilerplate (SYB) instances, and Direct requires special instances. HLint uses the Data variant, as it requires no instances to be written.
One of the advantages of Uniplate is that it generally outperforms most generics libraries. In particular, the variant written on top of Data instances is often many times faster than using SYB directly. The reason for outperforming SYB is documented in my
PhD thesis. The essential idea is that Uniplate builds a "hit table", a mapping noting which types can be contained within which other types - e.g. that there is potentially an
Int inside
Either String [Int], but there isn't an
Int inside
String. By consulting this mapping while traversing a value, Uniplate is able to skip large portions, leading to an efficiency improvement.
When computing the hit table it is necessary for Uniplate to create dummy values of each type, which it then traverses using SYB functions. To create dummy values Uniplate uses
undefined, unfortunately given the definition
data Foo = Foo !Int the value
Foo undefined will be forced due to the strictness annotation, and the code will raise an error - as described in
bug 243. Uniplate 1.2 had a special case for
Rational, which is the only type with strict components contained within HSE. Uniplate 1.3 fixed this problem more generally by catching the exception and turning off the hit table facility on such types. Unfortunately this caused Uniplate 1.3 to turn off the hit table for HSE, causing HLint to run three times slower.
The fix was simple, and pragmatic, but not general. In Uniplate 1.4 I reinstated the special case for
Rational, so now HLint makes use of the hit table, and goes three times faster. A more general solution would be to manufacture dummy values for certain types (it's usually an
Int or
Integer that is a strict component), or to create concrete dummy values using SYB. It's interesting to observe that if HLint used SYB as it's generics library, it would not be using the hit table trick, and would run three times slower.
Use Uniplate.Direct, 10s to 5s, revertedUniplate also provides a Direct version, which performs better, but requires instances to be written. In order to further improve the speed of HLint I decided to try the Direct version. The biggest hurdle to using the Direct version is that many instances need to be generated, in the case of HLint it required 677. The first step was to modify the
Derive tool to generate these instances (which Derive 2.1 now does), and to write a small script to decide which instances were necessary. With these instances in place, the time dropped to 5 seconds.
Unfortunately, the downside was that compilation time skyrocketed, and the instances are very specific to a particular version of HSE. While these problems are not insurmountable, I did not consider the benefit to be worthwhile, so reverted the changes. It's worth pointing out that most users of Uniplate won't require so many instances to use the Direct versions, and that a program can be switched between Direct and Data versions without any code changes (just a simple import). I also considered the possibility of discovering which Uniplate instances dominated and using the Direct method only for those - but I did not investigate further.
Add and optimise eqExpShell, 10s to 8sThe next step was to profile. I compiled and ran HLint with the following options:
ghc --make -O1 -prof -auto-all -o hlint
hlint src +RTS -p
Looking at the profiling output, I saw that the function
unify took up over 60% of the execution time:
unify :: NameMatch -> Exp -> Exp -> Maybe [(String,Exp)]
unify nm (Do xs) (Do ys) | length xs == length ys = concatZipWithM (unifyStmt nm) xs ys
unify nm (Lambda _ xs x) (Lambda _ ys y) | length xs == length ys = liftM2 (++) (unify nm x y) (concatZipWithM unifyPat xs ys)
unify nm x y | isParen x || isParen y = unify nm (fromParen x) (fromParen y)
unify nm (Var (fromNamed -> v)) y | isUnifyVar v = Just [(v,y)]
unify nm (Var x) (Var y) | nm x y = Just []
unify nm x y | ((==) `on` descend (const $ toNamed "_")) x y = concatZipWithM (unify nm) (children x) (children y)
unify nm _ _ = Nothing
The function
unify is an essential part of the rule matching in HLint, and attempts to compare a rule to a target expression, and if successful returns the correct substitution. Each rule is compared to every expression in all files, which means that
unify is called millions of times in a normal HLint run. Looking closely, my first suspicion was the second line from the bottom in the guard - the call to
descend and
(==). This line compares the outer shell of two expressions, ignoring any inner expressions. It first uses the Uniplate
descend function to insert a dummy value as each subexpression, then compares for equality. To test my hypothesis that this method was indeed the culprit I extracted it to a separate function, and modified
unify to call it:
eqExpShell :: Exp_ -> Exp_ -> Bool
eqExpShell = (==) `on` descend (const $ toNamed "_")
I reran the profiling, and now all the time was being spent in
eqExpShell. My first thought was to expand out the function to not use Uniplate. I quickly rejected that idea - there are expressions within statements and patterns, and to follow all the intricacies of HSE would be fragile and verbose.
The first optimisation I tried was to replace
toNamed "_", the dummy expression, with something simpler. The
toNamed call expands to many constructors, so instead I used
Do an [] (
an is a dummy source location), which is the simplest expression HSE provides. This change had a noticeable speed improvement.
eqExpShell :: Exp_ -> Exp_ -> Bool
eqExpShell = (==) `on` descend (const $ Do an [])
My second thought was to add a quick test, so that if the outer constructors were not equal then the expensive test was not tried. Determining the outer constructor of a value can be done by calling
show then only looking at the first word (assuming a sensible
Show instance, which HSE has).
eqExpShell :: Exp_ -> Exp_ -> Bool
eqExpShell x y =
((==) `on` constr) x y &&
((==) `on` descend (const $ Do an [])) x y
where constr = takeWhile (not . isSpace) . show
This change had a bigger speed improvement. I found that of the 1.5 million times
eqExpShell was called, the quick equality test rejected 1 million cases.
My next thought was to try replacing
constr with the SYB function
toConstr. There was no noticeable performance impact, but the code is neater, and doesn't rely on the
Show instance, so I stuck with it. After all these changes HLint was 2 seconds faster, but
eqExpShell was still the biggest culprit on the profiling report.
Write eqExpShell entirely in SYB, 8s to 8s, revertedMy next thought was to rewrite
eqExpShell entirely in SYB functions, not using the
Eq instance at all. The advantages of this approach would be that I can simply disregard all subexpressions, I only walk the expression once, and I can skip source position annotations entirely. Starting from the
geq function in SYB, I came up with:
data Box = forall a . Data a => Box a
eqExpShellSYB :: Exp_ -> Exp_ -> Bool
eqExpShellSYB = f
where
f :: (Data a, Data b) => a -> b -> Bool
f x y = toConstr x == toConstr y &&
and (zipWith g (gmapQ Box x) (gmapQ Box y))
g (Box x) (Box y) = tx == typeAnn || tx == typeExp || f x y
where tx = typeOf x
typeAnn = typeOf an
typeExp = typeOf ex
Unfortunately, this code takes exactly the same time as the previous version, despite being significantly more complex. My guess is that
toConstr is not as fast as the
Eq instance, and that this additional overhead negates all the other savings. I decided to revert back to the simpler version.
Call eqExpShell less, 8s to 4sHaving failed to optimise
eqExpShell further, I then thought about how to call it less. I added a trace and found that of the 1.5 million calls, in 1.3 million times at least one of the constructors was an
App. Application is very common in Haskell programs, so this is not particularly surprising. By looking back at the code for
unify I found several other constructors were already handled, so I added a special case for
App.
unify nm (App _ x1 x2) (App _ y1 y2) = liftM2 (++) (unify nm x1 y1) (unify nm x2 y2)
It is easy to show that if an
App (or any specially handled constructor) is passed as either argument to
eqExpShell then the result will be
False, as if both shells had been equal a previous case would have matched. Taking advantage of this observation, I rewrote the line with the generic match as:
unify nm x y | isOther x && isOther y && eqExpShell x y = concatZipWithM (unify nm) (children x) (children y)
where
isOther Do{} = False
isOther Lambda{} = False
isOther Var{} = False
isOther App{} = False
isOther _ = True
With this change the
eqExpShell function was called substantially less, it disappeared from the profile, and the speed improved to 4 seconds.
Fix Uniplate bug, 4s to 1.3sThe next step was to rerun the profiling. However, the results were very confusing - almost 70% of the execution time was recorded to three CAF's, while I could see no obvious culprit. I reran the profiler with the
-caf-all flag to more precisely report the location of CAF's, and was again confused - the optimiser had rearranged the functions to make
-caf-all useless. I then reran the profiler with optimisation turned off using
-O0 and looked again. This time the profiling clearly showed Uniplate being the source of the CAF's. The hit table that Uniplate creates is stored inside a CAF, so was an obvious candidate.
Turning back to Uniplate, I attempted to reproduce the bug outside HLint. I enhanced the benchmarking suite by adding a method to find all the
String's inside very small HSE values. The standard Uniplate benchmarks are for testing the performance of running code, and I had neglected to check the creation of the hit table, assuming it to be negligible.
testN "Module" $ Module ssi Nothing [] [] []
testN "Decl" $ FunBind ssi []
testN "Exp" $ Do ssi []
testN "Pat" $ PWildCard ssi
testN :: Biplate a String => String -> a -> IO ()
testN msg x = do
t <- timer $ evaluate $ length (universeBi x :: [String])
putStrLn $ "HSE for " ++ msg ++ " takes " ++ dp2 t ++ "s"
My initial worry was that at 2 decimal places I was likely to see
0.00 for all values. However, that turned out not to be a problem! What I saw was:
HSE for Module takes 0.54
HSE for Decl takes 0.86
HSE for Exp takes 2.54
HSE for Pat takes 0.32
These results surprised me. In particular, the hit table from
Exp to
String is a subset of the
Module one, so should always be faster. The computation of the hit table is reasonably complex, and I was unable to isolate the problem. The tricky part of the hit table is that it is necessary to take the fixed point of the transitive closure of reachability - I had tried to keep track of recursive types and reach a fixed point with the minimal number of recomputations. I clearly failed, and probably had an exponential aspect to the algorithm that under certain circumstances caused ridiculously bad behaviour.
Rather than try to diagnose the bug, I decided instead to rethink the approach, and simplify the design. In particular, the fixed point of the transitive closure is now written as:
fixEq trans (populate box)
Where
populate finds the immediate children of a constructor,
trans takes the transitive closure based on the current state, and
fixEq takes the fixed point. Using this new simpler design I was also able to compute which types contained with types recursively and cache it, meaning that now computing the hit table for
Module not only computes the hit table for
Exp, but does so in a way that means the result can be reused when asking about
Exp. After rewriting the code I reran the benchmark:
HSE for Module takes 0.03
HSE for Decl takes 0.00
HSE for Exp takes 0.00
HSE for Pat takes 0.00
I had hoped that the rewrite would fix the performance problems, and it did. I have not diagnosed the original performance bug, but at 0.03 seconds I was satisfied. I have now released Uniplate 1.5 with the revised hit table code. With this change, the time for HLint drops to 1.3 seconds and all the CAF's went away from the profile.
Sharing computation, 1.3s to 1.2sAt this point, I was happy to finish, but decided to profile just one last time. The top function in the list was
matchIdea:
matchIdea :: NameMatch -> Setting -> Exp_ -> Maybe Exp_
matchIdea nm MatchExp{lhs=lhs,rhs=rhs,side=side} x = do
u <- unify nm (fmap (const an) lhs) (fmap (const an) x)
u <- check u
guard $ checkSide side u
let rhs2 = subst u rhs
guard $ checkDot lhs rhs2
return $ unqualify nm $ dotContract $ performEval rhs2
This function first strips both the rule (
lhs) and the expression (
x) of their source position information to ensure equality works correctly. However, both the rules and expressions are reused multiple times, so I moved the
fmap calls backwards so each rule/expression is only ever stripped of source position information once. With this change the runtime was reduced to 1.2 seconds.
Final ResultsAfter all that effort, I reran the profile results and got:
parseString HSE.All 16.2 17.7
matchIdea Hint.Match 14.4 1.6
eqExpShellSlow HSE.Eq 9.2 11.4
listDecl Hint.List 6.1 4.1
lambdaHint Hint.Lambda 5.1 5.6
bracketHint Hint.Bracket 4.1 6.4
hasS Hint.Extensions 4.1 5.0
monadHint Hint.Monad 4.1 2.9
~= HSE.Match 4.1 2.5
isParen HSE.Util 3.1 0.0
Now the largest contributor to the HLint runtime is the parsing of Haskell files. There are no obvious candidates for easy optimisations, and the code runs sufficiently fast for my purposes.
ConclusionsThere is still some scope for optimisation of HLint, but I leave that for future work. One possible avenue for exploration would be turning on selected packages of hints, to see which one takes the most time - profiling on a different measure.
In optimising HLint I've found two issues in Uniplate, the first of which I was aware of, and the second of which came as a total surprise. These optimisations to Uniplate will benefit everyone who uses it. I have achieved the goal of optimising HLint, simply by following the profile reports, and as a result HLint is now substantially faster than I had ever expected.
Footnote: I didn't actually profile first, as I knew that a performance regression was caused by the upgrade to Uniplate 1.3, so knew where to start looking. Generally, I would start with a profile.