Saturday, January 23, 2010

Optimising HLint

HLint is a tool for suggesting improvements to Haskell code. Recently I've put some effort in to optimisation and HLint is now over 20 times faster. The standard benchmark (running HLint over the source code of HLint) has gone from 30 seconds to just over 1 second. This blog post is the story of that optimisation, the dead ends I encountered, and the steps I took. I've deliberately included reasonable chunks of code in this post, so interested readers can see the whole story - less technical readers should feel free to skip them. The results of the optimisation are all available on Hackage, as new versions of hlint, uniplate and derive.

Before I start, I'd like to share my five guiding principles of optimisation:

  • Tests - make sure you have tests, so you don't break anything while optimising.

  • Profile - if you don't profile first, you are almost certainly optimising the wrong thing.

  • Necessity - only start the optimisation process if something is running too slowly.

  • Benchmark - use a sensible and representative test case to benchmark, to make sure you optimise the right thing.

  • Optimising - to make a function faster, either call it less, or write it better.

Below are the steps in the optimisation, along with their speed improvement.

Special support for Rational in Uniplate.Data, 30s to 10s

HLint uses Uniplate extensively. HLint works over large abstract syntax trees, from the library haskell-src-exts (HSE), so a generics library is essential. There are two main variants of Uniplate - Data builds on the Scrap Your Boilerplate (SYB) instances, and Direct requires special instances. HLint uses the Data variant, as it requires no instances to be written.

One of the advantages of Uniplate is that it generally outperforms most generics libraries. In particular, the variant written on top of Data instances is often many times faster than using SYB directly. The reason for outperforming SYB is documented in my PhD thesis. The essential idea is that Uniplate builds a "hit table", a mapping noting which types can be contained within which other types - e.g. that there is potentially an Int inside Either String [Int], but there isn't an Int inside String. By consulting this mapping while traversing a value, Uniplate is able to skip large portions, leading to an efficiency improvement.

When computing the hit table it is necessary for Uniplate to create dummy values of each type, which it then traverses using SYB functions. To create dummy values Uniplate uses undefined, unfortunately given the definition data Foo = Foo !Int the value Foo undefined will be forced due to the strictness annotation, and the code will raise an error - as described in bug 243. Uniplate 1.2 had a special case for Rational, which is the only type with strict components contained within HSE. Uniplate 1.3 fixed this problem more generally by catching the exception and turning off the hit table facility on such types. Unfortunately this caused Uniplate 1.3 to turn off the hit table for HSE, causing HLint to run three times slower.

The fix was simple, and pragmatic, but not general. In Uniplate 1.4 I reinstated the special case for Rational, so now HLint makes use of the hit table, and goes three times faster. A more general solution would be to manufacture dummy values for certain types (it's usually an Int or Integer that is a strict component), or to create concrete dummy values using SYB. It's interesting to observe that if HLint used SYB as it's generics library, it would not be using the hit table trick, and would run three times slower.

Use Uniplate.Direct, 10s to 5s, reverted

Uniplate also provides a Direct version, which performs better, but requires instances to be written. In order to further improve the speed of HLint I decided to try the Direct version. The biggest hurdle to using the Direct version is that many instances need to be generated, in the case of HLint it required 677. The first step was to modify the Derive tool to generate these instances (which Derive 2.1 now does), and to write a small script to decide which instances were necessary. With these instances in place, the time dropped to 5 seconds.

Unfortunately, the downside was that compilation time skyrocketed, and the instances are very specific to a particular version of HSE. While these problems are not insurmountable, I did not consider the benefit to be worthwhile, so reverted the changes. It's worth pointing out that most users of Uniplate won't require so many instances to use the Direct versions, and that a program can be switched between Direct and Data versions without any code changes (just a simple import). I also considered the possibility of discovering which Uniplate instances dominated and using the Direct method only for those - but I did not investigate further.

Add and optimise eqExpShell, 10s to 8s

The next step was to profile. I compiled and ran HLint with the following options:

ghc --make -O1 -prof -auto-all -o hlint
hlint src +RTS -p

Looking at the profiling output, I saw that the function unify took up over 60% of the execution time:

unify :: NameMatch -> Exp -> Exp -> Maybe [(String,Exp)]
unify nm (Do xs) (Do ys) | length xs == length ys = concatZipWithM (unifyStmt nm) xs ys
unify nm (Lambda _ xs x) (Lambda _ ys y) | length xs == length ys = liftM2 (++) (unify nm x y) (concatZipWithM unifyPat xs ys)
unify nm x y | isParen x || isParen y = unify nm (fromParen x) (fromParen y)
unify nm (Var (fromNamed -> v)) y | isUnifyVar v = Just [(v,y)]
unify nm (Var x) (Var y) | nm x y = Just []
unify nm x y | ((==) `on` descend (const $ toNamed "_")) x y = concatZipWithM (unify nm) (children x) (children y)
unify nm _ _ = Nothing

The function unify is an essential part of the rule matching in HLint, and attempts to compare a rule to a target expression, and if successful returns the correct substitution. Each rule is compared to every expression in all files, which means that unify is called millions of times in a normal HLint run. Looking closely, my first suspicion was the second line from the bottom in the guard - the call to descend and (==). This line compares the outer shell of two expressions, ignoring any inner expressions. It first uses the Uniplate descend function to insert a dummy value as each subexpression, then compares for equality. To test my hypothesis that this method was indeed the culprit I extracted it to a separate function, and modified unify to call it:

eqExpShell :: Exp_ -> Exp_ -> Bool
eqExpShell = (==) `on` descend (const $ toNamed "_")

I reran the profiling, and now all the time was being spent in eqExpShell. My first thought was to expand out the function to not use Uniplate. I quickly rejected that idea - there are expressions within statements and patterns, and to follow all the intricacies of HSE would be fragile and verbose.

The first optimisation I tried was to replace toNamed "_", the dummy expression, with something simpler. The toNamed call expands to many constructors, so instead I used Do an [] (an is a dummy source location), which is the simplest expression HSE provides. This change had a noticeable speed improvement.

eqExpShell :: Exp_ -> Exp_ -> Bool
eqExpShell = (==) `on` descend (const $ Do an [])

My second thought was to add a quick test, so that if the outer constructors were not equal then the expensive test was not tried. Determining the outer constructor of a value can be done by calling show then only looking at the first word (assuming a sensible Show instance, which HSE has).

eqExpShell :: Exp_ -> Exp_ -> Bool
eqExpShell x y =
((==) `on` constr) x y &&
((==) `on` descend (const $ Do an [])) x y
where constr = takeWhile (not . isSpace) . show

This change had a bigger speed improvement. I found that of the 1.5 million times eqExpShell was called, the quick equality test rejected 1 million cases.

My next thought was to try replacing constr with the SYB function toConstr. There was no noticeable performance impact, but the code is neater, and doesn't rely on the Show instance, so I stuck with it. After all these changes HLint was 2 seconds faster, but eqExpShell was still the biggest culprit on the profiling report.

Write eqExpShell entirely in SYB, 8s to 8s, reverted

My next thought was to rewrite eqExpShell entirely in SYB functions, not using the Eq instance at all. The advantages of this approach would be that I can simply disregard all subexpressions, I only walk the expression once, and I can skip source position annotations entirely. Starting from the geq function in SYB, I came up with:

data Box = forall a . Data a => Box a

eqExpShellSYB :: Exp_ -> Exp_ -> Bool
eqExpShellSYB = f
f :: (Data a, Data b) => a -> b -> Bool
f x y = toConstr x == toConstr y &&
and (zipWith g (gmapQ Box x) (gmapQ Box y))

g (Box x) (Box y) = tx == typeAnn || tx == typeExp || f x y
where tx = typeOf x

typeAnn = typeOf an
typeExp = typeOf ex

Unfortunately, this code takes exactly the same time as the previous version, despite being significantly more complex. My guess is that toConstr is not as fast as the Eq instance, and that this additional overhead negates all the other savings. I decided to revert back to the simpler version.

Call eqExpShell less, 8s to 4s

Having failed to optimise eqExpShell further, I then thought about how to call it less. I added a trace and found that of the 1.5 million calls, in 1.3 million times at least one of the constructors was an App. Application is very common in Haskell programs, so this is not particularly surprising. By looking back at the code for unify I found several other constructors were already handled, so I added a special case for App.

unify nm (App _ x1 x2) (App _ y1 y2) = liftM2 (++) (unify nm x1 y1) (unify nm x2 y2)

It is easy to show that if an App (or any specially handled constructor) is passed as either argument to eqExpShell then the result will be False, as if both shells had been equal a previous case would have matched. Taking advantage of this observation, I rewrote the line with the generic match as:

unify nm x y | isOther x && isOther y && eqExpShell x y = concatZipWithM (unify nm) (children x) (children y)
isOther Do{} = False
isOther Lambda{} = False
isOther Var{} = False
isOther App{} = False
isOther _ = True

With this change the eqExpShell function was called substantially less, it disappeared from the profile, and the speed improved to 4 seconds.

Fix Uniplate bug, 4s to 1.3s

The next step was to rerun the profiling. However, the results were very confusing - almost 70% of the execution time was recorded to three CAF's, while I could see no obvious culprit. I reran the profiler with the -caf-all flag to more precisely report the location of CAF's, and was again confused - the optimiser had rearranged the functions to make -caf-all useless. I then reran the profiler with optimisation turned off using -O0 and looked again. This time the profiling clearly showed Uniplate being the source of the CAF's. The hit table that Uniplate creates is stored inside a CAF, so was an obvious candidate.

Turning back to Uniplate, I attempted to reproduce the bug outside HLint. I enhanced the benchmarking suite by adding a method to find all the String's inside very small HSE values. The standard Uniplate benchmarks are for testing the performance of running code, and I had neglected to check the creation of the hit table, assuming it to be negligible.

testN "Module" $ Module ssi Nothing [] [] []
testN "Decl" $ FunBind ssi []
testN "Exp" $ Do ssi []
testN "Pat" $ PWildCard ssi

testN :: Biplate a String => String -> a -> IO ()
testN msg x = do
t <- timer $ evaluate $ length (universeBi x :: [String])
putStrLn $ "HSE for " ++ msg ++ " takes " ++ dp2 t ++ "s"

My initial worry was that at 2 decimal places I was likely to see 0.00 for all values. However, that turned out not to be a problem! What I saw was:

HSE for Module takes 0.54
HSE for Decl takes 0.86
HSE for Exp takes 2.54
HSE for Pat takes 0.32

These results surprised me. In particular, the hit table from Exp to String is a subset of the Module one, so should always be faster. The computation of the hit table is reasonably complex, and I was unable to isolate the problem. The tricky part of the hit table is that it is necessary to take the fixed point of the transitive closure of reachability - I had tried to keep track of recursive types and reach a fixed point with the minimal number of recomputations. I clearly failed, and probably had an exponential aspect to the algorithm that under certain circumstances caused ridiculously bad behaviour.

Rather than try to diagnose the bug, I decided instead to rethink the approach, and simplify the design. In particular, the fixed point of the transitive closure is now written as:

fixEq trans (populate box)

Where populate finds the immediate children of a constructor, trans takes the transitive closure based on the current state, and fixEq takes the fixed point. Using this new simpler design I was also able to compute which types contained with types recursively and cache it, meaning that now computing the hit table for Module not only computes the hit table for Exp, but does so in a way that means the result can be reused when asking about Exp. After rewriting the code I reran the benchmark:

HSE for Module takes 0.03
HSE for Decl takes 0.00
HSE for Exp takes 0.00
HSE for Pat takes 0.00

I had hoped that the rewrite would fix the performance problems, and it did. I have not diagnosed the original performance bug, but at 0.03 seconds I was satisfied. I have now released Uniplate 1.5 with the revised hit table code. With this change, the time for HLint drops to 1.3 seconds and all the CAF's went away from the profile.

Sharing computation, 1.3s to 1.2s

At this point, I was happy to finish, but decided to profile just one last time. The top function in the list was matchIdea:

matchIdea :: NameMatch -> Setting -> Exp_ -> Maybe Exp_
matchIdea nm MatchExp{lhs=lhs,rhs=rhs,side=side} x = do
u <- unify nm (fmap (const an) lhs) (fmap (const an) x)
u <- check u
guard $ checkSide side u
let rhs2 = subst u rhs
guard $ checkDot lhs rhs2
return $ unqualify nm $ dotContract $ performEval rhs2

This function first strips both the rule (lhs) and the expression (x) of their source position information to ensure equality works correctly. However, both the rules and expressions are reused multiple times, so I moved the fmap calls backwards so each rule/expression is only ever stripped of source position information once. With this change the runtime was reduced to 1.2 seconds.

Final Results

After all that effort, I reran the profile results and got:

parseString HSE.All 16.2 17.7
matchIdea Hint.Match 14.4 1.6
eqExpShellSlow HSE.Eq 9.2 11.4
listDecl Hint.List 6.1 4.1
lambdaHint Hint.Lambda 5.1 5.6
bracketHint Hint.Bracket 4.1 6.4
hasS Hint.Extensions 4.1 5.0
monadHint Hint.Monad 4.1 2.9
~= HSE.Match 4.1 2.5
isParen HSE.Util 3.1 0.0

Now the largest contributor to the HLint runtime is the parsing of Haskell files. There are no obvious candidates for easy optimisations, and the code runs sufficiently fast for my purposes.


There is still some scope for optimisation of HLint, but I leave that for future work. One possible avenue for exploration would be turning on selected packages of hints, to see which one takes the most time - profiling on a different measure.

In optimising HLint I've found two issues in Uniplate, the first of which I was aware of, and the second of which came as a total surprise. These optimisations to Uniplate will benefit everyone who uses it. I have achieved the goal of optimising HLint, simply by following the profile reports, and as a result HLint is now substantially faster than I had ever expected.

Footnote: I didn't actually profile first, as I knew that a performance regression was caused by the upgrade to Uniplate 1.3, so knew where to start looking. Generally, I would start with a profile.


meteficha said...

Great post, thanks a lot for sharing your thoughts! And, of course, for sharing your libraries :).


Syzygies said...

Great read!

I don't know if this would be easier as an extension of hlint or of hoogle, but I've been wondering lately if a tool could answer "I want to write a function with these pieces and perhaps some standard library glue?"

My hunch is that often one of the shorter combinations producing the right type is exactly what someone learning Haskell is looking for. "Theorems for free" and all, Haskell is smart enough genetic material that one can just shake the bag and see what recombines, doing far better than a monkey at a keyboard.

malcolm said...

I have a general comment on strategies for optimisation. Your unify function has many clauses, using both pattern matches and somewhat expensive guards, before reaching the "default" case. I don't know for sure in your code, but in other applications with a similar structure, I have frequently found the default case to be the most common path, which necessitates evaluating all of the preceding left-hand-sides.

On such occasions, it can be very beneficial to write a cheap conditional that will identify the large majority of these default cases, and use it in a guard as the first clause of many.

So for instance,

unify a b | toConstr a /= toConstr b = Nothing

as the first clause might instantly avoid lots of pattern-matching/guard evaluation in the common case, whilst admittedly adding slightly to the evaluation work required in the other less common cases.

Neil Mitchell said...

Syzygies: I think that is a really difficult challenge, and I'm not sure if it could be made to work automatically. I think Djinn is the closest thing to what you describe, and might be a good place to start.

Malcolm: I think with the guards/pattern matching combination as shown in the blog post, as long as isOther, isParen and eqShellExp are inlined, it will be a straight case on x/case on y in the default case. It does rely on the simplifier doing quite a lot of lifting, but I think it's optimal.

However, the code in the blog post isn't quite the true story (there are some additional rules to catch $ and .) and for those I think adding a default test to avoid them would be useful (unless specctor does a particularly awesome job).