Neil Mitchell's Blog (Haskell etc): June 2020

Monday, June 22, 2020

The HLint Match Engine

Summary: HLint has a match engine which powers most of the rules.

The Haskell linter HLint has two forms of lint - some are built in written in Haskell code over the GHC AST (e.g. unused extension detection), but 700+ hints are written using a matching engine. As an example, we can replace map f (map g xs) with map (f . g) xs. Doing so might be more efficient, but importantly for HLint, it's often clearer. That rule is defined in HLint as:

- hint: {lhs: map f (map g x), rhs: map (f . g) x}

All single-letter variables are wildcard matches, so the above rule will match:

map isDigit (map toUpper "test")

And suggest:

map (isDigit . toUpper) "test"

However, Haskell programmers are uniquely creative in specifying functions - with a huge variety of $ and . operators, infix operators etc. The HLint matching engine in HLint v3.1.4 would match this rule to all of the following (I'm using sort as a convenient function, replacing it with foo below would not change any matches):

map f . map g
sort . map f . map g . sort
concatMap (map f . map g)
map f (map (g xs) xs)
f `map` (g `map` xs)
map f $ map g xs
map f (map g $ xs)
map f (map (\x -> g x) xs)
Data.List.map f (Prelude.map g xs)
map f ((sort . map g) xs)

That's a large variety of ways to write a nested map. In this post I'll explain how HLint matches everything above, and the bug that used to cause it to match even the final line (which isn't a legitimate match) which was fixed in HLint v3.1.5.

Eta-contraction

Given a hint comprising of lhs and rhs, the first thing HLint does is determine if it can eta-contract the hint, producing a version without the final argument. If it can do so for both sides, it generates a completely fresh hint. In the case of map f (map g x) in generates:

- hint: {lhs: map f . map g, rhs: map (f . g)}

For the examples above, the first three match with this eta-contracted version, and the rest match with the original form. Now we've generated two hints, it's important that we don't perform sufficiently fuzzy matching that both match some expression, as that would generate twice as many warnings as appropriate.

Root matching

The next step is root matching, which happens only when trying to match at the root of some match. If we have (foo . bar) x then it would be reasonable for that to match bar x, despite the fact that bar x is not a subexpression. We overcome that by transforming the expression to foo (bar x), unifying only on bar x, and recording that we need to add back foo . at the start of the replacement.

Expression matching

After splitting off any extra prefix, HLint tries to unify the single-letter variables with expressions, and build a substitution table with type Maybe [(String, Expr)]. The substitution is Nothing to denote the expressions are incompatible, or Just a mapping of variables to the expression they matched. If two expressions have the same structure, we descend into all child terms and match further. If they don't have the same structure, but are similar in a number of ways, we adjust the source expression and continue.

Examples of adjustments include expanding out $, removing infix application such as f `map` x and ignoring redundant brackets. We translate (f . g) x to f (g x), but not at the root - otherwise we might match both the eta-expanded and non-eta-expanded variants. We also re-associate (.) where needed, e.g. for expressions like sort . map f . map g . sort the bracketing means we have sort . (map f . (map g . sort)). We can see that map f . map g is not a subexpression of that expression, but given that . is associative, we can adjust the source.

When we get down to a terminal name like map, we use the scope information HLint knows to determine if the two map's are equivalent. I'm not going to talk about that too much, as it's slated to be rewritten in a future version of HLint, and is currently both slow and a bit approximate.

Substitution validity

Once we have a substitution, we see if there are any variables which map to multiple distinct expressions. If so, the substitution is invalid, and we don't match. However, in our example above, there are no duplicate variables so any matching substitution must be valid.

Side conditions

Next we check any side conditions - e.g. we could decide that the above hint only makes sense if x is atomic - i.e. does not need brackets in any circumstance. We could have expressed that with side: isAtom x, and any such conditions are checked in a fairly straightforward manner.

Substitutions

Finally, we substitute the variables into the provided replacement. When doing the replacement, we keep track of the free variables, and if the resulting expression has more free variables than it started with, we assume the hint doesn't apply cleanly. As an example, consider the hint \x -> a <$> b x to fmap a . b. It looks a perfectly reasonable hint, but what if we apply it to the expression \x -> f <$> g x x. Now b matches g x, but we are throwing away the \x binding and x is now dangling, so we reject it.

When performing the substitution, we used knowledge of the AST we want, and the brackets required to parse that expression, to ensure we insert the right brackets, but not too many.

Bug #1055

Hopefully all the above sounds quite reasonable. Unfortunately, at some point, the root-matching lost the check that it really was at the root, and started applying the translation to terms such as sort . in map f ((sort . map g) xs). Having generated the sort ., it decided since it wasn't at the root, there was nowhere for it to go, so promptly threw it away. Oops. HLint v3.1.5 fixes the bug in two distinct ways (for defence in depth):

It checks the root boolean before doing the root matching rule.
If it would have to throw away any extra expression, it fails, as throwing away that expression is certain to lead to a correctness bug.

Conclusion

The matching engine of HLint is relatively complex, but I always assumed one day would be replaced with a finite-state-machine scanner that could match n hints against an expression in O(size-of-expression), rather than the current O(n * size-of-expression). However, it's never been the bottleneck, so I've kept with the more direct version.

I'm glad HLint has a simple external lint format. It allows easy contributions and makes hint authoring accessible to everyone. For large projects it's easy to define your own hints to capture common coding patterns. When using languages whose linter does not have an external matching language (e.g. Rust's Clippy) I certainly miss the easy customization.

Tuesday, June 09, 2020

Hoogle Searching Overview

Summary: Hoogle 5 has three interesting parts, a pipeline, database and search algorithm.

The Haskell search engine Hoogle has gone through five major designs, the first four of which are described in these slides from TFP 2011. Hoogle version 5 was designed to be a complete rewrite which simplified the design and allowed it to scale to all of Hackage. All versions of Hoogle have had some preprocessing step which consumes Haskell definitions, and writes out a data file. They then have the search phase which uses that data file to perform searches. In this post I'll go through three parts -- what the data file looks like, how we generate it, and how we search it. When we consider these three parts, the evolution of Hoogle can be seen as:

Versions 1-3, produce fairly simple data files, then do an expensive search on top. Fails to scale to large sizes.
Version 4, produce a very elaborate data files, aiming to search quickly on top. Failed because producing the data file required a lot of babysitting and a long time, so was updated very rarely (yearly). Also, searching a complex data file ends up with a lot of corner cases which have terrible complexity (e.g. a -> a -> a -> a -> a would kill the server).
Version 5, generate very simple data files, then do O(n) but small-constant multiplier searching on top. Update the files daily and automatically. Make search time very consistent.

Version 5 data file

By version 5 I had realised that deserialising the data file was both time consuming and memory hungry. Therefore, in version 5, the data file consists of chunks of data that can be memory-mapped into Vector and ByteString chunks using a ForeignPtr underlying storage. The OS figures out which bits of the data file should be paged in, and there is very little overhead or complexity on the Haskell side. There is a small index structure at the start of the data file which says where these interesting data structures live, and gives them identity using types. For example, to store information about name search we have three definitions:

data NamesSize a where NamesSize :: NamesSize Int
data NamesItems a where NamesItems :: NamesItems (V.Vector TargetId)
data NamesText a where NamesText :: NamesText BS.ByteString

Namely, in the data file we have NamesSize which is an Int, NamesItems which is a Vector TargetId, and NamesText which is a ByteString. The NamesSize is the maximum number of results that can be returned from any non-empty search (used to reduce memory allocation for the result structure), the NamesText is a big string with \0 separators between each entry, and the NamesItems are the identifiers of the result for each name, with as many entries as there are \0 separators.

The current data file is 183Mb for all of Stackage, of which 78% of that is the information about items (documentation, enough information to render them, where the links go etc - we then GZip this information). There are 21 distinct storage types, most involved with type search.

Generating the data file

Generating the data file is done in four phases.

Phase 0 downloads the inputs, primarily a .tar.gz file containing all .cabal files, and another containing all the Haddock Hoogle outputs. These .tar.gz files are never unpacked, but streamed through and analysed using conduit.

Phase 1 reads through all the .cabal files to get metadata about each package - the author, tags, whether it's in Stackage etc. It stores this information in a Map. This phase takes about 7s and uses 100Mb of memory.

Phase 2 reads through every definition in every Haddock Hoogle output (the .txt files --hoogle generates). It loads the entry, parses it, processes it, and writes most of the data to the data file, assigning it a TargetId. That TargetId is the position of the item in the data file, so it's unique, and can be used to grab the relevant item when we need to display it while searching. During this time we collect the unique deduplicated type signatures and names, along with the TargetId values. This phase takes about 1m45s and has about 900Mb of memory at the end. The most important part of phase 2 is not to introduce a space leak, since then memory soars to many Gb.

Phase 3 processes the name and type maps and writes out the information used for searching. This phase takes about 20s and consumes an additional 250Mb over the previous phase.

Since generating the data file takes only a few minutes, there is a nightly job that updates the data file at 8pm every night. The job takes about 15 minutes in total, because it checks out a new version of Hoogle from GitHub, builds it, downloads all the data files, generates a data file, runs the tests, and then restarts the servers.

Searching

Hoogle version 5 works on the principle that it's OK to be O(n) if the constant is small. For textual search, we have a big flat ByteString, and give that to some C code that quickly looks for the substring we enter, favouring complete and case-matching matches. Such a loop is super simple, and at the size of data we are working with (about 10Mb), plenty fast enough.

Type search is inspired by the same principle. We deduplicate types, then for each type, we produce an 18 byte fingerprint. There are about 150K distinct type signatures in Stackage, so that results in about 2.5Mb of fingerprints. For every type search we scan all those fingerprints and figure out the top 100 matches, then do a more expensive search on the full type for those top 100, producing a ranking. For a long time (a few years) I hadn't even bothered doing the second phase of more precise matching, and it still gave reasonable results. (In fact, I never implemented the second phase, but happily Matt Noonan contributed it.)

A type fingerprint is made up of three parts:

1 byte being the arity of the function. a -> b -> c would have arity 3.
1 byte being the number of constructors/variables in the type signature. Maybe a -> a would have a value of 3.
The three rarest names in the function. E.g. A -> B -> C -> D would compare how frequent each of A, B, C and D were in the index of functions, and record the 3 rarest. Each name is given a 32 bit value (where 0 is the most common and 2^32 is the rarest).

The idea of arity and number of constructors/variables is to try and get an approximate shape fit to the type being search for. The idea of the rarest names is an attempt to take advantage that if you are searching for ShakeOptions -> [a] -> [a] then you probably didn't write ShakeOptions by accident -- it provides a lot of signal. Therefore, filtering down to functions that mention ShakeOptions probably gives a good starting point.

Once we have the top 100 matches, we can then start considering whether type classes are satisfied, whether type aliases can be expanded, what the shape of the actual function is etc. By operating on a small and bounded number of types we can do much more expensive comparisons than if we had to apply them to every possible candidate.

Conclusion

Hoogle 5 is far from perfect, but the performance is good, the scale can keep up with the growth of Haskell packages, and the simplicity has kept maintenance low. The technique of operations which are O(n) but with a small constant is one I've applied in other projects since, and I think is an approach often overlooked.

Sunday, June 07, 2020

Surprising IO: How I got a benchmark wrong

Summary: IO evaluation caught me off guard when trying to write some benchmarks.

I once needed to know a quick back-of-the-envelope timing of a pure operation, so hacked something up quickly rather than going via criterion. The code I wrote was:

main = do
    (t, _) <- duration $ replicateM_ 100 $ action myInput
    print $ t / 100

{-# NOINLINE action #-}
action x = do
    evaluate $ myOperation x
    return ()

Reading from top to bottom, it takes the time of running action 100 times and prints it out. I deliberately engineered the code so that GHC couldn't optimise it so myOperation was run only once. As examples of the defensive steps I took:

The action function is marked NOINLINE. If action was inlined then myOperation x could be floated up and only run once.
The myInput is given as an argument to action, ensuring it can't be applied to myOperation at compile time.
The action is in IO so the it has to be rerun each time.

Alas, GHC still had one trick up its sleeve, and it wasn't even an optimisation - merely the definition of evaluation. The replicateM_ function takes action myInput, which is evaluated once to produce a value of type IO (), and then runs that IO () 100 times. Unfortunately, in my benchmark myOperation x is actually evaluated in the process of creating the IO (), not when running the IO (). The fix was simple:

action x = do
    _ <- return ()
    evaluate $ myOperation x
    return ()

Which roughly desugars to to:

return () >>= \_ -> evaluate (myOperation x)

Now the IO produced has a lambda inside it, and my benchmark runs 100 times. However, at -O2 GHC used to manage to break this code once more, by lifting myOperation x out of the lambda, producing:

let y = myOperation x in return () >>= \_ -> evaluate y

Now myOperation runs just once again. I finally managed to defeat GHC by lifting the input into IO, giving:

action x = do
    evaluate . myOperation =<< x
    return ()

Now the input x is itself in IO, so myOperation can't be hoisted.

I originally wrote this post a very long time ago, and back then GHC did lift myOperation out from below the lambda. But nowadays it doesn't seem to do so (quite possibly because doing so might cause a space leak). However, there's nothing that promises GHC won't learn this trick again in the future.

Neil Mitchell's Blog (Haskell etc)