Monday, July 27, 2020

Which packages does Hoogle search?

Summary: Hoogle searches packages on Stackage.

Haskell (as of 27 July 2020) has 14,490 packages in the Hackage package repository. Hoogle (the Haskell API search engine) searches 2,463 packages. This post explains which packages are searched, why some packages are excluded, and thus, how you can ensure your package is searched.

The first filter is that Hoogle only searches packages on Stackage. Hoogle indexes any package which is either in the latest Stackage nightly or Stackage LTS, but always indexes the latest version that is on Hackage. If you want a Hoogle search that perfectly matches a given Stackage release, I recommend using the Stackage Hoogle search available from any snapshot page. There are two reasons for restricting to only packages on Stackage:

  • I want Hoogle results to be useful. The fact that the package currently builds with a recent GHC used by Stackage is a positive sign that the package is maintained and might actually work for a user who wants to use it. Most of the packages on Hackage probably don't build with the latest GHC, and thus aren't useful search results.
  • Indexing time and memory usage is proportional to the number of packages, and somewhat the size of those packages. By dropping over 10 thousand packages we can index more frequently and on more constrained virtual machines. With the recent release of Hoogle 5.0.18 the technical limitations on size were removed to enable indexing all of Hackage - but there is still no intention to do so.

There are 2,426 packages in Stackage Nightly, and 2,508 in Stackage LTS, with most overlapping. There are 2,580 distinct packages between these two sources, the Haskell Platform and a few custom packages Hoogle knows about (e.g. GHC).

Of the 2,580 eligible packages, 77 are executables only, so don't have any libraries to search, leaving 2,503 packages.

Of the remaining packages 2,503, 40 are missing documentation on Hackage, taking us down to 2,463. As for why a package might not have documentation:

  • Some are missing documentation because they are very GHC internal and are mentioned but not on Hackage, e.g. ghc-heap.
  • Some are Windows only and won't generate documentation on the Linux Hackage servers, e.g. Win32-notify.
  • Some have dependencies not installed on the Hackage servers, e.g. rocksdb-query.
  • Some have documentation that appears to have been generated without generating a corresponding Hoogle data file, e.g. array-memoize.
  • Some are just missing docs entirely on Hackage for no good reason I can see, e.g. bytestring-builder.

The Hoogle database is generated and deployed once per day, automatically. Occasionally a test failure or dependency outage will cause generation to fail, but I get alerted, and usually it doesn't get stale by more than a few days. If you add your package to Stackage and it doesn't show up on Hoogle within a few days, raise an issue.

Wednesday, July 15, 2020

Managing Haskell Extensions

Summary: You can divide extensions into yes, no and maybe, and then use HLint to enforce that.

I've worked in multiple moderately sized multinational teams of Haskell programmers. One debate that almost always comes up is which extensions to enable. It's important to have some consistency, so that everyone is using similar dialects of Haskell and can share/review code easily. The way I've solved this debate in the past is by, as a team, dividing the extensions into three categories:

  • Always-on. For example, ScopedTypeVariables is just how Haskell should work and should be enabled everywhere. We turn these extensions on globally in Cabal with default-extensions, and then write an HLint rule to ban turning the extension on manually. In one quick stroke, a large amount of {-# LANGUAGE #-} boilerplate at the top of each file disappears.
  • Always-off, not enabled and mostly banned. For example, TransformListComp steals the group keyword and never got much traction. These extensions can be similarly banned by HLint, or you can default unmentioned extensions to being disabled. If you really need to turn on one of these extensions, you need to both turn it on in the file, and add an HLint exception. Such an edit can trigger wider code review, and serves as a good signal that something needs looking at carefully.
  • Sometimes-on, written at the top of the file as {-# LANGUAGE #-} pragmas when needed. These are extensions which show up sometimes, but not that often. A typical file might have zero to three of them. The reasons for making an extension sometimes-on fall into three categories:
    • The extension has a harmful compile-time impact, e.g. CPP or TemplateHaskell. It's better to only turn these extensions on if they are needed, but they are needed fairly often.
    • The extension can break some code, e.g. OverloadedStrings, so enabling it everywhere would cause compile failures. Generally, we work to minimize such cases, aiming to fix all code to compile with most extensions turned on.
    • The extension is used rarely within the code base and is a signal to the reader that something unusual is going on. Depending on the code base that might be things like RankNTypes or GADTs. But for certain code bases, those extensions will be very common, so it very much varies by code base.

The features that are often most debated are the syntax features - e.g. BlockSyntax or LambdaCase. Most projects should either use these extensions commonly (always-on), or never (banned). They provide some syntactic convenience, but if used rarely, tend to mostly confuse things.

Using this approach every large team I've worked on has had one initial debate to classify extensions, then every few months someone will suggest moving an extension from one pile to another. However, it's pretty much entirely silenced the issue from normal discussion thereafter, leaving us to focus on actual coding.

Tuesday, July 07, 2020

How I Interview

Summary: In previous companies I had a lot of freedom to design an interview. This article describes what I came up with.

Over the years, I've interviewed hundreds of candidates for software engineering jobs (at least 500, probably quite a bit more). I've interviewed for many companies, for teams I was setting up, for teams I was managing, for teams I worked in, and for different teams at the same company. In most places, I've been free to set the majority of the interview. This post describes why and how I designed my interview process. I'm making this post now because where I currently work has a pre-existing interview process, so I won't be following the process below anymore.

I have always run my interviews as a complete assessment of a candidate, aiming to form a complete answer. Sometimes I did that as a phone screen, and sometimes as part of a set of interviews, but I never relied on other people to cover different aspects of a candidate. (Well, I did once, and it went badly...)

When interviewing, there are three questions I want to answer for myself, in order of importance.

Will they be happy here?

If the candidate joined, would they be happy? If people aren't happy, it won't be a pleasant experience, and likely, they won't be very successful. Whether they are happy is the most important criteria because an employee who can't do the job but is happy can be trained or can use their skills for other purposes. But an employee who is unhappy will just drag the morale of the whole team down.

To figure out whether a candidate would be happy, I explain the job (including any office hours/environment/location) and discuss it in contrast to their previous experience. The best person to judge if they would be happy are the candidate themselves - so I ask that question. The tricky part is that it's an interview setting, so they have prepared saying "Yes, that sounds good" to every question. I try and alleviate that by building a rapport with the candidate first, being honest about my experiences, and trying to discuss what they like in the abstract first. If I'm not convinced they are being truthful or properly thinking it through, I ask deeper questions, for example how they like to split their day etc.

A great sign is when a candidate, during the interview, concludes for themselves that this job just isn't what they were looking for. I've had that happen 5 times during the actual interview, and 2 times as an email afterwards. It isn't awkward, and has saved some candidates an inappropriate job (at least 2 would have likely been offered a job otherwise).

While I'm trying to find out if the candidate will be happy, at the same time, I'm also attempting to persuade the candidate that they want to join. It's a hard balance and being open and honest is the only way I have managed it. Assuming I am happy where I work, I can use my enthusiasm to convince the candidate it's a great place, but also give them a sense of what I do.

Can they do the job?

There are two ways I used to figure out if someone can do the job. Firstly, I discuss their background, coding preferences etc. Do the things they've done in the past match the kind of things required in the job. Have they got experience with the non-technical portions of the job, or domain expertise. Most of these aspects are on their CV, so it involves talking about their CV, past projects, what worked well etc.

Secondly, I give them a short technical problem. My standard problem can be solved in under a minute in a single line of code by the best candidates. The problem is not complex, and has no trick-question or clever-approach element. The result can then be used as a springboard to talk about algorithmic efficiency, runtime implementation, parallelism, testing, verification etc. However, my experience is that candidates who struggle at the initial problem go on to struggle with any of the extensions, and candidates that do well at the initial question continue to do well on the extensions. The correlation has been so complete that over time I have started to use the extensions more for candidates who did adequately but not great on the initial problem.

My approach of an incredibly simple problem does not seem to be standard or adopted elsewhere. One reason might be that if it was used at scale, the ability to cheat would be immense (I actually have 2 backup questions for people I've interviewed previously).

Given such a simple question, there have been times when 5 candidates in a row ace the question, and I wonder if the question is just too simple. But usually then the next 5 candidates all struggle terribly and I decide it still has value.

Will I be happy with them doing the job?

The final thing I wonder is would I be happy with them being a part of the team/company. The usual answer is yes. However, if the candidate displays nasty characteristics (belittling, angry, racist, sexist, lying) then it's a no. This question definitely isn't code for "culture fit" or "would I go for a beer with them", but specific negative traits. Generally I answer this question based on whether I see these characteristics reflected in the interactions I have with the candidate, not specific questions. I've never actually had a candidate who was successful at the above questions, and yet failed at this question. I think approximately 5-10 candidates have failed on this question.

Sunday, July 05, 2020

Automatic UI's for Command Lines with cmdargs

Summary: Run cmdargs-browser hlint and you can fill out arguments easily.

The Haskell command line parsing library cmdargs contains a data type that represents a command line. I always thought it would be a neat trick to transform that into a web page, to make it easier to explore command line options interactively - similar to how the custom-written wget::gui wraps wget.

I wrote a demo to do just that, named cmdargs-browser. Given any program that uses cmdargs (e.g. hlint), you can install cmdargs-browser (with cabal install cmdargs-browser) and run:

cmdargs-browser hlint

And it will pop up:

As we can see, the HLint modes are listed on the left (you can use lint, grep or test), the possible options on the right (e.g. normal arguments and --color) and the command line it produces at the bottom. As you change mode or add/remove flags, the command line updates. If you hit OK it then runs the program with the command line. The help is included next to the argument, and if you make a mistake (e.g. write foo for the --color flag) it tells you immediately. It could be more polished (e.g. browse buttons for file selections, better styling) but the basic concepts works well.

Technical implementation

I wanted every cmdargs-using program to support this automatic UI, but also didn't want to increase the dependency footprint or compile-time overhead for cmdargs. I didn't want to tie cmdargs to this particular approach to a UI - I wanted a flexible mechanism that anyone could use for other purposes.

To that end, I built out a Helper module that is included in cmdargs. That API provides the full power and capabilities on which cmdargs-browser is written. The Helper module is only 350 lines.

If you run cmdargs with either $CMDARGS_HELPER or $CMDARGS_HELPER_HLINT set (in the case of HLint) then cmdargs will run the command line you specify, passing over the explicit Mode data type on the stdin. That Mode data type includes functions, and using a simplistic communication channel on the stdin/stdout, the helper process can invoke those functions. As an example, when cmdargs-browser wants to validate the --color flag, it does so by calling a function in Mode, that secretly talks back to hlint to validate it.

At the end, the helper program can choose to either give an error message (to stop the program, e.g. if you press Cancel), or give some command lines to use to run the program.

Future plans

This demo was a cool project, which may turn out to be useful for some, but I have no intention to develop it further. I think something along these lines should be universally available for all command line tools, and built into all command line parsing libraries.

Historical context

All the code that makes this approach work was written over seven years ago. Specifically, it was my hacking project in the hospital while waiting for my son to be born. Having a little baby is a hectic time of life, so I never got round to telling anyone about its existence.

This weekend I resurrected the code and published an updated version to Hackage, deliberately making as few changes as possible. The three necessary changes were:

  1. jQuery deprecated the live function replacing it with on, meaning the code didn't work.
  2. I had originally put an upper bound of 0.4 for the transformers library. Deleting the upper bound made it work.
  3. Hackage now requires that all your uploaded .cabal files declare that they require a version of 1.10 or above of Cabal itself, even if they don't.

Overall, to recover a project that is over 7 years old, it was surprisingly little effort.

Wednesday, July 01, 2020

A Rust self-ownership lifetime trick (that doesn't work)

Summary: I came up with a clever trick to encode lifetimes of allocated values in Rust. It doesn't work.

Let's imagine we are using Rust to implement some kind of container that can allocate values, and a special value can be associated with the container. It's a bug if the allocated value gets freed while it is the special value of a container. We might hope to use lifetimes to encode that relationship:

struct Value<'v> {...}
struct Container {...}

impl Container {
    fn alloc<'v>(&'v self) -> Value<'v> {...}
    fn set_special<'v>(&'v self, x: Value<'v>) {...}
}

Here we have a Container (which has no lifetime arguments), and a Value<'v> (where 'v ties it to the right container). Within our container we can implement alloc and set_special. In both cases, we take &'v self and then work with a Value<'v>, which ensures that the lifetime of the Container and Value match. (We ignore details of how to implement these functions - it's possible but requires unsafe).

Unfortunately, the following code compiles:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, x: Value<'v2>) {
    to.set_special(x);
}

The Rust compiler has taken advantage of the fact that Container can be reborrowed, and that Value is variant, and rewritten the code to:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, x: Value<'v2>) {
    'v3: {
        let x : Value<'v3> = x; // Value is variant, 'v2 : 'v3
        let to : &'v3 Container = &*to;
        to.set_special(x);
    }
}

The code with lifetime annotations doesn't actually compile, it's just what the compiler did under the hood. But we can stop Value being variant by making it contain PhantomData<Cell<&'v ()>>, since lifetimes under Cell are invariant. Now the above code no longer compiles. Unfortunately, there is a closely related variant which does compile:

fn set_cheat_alloc<'v1, 'v2>(to: &'v1 Container, from: &'v2 Container) {
    let x = from.alloc();
    to.set_special(x);
}

While Value isn't variant, &Container is, so the compiler has rewritten this code as:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, from: &'v2 Container) {
    'v3: {
        let from = &'v3 Container = &*from;
        let x : Value<'v3> = from.alloc();
        let to : &'v3 Container = &*to;
        to.set_special(x);
    }
}

Since lifetimes on & are always variant, I don't think there is a trick to make this work safely. Much of the information in this post was gleaned from this StackOverflow question.

Monday, June 22, 2020

The HLint Match Engine

Summary: HLint has a match engine which powers most of the rules.

The Haskell linter HLint has two forms of lint - some are built in written in Haskell code over the GHC AST (e.g. unused extension detection), but 700+ hints are written using a matching engine. As an example, we can replace map f (map g xs) with map (f . g) xs. Doing so might be more efficient, but importantly for HLint, it's often clearer. That rule is defined in HLint as:

- hint: {lhs: map f (map g x), rhs: map (f . g) x}

All single-letter variables are wildcard matches, so the above rule will match:

map isDigit (map toUpper "test")

And suggest:

map (isDigit . toUpper) "test"

However, Haskell programmers are uniquely creative in specifying functions - with a huge variety of $ and . operators, infix operators etc. The HLint matching engine in HLint v3.1.4 would match this rule to all of the following (I'm using sort as a convenient function, replacing it with foo below would not change any matches):

  • map f . map g
  • sort . map f . map g . sort
  • concatMap (map f . map g)
  • map f (map (g xs) xs)
  • f `map` (g `map` xs)
  • map f $ map g xs
  • map f (map g $ xs)
  • map f (map (\x -> g x) xs)
  • Data.List.map f (Prelude.map g xs)
  • map f ((sort . map g) xs)

That's a large variety of ways to write a nested map. In this post I'll explain how HLint matches everything above, and the bug that used to cause it to match even the final line (which isn't a legitimate match) which was fixed in HLint v3.1.5.

Eta-contraction

Given a hint comprising of lhs and rhs, the first thing HLint does is determine if it can eta-contract the hint, producing a version without the final argument. If it can do so for both sides, it generates a completely fresh hint. In the case of map f (map g x) in generates:

- hint: {lhs: map f . map g, rhs: map (f . g)}

For the examples above, the first three match with this eta-contracted version, and the rest match with the original form. Now we've generated two hints, it's important that we don't perform sufficiently fuzzy matching that both match some expression, as that would generate twice as many warnings as appropriate.

Root matching

The next step is root matching, which happens only when trying to match at the root of some match. If we have (foo . bar) x then it would be reasonable for that to match bar x, despite the fact that bar x is not a subexpression. We overcome that by transforming the expression to foo (bar x), unifying only on bar x, and recording that we need to add back foo . at the start of the replacement.

Expression matching

After splitting off any extra prefix, HLint tries to unify the single-letter variables with expressions, and build a substitution table with type Maybe [(String, Expr)]. The substitution is Nothing to denote the expressions are incompatible, or Just a mapping of variables to the expression they matched. If two expressions have the same structure, we descend into all child terms and match further. If they don't have the same structure, but are similar in a number of ways, we adjust the source expression and continue.

Examples of adjustments include expanding out $, removing infix application such as f `map` x and ignoring redundant brackets. We translate (f . g) x to f (g x), but not at the root - otherwise we might match both the eta-expanded and non-eta-expanded variants. We also re-associate (.) where needed, e.g. for expressions like sort . map f . map g . sort the bracketing means we have sort . (map f . (map g . sort)). We can see that map f . map g is not a subexpression of that expression, but given that . is associative, we can adjust the source.

When we get down to a terminal name like map, we use the scope information HLint knows to determine if the two map's are equivalent. I'm not going to talk about that too much, as it's slated to be rewritten in a future version of HLint, and is currently both slow and a bit approximate.

Substitution validity

Once we have a substitution, we see if there are any variables which map to multiple distinct expressions. If so, the substitution is invalid, and we don't match. However, in our example above, there are no duplicate variables so any matching substitution must be valid.

Side conditions

Next we check any side conditions - e.g. we could decide that the above hint only makes sense if x is atomic - i.e. does not need brackets in any circumstance. We could have expressed that with side: isAtom x, and any such conditions are checked in a fairly straightforward manner.

Substitutions

Finally, we substitute the variables into the provided replacement. When doing the replacement, we keep track of the free variables, and if the resulting expression has more free variables than it started with, we assume the hint doesn't apply cleanly. As an example, consider the hint \x -> a <$> b x to fmap a . b. It looks a perfectly reasonable hint, but what if we apply it to the expression \x -> f <$> g x x. Now b matches g x, but we are throwing away the \x binding and x is now dangling, so we reject it.

When performing the substitution, we used knowledge of the AST we want, and the brackets required to parse that expression, to ensure we insert the right brackets, but not too many.

Bug #1055

Hopefully all the above sounds quite reasonable. Unfortunately, at some point, the root-matching lost the check that it really was at the root, and started applying the translation to terms such as sort . in map f ((sort . map g) xs). Having generated the sort ., it decided since it wasn't at the root, there was nowhere for it to go, so promptly threw it away. Oops. HLint v3.1.5 fixes the bug in two distinct ways (for defence in depth):

  1. It checks the root boolean before doing the root matching rule.
  2. If it would have to throw away any extra expression, it fails, as throwing away that expression is certain to lead to a correctness bug.

Conclusion

The matching engine of HLint is relatively complex, but I always assumed one day would be replaced with a finite-state-machine scanner that could match n hints against an expression in O(size-of-expression), rather than the current O(n * size-of-expression). However, it's never been the bottleneck, so I've kept with the more direct version.

I'm glad HLint has a simple external lint format. It allows easy contributions and makes hint authoring accessible to everyone. For large projects it's easy to define your own hints to capture common coding patterns. When using languages whose linter does not have an external matching language (e.g. Rust's Clippy) I certainly miss the easy customization.

Tuesday, June 09, 2020

Hoogle Searching Overview

Summary: Hoogle 5 has three interesting parts, a pipeline, database and search algorithm.

The Haskell search engine Hoogle has gone through five major designs, the first four of which are described in these slides from TFP 2011. Hoogle version 5 was designed to be a complete rewrite which simplified the design and allowed it to scale to all of Hackage. All versions of Hoogle have had some preprocessing step which consumes Haskell definitions, and writes out a data file. They then have the search phase which uses that data file to perform searches. In this post I'll go through three parts -- what the data file looks like, how we generate it, and how we search it. When we consider these three parts, the evolution of Hoogle can be seen as:

  • Versions 1-3, produce fairly simple data files, then do an expensive search on top. Fails to scale to large sizes.
  • Version 4, produce a very elaborate data files, aiming to search quickly on top. Failed because producing the data file required a lot of babysitting and a long time, so was updated very rarely (yearly). Also, searching a complex data file ends up with a lot of corner cases which have terrible complexity (e.g. a -> a -> a -> a -> a would kill the server).
  • Version 5, generate very simple data files, then do O(n) but small-constant multiplier searching on top. Update the files daily and automatically. Make search time very consistent.

Version 5 data file

By version 5 I had realised that deserialising the data file was both time consuming and memory hungry. Therefore, in version 5, the data file consists of chunks of data that can be memory-mapped into Vector and ByteString chunks using a ForeignPtr underlying storage. The OS figures out which bits of the data file should be paged in, and there is very little overhead or complexity on the Haskell side. There is a small index structure at the start of the data file which says where these interesting data structures live, and gives them identity using types. For example, to store information about name search we have three definitions:

data NamesSize a where NamesSize :: NamesSize Int
data NamesItems a where NamesItems :: NamesItems (V.Vector TargetId)
data NamesText a where NamesText :: NamesText BS.ByteString

Namely, in the data file we have NamesSize which is an Int, NamesItems which is a Vector TargetId, and NamesText which is a ByteString. The NamesSize is the maximum number of results that can be returned from any non-empty search (used to reduce memory allocation for the result structure), the NamesText is a big string with \0 separators between each entry, and the NamesItems are the identifiers of the result for each name, with as many entries as there are \0 separators.

The current data file is 183Mb for all of Stackage, of which 78% of that is the information about items (documentation, enough information to render them, where the links go etc - we then GZip this information). There are 21 distinct storage types, most involved with type search.

Generating the data file

Generating the data file is done in four phases.

Phase 0 downloads the inputs, primarily a .tar.gz file containing all .cabal files, and another containing all the Haddock Hoogle outputs. These .tar.gz files are never unpacked, but streamed through and analysed using conduit.

Phase 1 reads through all the .cabal files to get metadata about each package - the author, tags, whether it's in Stackage etc. It stores this information in a Map. This phase takes about 7s and uses 100Mb of memory.

Phase 2 reads through every definition in every Haddock Hoogle output (the .txt files --hoogle generates). It loads the entry, parses it, processes it, and writes most of the data to the data file, assigning it a TargetId. That TargetId is the position of the item in the data file, so it's unique, and can be used to grab the relevant item when we need to display it while searching. During this time we collect the unique deduplicated type signatures and names, along with the TargetId values. This phase takes about 1m45s and has about 900Mb of memory at the end. The most important part of phase 2 is not to introduce a space leak, since then memory soars to many Gb.

Phase 3 processes the name and type maps and writes out the information used for searching. This phase takes about 20s and consumes an additional 250Mb over the previous phase.

Since generating the data file takes only a few minutes, there is a nightly job that updates the data file at 8pm every night. The job takes about 15 minutes in total, because it checks out a new version of Hoogle from GitHub, builds it, downloads all the data files, generates a data file, runs the tests, and then restarts the servers.

Searching

Hoogle version 5 works on the principle that it's OK to be O(n) if the constant is small. For textual search, we have a big flat ByteString, and give that to some C code that quickly looks for the substring we enter, favouring complete and case-matching matches. Such a loop is super simple, and at the size of data we are working with (about 10Mb), plenty fast enough.

Type search is inspired by the same principle. We deduplicate types, then for each type, we produce an 18 byte fingerprint. There are about 150K distinct type signatures in Stackage, so that results in about 2.5Mb of fingerprints. For every type search we scan all those fingerprints and figure out the top 100 matches, then do a more expensive search on the full type for those top 100, producing a ranking. For a long time (a few years) I hadn't even bothered doing the second phase of more precise matching, and it still gave reasonable results. (In fact, I never implemented the second phase, but happily Matt Noonan contributed it.)

A type fingerprint is made up of three parts:

  • 1 byte being the arity of the function. a -> b -> c would have arity 3.
  • 1 byte being the number of constructors/variables in the type signature. Maybe a -> a would have a value of 3.
  • The three rarest names in the function. E.g. A -> B -> C -> D would compare how frequent each of A, B, C and D were in the index of functions, and record the 3 rarest. Each name is given a 32 bit value (where 0 is the most common and 2^32 is the rarest).

The idea of arity and number of constructors/variables is to try and get an approximate shape fit to the type being search for. The idea of the rarest names is an attempt to take advantage that if you are searching for ShakeOptions -> [a] -> [a] then you probably didn't write ShakeOptions by accident -- it provides a lot of signal. Therefore, filtering down to functions that mention ShakeOptions probably gives a good starting point.

Once we have the top 100 matches, we can then start considering whether type classes are satisfied, whether type aliases can be expanded, what the shape of the actual function is etc. By operating on a small and bounded number of types we can do much more expensive comparisons than if we had to apply them to every possible candidate.

Conclusion

Hoogle 5 is far from perfect, but the performance is good, the scale can keep up with the growth of Haskell packages, and the simplicity has kept maintenance low. The technique of operations which are O(n) but with a small constant is one I've applied in other projects since, and I think is an approach often overlooked.

Sunday, June 07, 2020

Surprising IO: How I got a benchmark wrong

Summary: IO evaluation caught me off guard when trying to write some benchmarks.

I once needed to know a quick back-of-the-envelope timing of a pure operation, so hacked something up quickly rather than going via criterion. The code I wrote was:

main = do
    (t, _) <- duration $ replicateM_ 100 $ action myInput
    print $ t / 100

{-# NOINLINE action #-}
action x = do
    evaluate $ myOperation x
    return ()

Reading from top to bottom, it takes the time of running action 100 times and prints it out. I deliberately engineered the code so that GHC couldn't optimise it so myOperation was run only once. As examples of the defensive steps I took:

  • The action function is marked NOINLINE. If action was inlined then myOperation x could be floated up and only run once.
  • The myInput is given as an argument to action, ensuring it can't be applied to myOperation at compile time.
  • The action is in IO so the it has to be rerun each time.

Alas, GHC still had one trick up its sleeve, and it wasn't even an optimisation - merely the definition of evaluation. The replicateM_ function takes action myInput, which is evaluated once to produce a value of type IO (), and then runs that IO () 100 times. Unfortunately, in my benchmark myOperation x is actually evaluated in the process of creating the IO (), not when running the IO (). The fix was simple:

action x = do
    _ <- return ()
    evaluate $ myOperation x
    return ()

Which roughly desugars to to:

return () >>= \_ -> evaluate (myOperation x)

Now the IO produced has a lambda inside it, and my benchmark runs 100 times. However, at -O2 GHC used to manage to break this code once more, by lifting myOperation x out of the lambda, producing:

let y = myOperation x in return () >>= \_ -> evaluate y

Now myOperation runs just once again. I finally managed to defeat GHC by lifting the input into IO, giving:

action x = do
    evaluate . myOperation =<< x
    return ()

Now the input x is itself in IO, so myOperation can't be hoisted.

I originally wrote this post a very long time ago, and back then GHC did lift myOperation out from below the lambda. But nowadays it doesn't seem to do so (quite possibly because doing so might cause a space leak). However, there's nothing that promises GHC won't learn this trick again in the future.

Sunday, May 31, 2020

HLint --cross was accidentally quadratic

Summary: HLint --cross was accidentally quadratic in the number of files.

One of my favourite blogs is Accidentally Quadratic, so when the Haskell linter HLint suffered such a fate, I felt duty bound to write it up. Most HLint hints work module-at-a-time (or smaller scopes), but there is one hint that can process multiple modules simultaneously - the duplication hint. If you write a sufficiently large repeated fragment in two modules, and pass --cross, then this hint will detect the duplication. The actual application of hints is HLint is governed by:

applyHintsReal :: [Setting] -> Hint -> [ModuleEx] -> [Idea]

Given a list of settings, a list of hints (which gets merged to a single composite Hint) and a list of modules, produce a list of ideas to suggest. Usually this function is called in parallel with a single module at a time, but when --cross is passed, all the modules being analysed get given in one go.

In HLint 3, applyHintsReal became quadratic in the number of modules. When you have 1 module, 1^2 = 1, and everything works fine, but --cross suffers a lot. The bug was simple. Given a Haskell list comprehension:

[(a,b) | a <- xs, b <- xs]

When given the list xs of [1,2] you get back the pairs [(1,1),(1,2),(2,1),(2,2)] - the cross product, which is quadratic in the size of xs. The real HLint code didn't look much different:

[ generateHints m m'
| m <- ms
, ...
, (nm',m') <- mns'
, ...
]
where
    mns' = map (\x -> (scopeCreate (GHC.unLoc $ ghcModule x), x)) ms

We map over ms to create mns' containing each module with some extra information. In the list comprehension we loop over each module ms to get m, then for each m in ms, loop over mns' to get m'. That means you take the cross-product of the modules, which is quadratic.

How did this bug come about? HLint used to work against haskell-src-exts (HSE), but now works against the GHC parser. We migrated the hints one by one, changing HLint to thread through both ASTs, and then each hint could pick which AST to use. The patch that introduced this behaviour left ms as the HSE AST, and made mns' the GHC AST. It should have zipped these two together, so for each module you have the HSE and GHC AST, but accidentally took the cross-product.

How did we spot it? Iustin Pop filed a bug report noting that each hint was repeated once per file being checked and performance had got significantly worse, hypothesising it was O(n^2). Iustin was right!

How did we fix it? By the time the bug was spotted, the HSE AST had been removed entirely, and both m and m' were the same type, so deleting one of the loops was easy. The fix is out in HLint version 3.1.4.

Should I be using --cross? If you haven't heard about --cross in HLint, I don't necessarily suggest you start experimenting with it. The duplicate detection hints are pretty dubious and I think most people would be better suited with a real duplicate detection tool. I've had good experiences with Simian in the past.

Wednesday, May 27, 2020

Fixing Space Leaks in Ghcide

Summary: A performance investigation uncovered a memory leak in unordered-containers and performance issues with Ghcide.

Over the bank holiday weekend, I decided to devote some time to a possible Shake build system performance issue in Ghcide Haskell IDE. As I started investigating (and mostly failed) I discovered a space leak which I eventually figured out, solved, and then (as a happy little accident) got a performance improvement anyway. This post is a tale of what I saw, how I tackled the problem, and how I went forward. As I'm writing the post, not all the threads have concluded. I wrote lots of code during the weekend, but most was only to experiment and has been thrown away - I've mostly left the code to the links. Hopefully the chaotic nature of development shines through.

Shake thread-pool performance

I started with a Shake PR claiming that simplifying the Shake thread pool could result in a performance improvement. Faster and simpler seems like a dream combination. Taking a closer look, simpler seemed like it was simpler because it supported less features (e.g. ability to kill all threads when one has an exception, some fairness/scheduling properties). But some of those features (e.g. better scheduling) were in the pursuit of speed, so if a simpler scheduler was 30% faster (the cost of losing randomised scheduling), that might not matter.

The first step was to write a benchmark. It's very hard to synthesise a benchmark that measures the right thing, but spawning 200K short tasks into the thread pool seemed a plausible start. As promised on the PR, the simpler version did indeed run faster. But interestingly, the simplifications weren't really responsible for the speed difference - switching from forkIO to forkOn explained nearly all the difference. I'm not that familiar with forkOn, so decided to micro-benchmark it - how long does it take to spawn off 1M threads with the two methods. I found two surprising results:

  • The performance of forkOn was quadratic! A GHC bug explains why - it doesn't look too hard to fix, but relying on forkOn is unusual, so its unclear if the fix is worth it.
  • The performance of forkIO was highly inconsistent. Often it took in the region of 1 second. Sometimes it was massively faster, around 0.1s. A StackOverflow question didn't shed much light on why, but did show that by using the PVar concurrency primitive it could be 10x faster. There is a GHC bug tracking the issue, and it seems as though the thread gets created them immediately switches away. There is a suggestion from Simon Peyton Jones of a heuristic that might help, but the issue remains unsolved.

My desire to switch the Shake thread-pool to a quadratic primitive which is explicitly discouraged is low. Trying to microbenchmark with primitives that have inconsistent performance is no fun. The hint towards PVar is super interesting, and I may follow up on it in future, but given the remarks in the GHC tickets I wonder if PVar is merely avoiding one small allocation, and avoiding an allocation avoids a context switch, so it's not a real signal.

At this point I decided to zoom out and try benchmarking all of Ghcide.

Benchmarking Ghcide

The thread about the Shake thread pool pointed at a benchmarking approach of making hover requests. I concluded that making a hover request with no file changes would benchmark the part of Shake I thought the improved thread-pool was most likely to benefit. I used the Shake source code as a test bed, and opened a file with 100 transitive imports, then did a hover over the listToMaybe function. I know that will require Shake validating that everything is up to date, and then doing a little bit of hover computation.

I knew I was going to be running Ghcide a lot, and the Cabal/Stack build steps are frustratingly slow. In particular, every time around Stack wanted to unregister the Ghcide package. Therefore, I wrote a simple .bat file that compiled Ghcide and my benchmark using ghc --make. So I could experiment quickly with changes to Shake, I pulled in all of Shake as source, not as a separate library, with an include path. I have run that benchmark 100's of times, so the fact it is both simple (no arguments) and as fast as I could get has easily paid off.

For the benchmark itself, I first went down the route of looking at the replay functionality in lsp-test. Sadly, that code doesn't link to anything that explains how to generate traces. After asking on the haskell-ide-engine IRC I got pointed at both the existing functionality of resCaptureFile. I also got pointed at the vastly improved version in a PR, which doesn't fail if two messages race with each other. Configuring that and running it on my benchmark in the IDE told me that the number of messages involved was tiny - pretty much an initialisation and then a bunch of hovers. Coding those directly in lsp-test was trivial, and so I wrote a benchmark. The essence was:

doc <- openDoc "src/Test.hs" "haskell"
(t, _) <- duration $ replicateM_ 100 $
    getHover doc $ Position 127 43
print t

Open a document. Send 100 hover requests. Print the time taken.

Profiling Ghcide

Now I could run 100 hovers, I wanted to use the GHC profiling mechanisms. Importantly, the 100 hover requests dominates the loading by a huge margin, so profiles would focus on the right thing. I ran a profile, but it was empty. Turns out the way lsp-test invokes the binary it is testing means it kills it too aggressively to allow GHC to write out profiling information. I changed the benchmark to send a shutdown request at the end, then sleep, and changed Ghcide to abort on a shutdown, so it could write the profiling information.

Once I had the profiling information, I was thoroughly uniformed. 10% went in file modification checking, which could be eliminated. 10% seemed to go to hash table manipulations, which seemed on the high side, but not too significant (turned out I was totally wrong, read to the end!). Maybe 40% went in the Shake monad, but profiling exaggerates those costs significantly, so it's unclear what the truth is. Nothing else stood out, but earlier testing when profiling forkIO operations had shown they weren't counted well, so that didn't mean much.

Prodding Ghcide

In the absence of profiling data, I started changing things and measuring the performance. I tried a bunch of things that made no difference, but some things did have an impact on the time to do 100 hovers:

  • Running normally: 9.77s. The baseline.
  • Switching to forkOn: 10.65s. Suggestive that either Ghcide has changed, or the project is different, or platform differences mean that forkOn isn't as advantageous.
  • Using only one Shake thread: 13.65s. This change had been suggested in one ticket, but made my benchmark worse.
  • Avoid spawning threads for things I think will be cheap: 7.49s. A useful trick, and maybe one that will be of benefit in future, but for such a significant change a 25% performance reduction seemed poor.
  • Avoid doing any Shake invalidation: 0.31s. An absolute lower bound if Shake cheats and does nothing.

With all that, I was a bit dejected - performance investigation reveals nothing of note was not a great conclusion from a days work. I think that other changes to Ghcide to run Shake less and cache data more will probably make this benchmark even less important, so the conclusion worsens - performance investigation of nothing of note reveals nothing of note. How sad.

But in my benchmark I did notice something - a steadily increasing memory size in process explorer. Such issues are pretty serious in an interactive program, and we'd fixed several issues recently, but clearly there were more. Time to change gears.

Space leak detection

Using the benchmark I observed a space leak. But the program is huge, and manual code inspection usually needs a 10 line code fragment to have a change. So I started modifying the program to do less, and continued until the program did as little as it could, but still leaked space. After I fixed a space leak, I zoomed out and saw if the space leak persisted, and then had another go.

The first investigation took me into the Shake Database module. I found that if I ran the Shake script to make everything up to date, but did no actions inside, then there was a space leak. Gradually commenting out lines (over the course of several hours) eventually took me to:

step <- pure $ case v of
    Just (_, Loaded r) -> incStep $ fromStepResult r
    _ -> Step 1

This code increments a step counter on each run. In normal Shake this counter is written to disk each time, which forces the value. In Ghcide we use Shake in memory, and nothing ever forced the counter. The change was simple - replace pure with evaluate. This fix has been applied to Shake HEAD.

Space leak detection 2

The next space leak took me to the Shake database reset function, which moves all Shake keys from Ready to Loaded when a new run starts. I determined that if you didn't run this function, the leak went away. I found a few places I should have put strictness annotations, and a function that mutated an array lazily. I reran the code, but the problem persisted. I eventually realised that if you don't call reset then none of the user rules run either, which was really what was fixing the problem - but I committed the improvements I'd made even though they don't fix any space leaks.

By this point I was moderately convinced that Shake wasn't to blame, so turned my attention to the user rules in Ghcide. I stubbed them out, and the leak went away, so that looked plausible. There were 8 types of rules that did meaningful work during the hover operation (things like GetModificationTime, DoesFileExist, FilesOfInterest). I picked a few in turn, and found they all leaked memory, so picked the simple DoesFileExist and looked at what it did.

For running DoesFileExist I wrote a very quick "bailout" version of the rule, equivalent to the "doing nothing" case, then progressively enabled more bits of the rule before bailing out, to see what caused the leak. The bailout looked like:

Just v <- getValues state key file
let bailout = Just $ RunResult ChangedNothing old $ A v

I progressively enabled more and more of the rule, but even with the whole rule enabled, the leak didn't recur. At that point, I realised I'd introduced a syntax error and that all my measurements for the last hour had been using a stale binary. Oops. I span up a copy of Ghcid, so I could see syntax errors more easily, and repeated the measurements. Again, the leak didn't recur. Very frustrating.

At that point I had two pieces of code, one which leaked and one which didn't, and the only difference was the unused bailout value I'd been keeping at the top to make it easier to quickly give up half-way through the function. Strange though it seemed, the inescapable conclusion was that getValues must somehow be fixing the space leak.

If getValues fixes a leak, it is a likely guess that setValues is causing the leak. I modified setValues to also call getValues and the problem went away. But, after hours of staring, I couldn't figure out why. The code of setValues read:

setValues state key file val = modifyVar_ state $ \vals -> do
    evaluate $ HMap.insert (file, Key key) (fmap toDyn val) vals

Namely, modify a strict HashMap from unordered-containers, forcing the result. After much trial and error I determined that a "fix" was to add:

case HMap.lookup k res of
    Nothing -> pure ()
    Just v -> void $ evaluate v

It's necessary to insert into the strict HashMap, then do a lookup, then evaluate the result that comes back, or there is a space leak. I duly raised a PR to Ghcide with the unsatisfying comment:

I'm completely lost, but I do have a fix.

It's nice to fix bugs. It's better to have some clue why a fix works.

Space leak in HashMap

My only conclusion was that HashMap must have a space leak. I took a brief look at the code, but it was 20+ lines and nothing stood out. I wrote a benchmark that inserted billions of values at 1000 random keys, but it didn't leak space. I puzzled it over in my brain, and then about a day later inspiration struck. One of the cases was to deal with collisions in the HashMap. Most HashMap's don't have any collisions, so a bug hiding there could survive a very long time. I wrote a benchmark with colliding keys, and lo and behold, it leaked space. Concretely, it leaked 1Gb/s, and brought my machine to its knees. The benchmark inserted three keys all with the same hash, then modified one key repeatedly. I posted the bug to the unordered-containers library.

I also looked at the code, figured out why the space leak was occurring, and a potential fix. However, the fix requires duplicating some code, and its likely the same bug exists in several other code paths too. The Lazy vs Strict approach of HashMap being dealt with as an outer layer doesn't quite work for the functions in question. I took a look at the PR queue for unordered-containers and saw 29 requests, with the recent few having no comments on them. That's a bad sign and suggested that spending time preparing a PR might be in vain, so I didn't.

Aside: Maintainers get busy. It's no reflection negatively on the people who have invested lots of time on this library, and I thank them for their effort! Given 1,489 packages on Hackage depend on it, I think it could benefit from additional bandwidth from someone.

Hash collisions in Ghcide

While hash collisions leading to space leaks is bad, having hash collisions at all is also bad. I augmented the code in Ghcide to print out hash collisions, and saw collisions between ("Path.hs", Key GetModificationTime) and ("Path.hs", Key DoesFileExist). Prodding a bit further I saw that the Hashable instance for Key only consulted its argument value, and given most key types are simple data Foo = Foo constructions, they all had the same hash. The solution was to mix in the type information stored by Key. I changed to the definition:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key) `xor` hashWithSalt salt key

Unfortunately, that now gave hash collisions with different paths at the same key. I looked into the hashing for the path part (which is really an lsp-haskell-types NormalizedFilePath) and saw that it used an optimised hashing scheme, precomputing the hash, and returning it with hash. I also looked at the hashable library and realised the authors of lsp-haskell-types hadn't implemented hashWithSalt. If you don't do that, a generic instance is constructed which deeply walks the data structure, completely defeating the hash optimisation. A quick PR fixes that.

I also found that for tuples, the types are combined by using the salt argument. Therefore, to hash the pair of path information and Key, the Key hashWithSalt gets called with the hash of the path as its salt. However, looking at the definition above, you can imagine that both hashWithSalt of a type and hashWithSalt of a key expand to something like:

hashWithSalt salt (Key key) = salt `xor` hash (typeOf key) `xor` (salt `xor` 0)

Since xor is associative and commutative, those two salt values cancel out! While I wasn't seeing complete cancellation, I was seeing quite a degree of collision, so I changed to:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key, key)

With that fix in Ghcide, all collisions went away, and all space leaks left with them. I had taken this implementation of hash combining from Shake, and while it's not likely to be a problem in the setting its used there, I've fixed it in Shake too.

Benchmarking Ghcide

With the hash collisions reduced, and the number of traversals when computing a hash reduced, I wondered what the impact was on performance. A rerun of the original benchmark showed the time had reduced to 9.10s - giving a speed up of about 5%. Not huge, but welcome.

Several days later we're left with less space leaks, more performance, and hopefully a better IDE experience for Haskell programmers. I failed in what I set out to do, but found some other bugs along the way, leading to 9 PRs/commits and 4 ongoing issues. I'd like to thank everyone in the Haskell IDE team for following along, making suggestions, confirming suspicions, and generally working as a great team. Merging the Haskell IDE efforts continues to go well, both in terms of code output, and team friendliness.

Saturday, May 23, 2020

Shake 0.19 - changes to process execution

Summary: The new version of Shake has some tweaks to how stdin works with cmd.

I've just released Shake 0.19, see the full change log. Most of the interesting changes in this release are around the cmd/command functions, which let you easily run command lines. As an example, Shake has always allowed:

cmd "gcc -c" [source] "-o" [output]

This snippet compiles a source file using gcc. The cmd function is variadic, and treats strings as space-separated arguments, and lists as literal arguments. It's overloaded by return type, so can work in the IO monad (entirely outside Shake) or the Shake Action monad (inside Shake). You can capture results and pass in options, e.g. to get the standard error and run in a different directory, you can do:

Stderr err <- cmd "gcc -c" [source] "-o" [output] (Cwd "src")

Shake is a dynamic build system with advanced dependency tracking features that let's you write your rules in Haskell. It just so happens that running commands is very common in build systems, so while not really part of a build system, it's a part of Shake that has had a lot of work done on it. Since the command is both ergonomic and featureful, I've taken to using the module Develoment.Shake.Command in non-Shake related projects.

Recent cmd changes

The first API breaking change only impacts users of the file access tracing. The resulting type is now polymorphic, and if you opt to for the FSATrace ByteString, you'll get your results a few milliseconds faster. Even if you stick with FSATrace FilePath, you'll get your results faster than the previous version. Performance of tracing happened to matter for a project I've been working on :-).

The other changes in this release are to process groups and the standard input. In Shake 0.18.3, changes were made to switch to create_group=True in the process library, as that improves the ability to cancel actions and clean up sub-processes properly. Unfortunately, on Linux that caused processes that read from standard input to hang. The correlation between these events, and the exact circumstances that triggered it, took a long time to track down - thanks to Gergő Érdi for some excellent bisection work. Most processes that are run in a build system should not access the standard input, and the only reports have come from docker (don't use -i) and ffmpeg (pass -nostdin), but hanging is a very impolite way to fail. In older versions of Shake we inherited the Shake stdin to the child (unless you specified the stdin explicitly with Stdin), but now we create a new pipe with no contents. There are now options NoProcessGroup and InheritStdin which let you change these settings independently. I suspect a handful of commands will need flag tweaks to stop reading the stdin, but they will probably fail saying the stdin is inaccessible, so debugging it should be relatively easy.

In another tale of cmd not working how you might hope, in Shake 0.15.2 we changed cmd to close file handles when spawning a process. Unfortunately, that step is O(n) in the number of potential handles on your system, where n is RLIMIT_NOFILE and can be quite big, so we switched back in 0.18.4. Since 0.18.4 you can pass CloseFileHandles if you definitely want handles to be closed. It's been argued that fork is a bad design, and this performance vs safety trade-off seems another point in favour of that viewpoint.

The amount of work that has gone into processes, especially around timeout and cross-platform differences, has been huge. I see 264 commits to these files, but the debugging time associated with them has been many weeks!

Other changes

This release contains other little tweaks that might be useful:

  • Time spent in the batch function is better accounted for in profiles.
  • Finally deleted the stuff that has been deprecated since 2014, particularly the *> operator. I think a six year deprecation cycle seems more than fair for a pre-1.0 library.
  • Optimised modification time on Linux.

Wednesday, May 20, 2020

GHC Unproposals

Summary: Four improvements to Haskell I'm not going to raise as GHC proposals.

Writing a GHC proposal is a lot of hard work. It requires you to fully flesh out your ideas, and then defend them robustly. That process can take many months. Here are four short proposals that I won't be raising, but think would be of benefit (if you raise a proposal for one of them, I'll buy you a beer next time we are physically co-located).

Use : for types

In Haskell we use : for list-cons and :: for types. That's the wrong way around - the use of types is increasing, the use of lists is decreasing, and type theory has always used :. This switch has been joke-proposed before. We actually switched these operators in DAML, and it worked very nicely. Having written code in both styles, I now write Haskell on paper with : for types instead of ::. Nearly all other languages use : for types, even Python. It's sad when Python takes the more academically pure approach than Haskell.

Is it practical: Maybe. The compiler diff is quite small, so providing it as an option has very little technical cost. The problem is it bifurcates the language - example code will either work with : for types or :: for types. It's hard to write documentation, text books etc. If available, I would switch my code.

Make recursive let explicit

Currently you can write let x = x + 1 and it means loop forever at runtime because x is defined in terms of itself. You probably meant to refer to the enclosing x, but you don't get a type error, and often don't even get a good runtime error message, just a hang. In do bindings, to avoid the implicit reference to self, it's common to write x <- pure $ x + 1. That can impose a runtime cost, and obscure the true intent.

In languages like OCaml there are two different forms of let - one which allows variables to be defined and used in a single let (spelled let rec) and one which doesn't (spelled let). Interestingly, this distinction is important in GHC Core, which has two different keywords, and a source let desugars differently based on whether it is recursive. I think Haskell should add letrec as a separate keyword and make normal let non-recursive. Most recursive bindings are done under a where, and these would continue to allow full recursion, so most code wouldn't need changing.

Is it practical: The simplest version of this proposal would be to add letrec as a keyword equivalent to let and add a warning on recursive let. Whether it's practical to go the full way and redefine the semantics of let to mean non-recursive binding depends on how strong the adoption of letrec was, but given that I suspect recursive let is less common, it seems like it could work. Making Haskell a superset of GHC Core is definitely an attractive route to pursue.

Allow trailing variables in bind

When writing Haskell code, I often have do blocks that I'm in the middle of fleshing out, e.g.:

do fileName <- getLine
   src <- readFile fileName

My next line will be to print the file or similar, but this entire do block, and every sub-part within it, is constantly a parse error until I put in that final line. When the IDE has a parse error, it can't really help me as much as I'd like. The reason for the error is that <- can't be the final line of a do block. I think we should relax that restriction, probably under a language extension that only IDE's turn on. It's not necessarily clear what such a construct should mean, but in many ways that isn't the important bit, merely that such a construct results in a valid Haskell program, and allows more interactive feedback.

Is it practical: Yes, just add a language extension - since it doesn't actually enable any new power it's unlikely to cause problems. Fleshing out the semantics, and whether it applies to let x = y statements in a do block too, is left as an exercise for the submitter. An alternative would be to not change the language, but make GHC emit the error slightly later on, much like -fdefer-type-errors, which still works for IDEs (either way needs a GHC proposal).

Add an exporting keyword

Currently the top of every Haskell file duplicates all the identifiers that are exported - unless you just export everything (which you shouldn't). That approach duplicates logic, makes refactorings like renamings more effort, and makes it hard to immediately know if the function you are working on is exposed. It would be much nicer if you could just declare things that were exported inline, e.g. with a pub keyword - so pub myfunc :: a -> a both defines and exports myfunc. Rust has taken this approach and it works out quite well, modulo some mistakes. The currently Haskell design has been found a bit wanting, with constructs like pattern Foo in the export list to differentiate when multiple names Foo might be in scope, when attaching the visibility to the identifier would be much easier.

Is it practical: Perhaps, provided someone doesn't try and take the proposal too far. It would be super tempting to differentiate between exports to of the package, and exports that are only inside this package (what Rust clumsily calls pub(crate)). And there are other things in the module system that could be improved. And maybe we should export submodules. I suspect everyone will want to pile more things into this design, to the point it breaks, but a simple exporting keyword would probably be viable.

Friday, May 15, 2020

File Access Tracing

Summary: It is useful to trace files accessed by a command. Shake and FSATrace provide some tools to do that.

When writing a build system, it's useful to see which files a command accesses. In the Shake build system, we use that information for linting, an auto-deps feature and a forward build mode. What we'd like is a primitive which when applied to a command execution:

  1. Reports which files are read/written.
  2. Reports the start and end time for when the files were accessed.
  3. Reports what file metadata is accessed, e.g. modification times and directory listing.
  4. Lets us pause a file access (so the dependencies can be built) or deny a file access (so dependency violations can be rejected early).
  5. Is computationally cheap.
  6. Doesn't require us to write/maintain too much low-level code.
  7. Works on all major OSs (Linux, Mac, Windows).
  8. Doesn't require sudo or elevated privilege levels.

While there are lots of approaches to tracing that get some of those features, it is currently impossible to get them all. Therefore, Shake has to make compromises. The first fours bullet points are about features -- we give up on 2 (timestamps) and 4 (pause/deny); 1 (read/writes) is essential, and we make 3 (metadata) optional, using the imperfect information when its available and tolerating its absence. The last four bullet points are about how it works -- we demand 7 (compatibility) and 8 (no sudo) because Shake must be easily available to its users. We strive for 5 (cheap) and 6 (easy), but are willing to compromise a bit on both.

Shake abstracts the result behind the cmd function with the FSATrace return type. As an example I ran in GHCi:

traced :: [FSATrace] <- cmd "gcc -c main.c"
print traced

Which compiles main.c with gcc, and on my machine prints 71 entries, including:

[ FSARead "C:\\ghc\\ghc-8.6.3\\mingw\\bin\\gcc.exe"
, FSARead "C:\\Neil\\temp\\main.c"
, FSAWrite "C:\\Users\\ndmit_000\\AppData\\Local\\Temp\\ccAadCiR.s"
, FSARead "C:\\ghc\\ghc-8.6.3\\mingw\\bin\\as.exe"
, FSARead "C:\\Users\\ndmit_000\\AppData\\Local\\Temp\\ccAadCiR.s"
, FSAWrite "C:\\Neil\\temp\\main.o"
, ...
]

Most of the remaining entries are dlls that gcc.exe uses, typically from the Windows directory. I've reordered the list to show the flow more clearly. First the process reads gcc.exe (so it can execute it), which reads main.c and writes a temporary file ccAadCiR.s. It then reads as.exe (the assembler) so it can run it, which in turn reads ccAadCiR.s and writes main.o.

Under the hood, Shake currently uses FSATrace, but that is an implementation detail -- in particular the BigBro library might one day also be supported. In order to understand the limitations of the above API, it's useful to understand the different approaches to file system tracing, and which ones FSATrace uses.

Syscall tracing On Linux, ptrace allows tracing every system call made, examining the arguments, and thus recording the files accessed. Moreover, by tracing the stat system call even file queries can be recorded. The syscall tracking approach can be made complete, but because every syscall must be hooked, can end up imposing high overhead. This approach is used by BigBro as well as numerous other debugging and instrumentation tools.

Library preload On both Linux and Mac most programs use a dynamically linked C library to make file accesses. By using LD_LIBRARY_PRELOAD it is possible to inject a different library into the program memory which intercepts the relevant C library calls, recording which files are read and written. This approach is simpler than hooking syscalls, but only works if all syscall access is made through the C library. While normally true, that isn't the case for Go programs (syscalls are invoked directly) or statically linked programs (the C library cannot be replaced).

While the technique works on a Mac, from Mac OS X 1.10 onwards system binaries can't be traced due to System Integrity Protection. As an example, the C compiler is typically installed as a system binary. It is possible to disable System Integrity Protection (but not recommended by Apple); or to use non-system binaries (e.g. those supplied by Nix); or to copy the system binary to a temporary directory (which works provided the binary does not afterwards invoke another system binary). The library preload mechanism is implemented by FSATrace and the copying system binaries trick on Mac is implemented in Shake.

File system tracing An alternative approach is to implement a custom file system and have that report which files are accessed. One such implementation for Linux is TracedFS, which is unfortunately not yet complete. Such an approach can track all accesses, but may require administrator privileges to mount a file system.

Custom Linux tracing On Linux, thanks to the open-source nature of the kernel, there are many custom file systems (e.g FUSE) and tracing mechanisms (e.g. eBPF), many of which can be used/configured/extended to perform some kind of system tracing. Unfortunately, most of these are restricted to Linux only.

Custom Mac tracing BuildXL uses a Mac sandbox based on KAuth combined with TrustedBSD Mandatory Access Control (MAC) to both detect which files are accessed and also block access to specific files. The approach is based on internal Mac OS X details which have been reversed engineered, some of which are deprecated and scheduled for removal.

Windows Kernel API hooking On Windows it is possible to hook the Kernel API, which can be used to detect when any files are accessed. Implementing such a hook is difficult, particularly around 32bit v 64bit differences, as custom assembly language trampolines must be used. Furthermore, some antivirus products (incorrectly) detect such programs as viruses. Windows kernel hooking is available in both FSATrace and BigBro (sharing the same source code), although without support for 32bit processes that spawn 64bit processes.

Current State

Shake currently uses FSATrace, meaning it uses library preloading on Linux/Mac and kernel hooking on Windows. The biggest practical limitations vary by OS:

  • On Linux it can't trace into Go programs (or other programs that use system calls directly) and statically linked binaries. Integrating BigBro as an alternative would address these issues.
  • On Mac it can't trace into system binaries called from other system binaries, most commonly the system C/C++ compiler. Using your own C/C++ installation, via Homebrew or Nix, is a workaround.
  • On Windows it can't trace 64bit programs spawned by 32bit programs. In most cases the 32bit binaries can easily be replaced by 64bit binaries. The only problem I've seen was caused by a five year-old version of sh hiding out in my C:\bin directory, which was easily remedied with a newer version. The code to fix this issue is available, but scares me too much to try integrating.

Overall, the tracing available in Shake has a simple API, is very useful for Shake, and has been repurposed in other build systems. But I do dearly wish such functionality could be both powerful and standardised!

Sunday, May 03, 2020

HLint 3.0

Summary: HLint 3.0 uses the GHC parser.

In June 2019 I posted about our intention to move HLint to the GHC parser. Since then a small group of us have been hard at work making the conversion -- first by parsing with both GHC and haskell-src-exts, and finally, with the newly released HLint 3.0, parsing only with GHC. As of now, if your code can be parsed with GHC, it can probably be parsed with HLint. As new GHC releases come out, with new features and new forms of syntax, HLint will follow along closely.

The change list for this release records 51 separate items, which is about as many as the last nine HLint releases combined. Of those changes, 11 are breaking changes (the ones marked with *). That count omits all the hint conversions, which (hopefully!) aren't user visible. The main API breaks are in the Language.Haskell.HLint API, which has switched from haskell-src-exts types to GHC ones. You can now take a GHC syntax tree and apply HLint to it, or (as before) give HLint the source and have it do the parsing for you. We also took the opportunity to simplify the API while we were at it -- but the underlying functionality remains much the same. We also deleted a small number of command line flags that were no longer useful, and were never used very much. If you have difficulty converting to the new API or relied on some removed functionality, raise a bug.

What was especially nice about this conversion process, and the development of HLint in general, is that it is increasingly becoming a real team, where my role is more reviewer than coder. There have been 21 distinct contributors since the start of the GHC conversion, but I'd like to particularly call out a few major pieces of work that have been completed:

  • Shayne Fletcher is responsible for the ghc-lib-parser library that makes it feasible to use a single GHC API across multiple GHC versions. Without that, using the GHC library would be at least double the work (and just wouldn't be feasible). Shayne also did a lot of the conversion, mapping many of the rule types from haskell-src-exts to GHC. These API's are surprisingly different given they have the same underlying representation.
  • Georgi Lyubenov did most of the conversions that Shayne didn't do.
  • Joseph C. Sible has made sure that while efforts were focused on a complete rewrite of the code base, the underlying hints have continued to improve, removing incorrect hints, adding useful additional hints.
  • Ziyang Liu has focused on the refactoring side of HLint. The initial refactoring work was completed by Matthew Pickering as part of GSoC 2015. Since then it's had mild attention at best. Ziyang has stepped into the gap, importantly adding tests, CI and improving the refactorings in lots of places. It now feels like a real part of HLint.

We hope you enjoy HLint 3.0 and beyond!

PS. You may spot we're already on HLint 3.0.2 on Hackage - thanks to Ryan Scott for already finding a few bugs.

Tuesday, April 28, 2020

Writing a fast interpreter

Summary: Interpretation by closure is a lot faster than I expected.

Let's imagine we have an imperative language (expressions, assignments, loops etc.) and we want to write an interpreter for it. What styles of interpreter are there? And how fast do they perform? I was curious, so I wrote a demo with some benchmarks. The full code, in Rust, is available here.

First, let's get a taste of the the mini-language we want to interpret:

x = 100;
y = 12;
while x != 0 {
    y = y + 1 + y + 3;
    x = x - 1;
}

We can immediately translate x and y from being named variables to indices in an array, namely 0 and 1. Once we've done that there are four interpretation techniques that spring to mind (and in brackets their performance on my benchmark):

  1. Interpret the AST directly (2.1s).
  2. Compile from the AST to closures (1.4s).
  3. Compile from the AST to a stream of instructions (1.5s).
  4. Encode those instructions as bytes (1.5s).
  5. Compile to assembly or JIT - I didn't try this approach (it's a lot more work).

All of these are vastly slower than my version written directly in Rust (which takes a mere 0.003s) -- but my benchmark didn't have any real operations in it, so this comparison will be the absolute worst case.

Let's go through the approaches.

Style 1: AST evaluation

One option is to directly interpret the AST. Given a vector named slots representing the variables by index, we need to change the slots as we go. A fragment of the interpreter might look like:

fn f(x: &Expr, slots: &mut Vec<i64>) -> i64 {
    match x {
        Expr::Lit(i) => *i,
        Expr::Var(u) => slots[*u],
        Expr::Add(x, y) => f(x, slots) + f(y, slots),
        ...
    }
}

It's as simple as the options come. Given the expression and the slots we need to write to, we do whatever the instruction tells us. But the low simplicity leads to low performance.

Style 2: Conversion to closure

Instead of traversing the AST at runtime, we can traverse it once, and produce a closure/function that performs the action when run (e.g. see this blog post). Given that we access the slots at runtime, we make them an argument to the closure. In Rust, the type of our closure is:

type Compiled = Box<dyn Fn(&mut Vec<i64>) -> i64>;

Here we are defining a Fn (a closure -- function plus captured data) that goes from the slots to a result. Because these functions vary in how much data they capture, we have to wrap them in Box. With that type we can now define our evaluation function:

fn f(x: &Expr) -> Compiled {
    match x {
        Expr::Lit(i) => {
            let i = *i;
            box move |_| i
        }
        Expr::Var(u) => {
            let u = *u;
            box move |slots| slots[u]
        }
        Expr::Add(x, y) => {
            let x = compile(x);
            let y = compile(y);
            box move |slots| x(slots) + y(slots)
        }
        ...
    }
}

Instead of taking the AST (compile-time information) and the slot data (runtime information) we use the compile-time information to produce a function that can then be applied to the run-time information. We trade matching on the AST for an indirect function call at runtime. Rust is able to turn tail calls on dynamic functions into jumps and the processor is able to accurately predict the jumps/calls, leading to reasonable performance.

One large advantage of the closure approach is that adding specialised variants, e.g. compiling a nested Add differently, can be done locally and with no additional runtime cost.

Style 3: Fixed sized instructions

Instead of interpreting an AST, or jumping via indirect functions, we can define a set of instructions and interpret an array of them them using a stack of intermediate values. We are effectively virtualising a CPU, including program counter. We can define a bytecode with instructions such as:

enum Bytecode {
    Assign(u32),  // assign the value at the top of the stack to a slot
    Var(u32),     // push the value in slot to the top of the stack
    Lit(i32),     // push a literal on the stack
    Add,          // Add the top two items on the stack
    Jump(u32),
    ...
}

We now interpret these instructions:

let mut pc = 0;
let mut slots = vec![0; 10];
let mut stack = Stack::new();

loop {
    match xs[pc] {
        Assign(x) => slots[x as usize] = stack.pop(),
        Var(x) => stack.push(slots[x as usize]),
        Lit(i) => stack.push(i as i64),
        Add => {
            let x = stack.pop();
            let y = stack.pop();
            stack.push(x + y)
        }
        Jump(pc2) => pc = pc2 as usize - 1,
        ...
    }
    pc = pc + 1;
}

Most of these operations work against the stack. I found that if I used checked array accesses on the stack (the default in Rust) it went about the same speed as AST interpretation. Moving to unchecked access made it similar in performance (slightly worse) than the closure version.

The bytecode approach is much harder to implement, requiring a compiler to the bytecode. It's also much harder to add specialised variants for certain combinations of instructions. To get good performance via the branch predictor probably requires further tricks beyond what I've shown here (e.g. direct threading).

There are advantages to a bytecode though -- it's easier to capture all the program state, which is useful for garbage collection and other operations.

Style 4: Byte encoded instructions

Instead of having a Rust enum to represent the values, we can instead use bytes, so instead of:

Lit(38)
Lit(4)
Add
Assign(0)

We would have a series of bytes [0,38,0,4,1,2,0] (where 0 = Lit, 1 = Add, 2 = Assign). This approach gives a more compact bytecode, and might have an impact on the instruction cache, but in my benchmarks performed the same as style 3.

Monday, March 16, 2020

The <- pure pattern

Summary: Sometimes <- pure makes a lot of sense, avoiding some common bugs.

In Haskell, in a monadic do block, you can use either <- to bind monadic values, or let to bind pure values. You can also use pure or return to wrap a value with the monad, meaning the following are mostly equivalent:

let x = myExpression
x <- pure myExpression

The one place they aren't fully equivalent is when myExpression contains x within it, for example:

let x = x + 1
x <- pure (x + 1)

With the let formulation you get an infinite loop which never terminates, whereas with the <- pure pattern you take the previously defined x and add 1 to it. To solve the infinite loop, the usual solution with let is to rename the variable on the left, e.g.:

let x2 = x + 1

And now make sure you use x2 everywhere from now on. However, x remains in scope, with a more convenient name, and the same type, but probably shouldn't be used. Given a sequence of such bindings, you often end up with:

let x2 = x + 1
let x3 = x2 + 1
let x4 = x3 + 1
...

Given a large number of unchecked indicies that must be strictly incrementing, bugs usually creep in, especially when refactoring. The unused variable warning will sometime catch mistakes, but not if a variable is legitimately used twice, but one of those instances is incorrect.

Given the potential errors, when a variable x is morally "changing" in a way that the old x is not longer useful, I find it much simpler to write:

x <- pure myExpression

The compiler now statically ensures we haven't fallen into the traps of an infinite loop (which is obvious and frustrating to track down) or using the wrong data (which is much harder to track down, and often very subtly wrong).

What I really want: What I actually think Haskell should have done is made let non-recursive, and had a special letrec keyword for recursive bindings (leaving where be recursive by default). This distinction is present in GHC Core, and would mean let was much safer.

What HLint does: HLint is very aware of the <- pure pattern, but also aware that a lot of beginners should be guided towards let. If any variable is defined more than once on the LHS of an <- then it leaves the do alone, otherwise it will suggest let for those where it fits.

Warnings: In the presence of mdo or do rec both formulations might end up being the same. If the left is a refutable pattern you change between error and fail, which might be quite different. Let bindings might be generalised. This pattern gives a warning about shadowed variables with -Wall.