Sunday, November 15, 2020

Data types for build system dependencies

Summary: Monadic and early cut-off? Use a sequence of sets.

In the Build Systems a la Carte paper we talk about the expressive power of various types of build systems. We deliberately simplify away parallelism and implementation concerns, but those details matter. In this post I'm going to discuss some of those details, specifically the representation of dependencies.

Applicative build systems

In an applicative build system like Make, all dependencies for a target are known before you start executing the associated action. That means the dependencies have no ordering, so are best represented as a set. However, because they can be calculated from the target, they don't usually need to be stored separately. The dependencies can also be evaluated in parallel. To build a target you evaluate the dependencies to values, then evaluate the action.

Early cut-off is when an action is skipped because none of its dependencies have changed value, even if some dependencies might have required recomputing. This optimisation can be incredibly important for build systems with generated code - potentially seconds vs hours of build time. To obtain early cut-off in applicative systems, after evaluating the dependencies you compare them to the previous results, and only run the action if there were changes.

Monadic build systems

In monadic build systems like Shake, the representation of dependencies is more complex. If you have an alternative mechanism of detecting whether a rule is dirty (e.g. reverse dependencies) you don't need to record the dependencies at all. If the key is dirty, you start executing the action, and that will request the dependencies it needs. The action can then suspend, calculate the dependencies, and continue.

If you want early cut-off in a monadic build system, you need to rerun the dependencies in advance, and if they all have the same result, skip rerunning the action. Importantly, you probably want to rerun the dependencies in the same order that the action originally requested them -- otherwise you might pay a severe and unnecessary time penalty. As an example, let's consider an action:

opt <- need "is_optimised"
object <- if opt then need "foo.optimised" else need "foo.unoptimised"
link object

This rule is monadic, as whether you need the optimised or unoptimised dependency depends on the result of calculating some is_optimised property. If on the first run is_optimised is True, then we build foo.optimised. On the second run, if is_optimised is False, it is important we don't build foo.optimised as that might take a seriously long time and be entirely redundant. Therefore, it's important when checking for early cut-off we build in the order that the previous action requested the dependencies, and stop on the first difference we encounter.

(If you have unlimited resources, e.g. remote execution, it might be profitable to evaluate everything in parallel - but we're assuming that isn't the case here.)

Provided a rule performs identically between runs (i.e. is deterministic and hasn't been changed), everything that we request to check for early cut-off will still be needed for real, and we won't have wasted any work. For all these reasons, it is important to store dependencies as a sequence (e.g. a list/vector).

Monadic build systems plus parallelism

Applicative build systems naturally request all their dependencies in parallel, but monadic build systems are naturally one dependency at a time. To regain parallelism, in build systems like Shake the primitive dependency requesting mechanism takes a set of dependencies that are computed in parallel. While requesting dependencies individually or in bulk gives the same result, in bulk gives significantly more parallelism. (In Shake we use lists to track correspondence between requests and results, but it's morally a set.)

As we saw previously, it is still important that for early cut-off you reproduce the dependencies much like they were in the action. That means you request dependencies in the order they were requested, and when they were requested in bulk, they are also checked in bulk. Now we have a sequence of sets to represent dependencies, where the elements of the sets can be checked in parallel, but the sequence must be checked in order.

Monadic build systems plus explicit parallelism

What if we add an explicit parallelism operator to a monadic build system, something like parallel :: [Action a] -> IO [a] to run arbitrary actions in parallel (which is what Shake provides). Now, instead of a sequence of sets, we have a tree of parallelism. As before it's important when replaying that the dependencies are requested in order, but also that as much is requested in parallel as possible.

What Shake does

Shake is a monadic build system with early cut-off, parallelism and explicit parallelism. When building up dependencies it uses a tree representation. The full data type is:

data Traces
    = None
    | One Trace
    | Sequence Traces Traces
    | Parallel [Traces]

Sequenced dependencies are represented with Sequence and the traces captured by parallelism use Parallel. Importantly, constructing Traces values is nicely O(1) in all cases. (Shake v0.19.1 used a different representation and repeatedly normalised it, which could have awful time complexity - potentially O(2^n) in pathological cases.)

While these traces store complete information, actually evaluating that trace when checking for rebuilds would be complicated. Instead, we flatten that representation to [[Trace]] for writing to the Shake database. The outer list is a sequence, the inner list is morally a set. We have the invariant that no Trace value will occur multiple times, since if you depend on something once, and then again, the second dependency was irrelevant. To flatten Parallel computations we take the first required dependency in each parallel action, merge them together, and then repeat for the subsequent actions. If you run code like:

parallel [
    need ["a"] >> parallel [need ["b"], need ["c"]]
    need ["d"]
]

It will get flattened to appear as though you wrote need ["a","d"] >> need ["b","c"]. When checking, it will delay the evaluation of b and c until after d completes, even though that is unnecessary. But simplifying traces at the cost of marginally less rebuild parallelism for those who use explicit parallelism (which is not many) seems like the right trade-off for Shake.

Conclusion

Applicative build systems should use sets for their dependencies. Monadic build systems should use sets, but if they support early cut-off, should use sequences of sets.

Monday, November 09, 2020

Turing Incomplete Languages

Summary: Some languages ban recursion to ensure programs "terminate". That's technically true, but usually irrelevant.

In my career there have been three instances where I've worked on a programming language that went through the evolution:

  1. Ban recursion and unbounded loops. Proclaim the language is "Turing incomplete" and that all programs terminate.
  2. Declare that Turing incomplete programs are simpler. Have non-technical people conflate terminate quickly with terminate eventually.
  3. Realise lacking recursion makes things incredibly clunky to express, turning simple problems into brain teasers.
  4. Add recursion.
  5. Realise that the everything is better.

Before I left university, this process would have sounded ridiculous. In fact, even after these steps happened twice I was convinced it was the kind of thing that would never happen again. Now I've got three instances, it seems worth writing a blog post so for case number four I have something to refer to.

A language without recursion or unbounded loops

First, let's consider a small simple statement-orientated first-order programming language. How might we write a non-terminating program? There are two easy ways. Firstly, write a loop - while (true) {}. Second, write recursion, void f() { f() }. We can ban both of those, leaving only bounded iteration of the form for x in xs { .. } or similar. Now the language is Turing incomplete and all programs terminate.

The lack of recursion makes programs harder to write, but we can always use an explicit stack with unbounded loops.

The lack of unbounded loops isn't a problem provided we have an upper bound on how many steps our program might take. For example, we know QuickSort has worst-case complexity O(n^2), so if we can write for x in range(0, n^2) { .. } then we'll have enough steps in our program such that we never reach the bound.

But what if our programming language doesn't even provide a range function? We can synthesise it by realising that in a linear amount of code we can produce exponentially large values. As an example:

double xs = xs ++ xs -- Double the length of a list
repeated x = double (double (double (double (double (double (double (double (double (double [x])))))))))

The function repeated 1 makes 10 calls to double, and creates a list of length 2^10 (1024). A mere 263 more calls to double and we'll have a list long enough to contain each atom in the universe. With some tweaks we can cause doubling to stop at a given bound, and generate numbers in sequence, giving us range to any bound we pick.

We now have a menu of three techniques that lets us write almost any program we want to do so:

  1. We can encoding recursion using an explicit stack.
  2. We can change unbounded loops into loops with a conservative upper bound.
  3. We can generate structures of exponential size with a linear amount of code.

The consequences

Firstly, we still don't have a Turing complete language. The code will terminate. But there is no guarantee on how long it will take to terminate. Programs that take a million years to finish technically terminate, but probably can't be run on an actual computer. For most of the domains I've seen Turing incompleteness raised, a runtime of seconds would be desirable. Turing incompleteness doesn't help at all.

Secondly, after encoding the program in a tortured mess of logic puzzles, the code is much harder to read. While there are three general purpose techniques to encode the logic, there are usually other considerations that cause each instance to be solved differently. I've written tree traversals, sorts and parsers in such restricted languages - the result is always a lot of comments and at least one level of unnecessary indirection.

Finally, code written in such a complex style often performs significantly worse. Consider QuickSort - the standard implementation takes O(n^2) time worst case, but O(n log n) time average case, and O(log n) space (for the stack). If you take the approach of building an O(n^2) list before you start to encode a while loop, you end up with O(n^2) space and time. Moreover, while in normal QuickSort the time complexity is counting the number of cheap comparisons, in an encoded version the time complexity relates to allocations, which can be much more expensive as a constant factor.

The solution

Most languages with the standard complement of if/for etc which are Turing incomplete do not gain any benefit from this restriction. One exception is in domains where you are proving properties or doing analysis, as two examples:

  1. Dependently typed languages such as Idris, which typically have much more sophisticated termination checkers than just banning recursion and unbounded loops.
  2. Resource bounded languages such as Hume, which allow better analysis and implementation techniques by restricting how expressive the language is.

Such languages tend to be a rarity in industry. In all the Turing incomplete programming languages I've experienced, recursion was later added, programs were simplified, and programming in the language became easier.

While most languages I've worked on made this evolution in private, one language, DAML from Digital Asset, did so in public. In 2016 they wrote:

DAML was intentionally designed not to be Turing-complete. While Turing-complete languages can model any business domain, what they gain in flexibility they lose in analysability.

Whereas in 2020 their user manual says:

If there is no explicit iterator, you can use recursion. Let’s try to write a function that reverses a list, for example.

Note that while I used to work at Digital Asset, these posts both predate and postdate my time there.

Tuesday, September 22, 2020

Don't use Ghcide anymore (directly)

Summary: I recommend people use the Haskell Language Server IDE.

Just over a year ago, I recommended people looking for a Haskell IDE experience to give Ghcide a try. A few months later the Haskell IDE Engine and Ghcide teams agreed to work together on Haskell Language Server - using Ghcide as a library as the core, with the plugins/installer experience from the Haskell IDE Engine (by that stage we were already both using the same Haskell setup and LSP libraries). At that time Alan Zimmerman said to me:

"We will have succeeded in joining forces when you (Neil) start recommending people use Haskell Language Server."

I'm delighted to say that time has come. For the last few months I've been both using and recommending Haskell Language Server for all Haskell IDE users. Moreover, for VS Code users, I recommend simply installing the Haskell extension which downloads the right version automatically. The experience of Haskell Language Server is better than either the Haskell IDE Engine or Ghcide individually, and is improving rapidly. The teams have merged seamlessly, and can now be regarded as a single team, producing one IDE experience.

There's still lots of work to be done. And for those people developing the IDE, Ghcide remains an important part of the puzzle - but it's now a developer-orientated piece rather than a user-orientated piece. Users should follow the README at Haskell Language Server and report bugs against Haskell Language Server.

Monday, August 31, 2020

Interviewing while biased

Interviewing usually involves some level of subjectivity. I once struggled to decide about a candidate, and after some period of reflection, the only cause I can see is that I was biased against the candidate. That wasn't a happy realisation, but even so, it's one I think worth sharing.

Over my years, I've interviewed hundreds of candidates for software engineering jobs (I reckon somewhere in the 500-1000 mark). I've interviewed for many companies, for teams I was managing, for teams I worked in, and for other teams at the same company. In most places, I've been free to set the majority of the interview. I have a standard pattern, with a standard technical question, to which I have heard a lot of answers. The quality of the answers fall into one of three categories:

  • About 40% give excellent, quick, effortless answers. These candidates pass the technical portion.
  • About 50% are confused and make nearly no progress even with lots of hints. These candidates fail.
  • About 10% struggle a bit but get to the answer.

Candidates in the final bucket are by far the hardest to make a decision on. Not answering a question effortlessly doesn't mean you aren't a good candidate - it might mean it's not something you are used to, you got interview nerves or a million other factors that go into someone's performance. It makes the process far more subjective.

Many years ago, I interviewed one candidate over the phone. It was their first interview with the company, so I had to decide whether we should take the step of transporting them to the office for an in-person interview, which has some level of cost associated with it. Arranging an in-person interview would also mean holding a job open for them, which would mean pausing further recruitment. The candidate had a fairly strong accent, but a perfect grasp of English. Their performance fell squarely into the final bucket.

For all candidates, I make a decision, and write down a paragraph or so explaining how they performed. My initial decision was to not go any further in interviewing the candidate. But after writing down the paragraph, I found it hard to justify my decision. I'd written other paragraphs that weren't too dissimilar, but had a decision to continue onwards. I wondered about changing my decision, but felt rather hesitant - I had a sneaking suspicion that this candidate "just wouldn't work out". Had I spotted something subtle I had forgotten to write down? Had their answers about their motivation given me a subconscious red-flag? I didn't know, but for the first time I can remember, decided to wait on sending my internal interview report overnight.

One day later, I still had a feeling of unease. But still didn't have anything to pin it on. In the absence of a reason to reject them, I decided the only fair thing to do was get them onsite for further interviews. Their onsite interviews went fine, I went on to hire them, they worked for me for over a year, and were a model employee. If I saw red-flags, they were false-flags, but more likely, I saw nothing.

However, I still wonder what caused me to decide "no" initially. Unfortunately, the only thing I can hypothesise is that their accent was the cause. I had previously worked alongside someone with a similar accent, who turned out to be thoroughly incompetent. I seem to have projected some aspects of that behaviour onto an entirely unrelated candidate. That's a pretty depressing realisation to make.

To try and reduce the chance of this situation repeating, I now write down the interview description first, and then the decision last. I also remember this story, and how my biases nearly caused me to screw up someone's career.

Monday, July 27, 2020

Which packages does Hoogle search?

Summary: Hoogle searches packages on Stackage.

Haskell (as of 27 July 2020) has 14,490 packages in the Hackage package repository. Hoogle (the Haskell API search engine) searches 2,463 packages. This post explains which packages are searched, why some packages are excluded, and thus, how you can ensure your package is searched.

The first filter is that Hoogle only searches packages on Stackage. Hoogle indexes any package which is either in the latest Stackage nightly or Stackage LTS, but always indexes the latest version that is on Hackage. If you want a Hoogle search that perfectly matches a given Stackage release, I recommend using the Stackage Hoogle search available from any snapshot page. There are two reasons for restricting to only packages on Stackage:

  • I want Hoogle results to be useful. The fact that the package currently builds with a recent GHC used by Stackage is a positive sign that the package is maintained and might actually work for a user who wants to use it. Most of the packages on Hackage probably don't build with the latest GHC, and thus aren't useful search results.
  • Indexing time and memory usage is proportional to the number of packages, and somewhat the size of those packages. By dropping over 10 thousand packages we can index more frequently and on more constrained virtual machines. With the recent release of Hoogle 5.0.18 the technical limitations on size were removed to enable indexing all of Hackage - but there is still no intention to do so.

There are 2,426 packages in Stackage Nightly, and 2,508 in Stackage LTS, with most overlapping. There are 2,580 distinct packages between these two sources, the Haskell Platform and a few custom packages Hoogle knows about (e.g. GHC).

Of the 2,580 eligible packages, 77 are executables only, so don't have any libraries to search, leaving 2,503 packages.

Of the remaining packages 2,503, 40 are missing documentation on Hackage, taking us down to 2,463. As for why a package might not have documentation:

  • Some are missing documentation because they are very GHC internal and are mentioned but not on Hackage, e.g. ghc-heap.
  • Some are Windows only and won't generate documentation on the Linux Hackage servers, e.g. Win32-notify.
  • Some have dependencies not installed on the Hackage servers, e.g. rocksdb-query.
  • Some have documentation that appears to have been generated without generating a corresponding Hoogle data file, e.g. array-memoize.
  • Some are just missing docs entirely on Hackage for no good reason I can see, e.g. bytestring-builder.

The Hoogle database is generated and deployed once per day, automatically. Occasionally a test failure or dependency outage will cause generation to fail, but I get alerted, and usually it doesn't get stale by more than a few days. If you add your package to Stackage and it doesn't show up on Hoogle within a few days, raise an issue.

Wednesday, July 15, 2020

Managing Haskell Extensions

Summary: You can divide extensions into yes, no and maybe, and then use HLint to enforce that.

I've worked in multiple moderately sized multinational teams of Haskell programmers. One debate that almost always comes up is which extensions to enable. It's important to have some consistency, so that everyone is using similar dialects of Haskell and can share/review code easily. The way I've solved this debate in the past is by, as a team, dividing the extensions into three categories:

  • Always-on. For example, ScopedTypeVariables is just how Haskell should work and should be enabled everywhere. We turn these extensions on globally in Cabal with default-extensions, and then write an HLint rule to ban turning the extension on manually. In one quick stroke, a large amount of {-# LANGUAGE #-} boilerplate at the top of each file disappears.
  • Always-off, not enabled and mostly banned. For example, TransformListComp steals the group keyword and never got much traction. These extensions can be similarly banned by HLint, or you can default unmentioned extensions to being disabled. If you really need to turn on one of these extensions, you need to both turn it on in the file, and add an HLint exception. Such an edit can trigger wider code review, and serves as a good signal that something needs looking at carefully.
  • Sometimes-on, written at the top of the file as {-# LANGUAGE #-} pragmas when needed. These are extensions which show up sometimes, but not that often. A typical file might have zero to three of them. The reasons for making an extension sometimes-on fall into three categories:
    • The extension has a harmful compile-time impact, e.g. CPP or TemplateHaskell. It's better to only turn these extensions on if they are needed, but they are needed fairly often.
    • The extension can break some code, e.g. OverloadedStrings, so enabling it everywhere would cause compile failures. Generally, we work to minimize such cases, aiming to fix all code to compile with most extensions turned on.
    • The extension is used rarely within the code base and is a signal to the reader that something unusual is going on. Depending on the code base that might be things like RankNTypes or GADTs. But for certain code bases, those extensions will be very common, so it very much varies by code base.

The features that are often most debated are the syntax features - e.g. BlockSyntax or LambdaCase. Most projects should either use these extensions commonly (always-on), or never (banned). They provide some syntactic convenience, but if used rarely, tend to mostly confuse things.

Using this approach every large team I've worked on has had one initial debate to classify extensions, then every few months someone will suggest moving an extension from one pile to another. However, it's pretty much entirely silenced the issue from normal discussion thereafter, leaving us to focus on actual coding.

Tuesday, July 07, 2020

How I Interview

Summary: In previous companies I had a lot of freedom to design an interview. This article describes what I came up with.

Over the years, I've interviewed hundreds of candidates for software engineering jobs (at least 500, probably quite a bit more). I've interviewed for many companies, for teams I was setting up, for teams I was managing, for teams I worked in, and for different teams at the same company. In most places, I've been free to set the majority of the interview. This post describes why and how I designed my interview process. I'm making this post now because where I currently work has a pre-existing interview process, so I won't be following the process below anymore.

I have always run my interviews as a complete assessment of a candidate, aiming to form a complete answer. Sometimes I did that as a phone screen, and sometimes as part of a set of interviews, but I never relied on other people to cover different aspects of a candidate. (Well, I did once, and it went badly...)

When interviewing, there are three questions I want to answer for myself, in order of importance.

Will they be happy here?

If the candidate joined, would they be happy? If people aren't happy, it won't be a pleasant experience, and likely, they won't be very successful. Whether they are happy is the most important criteria because an employee who can't do the job but is happy can be trained or can use their skills for other purposes. But an employee who is unhappy will just drag the morale of the whole team down.

To figure out whether a candidate would be happy, I explain the job (including any office hours/environment/location) and discuss it in contrast to their previous experience. The best person to judge if they would be happy are the candidate themselves - so I ask that question. The tricky part is that it's an interview setting, so they have prepared saying "Yes, that sounds good" to every question. I try and alleviate that by building a rapport with the candidate first, being honest about my experiences, and trying to discuss what they like in the abstract first. If I'm not convinced they are being truthful or properly thinking it through, I ask deeper questions, for example how they like to split their day etc.

A great sign is when a candidate, during the interview, concludes for themselves that this job just isn't what they were looking for. I've had that happen 5 times during the actual interview, and 2 times as an email afterwards. It isn't awkward, and has saved some candidates an inappropriate job (at least 2 would have likely been offered a job otherwise).

While I'm trying to find out if the candidate will be happy, at the same time, I'm also attempting to persuade the candidate that they want to join. It's a hard balance and being open and honest is the only way I have managed it. Assuming I am happy where I work, I can use my enthusiasm to convince the candidate it's a great place, but also give them a sense of what I do.

Can they do the job?

There are two ways I used to figure out if someone can do the job. Firstly, I discuss their background, coding preferences etc. Do the things they've done in the past match the kind of things required in the job. Have they got experience with the non-technical portions of the job, or domain expertise. Most of these aspects are on their CV, so it involves talking about their CV, past projects, what worked well etc.

Secondly, I give them a short technical problem. My standard problem can be solved in under a minute in a single line of code by the best candidates. The problem is not complex, and has no trick-question or clever-approach element. The result can then be used as a springboard to talk about algorithmic efficiency, runtime implementation, parallelism, testing, verification etc. However, my experience is that candidates who struggle at the initial problem go on to struggle with any of the extensions, and candidates that do well at the initial question continue to do well on the extensions. The correlation has been so complete that over time I have started to use the extensions more for candidates who did adequately but not great on the initial problem.

My approach of an incredibly simple problem does not seem to be standard or adopted elsewhere. One reason might be that if it was used at scale, the ability to cheat would be immense (I actually have 2 backup questions for people I've interviewed previously).

Given such a simple question, there have been times when 5 candidates in a row ace the question, and I wonder if the question is just too simple. But usually then the next 5 candidates all struggle terribly and I decide it still has value.

Will I be happy with them doing the job?

The final thing I wonder is would I be happy with them being a part of the team/company. The usual answer is yes. However, if the candidate displays nasty characteristics (belittling, angry, racist, sexist, lying) then it's a no. This question definitely isn't code for "culture fit" or "would I go for a beer with them", but specific negative traits. Generally I answer this question based on whether I see these characteristics reflected in the interactions I have with the candidate, not specific questions. I've never actually had a candidate who was successful at the above questions, and yet failed at this question. I think approximately 5-10 candidates have failed on this question.

Sunday, July 05, 2020

Automatic UI's for Command Lines with cmdargs

Summary: Run cmdargs-browser hlint and you can fill out arguments easily.

The Haskell command line parsing library cmdargs contains a data type that represents a command line. I always thought it would be a neat trick to transform that into a web page, to make it easier to explore command line options interactively - similar to how the custom-written wget::gui wraps wget.

I wrote a demo to do just that, named cmdargs-browser. Given any program that uses cmdargs (e.g. hlint), you can install cmdargs-browser (with cabal install cmdargs-browser) and run:

cmdargs-browser hlint

And it will pop up:

As we can see, the HLint modes are listed on the left (you can use lint, grep or test), the possible options on the right (e.g. normal arguments and --color) and the command line it produces at the bottom. As you change mode or add/remove flags, the command line updates. If you hit OK it then runs the program with the command line. The help is included next to the argument, and if you make a mistake (e.g. write foo for the --color flag) it tells you immediately. It could be more polished (e.g. browse buttons for file selections, better styling) but the basic concepts works well.

Technical implementation

I wanted every cmdargs-using program to support this automatic UI, but also didn't want to increase the dependency footprint or compile-time overhead for cmdargs. I didn't want to tie cmdargs to this particular approach to a UI - I wanted a flexible mechanism that anyone could use for other purposes.

To that end, I built out a Helper module that is included in cmdargs. That API provides the full power and capabilities on which cmdargs-browser is written. The Helper module is only 350 lines.

If you run cmdargs with either $CMDARGS_HELPER or $CMDARGS_HELPER_HLINT set (in the case of HLint) then cmdargs will run the command line you specify, passing over the explicit Mode data type on the stdin. That Mode data type includes functions, and using a simplistic communication channel on the stdin/stdout, the helper process can invoke those functions. As an example, when cmdargs-browser wants to validate the --color flag, it does so by calling a function in Mode, that secretly talks back to hlint to validate it.

At the end, the helper program can choose to either give an error message (to stop the program, e.g. if you press Cancel), or give some command lines to use to run the program.

Future plans

This demo was a cool project, which may turn out to be useful for some, but I have no intention to develop it further. I think something along these lines should be universally available for all command line tools, and built into all command line parsing libraries.

Historical context

All the code that makes this approach work was written over seven years ago. Specifically, it was my hacking project in the hospital while waiting for my son to be born. Having a little baby is a hectic time of life, so I never got round to telling anyone about its existence.

This weekend I resurrected the code and published an updated version to Hackage, deliberately making as few changes as possible. The three necessary changes were:

  1. jQuery deprecated the live function replacing it with on, meaning the code didn't work.
  2. I had originally put an upper bound of 0.4 for the transformers library. Deleting the upper bound made it work.
  3. Hackage now requires that all your uploaded .cabal files declare that they require a version of 1.10 or above of Cabal itself, even if they don't.

Overall, to recover a project that is over 7 years old, it was surprisingly little effort.

Wednesday, July 01, 2020

A Rust self-ownership lifetime trick (that doesn't work)

Summary: I came up with a clever trick to encode lifetimes of allocated values in Rust. It doesn't work.

Let's imagine we are using Rust to implement some kind of container that can allocate values, and a special value can be associated with the container. It's a bug if the allocated value gets freed while it is the special value of a container. We might hope to use lifetimes to encode that relationship:

struct Value<'v> {...}
struct Container {...}

impl Container {
    fn alloc<'v>(&'v self) -> Value<'v> {...}
    fn set_special<'v>(&'v self, x: Value<'v>) {...}
}

Here we have a Container (which has no lifetime arguments), and a Value<'v> (where 'v ties it to the right container). Within our container we can implement alloc and set_special. In both cases, we take &'v self and then work with a Value<'v>, which ensures that the lifetime of the Container and Value match. (We ignore details of how to implement these functions - it's possible but requires unsafe).

Unfortunately, the following code compiles:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, x: Value<'v2>) {
    to.set_special(x);
}

The Rust compiler has taken advantage of the fact that Container can be reborrowed, and that Value is variant, and rewritten the code to:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, x: Value<'v2>) {
    'v3: {
        let x : Value<'v3> = x; // Value is variant, 'v2 : 'v3
        let to : &'v3 Container = &*to;
        to.set_special(x);
    }
}

The code with lifetime annotations doesn't actually compile, it's just what the compiler did under the hood. But we can stop Value being variant by making it contain PhantomData<Cell<&'v ()>>, since lifetimes under Cell are invariant. Now the above code no longer compiles. Unfortunately, there is a closely related variant which does compile:

fn set_cheat_alloc<'v1, 'v2>(to: &'v1 Container, from: &'v2 Container) {
    let x = from.alloc();
    to.set_special(x);
}

While Value isn't variant, &Container is, so the compiler has rewritten this code as:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, from: &'v2 Container) {
    'v3: {
        let from = &'v3 Container = &*from;
        let x : Value<'v3> = from.alloc();
        let to : &'v3 Container = &*to;
        to.set_special(x);
    }
}

Since lifetimes on & are always variant, I don't think there is a trick to make this work safely. Much of the information in this post was gleaned from this StackOverflow question.

Monday, June 22, 2020

The HLint Match Engine

Summary: HLint has a match engine which powers most of the rules.

The Haskell linter HLint has two forms of lint - some are built in written in Haskell code over the GHC AST (e.g. unused extension detection), but 700+ hints are written using a matching engine. As an example, we can replace map f (map g xs) with map (f . g) xs. Doing so might be more efficient, but importantly for HLint, it's often clearer. That rule is defined in HLint as:

- hint: {lhs: map f (map g x), rhs: map (f . g) x}

All single-letter variables are wildcard matches, so the above rule will match:

map isDigit (map toUpper "test")

And suggest:

map (isDigit . toUpper) "test"

However, Haskell programmers are uniquely creative in specifying functions - with a huge variety of $ and . operators, infix operators etc. The HLint matching engine in HLint v3.1.4 would match this rule to all of the following (I'm using sort as a convenient function, replacing it with foo below would not change any matches):

  • map f . map g
  • sort . map f . map g . sort
  • concatMap (map f . map g)
  • map f (map (g xs) xs)
  • f `map` (g `map` xs)
  • map f $ map g xs
  • map f (map g $ xs)
  • map f (map (\x -> g x) xs)
  • Data.List.map f (Prelude.map g xs)
  • map f ((sort . map g) xs)

That's a large variety of ways to write a nested map. In this post I'll explain how HLint matches everything above, and the bug that used to cause it to match even the final line (which isn't a legitimate match) which was fixed in HLint v3.1.5.

Eta-contraction

Given a hint comprising of lhs and rhs, the first thing HLint does is determine if it can eta-contract the hint, producing a version without the final argument. If it can do so for both sides, it generates a completely fresh hint. In the case of map f (map g x) in generates:

- hint: {lhs: map f . map g, rhs: map (f . g)}

For the examples above, the first three match with this eta-contracted version, and the rest match with the original form. Now we've generated two hints, it's important that we don't perform sufficiently fuzzy matching that both match some expression, as that would generate twice as many warnings as appropriate.

Root matching

The next step is root matching, which happens only when trying to match at the root of some match. If we have (foo . bar) x then it would be reasonable for that to match bar x, despite the fact that bar x is not a subexpression. We overcome that by transforming the expression to foo (bar x), unifying only on bar x, and recording that we need to add back foo . at the start of the replacement.

Expression matching

After splitting off any extra prefix, HLint tries to unify the single-letter variables with expressions, and build a substitution table with type Maybe [(String, Expr)]. The substitution is Nothing to denote the expressions are incompatible, or Just a mapping of variables to the expression they matched. If two expressions have the same structure, we descend into all child terms and match further. If they don't have the same structure, but are similar in a number of ways, we adjust the source expression and continue.

Examples of adjustments include expanding out $, removing infix application such as f `map` x and ignoring redundant brackets. We translate (f . g) x to f (g x), but not at the root - otherwise we might match both the eta-expanded and non-eta-expanded variants. We also re-associate (.) where needed, e.g. for expressions like sort . map f . map g . sort the bracketing means we have sort . (map f . (map g . sort)). We can see that map f . map g is not a subexpression of that expression, but given that . is associative, we can adjust the source.

When we get down to a terminal name like map, we use the scope information HLint knows to determine if the two map's are equivalent. I'm not going to talk about that too much, as it's slated to be rewritten in a future version of HLint, and is currently both slow and a bit approximate.

Substitution validity

Once we have a substitution, we see if there are any variables which map to multiple distinct expressions. If so, the substitution is invalid, and we don't match. However, in our example above, there are no duplicate variables so any matching substitution must be valid.

Side conditions

Next we check any side conditions - e.g. we could decide that the above hint only makes sense if x is atomic - i.e. does not need brackets in any circumstance. We could have expressed that with side: isAtom x, and any such conditions are checked in a fairly straightforward manner.

Substitutions

Finally, we substitute the variables into the provided replacement. When doing the replacement, we keep track of the free variables, and if the resulting expression has more free variables than it started with, we assume the hint doesn't apply cleanly. As an example, consider the hint \x -> a <$> b x to fmap a . b. It looks a perfectly reasonable hint, but what if we apply it to the expression \x -> f <$> g x x. Now b matches g x, but we are throwing away the \x binding and x is now dangling, so we reject it.

When performing the substitution, we used knowledge of the AST we want, and the brackets required to parse that expression, to ensure we insert the right brackets, but not too many.

Bug #1055

Hopefully all the above sounds quite reasonable. Unfortunately, at some point, the root-matching lost the check that it really was at the root, and started applying the translation to terms such as sort . in map f ((sort . map g) xs). Having generated the sort ., it decided since it wasn't at the root, there was nowhere for it to go, so promptly threw it away. Oops. HLint v3.1.5 fixes the bug in two distinct ways (for defence in depth):

  1. It checks the root boolean before doing the root matching rule.
  2. If it would have to throw away any extra expression, it fails, as throwing away that expression is certain to lead to a correctness bug.

Conclusion

The matching engine of HLint is relatively complex, but I always assumed one day would be replaced with a finite-state-machine scanner that could match n hints against an expression in O(size-of-expression), rather than the current O(n * size-of-expression). However, it's never been the bottleneck, so I've kept with the more direct version.

I'm glad HLint has a simple external lint format. It allows easy contributions and makes hint authoring accessible to everyone. For large projects it's easy to define your own hints to capture common coding patterns. When using languages whose linter does not have an external matching language (e.g. Rust's Clippy) I certainly miss the easy customization.

Tuesday, June 09, 2020

Hoogle Searching Overview

Summary: Hoogle 5 has three interesting parts, a pipeline, database and search algorithm.

The Haskell search engine Hoogle has gone through five major designs, the first four of which are described in these slides from TFP 2011. Hoogle version 5 was designed to be a complete rewrite which simplified the design and allowed it to scale to all of Hackage. All versions of Hoogle have had some preprocessing step which consumes Haskell definitions, and writes out a data file. They then have the search phase which uses that data file to perform searches. In this post I'll go through three parts -- what the data file looks like, how we generate it, and how we search it. When we consider these three parts, the evolution of Hoogle can be seen as:

  • Versions 1-3, produce fairly simple data files, then do an expensive search on top. Fails to scale to large sizes.
  • Version 4, produce a very elaborate data files, aiming to search quickly on top. Failed because producing the data file required a lot of babysitting and a long time, so was updated very rarely (yearly). Also, searching a complex data file ends up with a lot of corner cases which have terrible complexity (e.g. a -> a -> a -> a -> a would kill the server).
  • Version 5, generate very simple data files, then do O(n) but small-constant multiplier searching on top. Update the files daily and automatically. Make search time very consistent.

Version 5 data file

By version 5 I had realised that deserialising the data file was both time consuming and memory hungry. Therefore, in version 5, the data file consists of chunks of data that can be memory-mapped into Vector and ByteString chunks using a ForeignPtr underlying storage. The OS figures out which bits of the data file should be paged in, and there is very little overhead or complexity on the Haskell side. There is a small index structure at the start of the data file which says where these interesting data structures live, and gives them identity using types. For example, to store information about name search we have three definitions:

data NamesSize a where NamesSize :: NamesSize Int
data NamesItems a where NamesItems :: NamesItems (V.Vector TargetId)
data NamesText a where NamesText :: NamesText BS.ByteString

Namely, in the data file we have NamesSize which is an Int, NamesItems which is a Vector TargetId, and NamesText which is a ByteString. The NamesSize is the maximum number of results that can be returned from any non-empty search (used to reduce memory allocation for the result structure), the NamesText is a big string with \0 separators between each entry, and the NamesItems are the identifiers of the result for each name, with as many entries as there are \0 separators.

The current data file is 183Mb for all of Stackage, of which 78% of that is the information about items (documentation, enough information to render them, where the links go etc - we then GZip this information). There are 21 distinct storage types, most involved with type search.

Generating the data file

Generating the data file is done in four phases.

Phase 0 downloads the inputs, primarily a .tar.gz file containing all .cabal files, and another containing all the Haddock Hoogle outputs. These .tar.gz files are never unpacked, but streamed through and analysed using conduit.

Phase 1 reads through all the .cabal files to get metadata about each package - the author, tags, whether it's in Stackage etc. It stores this information in a Map. This phase takes about 7s and uses 100Mb of memory.

Phase 2 reads through every definition in every Haddock Hoogle output (the .txt files --hoogle generates). It loads the entry, parses it, processes it, and writes most of the data to the data file, assigning it a TargetId. That TargetId is the position of the item in the data file, so it's unique, and can be used to grab the relevant item when we need to display it while searching. During this time we collect the unique deduplicated type signatures and names, along with the TargetId values. This phase takes about 1m45s and has about 900Mb of memory at the end. The most important part of phase 2 is not to introduce a space leak, since then memory soars to many Gb.

Phase 3 processes the name and type maps and writes out the information used for searching. This phase takes about 20s and consumes an additional 250Mb over the previous phase.

Since generating the data file takes only a few minutes, there is a nightly job that updates the data file at 8pm every night. The job takes about 15 minutes in total, because it checks out a new version of Hoogle from GitHub, builds it, downloads all the data files, generates a data file, runs the tests, and then restarts the servers.

Searching

Hoogle version 5 works on the principle that it's OK to be O(n) if the constant is small. For textual search, we have a big flat ByteString, and give that to some C code that quickly looks for the substring we enter, favouring complete and case-matching matches. Such a loop is super simple, and at the size of data we are working with (about 10Mb), plenty fast enough.

Type search is inspired by the same principle. We deduplicate types, then for each type, we produce an 18 byte fingerprint. There are about 150K distinct type signatures in Stackage, so that results in about 2.5Mb of fingerprints. For every type search we scan all those fingerprints and figure out the top 100 matches, then do a more expensive search on the full type for those top 100, producing a ranking. For a long time (a few years) I hadn't even bothered doing the second phase of more precise matching, and it still gave reasonable results. (In fact, I never implemented the second phase, but happily Matt Noonan contributed it.)

A type fingerprint is made up of three parts:

  • 1 byte being the arity of the function. a -> b -> c would have arity 3.
  • 1 byte being the number of constructors/variables in the type signature. Maybe a -> a would have a value of 3.
  • The three rarest names in the function. E.g. A -> B -> C -> D would compare how frequent each of A, B, C and D were in the index of functions, and record the 3 rarest. Each name is given a 32 bit value (where 0 is the most common and 2^32 is the rarest).

The idea of arity and number of constructors/variables is to try and get an approximate shape fit to the type being search for. The idea of the rarest names is an attempt to take advantage that if you are searching for ShakeOptions -> [a] -> [a] then you probably didn't write ShakeOptions by accident -- it provides a lot of signal. Therefore, filtering down to functions that mention ShakeOptions probably gives a good starting point.

Once we have the top 100 matches, we can then start considering whether type classes are satisfied, whether type aliases can be expanded, what the shape of the actual function is etc. By operating on a small and bounded number of types we can do much more expensive comparisons than if we had to apply them to every possible candidate.

Conclusion

Hoogle 5 is far from perfect, but the performance is good, the scale can keep up with the growth of Haskell packages, and the simplicity has kept maintenance low. The technique of operations which are O(n) but with a small constant is one I've applied in other projects since, and I think is an approach often overlooked.

Sunday, June 07, 2020

Surprising IO: How I got a benchmark wrong

Summary: IO evaluation caught me off guard when trying to write some benchmarks.

I once needed to know a quick back-of-the-envelope timing of a pure operation, so hacked something up quickly rather than going via criterion. The code I wrote was:

main = do
    (t, _) <- duration $ replicateM_ 100 $ action myInput
    print $ t / 100

{-# NOINLINE action #-}
action x = do
    evaluate $ myOperation x
    return ()

Reading from top to bottom, it takes the time of running action 100 times and prints it out. I deliberately engineered the code so that GHC couldn't optimise it so myOperation was run only once. As examples of the defensive steps I took:

  • The action function is marked NOINLINE. If action was inlined then myOperation x could be floated up and only run once.
  • The myInput is given as an argument to action, ensuring it can't be applied to myOperation at compile time.
  • The action is in IO so the it has to be rerun each time.

Alas, GHC still had one trick up its sleeve, and it wasn't even an optimisation - merely the definition of evaluation. The replicateM_ function takes action myInput, which is evaluated once to produce a value of type IO (), and then runs that IO () 100 times. Unfortunately, in my benchmark myOperation x is actually evaluated in the process of creating the IO (), not when running the IO (). The fix was simple:

action x = do
    _ <- return ()
    evaluate $ myOperation x
    return ()

Which roughly desugars to to:

return () >>= \_ -> evaluate (myOperation x)

Now the IO produced has a lambda inside it, and my benchmark runs 100 times. However, at -O2 GHC used to manage to break this code once more, by lifting myOperation x out of the lambda, producing:

let y = myOperation x in return () >>= \_ -> evaluate y

Now myOperation runs just once again. I finally managed to defeat GHC by lifting the input into IO, giving:

action x = do
    evaluate . myOperation =<< x
    return ()

Now the input x is itself in IO, so myOperation can't be hoisted.

I originally wrote this post a very long time ago, and back then GHC did lift myOperation out from below the lambda. But nowadays it doesn't seem to do so (quite possibly because doing so might cause a space leak). However, there's nothing that promises GHC won't learn this trick again in the future.

Sunday, May 31, 2020

HLint --cross was accidentally quadratic

Summary: HLint --cross was accidentally quadratic in the number of files.

One of my favourite blogs is Accidentally Quadratic, so when the Haskell linter HLint suffered such a fate, I felt duty bound to write it up. Most HLint hints work module-at-a-time (or smaller scopes), but there is one hint that can process multiple modules simultaneously - the duplication hint. If you write a sufficiently large repeated fragment in two modules, and pass --cross, then this hint will detect the duplication. The actual application of hints is HLint is governed by:

applyHintsReal :: [Setting] -> Hint -> [ModuleEx] -> [Idea]

Given a list of settings, a list of hints (which gets merged to a single composite Hint) and a list of modules, produce a list of ideas to suggest. Usually this function is called in parallel with a single module at a time, but when --cross is passed, all the modules being analysed get given in one go.

In HLint 3, applyHintsReal became quadratic in the number of modules. When you have 1 module, 1^2 = 1, and everything works fine, but --cross suffers a lot. The bug was simple. Given a Haskell list comprehension:

[(a,b) | a <- xs, b <- xs]

When given the list xs of [1,2] you get back the pairs [(1,1),(1,2),(2,1),(2,2)] - the cross product, which is quadratic in the size of xs. The real HLint code didn't look much different:

[ generateHints m m'
| m <- ms
, ...
, (nm',m') <- mns'
, ...
]
where
    mns' = map (\x -> (scopeCreate (GHC.unLoc $ ghcModule x), x)) ms

We map over ms to create mns' containing each module with some extra information. In the list comprehension we loop over each module ms to get m, then for each m in ms, loop over mns' to get m'. That means you take the cross-product of the modules, which is quadratic.

How did this bug come about? HLint used to work against haskell-src-exts (HSE), but now works against the GHC parser. We migrated the hints one by one, changing HLint to thread through both ASTs, and then each hint could pick which AST to use. The patch that introduced this behaviour left ms as the HSE AST, and made mns' the GHC AST. It should have zipped these two together, so for each module you have the HSE and GHC AST, but accidentally took the cross-product.

How did we spot it? Iustin Pop filed a bug report noting that each hint was repeated once per file being checked and performance had got significantly worse, hypothesising it was O(n^2). Iustin was right!

How did we fix it? By the time the bug was spotted, the HSE AST had been removed entirely, and both m and m' were the same type, so deleting one of the loops was easy. The fix is out in HLint version 3.1.4.

Should I be using --cross? If you haven't heard about --cross in HLint, I don't necessarily suggest you start experimenting with it. The duplicate detection hints are pretty dubious and I think most people would be better suited with a real duplicate detection tool. I've had good experiences with Simian in the past.

Wednesday, May 27, 2020

Fixing Space Leaks in Ghcide

Summary: A performance investigation uncovered a memory leak in unordered-containers and performance issues with Ghcide.

Over the bank holiday weekend, I decided to devote some time to a possible Shake build system performance issue in Ghcide Haskell IDE. As I started investigating (and mostly failed) I discovered a space leak which I eventually figured out, solved, and then (as a happy little accident) got a performance improvement anyway. This post is a tale of what I saw, how I tackled the problem, and how I went forward. As I'm writing the post, not all the threads have concluded. I wrote lots of code during the weekend, but most was only to experiment and has been thrown away - I've mostly left the code to the links. Hopefully the chaotic nature of development shines through.

Shake thread-pool performance

I started with a Shake PR claiming that simplifying the Shake thread pool could result in a performance improvement. Faster and simpler seems like a dream combination. Taking a closer look, simpler seemed like it was simpler because it supported less features (e.g. ability to kill all threads when one has an exception, some fairness/scheduling properties). But some of those features (e.g. better scheduling) were in the pursuit of speed, so if a simpler scheduler was 30% faster (the cost of losing randomised scheduling), that might not matter.

The first step was to write a benchmark. It's very hard to synthesise a benchmark that measures the right thing, but spawning 200K short tasks into the thread pool seemed a plausible start. As promised on the PR, the simpler version did indeed run faster. But interestingly, the simplifications weren't really responsible for the speed difference - switching from forkIO to forkOn explained nearly all the difference. I'm not that familiar with forkOn, so decided to micro-benchmark it - how long does it take to spawn off 1M threads with the two methods. I found two surprising results:

  • The performance of forkOn was quadratic! A GHC bug explains why - it doesn't look too hard to fix, but relying on forkOn is unusual, so its unclear if the fix is worth it.
  • The performance of forkIO was highly inconsistent. Often it took in the region of 1 second. Sometimes it was massively faster, around 0.1s. A StackOverflow question didn't shed much light on why, but did show that by using the PVar concurrency primitive it could be 10x faster. There is a GHC bug tracking the issue, and it seems as though the thread gets created them immediately switches away. There is a suggestion from Simon Peyton Jones of a heuristic that might help, but the issue remains unsolved.

My desire to switch the Shake thread-pool to a quadratic primitive which is explicitly discouraged is low. Trying to microbenchmark with primitives that have inconsistent performance is no fun. The hint towards PVar is super interesting, and I may follow up on it in future, but given the remarks in the GHC tickets I wonder if PVar is merely avoiding one small allocation, and avoiding an allocation avoids a context switch, so it's not a real signal.

At this point I decided to zoom out and try benchmarking all of Ghcide.

Benchmarking Ghcide

The thread about the Shake thread pool pointed at a benchmarking approach of making hover requests. I concluded that making a hover request with no file changes would benchmark the part of Shake I thought the improved thread-pool was most likely to benefit. I used the Shake source code as a test bed, and opened a file with 100 transitive imports, then did a hover over the listToMaybe function. I know that will require Shake validating that everything is up to date, and then doing a little bit of hover computation.

I knew I was going to be running Ghcide a lot, and the Cabal/Stack build steps are frustratingly slow. In particular, every time around Stack wanted to unregister the Ghcide package. Therefore, I wrote a simple .bat file that compiled Ghcide and my benchmark using ghc --make. So I could experiment quickly with changes to Shake, I pulled in all of Shake as source, not as a separate library, with an include path. I have run that benchmark 100's of times, so the fact it is both simple (no arguments) and as fast as I could get has easily paid off.

For the benchmark itself, I first went down the route of looking at the replay functionality in lsp-test. Sadly, that code doesn't link to anything that explains how to generate traces. After asking on the haskell-ide-engine IRC I got pointed at both the existing functionality of resCaptureFile. I also got pointed at the vastly improved version in a PR, which doesn't fail if two messages race with each other. Configuring that and running it on my benchmark in the IDE told me that the number of messages involved was tiny - pretty much an initialisation and then a bunch of hovers. Coding those directly in lsp-test was trivial, and so I wrote a benchmark. The essence was:

doc <- openDoc "src/Test.hs" "haskell"
(t, _) <- duration $ replicateM_ 100 $
    getHover doc $ Position 127 43
print t

Open a document. Send 100 hover requests. Print the time taken.

Profiling Ghcide

Now I could run 100 hovers, I wanted to use the GHC profiling mechanisms. Importantly, the 100 hover requests dominates the loading by a huge margin, so profiles would focus on the right thing. I ran a profile, but it was empty. Turns out the way lsp-test invokes the binary it is testing means it kills it too aggressively to allow GHC to write out profiling information. I changed the benchmark to send a shutdown request at the end, then sleep, and changed Ghcide to abort on a shutdown, so it could write the profiling information.

Once I had the profiling information, I was thoroughly uniformed. 10% went in file modification checking, which could be eliminated. 10% seemed to go to hash table manipulations, which seemed on the high side, but not too significant (turned out I was totally wrong, read to the end!). Maybe 40% went in the Shake monad, but profiling exaggerates those costs significantly, so it's unclear what the truth is. Nothing else stood out, but earlier testing when profiling forkIO operations had shown they weren't counted well, so that didn't mean much.

Prodding Ghcide

In the absence of profiling data, I started changing things and measuring the performance. I tried a bunch of things that made no difference, but some things did have an impact on the time to do 100 hovers:

  • Running normally: 9.77s. The baseline.
  • Switching to forkOn: 10.65s. Suggestive that either Ghcide has changed, or the project is different, or platform differences mean that forkOn isn't as advantageous.
  • Using only one Shake thread: 13.65s. This change had been suggested in one ticket, but made my benchmark worse.
  • Avoid spawning threads for things I think will be cheap: 7.49s. A useful trick, and maybe one that will be of benefit in future, but for such a significant change a 25% performance reduction seemed poor.
  • Avoid doing any Shake invalidation: 0.31s. An absolute lower bound if Shake cheats and does nothing.

With all that, I was a bit dejected - performance investigation reveals nothing of note was not a great conclusion from a days work. I think that other changes to Ghcide to run Shake less and cache data more will probably make this benchmark even less important, so the conclusion worsens - performance investigation of nothing of note reveals nothing of note. How sad.

But in my benchmark I did notice something - a steadily increasing memory size in process explorer. Such issues are pretty serious in an interactive program, and we'd fixed several issues recently, but clearly there were more. Time to change gears.

Space leak detection

Using the benchmark I observed a space leak. But the program is huge, and manual code inspection usually needs a 10 line code fragment to have a change. So I started modifying the program to do less, and continued until the program did as little as it could, but still leaked space. After I fixed a space leak, I zoomed out and saw if the space leak persisted, and then had another go.

The first investigation took me into the Shake Database module. I found that if I ran the Shake script to make everything up to date, but did no actions inside, then there was a space leak. Gradually commenting out lines (over the course of several hours) eventually took me to:

step <- pure $ case v of
    Just (_, Loaded r) -> incStep $ fromStepResult r
    _ -> Step 1

This code increments a step counter on each run. In normal Shake this counter is written to disk each time, which forces the value. In Ghcide we use Shake in memory, and nothing ever forced the counter. The change was simple - replace pure with evaluate. This fix has been applied to Shake HEAD.

Space leak detection 2

The next space leak took me to the Shake database reset function, which moves all Shake keys from Ready to Loaded when a new run starts. I determined that if you didn't run this function, the leak went away. I found a few places I should have put strictness annotations, and a function that mutated an array lazily. I reran the code, but the problem persisted. I eventually realised that if you don't call reset then none of the user rules run either, which was really what was fixing the problem - but I committed the improvements I'd made even though they don't fix any space leaks.

By this point I was moderately convinced that Shake wasn't to blame, so turned my attention to the user rules in Ghcide. I stubbed them out, and the leak went away, so that looked plausible. There were 8 types of rules that did meaningful work during the hover operation (things like GetModificationTime, DoesFileExist, FilesOfInterest). I picked a few in turn, and found they all leaked memory, so picked the simple DoesFileExist and looked at what it did.

For running DoesFileExist I wrote a very quick "bailout" version of the rule, equivalent to the "doing nothing" case, then progressively enabled more bits of the rule before bailing out, to see what caused the leak. The bailout looked like:

Just v <- getValues state key file
let bailout = Just $ RunResult ChangedNothing old $ A v

I progressively enabled more and more of the rule, but even with the whole rule enabled, the leak didn't recur. At that point, I realised I'd introduced a syntax error and that all my measurements for the last hour had been using a stale binary. Oops. I span up a copy of Ghcid, so I could see syntax errors more easily, and repeated the measurements. Again, the leak didn't recur. Very frustrating.

At that point I had two pieces of code, one which leaked and one which didn't, and the only difference was the unused bailout value I'd been keeping at the top to make it easier to quickly give up half-way through the function. Strange though it seemed, the inescapable conclusion was that getValues must somehow be fixing the space leak.

If getValues fixes a leak, it is a likely guess that setValues is causing the leak. I modified setValues to also call getValues and the problem went away. But, after hours of staring, I couldn't figure out why. The code of setValues read:

setValues state key file val = modifyVar_ state $ \vals -> do
    evaluate $ HMap.insert (file, Key key) (fmap toDyn val) vals

Namely, modify a strict HashMap from unordered-containers, forcing the result. After much trial and error I determined that a "fix" was to add:

case HMap.lookup k res of
    Nothing -> pure ()
    Just v -> void $ evaluate v

It's necessary to insert into the strict HashMap, then do a lookup, then evaluate the result that comes back, or there is a space leak. I duly raised a PR to Ghcide with the unsatisfying comment:

I'm completely lost, but I do have a fix.

It's nice to fix bugs. It's better to have some clue why a fix works.

Space leak in HashMap

My only conclusion was that HashMap must have a space leak. I took a brief look at the code, but it was 20+ lines and nothing stood out. I wrote a benchmark that inserted billions of values at 1000 random keys, but it didn't leak space. I puzzled it over in my brain, and then about a day later inspiration struck. One of the cases was to deal with collisions in the HashMap. Most HashMap's don't have any collisions, so a bug hiding there could survive a very long time. I wrote a benchmark with colliding keys, and lo and behold, it leaked space. Concretely, it leaked 1Gb/s, and brought my machine to its knees. The benchmark inserted three keys all with the same hash, then modified one key repeatedly. I posted the bug to the unordered-containers library.

I also looked at the code, figured out why the space leak was occurring, and a potential fix. However, the fix requires duplicating some code, and its likely the same bug exists in several other code paths too. The Lazy vs Strict approach of HashMap being dealt with as an outer layer doesn't quite work for the functions in question. I took a look at the PR queue for unordered-containers and saw 29 requests, with the recent few having no comments on them. That's a bad sign and suggested that spending time preparing a PR might be in vain, so I didn't.

Aside: Maintainers get busy. It's no reflection negatively on the people who have invested lots of time on this library, and I thank them for their effort! Given 1,489 packages on Hackage depend on it, I think it could benefit from additional bandwidth from someone.

Hash collisions in Ghcide

While hash collisions leading to space leaks is bad, having hash collisions at all is also bad. I augmented the code in Ghcide to print out hash collisions, and saw collisions between ("Path.hs", Key GetModificationTime) and ("Path.hs", Key DoesFileExist). Prodding a bit further I saw that the Hashable instance for Key only consulted its argument value, and given most key types are simple data Foo = Foo constructions, they all had the same hash. The solution was to mix in the type information stored by Key. I changed to the definition:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key) `xor` hashWithSalt salt key

Unfortunately, that now gave hash collisions with different paths at the same key. I looked into the hashing for the path part (which is really an lsp-haskell-types NormalizedFilePath) and saw that it used an optimised hashing scheme, precomputing the hash, and returning it with hash. I also looked at the hashable library and realised the authors of lsp-haskell-types hadn't implemented hashWithSalt. If you don't do that, a generic instance is constructed which deeply walks the data structure, completely defeating the hash optimisation. A quick PR fixes that.

I also found that for tuples, the types are combined by using the salt argument. Therefore, to hash the pair of path information and Key, the Key hashWithSalt gets called with the hash of the path as its salt. However, looking at the definition above, you can imagine that both hashWithSalt of a type and hashWithSalt of a key expand to something like:

hashWithSalt salt (Key key) = salt `xor` hash (typeOf key) `xor` (salt `xor` 0)

Since xor is associative and commutative, those two salt values cancel out! While I wasn't seeing complete cancellation, I was seeing quite a degree of collision, so I changed to:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key, key)

With that fix in Ghcide, all collisions went away, and all space leaks left with them. I had taken this implementation of hash combining from Shake, and while it's not likely to be a problem in the setting its used there, I've fixed it in Shake too.

Benchmarking Ghcide

With the hash collisions reduced, and the number of traversals when computing a hash reduced, I wondered what the impact was on performance. A rerun of the original benchmark showed the time had reduced to 9.10s - giving a speed up of about 5%. Not huge, but welcome.

Several days later we're left with less space leaks, more performance, and hopefully a better IDE experience for Haskell programmers. I failed in what I set out to do, but found some other bugs along the way, leading to 9 PRs/commits and 4 ongoing issues. I'd like to thank everyone in the Haskell IDE team for following along, making suggestions, confirming suspicions, and generally working as a great team. Merging the Haskell IDE efforts continues to go well, both in terms of code output, and team friendliness.

Saturday, May 23, 2020

Shake 0.19 - changes to process execution

Summary: The new version of Shake has some tweaks to how stdin works with cmd.

I've just released Shake 0.19, see the full change log. Most of the interesting changes in this release are around the cmd/command functions, which let you easily run command lines. As an example, Shake has always allowed:

cmd "gcc -c" [source] "-o" [output]

This snippet compiles a source file using gcc. The cmd function is variadic, and treats strings as space-separated arguments, and lists as literal arguments. It's overloaded by return type, so can work in the IO monad (entirely outside Shake) or the Shake Action monad (inside Shake). You can capture results and pass in options, e.g. to get the standard error and run in a different directory, you can do:

Stderr err <- cmd "gcc -c" [source] "-o" [output] (Cwd "src")

Shake is a dynamic build system with advanced dependency tracking features that let's you write your rules in Haskell. It just so happens that running commands is very common in build systems, so while not really part of a build system, it's a part of Shake that has had a lot of work done on it. Since the command is both ergonomic and featureful, I've taken to using the module Develoment.Shake.Command in non-Shake related projects.

Recent cmd changes

The first API breaking change only impacts users of the file access tracing. The resulting type is now polymorphic, and if you opt to for the FSATrace ByteString, you'll get your results a few milliseconds faster. Even if you stick with FSATrace FilePath, you'll get your results faster than the previous version. Performance of tracing happened to matter for a project I've been working on :-).

The other changes in this release are to process groups and the standard input. In Shake 0.18.3, changes were made to switch to create_group=True in the process library, as that improves the ability to cancel actions and clean up sub-processes properly. Unfortunately, on Linux that caused processes that read from standard input to hang. The correlation between these events, and the exact circumstances that triggered it, took a long time to track down - thanks to Gergő Érdi for some excellent bisection work. Most processes that are run in a build system should not access the standard input, and the only reports have come from docker (don't use -i) and ffmpeg (pass -nostdin), but hanging is a very impolite way to fail. In older versions of Shake we inherited the Shake stdin to the child (unless you specified the stdin explicitly with Stdin), but now we create a new pipe with no contents. There are now options NoProcessGroup and InheritStdin which let you change these settings independently. I suspect a handful of commands will need flag tweaks to stop reading the stdin, but they will probably fail saying the stdin is inaccessible, so debugging it should be relatively easy.

In another tale of cmd not working how you might hope, in Shake 0.15.2 we changed cmd to close file handles when spawning a process. Unfortunately, that step is O(n) in the number of potential handles on your system, where n is RLIMIT_NOFILE and can be quite big, so we switched back in 0.18.4. Since 0.18.4 you can pass CloseFileHandles if you definitely want handles to be closed. It's been argued that fork is a bad design, and this performance vs safety trade-off seems another point in favour of that viewpoint.

The amount of work that has gone into processes, especially around timeout and cross-platform differences, has been huge. I see 264 commits to these files, but the debugging time associated with them has been many weeks!

Other changes

This release contains other little tweaks that might be useful:

  • Time spent in the batch function is better accounted for in profiles.
  • Finally deleted the stuff that has been deprecated since 2014, particularly the *> operator. I think a six year deprecation cycle seems more than fair for a pre-1.0 library.
  • Optimised modification time on Linux.

Wednesday, May 20, 2020

GHC Unproposals

Summary: Four improvements to Haskell I'm not going to raise as GHC proposals.

Writing a GHC proposal is a lot of hard work. It requires you to fully flesh out your ideas, and then defend them robustly. That process can take many months. Here are four short proposals that I won't be raising, but think would be of benefit (if you raise a proposal for one of them, I'll buy you a beer next time we are physically co-located).

Use : for types

In Haskell we use : for list-cons and :: for types. That's the wrong way around - the use of types is increasing, the use of lists is decreasing, and type theory has always used :. This switch has been joke-proposed before. We actually switched these operators in DAML, and it worked very nicely. Having written code in both styles, I now write Haskell on paper with : for types instead of ::. Nearly all other languages use : for types, even Python. It's sad when Python takes the more academically pure approach than Haskell.

Is it practical: Maybe. The compiler diff is quite small, so providing it as an option has very little technical cost. The problem is it bifurcates the language - example code will either work with : for types or :: for types. It's hard to write documentation, text books etc. If available, I would switch my code.

Make recursive let explicit

Currently you can write let x = x + 1 and it means loop forever at runtime because x is defined in terms of itself. You probably meant to refer to the enclosing x, but you don't get a type error, and often don't even get a good runtime error message, just a hang. In do bindings, to avoid the implicit reference to self, it's common to write x <- pure $ x + 1. That can impose a runtime cost, and obscure the true intent.

In languages like OCaml there are two different forms of let - one which allows variables to be defined and used in a single let (spelled let rec) and one which doesn't (spelled let). Interestingly, this distinction is important in GHC Core, which has two different keywords, and a source let desugars differently based on whether it is recursive. I think Haskell should add letrec as a separate keyword and make normal let non-recursive. Most recursive bindings are done under a where, and these would continue to allow full recursion, so most code wouldn't need changing.

Is it practical: The simplest version of this proposal would be to add letrec as a keyword equivalent to let and add a warning on recursive let. Whether it's practical to go the full way and redefine the semantics of let to mean non-recursive binding depends on how strong the adoption of letrec was, but given that I suspect recursive let is less common, it seems like it could work. Making Haskell a superset of GHC Core is definitely an attractive route to pursue.

Allow trailing variables in bind

When writing Haskell code, I often have do blocks that I'm in the middle of fleshing out, e.g.:

do fileName <- getLine
   src <- readFile fileName

My next line will be to print the file or similar, but this entire do block, and every sub-part within it, is constantly a parse error until I put in that final line. When the IDE has a parse error, it can't really help me as much as I'd like. The reason for the error is that <- can't be the final line of a do block. I think we should relax that restriction, probably under a language extension that only IDE's turn on. It's not necessarily clear what such a construct should mean, but in many ways that isn't the important bit, merely that such a construct results in a valid Haskell program, and allows more interactive feedback.

Is it practical: Yes, just add a language extension - since it doesn't actually enable any new power it's unlikely to cause problems. Fleshing out the semantics, and whether it applies to let x = y statements in a do block too, is left as an exercise for the submitter. An alternative would be to not change the language, but make GHC emit the error slightly later on, much like -fdefer-type-errors, which still works for IDEs (either way needs a GHC proposal).

Add an exporting keyword

Currently the top of every Haskell file duplicates all the identifiers that are exported - unless you just export everything (which you shouldn't). That approach duplicates logic, makes refactorings like renamings more effort, and makes it hard to immediately know if the function you are working on is exposed. It would be much nicer if you could just declare things that were exported inline, e.g. with a pub keyword - so pub myfunc :: a -> a both defines and exports myfunc. Rust has taken this approach and it works out quite well, modulo some mistakes. The currently Haskell design has been found a bit wanting, with constructs like pattern Foo in the export list to differentiate when multiple names Foo might be in scope, when attaching the visibility to the identifier would be much easier.

Is it practical: Perhaps, provided someone doesn't try and take the proposal too far. It would be super tempting to differentiate between exports to of the package, and exports that are only inside this package (what Rust clumsily calls pub(crate)). And there are other things in the module system that could be improved. And maybe we should export submodules. I suspect everyone will want to pile more things into this design, to the point it breaks, but a simple exporting keyword would probably be viable.