Neil Mitchell's Blog (Haskell etc)

Working on build systems full-time at Meta

2022-05-04T15:27:00.001+01:00

Summary: I joined Meta 2.5 years ago to work on build systems. I’m enjoying it.

I joined Meta over two years ago when an opportunity arose to work on build systems full time. I started the Shake build system at Standard Chartered over 10 years ago, and then wrote an open source version a few years later. Since then, I’ve always been dabbling in build systems, at both moderate and small scale. I really enjoyed writing the Build Systems a la Carte paper, and as a result, started to appreciate some of the Bazel and Buck design decisions. I was involved in the Bazel work at Digital Asset, and after that decided that there was still lots of work to be done on build systems. I did some work on Cloud Shake, but the fact that I wasn’t working on it every day, and that I wasn’t personally using it, made it hard to productionize. A former colleague now at Meta reached out and invited me for breakfast — one thing led to another, and I ended up at Meta working on build systems full time.

What I’ve learnt about build systems

The biggest challenge at Meta is the scale. When I joined they already used the Buck build system, which had been developed at Meta. Looking at the first milliseconds after a user starts an incremental build is illustrative:

With Shake, it starts the process, loads the database into memory, walks the entire graph calling stat on each input and runs any build actions.
With Buck, it connects to a running daemon, talks to a running file watcher (Watchman in the case of Buck) and uses reverse dependencies to jump to the running actions.

For Shake, on repos with 100K files, that process might take ~0.5s, but it is O(n). If you increase to 10M files, it takes 50s, and your users will revolt. With Buck, the overhead is proportional to the number of changed files, which is usually a handful.

While Shake is clearly infeasible at the scale of Meta, Buck was also starting to show its age, and I’ve been working with others to significantly improve Buck, borrowing lessons from everywhere, including Shake. Buck also addresses problems that Shake doesn’t, such as how to cope with multi-configuration builds (e.g. building for x86 and ARM simultaneously), having a separate file and target namespace and effective use of remote execution and caching.

We expect that the new version of Buck will be released open source soon, at which point I’ll definitely be talking more about the design and engineering trade-offs behind it.

What's different moving from finance to tech

My career to date has been in finance, so working at Meta is a very different world. Below are a few things that stand out (I believe most of these are common to other big tech companies too, but Meta is my first one).

Engineering career ladder: In finance the promotion path for a good engineer is to become a manager of engineers, then a manager of managers, and so on up. In my previous few roles I was indeed managing teams, which included setting technical direction and doing coding. At Meta, managers look after people, and help set the team direction. Engineers look after code and services, and set the technical direction. But importantly, you can be promoted as an engineer, without gaining direct reports, and the opportunities and compensation are equivalent to that for managers. There are definitely aspects of management that I like (e.g. mentoring, career growth, starting collaborations), and happily all of these are things engineers can still engage in.

Programmer centric culture: In finance the company is often built around traders and sales people. In tech, the company is built around programmers, which is visible in the culture. There are hardware vending machines, free food, free ice cream, minimal approvals. They’ve done a very good job of providing a relaxing and welcoming environment (with open plan offices, but I don’t mind that aspect). The one complaint I had was that Meta used to have a pretty poor work from home policy, but that’s now been completely rewritten and is now very good.

Reduced hierarchy: I think this may be more true of Meta than other tech, but there is very minimal hierarchy. Programmers are all just programmers, not “senior” or “junior”. I don’t have the power to tell anyone what to do, but in a slightly odd way, my manager doesn’t have that power either. If I want someone to tackle a bug, I have to justify that it is a worthwhile thing to do. One consequence of that is that the ability to form relationships and influence people is much more important. Another consequence that I didn’t foresee is that working with people in different teams is very similar to working with people in your team, since exactly the same skills apply. I can message any engineer at Meta, about random ideas and possible collaborations, and everyone is happy to talk.

Migration is harder: In previous places I worked, if we needed 100 files moved to a new version of a library, someone got told to do it, and they went away and spent a lot of time doing it. At Meta that’s a lot harder — firstly, it’s probably 100K files due to the larger scale, and secondly, telling someone they must do something is a lot less effective. That means there is a greater focus on automation (automatically editing the files), compatibility (doesn’t require editing the files) and benefits (ensuring that moving to the new version of the library will make your life better). All those are definitely better ways to tackle the problem, but sometimes, work must be done that is tedious and time consuming, and that is harder to make happen.

Open source: The process for open sourcing an internal library or tool in the developer infrastructure space is very smooth. The team I work in has open sourced the Starlark programming language (taking over maintenance from Google), the Gazebo Rust utility library and a Rust linter, plus we have a few more projects in the pipeline. As I write code in the internal Meta monorepo, it gets sync’d to GitHub a few minutes later. It’s also easy to contribute to open source projects, e.g. Meta engineers have contributed to my projects such as Hoogle (before I even considered joining Meta).

Hiring: Meta hires a lot of engineers (e.g. 1,000 additional people in London). That means that interviews are more like a production line, with a desire to have a repeatable process, where candidates are assigned teams after the interviews, rather than interviewing with a team. There are upsides and downsides to that—if I interview a strong candidate, it’s really sad to know that I probably won’t get to work closely with them. It also means that the interview process is determined centrally, so I can’t follow my preferences. But it does mean that if a friend is looking for a job there’s often something available for them (you can find details on compilers and programming here and a full list of jobs here), and a repeatable process is good for fairness.

Overall I’m certainly very happy to be working on build systems. The build system is the thing that stands between a user and trying out their changes, so anything I can do to make that process better benefits all developers. I’m very excited to share what I’ve been working on more widely in the near future!

(Disclosure: This blog post had to go through Meta internal review, because it’s an employee talking about Meta, but other than typos, came out unchanged.)

Huge Project Build Systems

2021-09-16T16:46:00.000+01:00

Summary: Shake won't scale to millions of files, this post says what would be required to make it do so.

While Shake has compiled projects with hundreds of thousands of files, it's never scaled out to millions of files, and it would be unlikely to work well at that size. The most popular build systems that operate at that scale are Buck (from Facebook) and Bazel (from Google). In this post I go through the changes that would need to be made to make Shake scale.

The first issue is covered in my previous post, that Shake doesn't know if you change the build rules themselves. As you scale up, it becomes much more important that if you change the rules, everything is properly tracked. As the number of people involved in a project increases, the rate at which the build system changes will also increase. Both Buck and Bazel solve this problem using a deterministic Python-based configuration language called Starlark. If Shake stopped being a Haskell DSL, I'd argue that it stops being Shake and becomes something different, so it's unclear what could be done there.

The next issue is that every time Shake is invoked, it checks the modification time of every file, and then walks the entire dependency graph. That works fine at 10K files, but as you move to 1M files, it takes too long. The solution is two-fold, first be informed which files have changed using notification APIs (e.g. the Watchman tool), and then use reverse dependencies to only explore the portion of the graph that has changed. Happily, Pepe already has a patch adding reverse dependencies to Shake, so that isn't too infeasible.

The final issue is that Shake was designed as a single-machine build system, not for sharing results between users. When I first wrote Shake, I didn't have access to servers, and AWS was brand new. Now, over a decade later, servers are easy to obtain and large scale build systems need to share results, so that if one user builds a file, no one else needs to. Within the realm of multi-user build systems, there are two basic operations - sharing results and offloading commands.

Shake, with it's new cloud features, is able to share results between users using a shared drive. It works, and big companies are using it for real, but I'd consider it fairly experimental. For execution, Shake is unable to run actions remotely, so can't make use of something like Bazel's remote execution API. Since dependencies are specified at the rule level, and remote execution operates at the command level, there is a bit of a mismatch, and it's unclear what that might look like in Shake.

While Shake won't work at huge scales, it is still quite an effective build system at quite large scales. But, given the limitations, I imagine it will never get to the scale of Buck/Bazel. At the same time, Buck/Bazel lack dynamic dependencies, which makes them unable to express rules such as Haskell effectively.

Happily, I am involved with a new build system, the next generation of Buck. I joined Facebook two years ago, and since that time have been focused on this project. It's written in Rust, configured with Starlark (I've spent a lot of time working on an open-source Starlark interpreter in Rust), and should work at huge scales. It's not yet open source, but it will be - we are targeting early next year.

I think Shake is still a build system with a lot to offer, and continue to maintain and improve it. For people who want to scale beyond the range of Shake, I'd definitely recommend using the next generation of Buck, once it is available.

Small Project Build Systems

2021-09-15T17:02:00.001+01:00

Summary: Forward build systems might work better for small projects.

Yesterday's post talked about how Shake is a good medium sized build system - but what about smaller projects? Is the Shake model right for them? Shake can be considered a backwards build system. Each rule says how to produce a file, given some input files (which are dependencies) and an action. Almost all build systems (e.g. Make, Buck, Bazel, CMake, SCons, Ninja) fit this model, which is analysed in the Build Systems a la Carte paper. While this model works, it has two disadvantages:

You have to explicitly list dependencies, or infer them from include files etc. That means either dependencies are insufficient (you probably forgot some), or they are excessive (you added some you don't need). Usually both.
You have to think backwards. When you ask someone how to build an executable from a C file, no one talks about linking first, but to program a build system you have to.

The alternative to a backwards build system is a forwards build system, of which Memoize was the first. You just write out the commands in order, and dependency tracing figures out if they have changed. To compile a C program it can be as simple as:

gcc -c util.c
gcc -c main.c
gcc -o main main.o util.o

That build script is incredibly simple - so simple it could also be treated as a shell script.

A few years ago I wrote such a system, called Rattle, and wrote a paper about it at OOPSLA 2020 with my co-authors Sarah Spall and Sam Tobin-Hochstadt. Sarah gave a talk about Rattle at OOPSLA, and I gave a talk at Build Meetup 2021. We were able to compile projects like NodeJS faster than the NodeJS build system (which uses Make), showing the idea might be feasible.

If forward build systems are so great, why do I think they are most suitable for small projects? There are four reasons, the first three of which have mitigations, but the final one sets a limit on the size at which forward build systems are suitable.

Forward build systems rely on tracing which files are dependencies of a command. Doing that quickly in a cross-platform manner is a nightmare. There are tricks like hooking system calls etc, but it presents a significant engineering hurdle, especially on MacOS, which makes this task harder with every release.
Forward build systems are immature. The earliest examples no longer work. Rattle is a relatively practical research system - it could evolve into a production system - but it's not there yet. And compared to the alternatives, Rattle is probably one of the closest to production, in large part because it builds off a lot of common infrastructure from Shake which is much more mature.
Forward build systems lack parallelism, since if you want to express parallelism, you need to think about dependencies once more, and it's easy to go wrong. Rattle mostly solves the parallelism by automatically inferring when it is safe to parallelise, which is how we were able to remain competitive with Make.

And finally, the biggest issue is that forward build systems are not compositional, while backward build systems are. If you want to write a 1 million build rule system, in a backwards system, each rule looks like any other. Whereas in a forward build system, assuming you need to give an order, writing down that order in a compositional way is hard - in fact, whenever I've tried it, you start expressing the dependencies between entries and end up with a backwards build system.

Happily, people are continuing to research forward build system. Rattle adds parallelism, Stroll removes the requirement for an order, Fac allows some dependencies and infers the remaining ones, LaForge finds greater incrementality. Perhaps all those ideas can be combined, along with a lot of engineering, to produce a practical forward build system.

Rattle has shown a well engineered forward build system would be feasible for small projects. It's unclear how much larger the concept might be able to scale, probably never to millions of files, but for small projects it might provide a significantly lower effort path to writing build systems.

Reflecting on the Shake Build System

2021-09-14T16:22:00.004+01:00

Summary: As a medium-sized build system, Shake has some good bits and some bad bits.

I first developed the Shake build system at Standard Chartered in 2008, rewriting an open source version in my spare time in 2011. I wrote a paper on Shake for ICFP 2012 and then clarified some of the details in a JFP 2020 paper. Looking back, over a decade later, this post discusses what went well and what could be improved.

The first thing to note is that Shake is a medium sized build system. If you have either 1 source file or 1 million source files, Shake probably isn't a good fit. In this post I'm going to go through how Shake does as a medium-sized build system, and two other posts reflect on what I think a small build system or huge build system might look like.

The most important thing Shake got right was adding monadic/dynamic dependencies. Most build systems start with a static graph, and then, realising that can't express the real world, start hacking in an unprincipled manner. The resulting system becomes a bunch of special cases. Shake embraced dynamic dependencies. That makes some things harder (no static cycle detection, less obvious parallelism, must store dependency edges), but all those make Shake itself harder to write, while dynamic dependencies make Shake easier to use. I hope that eventually all build systems gain dynamic dependencies.

In addition to dynamic dependencies, Shake has early cut-off, meaning files that rebuild but don't change don't invalidate rules that depend upon them. This feature is something increasingly becoming standard in build systems, which is great to see.

Shake is written as a Haskell DSL, which means users of Shake are writing a Haskell program that happens to heavily leverage the Shake library. That choice was a double-edged sword. There are some significant advantages:

I didn't need to invent a special purpose language. That means I get to reuse existing tooling, existing learning materials, and existing libraries.
Since Shake is just a library, it can be documented with the normal tools like Haddock.
Users can extend Shake using Haskell, and publish libraries building on top of Shake.
The modelling of monadic dependencies in Haskell is pretty elegant, given a dedicated syntax for expressing monadic computations (the do keyword).
Projects like Haskell Language Server can build on top of Shake in fairly fundamental ways. See our recent IFL 2020 paper for the benefits that brought.

But there are also some downsides:

Most significantly, the audience of Shake is somewhat limited by the fact that Shake users probably have to learn some Haskell. While the user manual aims to teach enough Haskell to write Shake without really knowing Haskell, it's still a barrier.
Haskell has some significant weaknesses, e.g. it has two competing package managers, countless distribution mechanisms, and none of these things are consistent for long periods of time. Haskell has a poor on-ramp, and thus so does Shake.

The choice of an embedded DSL for Shake also leads to the issue that Shake doesn't know when a rule has changed, since a rule is opaque Haskell code. As a consequence, if you modify a command line in a .hs file, Shake is unaware and won't rebuild the necessary files. There are a bunch of techniques for dealing with this limitation (see the Shake functions shakeVersion, versioned), but none are pleasant, and it remains an easy mistake to make. A potential way out is to build a system which reads configuration files not in Haskell and interprets them, which I gave a talk about, and I've seen deployed in practice. But it's something where each user ends up rolling their own.

Another limitation is that Shake is (deliberately) quite low-level. It gives you a way to depend on a file, and a way to run a command line. It doesn't give you a way to express a C++ library. The hope from the beginning was that Shake would be language neutral, and that libraries would arise that built on top of Shake providing access to standard libraries. If you were writing a Python/C++/Ruby build script, you'd simply import those libraries, mix them together, and have a working build system. There are libraries that have gone in that direction, the libraries shake-language-c and shake-cpp provide C++ rules, avr-shake lets you work with AVR Crosspack. Unfortunately, there aren't enough libraries to just plug together a build system. I think a fundamental problem is that it's not immediately obvious how such libraries would compose, and without that composability, it's hard to build the libraries that would rely on composability.

Finally, the most surprising aspect about developing Shake is that a large part of the effort has gone into writing an ergonomic and cross-platform process executor. The result is found at Development.Shake.Command, and can be used outside Shake, letting users write:

cmd "gcc -c" [src]

This example invokes gcc, ensuring that src is properly escaped if it has spaces or other special characters. A significant amount of the engineering work in Shake has gone into that facility, when it's totally orthogonal to the build system itself.

In the next two parts of this series, I'll go through why I think Shake isn't a great build system for tiny projects (and what might be), followed by why Shake isn't great for absolutely huge projects (and what would need to be fixed).

Recording video

2021-01-17T16:29:00.001+00:00

Summary: Use OBS, Camo and Audacity.

I recently needed to record a presentation which had slides and my face combined, using a Mac. Based on suggestions from friends and searching the web, I came up with a recipe that worked reasonably well. I'm writing this down to both share that recipe, and so I can reuse the recipe next time.

Slide design: I used a slide template which had a vertical rectangle hole at the bottom left so I could overlay a picture of my video. It took a while to find a slide design that looked plausible, and make sure callouts/quotes etc didn't overlap into this area.

Camera: The best camera you have is probably the one on your phone. To hook up my iPhone to my Mac I used a £20 lightning to USB-C cable (next day shipping from Apple) along with the software Camo. I found Camo delightfully easy to use. I paid £5 per month to disable the logo and because I wanted to try out the portrait mode to blur my background - but that mode kept randomly blurring and unblurring things in the background, so I didn't use it. Camo is useful, but I record videos infrequently, and £5/month is way too steep. I'm not a fan of software subscriptions, so I'll remember to cancel Camo. Because it is subscription based, and subscribing/cancelling is a hassle, I'll probably just suck up the logo next time.

Composition: To put it all together I used OBS Studio. The lack of an undo feature is a bit annoying (click carefully), but otherwise everything was pretty smooth. I put my slide deck (in Keynote) on one monitor, and then had OBS grab the slide contents from it. I didn't use presentation mode in Keynote as that takes over all the screen, so I just used the slide editing view, with OBS cropping to the slide contents. One annoyance of slide editing view is that spelling mistakes (and variable names etc.) have red dotted underlines, so I had to go through every slide and make sure the spellings were ignored. Grabbing the video from Camo into OBS was very easy.

Camera angle: To get the best camera angle I used a lighting plus phone stand (which contains an impressive array of stands, clips, extensions etc) I'd already bought to position the camera right in front of me. Unfortunately, putting the camera right in front of me made it hard to see the screen, which is what I use to present from. It was awkward, and I had to make a real effort to ensure I kept looking into the camera - using my reflection on the back of the shiny iPhone to make sure I kept in the right position. Even then, watching the video after, you can see my eyes dart to the screen to read the next slide. There must be something better out there - or maybe it's only a problem if you're thinking about it and most people won't notice.

Recording: For actual recording there are two approaches - record perfectly in one take (which may take many tries, or accepting a lower quality) or repeatedly record each section and edit it together after. I decided to go for a single take, which meant that if a few slides through I stumbled then I restarted. Looking at my output directory, I see 15 real takes, with a combined total of about an hour runtime, for a 20 minute talk. I did two complete run throughs, one before I noticed that spelling mistakes were underlined in dotted red.

Conversion to MP4: OBS records files as .mkv, so I used VLC to preview them. When I was happy with the result, I converted the file to .mp4 using the OBS feature "Remux recordings".

Audio post processing: After listening to the audio, there was a clear background hum, I suspect from the fan of the laptop. I removed that using Audacity. Getting Audacity to open a .mp4 file was a bit of an uphill struggle, following this guide. I then cleaned up the audio using this guide, saved it as .wav, and reintegrated it with the video using ffmpeg and this guide. I was amazed and impressed how well Audacity was able to clean up the audio with no manual adjustment.

Sharing: I shared the resulting video via DropBox. However, when sharing via DropBox I noticed that the audio quality was significantly degraded in the DropBox preview on the iOS app. Be sure to download the file to assess whether the audio quality is adequate (it was fine when downloaded).

Data types for build system dependencies

2020-11-15T21:05:00.000+00:00

Summary: Monadic and early cut-off? Use a sequence of sets.

In the Build Systems a la Carte paper we talk about the expressive power of various types of build systems. We deliberately simplify away parallelism and implementation concerns, but those details matter. In this post I'm going to discuss some of those details, specifically the representation of dependencies.

Applicative build systems

In an applicative build system like Make, all dependencies for a target are known before you start executing the associated action. That means the dependencies have no ordering, so are best represented as a set. However, because they can be calculated from the target, they don't usually need to be stored separately. The dependencies can also be evaluated in parallel. To build a target you evaluate the dependencies to values, then evaluate the action.

Early cut-off is when an action is skipped because none of its dependencies have changed value, even if some dependencies might have required recomputing. This optimisation can be incredibly important for build systems with generated code - potentially seconds vs hours of build time. To obtain early cut-off in applicative systems, after evaluating the dependencies you compare them to the previous results, and only run the action if there were changes.

Monadic build systems

In monadic build systems like Shake, the representation of dependencies is more complex. If you have an alternative mechanism of detecting whether a rule is dirty (e.g. reverse dependencies) you don't need to record the dependencies at all. If the key is dirty, you start executing the action, and that will request the dependencies it needs. The action can then suspend, calculate the dependencies, and continue.

If you want early cut-off in a monadic build system, you need to rerun the dependencies in advance, and if they all have the same result, skip rerunning the action. Importantly, you probably want to rerun the dependencies in the same order that the action originally requested them -- otherwise you might pay a severe and unnecessary time penalty. As an example, let's consider an action:

opt <- need "is_optimised"
object <- if opt then need "foo.optimised" else need "foo.unoptimised"
link object

This rule is monadic, as whether you need the optimised or unoptimised dependency depends on the result of calculating some is_optimised property. If on the first run is_optimised is True, then we build foo.optimised. On the second run, if is_optimised is False, it is important we don't build foo.optimised as that might take a seriously long time and be entirely redundant. Therefore, it's important when checking for early cut-off we build in the order that the previous action requested the dependencies, and stop on the first difference we encounter.

(If you have unlimited resources, e.g. remote execution, it might be profitable to evaluate everything in parallel - but we're assuming that isn't the case here.)

Provided a rule performs identically between runs (i.e. is deterministic and hasn't been changed), everything that we request to check for early cut-off will still be needed for real, and we won't have wasted any work. For all these reasons, it is important to store dependencies as a sequence (e.g. a list/vector).

Monadic build systems plus parallelism

Applicative build systems naturally request all their dependencies in parallel, but monadic build systems are naturally one dependency at a time. To regain parallelism, in build systems like Shake the primitive dependency requesting mechanism takes a set of dependencies that are computed in parallel. While requesting dependencies individually or in bulk gives the same result, in bulk gives significantly more parallelism. (In Shake we use lists to track correspondence between requests and results, but it's morally a set.)

As we saw previously, it is still important that for early cut-off you reproduce the dependencies much like they were in the action. That means you request dependencies in the order they were requested, and when they were requested in bulk, they are also checked in bulk. Now we have a sequence of sets to represent dependencies, where the elements of the sets can be checked in parallel, but the sequence must be checked in order.

Monadic build systems plus explicit parallelism

What if we add an explicit parallelism operator to a monadic build system, something like parallel :: [Action a] -> IO [a] to run arbitrary actions in parallel (which is what Shake provides). Now, instead of a sequence of sets, we have a tree of parallelism. As before it's important when replaying that the dependencies are requested in order, but also that as much is requested in parallel as possible.

What Shake does

Shake is a monadic build system with early cut-off, parallelism and explicit parallelism. When building up dependencies it uses a tree representation. The full data type is:

data Traces
    = None
    | One Trace
    | Sequence Traces Traces
    | Parallel [Traces]

Sequenced dependencies are represented with Sequence and the traces captured by parallelism use Parallel. Importantly, constructing Traces values is nicely O(1) in all cases. (Shake v0.19.1 used a different representation and repeatedly normalised it, which could have awful time complexity - potentially O(2^n) in pathological cases.)

While these traces store complete information, actually evaluating that trace when checking for rebuilds would be complicated. Instead, we flatten that representation to [[Trace]] for writing to the Shake database. The outer list is a sequence, the inner list is morally a set. We have the invariant that no Trace value will occur multiple times, since if you depend on something once, and then again, the second dependency was irrelevant. To flatten Parallel computations we take the first required dependency in each parallel action, merge them together, and then repeat for the subsequent actions. If you run code like:

parallel [
    need ["a"] >> parallel [need ["b"], need ["c"]]
    need ["d"]
]

It will get flattened to appear as though you wrote need ["a","d"] >> need ["b","c"]. When checking, it will delay the evaluation of b and c until after d completes, even though that is unnecessary. But simplifying traces at the cost of marginally less rebuild parallelism for those who use explicit parallelism (which is not many) seems like the right trade-off for Shake.

Conclusion

Applicative build systems should use sets for their dependencies. Monadic build systems should use sets, but if they support early cut-off, should use sequences of sets.

Turing Incomplete Languages

2020-11-09T11:43:00.000+00:00

Summary: Some languages ban recursion to ensure programs "terminate". That's technically true, but usually irrelevant.

In my career there have been three instances where I've worked on a programming language that went through the evolution:

Ban recursion and unbounded loops. Proclaim the language is "Turing incomplete" and that all programs terminate.
Declare that Turing incomplete programs are simpler. Have non-technical people conflate terminate quickly with terminate eventually.
Realise lacking recursion makes things incredibly clunky to express, turning simple problems into brain teasers.
Add recursion.
Realise that the everything is better.

Before I left university, this process would have sounded ridiculous. In fact, even after these steps happened twice I was convinced it was the kind of thing that would never happen again. Now I've got three instances, it seems worth writing a blog post so for case number four I have something to refer to.

A language without recursion or unbounded loops

First, let's consider a small simple statement-orientated first-order programming language. How might we write a non-terminating program? There are two easy ways. Firstly, write a loop - while (true) {}. Second, write recursion, void f() { f() }. We can ban both of those, leaving only bounded iteration of the form for x in xs { .. } or similar. Now the language is Turing incomplete and all programs terminate.

The lack of recursion makes programs harder to write, but we can always use an explicit stack with unbounded loops.

The lack of unbounded loops isn't a problem provided we have an upper bound on how many steps our program might take. For example, we know QuickSort has worst-case complexity O(n^2), so if we can write for x in range(0, n^2) { .. } then we'll have enough steps in our program such that we never reach the bound.

But what if our programming language doesn't even provide a range function? We can synthesise it by realising that in a linear amount of code we can produce exponentially large values. As an example:

double xs = xs ++ xs -- Double the length of a list
repeated x = double (double (double (double (double (double (double (double (double (double [x])))))))))

The function repeated 1 makes 10 calls to double, and creates a list of length 2^10 (1024). A mere 263 more calls to double and we'll have a list long enough to contain each atom in the universe. With some tweaks we can cause doubling to stop at a given bound, and generate numbers in sequence, giving us range to any bound we pick.

We now have a menu of three techniques that lets us write almost any program we want to do so:

We can encoding recursion using an explicit stack.
We can change unbounded loops into loops with a conservative upper bound.
We can generate structures of exponential size with a linear amount of code.

The consequences

Firstly, we still don't have a Turing complete language. The code will terminate. But there is no guarantee on how long it will take to terminate. Programs that take a million years to finish technically terminate, but probably can't be run on an actual computer. For most of the domains I've seen Turing incompleteness raised, a runtime of seconds would be desirable. Turing incompleteness doesn't help at all.

Secondly, after encoding the program in a tortured mess of logic puzzles, the code is much harder to read. While there are three general purpose techniques to encode the logic, there are usually other considerations that cause each instance to be solved differently. I've written tree traversals, sorts and parsers in such restricted languages - the result is always a lot of comments and at least one level of unnecessary indirection.

Finally, code written in such a complex style often performs significantly worse. Consider QuickSort - the standard implementation takes O(n^2) time worst case, but O(n log n) time average case, and O(log n) space (for the stack). If you take the approach of building an O(n^2) list before you start to encode a while loop, you end up with O(n^2) space and time. Moreover, while in normal QuickSort the time complexity is counting the number of cheap comparisons, in an encoded version the time complexity relates to allocations, which can be much more expensive as a constant factor.

The solution

Most languages with the standard complement of if/for etc which are Turing incomplete do not gain any benefit from this restriction. One exception is in domains where you are proving properties or doing analysis, as two examples:

Dependently typed languages such as Idris, which typically have much more sophisticated termination checkers than just banning recursion and unbounded loops.
Resource bounded languages such as Hume, which allow better analysis and implementation techniques by restricting how expressive the language is.

Such languages tend to be a rarity in industry. In all the Turing incomplete programming languages I've experienced, recursion was later added, programs were simplified, and programming in the language became easier.

While most languages I've worked on made this evolution in private, one language, DAML from Digital Asset, did so in public. In 2016 they wrote:

DAML was intentionally designed not to be Turing-complete. While Turing-complete languages can model any business domain, what they gain in flexibility they lose in analysability.

Whereas in 2020 their user manual says:

If there is no explicit iterator, you can use recursion. Let’s try to write a function that reverses a list, for example.

Note that while I used to work at Digital Asset, these posts both predate and postdate my time there.

Don't use Ghcide anymore (directly)

2020-09-22T10:16:00.000+01:00

Summary: I recommend people use the Haskell Language Server IDE.

Just over a year ago, I recommended people looking for a Haskell IDE experience to give Ghcide a try. A few months later the Haskell IDE Engine and Ghcide teams agreed to work together on Haskell Language Server - using Ghcide as a library as the core, with the plugins/installer experience from the Haskell IDE Engine (by that stage we were already both using the same Haskell setup and LSP libraries). At that time Alan Zimmerman said to me:

"We will have succeeded in joining forces when you (Neil) start recommending people use Haskell Language Server."

I'm delighted to say that time has come. For the last few months I've been both using and recommending Haskell Language Server for all Haskell IDE users. Moreover, for VS Code users, I recommend simply installing the Haskell extension which downloads the right version automatically. The experience of Haskell Language Server is better than either the Haskell IDE Engine or Ghcide individually, and is improving rapidly. The teams have merged seamlessly, and can now be regarded as a single team, producing one IDE experience.

There's still lots of work to be done. And for those people developing the IDE, Ghcide remains an important part of the puzzle - but it's now a developer-orientated piece rather than a user-orientated piece. Users should follow the README at Haskell Language Server and report bugs against Haskell Language Server.

Interviewing while biased

2020-08-31T11:29:00.001+01:00

Interviewing usually involves some level of subjectivity. I once struggled to decide about a candidate, and after some period of reflection, the only cause I can see is that I was biased against the candidate. That wasn't a happy realisation, but even so, it's one I think worth sharing.

Over my years, I've interviewed hundreds of candidates for software engineering jobs (I reckon somewhere in the 500-1000 mark). I've interviewed for many companies, for teams I was managing, for teams I worked in, and for other teams at the same company. In most places, I've been free to set the majority of the interview. I have a standard pattern, with a standard technical question, to which I have heard a lot of answers. The quality of the answers fall into one of three categories:

About 40% give excellent, quick, effortless answers. These candidates pass the technical portion.
About 50% are confused and make nearly no progress even with lots of hints. These candidates fail.
About 10% struggle a bit but get to the answer.

Candidates in the final bucket are by far the hardest to make a decision on. Not answering a question effortlessly doesn't mean you aren't a good candidate - it might mean it's not something you are used to, you got interview nerves or a million other factors that go into someone's performance. It makes the process far more subjective.

Many years ago, I interviewed one candidate over the phone. It was their first interview with the company, so I had to decide whether we should take the step of transporting them to the office for an in-person interview, which has some level of cost associated with it. Arranging an in-person interview would also mean holding a job open for them, which would mean pausing further recruitment. The candidate had a fairly strong accent, but a perfect grasp of English. Their performance fell squarely into the final bucket.

For all candidates, I make a decision, and write down a paragraph or so explaining how they performed. My initial decision was to not go any further in interviewing the candidate. But after writing down the paragraph, I found it hard to justify my decision. I'd written other paragraphs that weren't too dissimilar, but had a decision to continue onwards. I wondered about changing my decision, but felt rather hesitant - I had a sneaking suspicion that this candidate "just wouldn't work out". Had I spotted something subtle I had forgotten to write down? Had their answers about their motivation given me a subconscious red-flag? I didn't know, but for the first time I can remember, decided to wait on sending my internal interview report overnight.

One day later, I still had a feeling of unease. But still didn't have anything to pin it on. In the absence of a reason to reject them, I decided the only fair thing to do was get them onsite for further interviews. Their onsite interviews went fine, I went on to hire them, they worked for me for over a year, and were a model employee. If I saw red-flags, they were false-flags, but more likely, I saw nothing.

However, I still wonder what caused me to decide "no" initially. Unfortunately, the only thing I can hypothesise is that their accent was the cause. I had previously worked alongside someone with a similar accent, who turned out to be thoroughly incompetent. I seem to have projected some aspects of that behaviour onto an entirely unrelated candidate. That's a pretty depressing realisation to make.

To try and reduce the chance of this situation repeating, I now write down the interview description first, and then the decision last. I also remember this story, and how my biases nearly caused me to screw up someone's career.

Which packages does Hoogle search?

2020-07-27T14:56:00.003+01:00

Summary: Hoogle searches packages on Stackage.

Haskell (as of 27 July 2020) has 14,490 packages in the Hackage package repository. Hoogle (the Haskell API search engine) searches 2,463 packages. This post explains which packages are searched, why some packages are excluded, and thus, how you can ensure your package is searched.

The first filter is that Hoogle only searches packages on Stackage. Hoogle indexes any package which is either in the latest Stackage nightly or Stackage LTS, but always indexes the latest version that is on Hackage. If you want a Hoogle search that perfectly matches a given Stackage release, I recommend using the Stackage Hoogle search available from any snapshot page. There are two reasons for restricting to only packages on Stackage:

I want Hoogle results to be useful. The fact that the package currently builds with a recent GHC used by Stackage is a positive sign that the package is maintained and might actually work for a user who wants to use it. Most of the packages on Hackage probably don't build with the latest GHC, and thus aren't useful search results.
Indexing time and memory usage is proportional to the number of packages, and somewhat the size of those packages. By dropping over 10 thousand packages we can index more frequently and on more constrained virtual machines. With the recent release of Hoogle 5.0.18 the technical limitations on size were removed to enable indexing all of Hackage - but there is still no intention to do so.

There are 2,426 packages in Stackage Nightly, and 2,508 in Stackage LTS, with most overlapping. There are 2,580 distinct packages between these two sources, the Haskell Platform and a few custom packages Hoogle knows about (e.g. GHC).

Of the 2,580 eligible packages, 77 are executables only, so don't have any libraries to search, leaving 2,503 packages.

Of the remaining packages 2,503, 40 are missing documentation on Hackage, taking us down to 2,463. As for why a package might not have documentation:

Some are missing documentation because they are very GHC internal and are mentioned but not on Hackage, e.g. ghc-heap.
Some are Windows only and won't generate documentation on the Linux Hackage servers, e.g. Win32-notify.
Some have dependencies not installed on the Hackage servers, e.g. rocksdb-query.
Some have documentation that appears to have been generated without generating a corresponding Hoogle data file, e.g. array-memoize.
Some are just missing docs entirely on Hackage for no good reason I can see, e.g. bytestring-builder.

The Hoogle database is generated and deployed once per day, automatically. Occasionally a test failure or dependency outage will cause generation to fail, but I get alerted, and usually it doesn't get stale by more than a few days. If you add your package to Stackage and it doesn't show up on Hoogle within a few days, raise an issue.

Managing Haskell Extensions

2020-07-15T19:17:00.000+01:00

Summary: You can divide extensions into yes, no and maybe, and then use HLint to enforce that.

I've worked in multiple moderately sized multinational teams of Haskell programmers. One debate that almost always comes up is which extensions to enable. It's important to have some consistency, so that everyone is using similar dialects of Haskell and can share/review code easily. The way I've solved this debate in the past is by, as a team, dividing the extensions into three categories:

Always-on. For example, ScopedTypeVariables is just how Haskell should work and should be enabled everywhere. We turn these extensions on globally in Cabal with default-extensions, and then write an HLint rule to ban turning the extension on manually. In one quick stroke, a large amount of {-# LANGUAGE #-} boilerplate at the top of each file disappears.
Always-off, not enabled and mostly banned. For example, TransformListComp steals the group keyword and never got much traction. These extensions can be similarly banned by HLint, or you can default unmentioned extensions to being disabled. If you really need to turn on one of these extensions, you need to both turn it on in the file, and add an HLint exception. Such an edit can trigger wider code review, and serves as a good signal that something needs looking at carefully.
Sometimes-on, written at the top of the file as {-# LANGUAGE #-} pragmas when needed. These are extensions which show up sometimes, but not that often. A typical file might have zero to three of them. The reasons for making an extension sometimes-on fall into three categories:
- The extension has a harmful compile-time impact, e.g. CPP or TemplateHaskell. It's better to only turn these extensions on if they are needed, but they are needed fairly often.
- The extension can break some code, e.g. OverloadedStrings, so enabling it everywhere would cause compile failures. Generally, we work to minimize such cases, aiming to fix all code to compile with most extensions turned on.
- The extension is used rarely within the code base and is a signal to the reader that something unusual is going on. Depending on the code base that might be things like RankNTypes or GADTs. But for certain code bases, those extensions will be very common, so it very much varies by code base.

The features that are often most debated are the syntax features - e.g. BlockSyntax or LambdaCase. Most projects should either use these extensions commonly (always-on), or never (banned). They provide some syntactic convenience, but if used rarely, tend to mostly confuse things.

Using this approach every large team I've worked on has had one initial debate to classify extensions, then every few months someone will suggest moving an extension from one pile to another. However, it's pretty much entirely silenced the issue from normal discussion thereafter, leaving us to focus on actual coding.

How I Interview

2020-07-07T22:21:00.002+01:00

Summary: In previous companies I had a lot of freedom to design an interview. This article describes what I came up with.

Over the years, I've interviewed hundreds of candidates for software engineering jobs (at least 500, probably quite a bit more). I've interviewed for many companies, for teams I was setting up, for teams I was managing, for teams I worked in, and for different teams at the same company. In most places, I've been free to set the majority of the interview. This post describes why and how I designed my interview process. I'm making this post now because where I currently work has a pre-existing interview process, so I won't be following the process below anymore.

I have always run my interviews as a complete assessment of a candidate, aiming to form a complete answer. Sometimes I did that as a phone screen, and sometimes as part of a set of interviews, but I never relied on other people to cover different aspects of a candidate. (Well, I did once, and it went badly...)

When interviewing, there are three questions I want to answer for myself, in order of importance.

Will they be happy here?

If the candidate joined, would they be happy? If people aren't happy, it won't be a pleasant experience, and likely, they won't be very successful. Whether they are happy is the most important criteria because an employee who can't do the job but is happy can be trained or can use their skills for other purposes. But an employee who is unhappy will just drag the morale of the whole team down.

To figure out whether a candidate would be happy, I explain the job (including any office hours/environment/location) and discuss it in contrast to their previous experience. The best person to judge if they would be happy are the candidate themselves - so I ask that question. The tricky part is that it's an interview setting, so they have prepared saying "Yes, that sounds good" to every question. I try and alleviate that by building a rapport with the candidate first, being honest about my experiences, and trying to discuss what they like in the abstract first. If I'm not convinced they are being truthful or properly thinking it through, I ask deeper questions, for example how they like to split their day etc.

A great sign is when a candidate, during the interview, concludes for themselves that this job just isn't what they were looking for. I've had that happen 5 times during the actual interview, and 2 times as an email afterwards. It isn't awkward, and has saved some candidates an inappropriate job (at least 2 would have likely been offered a job otherwise).

While I'm trying to find out if the candidate will be happy, at the same time, I'm also attempting to persuade the candidate that they want to join. It's a hard balance and being open and honest is the only way I have managed it. Assuming I am happy where I work, I can use my enthusiasm to convince the candidate it's a great place, but also give them a sense of what I do.

Can they do the job?

There are two ways I used to figure out if someone can do the job. Firstly, I discuss their background, coding preferences etc. Do the things they've done in the past match the kind of things required in the job. Have they got experience with the non-technical portions of the job, or domain expertise. Most of these aspects are on their CV, so it involves talking about their CV, past projects, what worked well etc.

Secondly, I give them a short technical problem. My standard problem can be solved in under a minute in a single line of code by the best candidates. The problem is not complex, and has no trick-question or clever-approach element. The result can then be used as a springboard to talk about algorithmic efficiency, runtime implementation, parallelism, testing, verification etc. However, my experience is that candidates who struggle at the initial problem go on to struggle with any of the extensions, and candidates that do well at the initial question continue to do well on the extensions. The correlation has been so complete that over time I have started to use the extensions more for candidates who did adequately but not great on the initial problem.

My approach of an incredibly simple problem does not seem to be standard or adopted elsewhere. One reason might be that if it was used at scale, the ability to cheat would be immense (I actually have 2 backup questions for people I've interviewed previously).

Given such a simple question, there have been times when 5 candidates in a row ace the question, and I wonder if the question is just too simple. But usually then the next 5 candidates all struggle terribly and I decide it still has value.

Will I be happy with them doing the job?

The final thing I wonder is would I be happy with them being a part of the team/company. The usual answer is yes. However, if the candidate displays nasty characteristics (belittling, angry, racist, sexist, lying) then it's a no. This question definitely isn't code for "culture fit" or "would I go for a beer with them", but specific negative traits. Generally I answer this question based on whether I see these characteristics reflected in the interactions I have with the candidate, not specific questions. I've never actually had a candidate who was successful at the above questions, and yet failed at this question. I think approximately 5-10 candidates have failed on this question.

Automatic UI's for Command Lines with cmdargs

2020-07-05T10:49:00.000+01:00

Summary: Run cmdargs-browser hlint and you can fill out arguments easily.

The Haskell command line parsing library cmdargs contains a data type that represents a command line. I always thought it would be a neat trick to transform that into a web page, to make it easier to explore command line options interactively - similar to how the custom-written wget::gui wraps wget.

I wrote a demo to do just that, named cmdargs-browser. Given any program that uses cmdargs (e.g. hlint), you can install cmdargs-browser (with cabal install cmdargs-browser) and run:

cmdargs-browser hlint

And it will pop up:

As we can see, the HLint modes are listed on the left (you can use lint, grep or test), the possible options on the right (e.g. normal arguments and --color) and the command line it produces at the bottom. As you change mode or add/remove flags, the command line updates. If you hit OK it then runs the program with the command line. The help is included next to the argument, and if you make a mistake (e.g. write foo for the --color flag) it tells you immediately. It could be more polished (e.g. browse buttons for file selections, better styling) but the basic concepts works well.

Technical implementation

I wanted every cmdargs-using program to support this automatic UI, but also didn't want to increase the dependency footprint or compile-time overhead for cmdargs. I didn't want to tie cmdargs to this particular approach to a UI - I wanted a flexible mechanism that anyone could use for other purposes.

To that end, I built out a Helper module that is included in cmdargs. That API provides the full power and capabilities on which cmdargs-browser is written. The Helper module is only 350 lines.

If you run cmdargs with either $CMDARGS_HELPER or $CMDARGS_HELPER_HLINT set (in the case of HLint) then cmdargs will run the command line you specify, passing over the explicit Mode data type on the stdin. That Mode data type includes functions, and using a simplistic communication channel on the stdin/stdout, the helper process can invoke those functions. As an example, when cmdargs-browser wants to validate the --color flag, it does so by calling a function in Mode, that secretly talks back to hlint to validate it.

At the end, the helper program can choose to either give an error message (to stop the program, e.g. if you press Cancel), or give some command lines to use to run the program.

Future plans

This demo was a cool project, which may turn out to be useful for some, but I have no intention to develop it further. I think something along these lines should be universally available for all command line tools, and built into all command line parsing libraries.

Historical context

All the code that makes this approach work was written over seven years ago. Specifically, it was my hacking project in the hospital while waiting for my son to be born. Having a little baby is a hectic time of life, so I never got round to telling anyone about its existence.

This weekend I resurrected the code and published an updated version to Hackage, deliberately making as few changes as possible. The three necessary changes were:

jQuery deprecated the live function replacing it with on, meaning the code didn't work.
I had originally put an upper bound of 0.4 for the transformers library. Deleting the upper bound made it work.
Hackage now requires that all your uploaded .cabal files declare that they require a version of 1.10 or above of Cabal itself, even if they don't.

Overall, to recover a project that is over 7 years old, it was surprisingly little effort.

A Rust self-ownership lifetime trick (that doesn't work)

2020-07-01T09:50:00.000+01:00

Summary: I came up with a clever trick to encode lifetimes of allocated values in Rust. It doesn't work.

Let's imagine we are using Rust to implement some kind of container that can allocate values, and a special value can be associated with the container. It's a bug if the allocated value gets freed while it is the special value of a container. We might hope to use lifetimes to encode that relationship:

struct Value<'v> {...}
struct Container {...}

impl Container {
    fn alloc<'v>(&'v self) -> Value<'v> {...}
    fn set_special<'v>(&'v self, x: Value<'v>) {...}
}

Here we have a Container (which has no lifetime arguments), and a Value<'v> (where 'v ties it to the right container). Within our container we can implement alloc and set_special. In both cases, we take &'v self and then work with a Value<'v>, which ensures that the lifetime of the Container and Value match. (We ignore details of how to implement these functions - it's possible but requires unsafe).

Unfortunately, the following code compiles:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, x: Value<'v2>) {
    to.set_special(x);
}

The Rust compiler has taken advantage of the fact that Container can be reborrowed, and that Value is variant, and rewritten the code to:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, x: Value<'v2>) {
    'v3: {
        let x : Value<'v3> = x; // Value is variant, 'v2 : 'v3
        let to : &'v3 Container = &*to;
        to.set_special(x);
    }
}

The code with lifetime annotations doesn't actually compile, it's just what the compiler did under the hood. But we can stop Value being variant by making it contain PhantomData<Cell<&'v ()>>, since lifetimes under Cell are invariant. Now the above code no longer compiles. Unfortunately, there is a closely related variant which does compile:

fn set_cheat_alloc<'v1, 'v2>(to: &'v1 Container, from: &'v2 Container) {
    let x = from.alloc();
    to.set_special(x);
}

While Value isn't variant, &Container is, so the compiler has rewritten this code as:

fn set_cheat<'v1, 'v2>(to: &'v1 Container, from: &'v2 Container) {
    'v3: {
        let from = &'v3 Container = &*from;
        let x : Value<'v3> = from.alloc();
        let to : &'v3 Container = &*to;
        to.set_special(x);
    }
}

Since lifetimes on & are always variant, I don't think there is a trick to make this work safely. Much of the information in this post was gleaned from this StackOverflow question.

The HLint Match Engine

2020-06-22T13:15:00.001+01:00

Summary: HLint has a match engine which powers most of the rules.

The Haskell linter HLint has two forms of lint - some are built in written in Haskell code over the GHC AST (e.g. unused extension detection), but 700+ hints are written using a matching engine. As an example, we can replace map f (map g xs) with map (f . g) xs. Doing so might be more efficient, but importantly for HLint, it's often clearer. That rule is defined in HLint as:

- hint: {lhs: map f (map g x), rhs: map (f . g) x}

All single-letter variables are wildcard matches, so the above rule will match:

map isDigit (map toUpper "test")

And suggest:

map (isDigit . toUpper) "test"

However, Haskell programmers are uniquely creative in specifying functions - with a huge variety of $ and . operators, infix operators etc. The HLint matching engine in HLint v3.1.4 would match this rule to all of the following (I'm using sort as a convenient function, replacing it with foo below would not change any matches):

map f . map g
sort . map f . map g . sort
concatMap (map f . map g)
map f (map (g xs) xs)
f `map` (g `map` xs)
map f $ map g xs
map f (map g $ xs)
map f (map (\x -> g x) xs)
Data.List.map f (Prelude.map g xs)
map f ((sort . map g) xs)

That's a large variety of ways to write a nested map. In this post I'll explain how HLint matches everything above, and the bug that used to cause it to match even the final line (which isn't a legitimate match) which was fixed in HLint v3.1.5.

Eta-contraction

Given a hint comprising of lhs and rhs, the first thing HLint does is determine if it can eta-contract the hint, producing a version without the final argument. If it can do so for both sides, it generates a completely fresh hint. In the case of map f (map g x) in generates:

- hint: {lhs: map f . map g, rhs: map (f . g)}

For the examples above, the first three match with this eta-contracted version, and the rest match with the original form. Now we've generated two hints, it's important that we don't perform sufficiently fuzzy matching that both match some expression, as that would generate twice as many warnings as appropriate.

Root matching

The next step is root matching, which happens only when trying to match at the root of some match. If we have (foo . bar) x then it would be reasonable for that to match bar x, despite the fact that bar x is not a subexpression. We overcome that by transforming the expression to foo (bar x), unifying only on bar x, and recording that we need to add back foo . at the start of the replacement.

Expression matching

After splitting off any extra prefix, HLint tries to unify the single-letter variables with expressions, and build a substitution table with type Maybe [(String, Expr)]. The substitution is Nothing to denote the expressions are incompatible, or Just a mapping of variables to the expression they matched. If two expressions have the same structure, we descend into all child terms and match further. If they don't have the same structure, but are similar in a number of ways, we adjust the source expression and continue.

Examples of adjustments include expanding out $, removing infix application such as f `map` x and ignoring redundant brackets. We translate (f . g) x to f (g x), but not at the root - otherwise we might match both the eta-expanded and non-eta-expanded variants. We also re-associate (.) where needed, e.g. for expressions like sort . map f . map g . sort the bracketing means we have sort . (map f . (map g . sort)). We can see that map f . map g is not a subexpression of that expression, but given that . is associative, we can adjust the source.

When we get down to a terminal name like map, we use the scope information HLint knows to determine if the two map's are equivalent. I'm not going to talk about that too much, as it's slated to be rewritten in a future version of HLint, and is currently both slow and a bit approximate.

Substitution validity

Once we have a substitution, we see if there are any variables which map to multiple distinct expressions. If so, the substitution is invalid, and we don't match. However, in our example above, there are no duplicate variables so any matching substitution must be valid.

Side conditions

Next we check any side conditions - e.g. we could decide that the above hint only makes sense if x is atomic - i.e. does not need brackets in any circumstance. We could have expressed that with side: isAtom x, and any such conditions are checked in a fairly straightforward manner.

Substitutions

Finally, we substitute the variables into the provided replacement. When doing the replacement, we keep track of the free variables, and if the resulting expression has more free variables than it started with, we assume the hint doesn't apply cleanly. As an example, consider the hint \x -> a <$> b x to fmap a . b. It looks a perfectly reasonable hint, but what if we apply it to the expression \x -> f <$> g x x. Now b matches g x, but we are throwing away the \x binding and x is now dangling, so we reject it.

When performing the substitution, we used knowledge of the AST we want, and the brackets required to parse that expression, to ensure we insert the right brackets, but not too many.

Bug #1055

Hopefully all the above sounds quite reasonable. Unfortunately, at some point, the root-matching lost the check that it really was at the root, and started applying the translation to terms such as sort . in map f ((sort . map g) xs). Having generated the sort ., it decided since it wasn't at the root, there was nowhere for it to go, so promptly threw it away. Oops. HLint v3.1.5 fixes the bug in two distinct ways (for defence in depth):

It checks the root boolean before doing the root matching rule.
If it would have to throw away any extra expression, it fails, as throwing away that expression is certain to lead to a correctness bug.

Conclusion

The matching engine of HLint is relatively complex, but I always assumed one day would be replaced with a finite-state-machine scanner that could match n hints against an expression in O(size-of-expression), rather than the current O(n * size-of-expression). However, it's never been the bottleneck, so I've kept with the more direct version.

I'm glad HLint has a simple external lint format. It allows easy contributions and makes hint authoring accessible to everyone. For large projects it's easy to define your own hints to capture common coding patterns. When using languages whose linter does not have an external matching language (e.g. Rust's Clippy) I certainly miss the easy customization.

Hoogle Searching Overview

2020-06-09T12:31:00.000+01:00

Summary: Hoogle 5 has three interesting parts, a pipeline, database and search algorithm.

The Haskell search engine Hoogle has gone through five major designs, the first four of which are described in these slides from TFP 2011. Hoogle version 5 was designed to be a complete rewrite which simplified the design and allowed it to scale to all of Hackage. All versions of Hoogle have had some preprocessing step which consumes Haskell definitions, and writes out a data file. They then have the search phase which uses that data file to perform searches. In this post I'll go through three parts -- what the data file looks like, how we generate it, and how we search it. When we consider these three parts, the evolution of Hoogle can be seen as:

Versions 1-3, produce fairly simple data files, then do an expensive search on top. Fails to scale to large sizes.
Version 4, produce a very elaborate data files, aiming to search quickly on top. Failed because producing the data file required a lot of babysitting and a long time, so was updated very rarely (yearly). Also, searching a complex data file ends up with a lot of corner cases which have terrible complexity (e.g. a -> a -> a -> a -> a would kill the server).
Version 5, generate very simple data files, then do O(n) but small-constant multiplier searching on top. Update the files daily and automatically. Make search time very consistent.

Version 5 data file

By version 5 I had realised that deserialising the data file was both time consuming and memory hungry. Therefore, in version 5, the data file consists of chunks of data that can be memory-mapped into Vector and ByteString chunks using a ForeignPtr underlying storage. The OS figures out which bits of the data file should be paged in, and there is very little overhead or complexity on the Haskell side. There is a small index structure at the start of the data file which says where these interesting data structures live, and gives them identity using types. For example, to store information about name search we have three definitions:

data NamesSize a where NamesSize :: NamesSize Int
data NamesItems a where NamesItems :: NamesItems (V.Vector TargetId)
data NamesText a where NamesText :: NamesText BS.ByteString

Namely, in the data file we have NamesSize which is an Int, NamesItems which is a Vector TargetId, and NamesText which is a ByteString. The NamesSize is the maximum number of results that can be returned from any non-empty search (used to reduce memory allocation for the result structure), the NamesText is a big string with \0 separators between each entry, and the NamesItems are the identifiers of the result for each name, with as many entries as there are \0 separators.

The current data file is 183Mb for all of Stackage, of which 78% of that is the information about items (documentation, enough information to render them, where the links go etc - we then GZip this information). There are 21 distinct storage types, most involved with type search.

Generating the data file

Generating the data file is done in four phases.

Phase 0 downloads the inputs, primarily a .tar.gz file containing all .cabal files, and another containing all the Haddock Hoogle outputs. These .tar.gz files are never unpacked, but streamed through and analysed using conduit.

Phase 1 reads through all the .cabal files to get metadata about each package - the author, tags, whether it's in Stackage etc. It stores this information in a Map. This phase takes about 7s and uses 100Mb of memory.

Phase 2 reads through every definition in every Haddock Hoogle output (the .txt files --hoogle generates). It loads the entry, parses it, processes it, and writes most of the data to the data file, assigning it a TargetId. That TargetId is the position of the item in the data file, so it's unique, and can be used to grab the relevant item when we need to display it while searching. During this time we collect the unique deduplicated type signatures and names, along with the TargetId values. This phase takes about 1m45s and has about 900Mb of memory at the end. The most important part of phase 2 is not to introduce a space leak, since then memory soars to many Gb.

Phase 3 processes the name and type maps and writes out the information used for searching. This phase takes about 20s and consumes an additional 250Mb over the previous phase.

Since generating the data file takes only a few minutes, there is a nightly job that updates the data file at 8pm every night. The job takes about 15 minutes in total, because it checks out a new version of Hoogle from GitHub, builds it, downloads all the data files, generates a data file, runs the tests, and then restarts the servers.

Searching

Hoogle version 5 works on the principle that it's OK to be O(n) if the constant is small. For textual search, we have a big flat ByteString, and give that to some C code that quickly looks for the substring we enter, favouring complete and case-matching matches. Such a loop is super simple, and at the size of data we are working with (about 10Mb), plenty fast enough.

Type search is inspired by the same principle. We deduplicate types, then for each type, we produce an 18 byte fingerprint. There are about 150K distinct type signatures in Stackage, so that results in about 2.5Mb of fingerprints. For every type search we scan all those fingerprints and figure out the top 100 matches, then do a more expensive search on the full type for those top 100, producing a ranking. For a long time (a few years) I hadn't even bothered doing the second phase of more precise matching, and it still gave reasonable results. (In fact, I never implemented the second phase, but happily Matt Noonan contributed it.)

A type fingerprint is made up of three parts:

1 byte being the arity of the function. a -> b -> c would have arity 3.
1 byte being the number of constructors/variables in the type signature. Maybe a -> a would have a value of 3.
The three rarest names in the function. E.g. A -> B -> C -> D would compare how frequent each of A, B, C and D were in the index of functions, and record the 3 rarest. Each name is given a 32 bit value (where 0 is the most common and 2^32 is the rarest).

The idea of arity and number of constructors/variables is to try and get an approximate shape fit to the type being search for. The idea of the rarest names is an attempt to take advantage that if you are searching for ShakeOptions -> [a] -> [a] then you probably didn't write ShakeOptions by accident -- it provides a lot of signal. Therefore, filtering down to functions that mention ShakeOptions probably gives a good starting point.

Once we have the top 100 matches, we can then start considering whether type classes are satisfied, whether type aliases can be expanded, what the shape of the actual function is etc. By operating on a small and bounded number of types we can do much more expensive comparisons than if we had to apply them to every possible candidate.

Conclusion

Hoogle 5 is far from perfect, but the performance is good, the scale can keep up with the growth of Haskell packages, and the simplicity has kept maintenance low. The technique of operations which are O(n) but with a small constant is one I've applied in other projects since, and I think is an approach often overlooked.

Surprising IO: How I got a benchmark wrong

2020-06-07T12:03:00.000+01:00

Summary: IO evaluation caught me off guard when trying to write some benchmarks.

I once needed to know a quick back-of-the-envelope timing of a pure operation, so hacked something up quickly rather than going via criterion. The code I wrote was:

main = do
    (t, _) <- duration $ replicateM_ 100 $ action myInput
    print $ t / 100

{-# NOINLINE action #-}
action x = do
    evaluate $ myOperation x
    return ()

Reading from top to bottom, it takes the time of running action 100 times and prints it out. I deliberately engineered the code so that GHC couldn't optimise it so myOperation was run only once. As examples of the defensive steps I took:

The action function is marked NOINLINE. If action was inlined then myOperation x could be floated up and only run once.
The myInput is given as an argument to action, ensuring it can't be applied to myOperation at compile time.
The action is in IO so the it has to be rerun each time.

Alas, GHC still had one trick up its sleeve, and it wasn't even an optimisation - merely the definition of evaluation. The replicateM_ function takes action myInput, which is evaluated once to produce a value of type IO (), and then runs that IO () 100 times. Unfortunately, in my benchmark myOperation x is actually evaluated in the process of creating the IO (), not when running the IO (). The fix was simple:

action x = do
    _ <- return ()
    evaluate $ myOperation x
    return ()

Which roughly desugars to to:

return () >>= \_ -> evaluate (myOperation x)

Now the IO produced has a lambda inside it, and my benchmark runs 100 times. However, at -O2 GHC used to manage to break this code once more, by lifting myOperation x out of the lambda, producing:

let y = myOperation x in return () >>= \_ -> evaluate y

Now myOperation runs just once again. I finally managed to defeat GHC by lifting the input into IO, giving:

action x = do
    evaluate . myOperation =<< x
    return ()

Now the input x is itself in IO, so myOperation can't be hoisted.

I originally wrote this post a very long time ago, and back then GHC did lift myOperation out from below the lambda. But nowadays it doesn't seem to do so (quite possibly because doing so might cause a space leak). However, there's nothing that promises GHC won't learn this trick again in the future.

HLint --cross was accidentally quadratic

2020-05-31T01:15:00.000+01:00

Summary: HLint --cross was accidentally quadratic in the number of files.

One of my favourite blogs is Accidentally Quadratic, so when the Haskell linter HLint suffered such a fate, I felt duty bound to write it up. Most HLint hints work module-at-a-time (or smaller scopes), but there is one hint that can process multiple modules simultaneously - the duplication hint. If you write a sufficiently large repeated fragment in two modules, and pass --cross, then this hint will detect the duplication. The actual application of hints is HLint is governed by:

applyHintsReal :: [Setting] -> Hint -> [ModuleEx] -> [Idea]

Given a list of settings, a list of hints (which gets merged to a single composite Hint) and a list of modules, produce a list of ideas to suggest. Usually this function is called in parallel with a single module at a time, but when --cross is passed, all the modules being analysed get given in one go.

In HLint 3, applyHintsReal became quadratic in the number of modules. When you have 1 module, 1^2 = 1, and everything works fine, but --cross suffers a lot. The bug was simple. Given a Haskell list comprehension:

[(a,b) | a <- xs, b <- xs]

When given the list xs of [1,2] you get back the pairs [(1,1),(1,2),(2,1),(2,2)] - the cross product, which is quadratic in the size of xs. The real HLint code didn't look much different:

[ generateHints m m'
| m <- ms
, ...
, (nm',m') <- mns'
, ...
]
where
    mns' = map (\x -> (scopeCreate (GHC.unLoc $ ghcModule x), x)) ms

We map over ms to create mns' containing each module with some extra information. In the list comprehension we loop over each module ms to get m, then for each m in ms, loop over mns' to get m'. That means you take the cross-product of the modules, which is quadratic.

How did this bug come about? HLint used to work against haskell-src-exts (HSE), but now works against the GHC parser. We migrated the hints one by one, changing HLint to thread through both ASTs, and then each hint could pick which AST to use. The patch that introduced this behaviour left ms as the HSE AST, and made mns' the GHC AST. It should have zipped these two together, so for each module you have the HSE and GHC AST, but accidentally took the cross-product.

How did we spot it? Iustin Pop filed a bug report noting that each hint was repeated once per file being checked and performance had got significantly worse, hypothesising it was O(n^2). Iustin was right!

How did we fix it? By the time the bug was spotted, the HSE AST had been removed entirely, and both m and m' were the same type, so deleting one of the loops was easy. The fix is out in HLint version 3.1.4.

Should I be using --cross? If you haven't heard about --cross in HLint, I don't necessarily suggest you start experimenting with it. The duplicate detection hints are pretty dubious and I think most people would be better suited with a real duplicate detection tool. I've had good experiences with Simian in the past.

Fixing Space Leaks in Ghcide

2020-05-27T21:42:00.000+01:00

Summary: A performance investigation uncovered a memory leak in unordered-containers and performance issues with Ghcide.

Over the bank holiday weekend, I decided to devote some time to a possible Shake build system performance issue in Ghcide Haskell IDE. As I started investigating (and mostly failed) I discovered a space leak which I eventually figured out, solved, and then (as a happy little accident) got a performance improvement anyway. This post is a tale of what I saw, how I tackled the problem, and how I went forward. As I'm writing the post, not all the threads have concluded. I wrote lots of code during the weekend, but most was only to experiment and has been thrown away - I've mostly left the code to the links. Hopefully the chaotic nature of development shines through.

Shake thread-pool performance

I started with a Shake PR claiming that simplifying the Shake thread pool could result in a performance improvement. Faster and simpler seems like a dream combination. Taking a closer look, simpler seemed like it was simpler because it supported less features (e.g. ability to kill all threads when one has an exception, some fairness/scheduling properties). But some of those features (e.g. better scheduling) were in the pursuit of speed, so if a simpler scheduler was 30% faster (the cost of losing randomised scheduling), that might not matter.

The first step was to write a benchmark. It's very hard to synthesise a benchmark that measures the right thing, but spawning 200K short tasks into the thread pool seemed a plausible start. As promised on the PR, the simpler version did indeed run faster. But interestingly, the simplifications weren't really responsible for the speed difference - switching from forkIO to forkOn explained nearly all the difference. I'm not that familiar with forkOn, so decided to micro-benchmark it - how long does it take to spawn off 1M threads with the two methods. I found two surprising results:

The performance of forkOn was quadratic! A GHC bug explains why - it doesn't look too hard to fix, but relying on forkOn is unusual, so its unclear if the fix is worth it.
The performance of forkIO was highly inconsistent. Often it took in the region of 1 second. Sometimes it was massively faster, around 0.1s. A StackOverflow question didn't shed much light on why, but did show that by using the PVar concurrency primitive it could be 10x faster. There is a GHC bug tracking the issue, and it seems as though the thread gets created them immediately switches away. There is a suggestion from Simon Peyton Jones of a heuristic that might help, but the issue remains unsolved.

My desire to switch the Shake thread-pool to a quadratic primitive which is explicitly discouraged is low. Trying to microbenchmark with primitives that have inconsistent performance is no fun. The hint towards PVar is super interesting, and I may follow up on it in future, but given the remarks in the GHC tickets I wonder if PVar is merely avoiding one small allocation, and avoiding an allocation avoids a context switch, so it's not a real signal.

At this point I decided to zoom out and try benchmarking all of Ghcide.

Benchmarking Ghcide

The thread about the Shake thread pool pointed at a benchmarking approach of making hover requests. I concluded that making a hover request with no file changes would benchmark the part of Shake I thought the improved thread-pool was most likely to benefit. I used the Shake source code as a test bed, and opened a file with 100 transitive imports, then did a hover over the listToMaybe function. I know that will require Shake validating that everything is up to date, and then doing a little bit of hover computation.

I knew I was going to be running Ghcide a lot, and the Cabal/Stack build steps are frustratingly slow. In particular, every time around Stack wanted to unregister the Ghcide package. Therefore, I wrote a simple .bat file that compiled Ghcide and my benchmark using ghc --make. So I could experiment quickly with changes to Shake, I pulled in all of Shake as source, not as a separate library, with an include path. I have run that benchmark 100's of times, so the fact it is both simple (no arguments) and as fast as I could get has easily paid off.

For the benchmark itself, I first went down the route of looking at the replay functionality in lsp-test. Sadly, that code doesn't link to anything that explains how to generate traces. After asking on the haskell-ide-engine IRC I got pointed at both the existing functionality of resCaptureFile. I also got pointed at the vastly improved version in a PR, which doesn't fail if two messages race with each other. Configuring that and running it on my benchmark in the IDE told me that the number of messages involved was tiny - pretty much an initialisation and then a bunch of hovers. Coding those directly in lsp-test was trivial, and so I wrote a benchmark. The essence was:

doc <- openDoc "src/Test.hs" "haskell"
(t, _) <- duration $ replicateM_ 100 $
    getHover doc $ Position 127 43
print t

Open a document. Send 100 hover requests. Print the time taken.

Profiling Ghcide

Now I could run 100 hovers, I wanted to use the GHC profiling mechanisms. Importantly, the 100 hover requests dominates the loading by a huge margin, so profiles would focus on the right thing. I ran a profile, but it was empty. Turns out the way lsp-test invokes the binary it is testing means it kills it too aggressively to allow GHC to write out profiling information. I changed the benchmark to send a shutdown request at the end, then sleep, and changed Ghcide to abort on a shutdown, so it could write the profiling information.

Once I had the profiling information, I was thoroughly uniformed. 10% went in file modification checking, which could be eliminated. 10% seemed to go to hash table manipulations, which seemed on the high side, but not too significant (turned out I was totally wrong, read to the end!). Maybe 40% went in the Shake monad, but profiling exaggerates those costs significantly, so it's unclear what the truth is. Nothing else stood out, but earlier testing when profiling forkIO operations had shown they weren't counted well, so that didn't mean much.

Prodding Ghcide

In the absence of profiling data, I started changing things and measuring the performance. I tried a bunch of things that made no difference, but some things did have an impact on the time to do 100 hovers:

Running normally: 9.77s. The baseline.
Switching to forkOn: 10.65s. Suggestive that either Ghcide has changed, or the project is different, or platform differences mean that forkOn isn't as advantageous.
Using only one Shake thread: 13.65s. This change had been suggested in one ticket, but made my benchmark worse.
Avoid spawning threads for things I think will be cheap: 7.49s. A useful trick, and maybe one that will be of benefit in future, but for such a significant change a 25% performance reduction seemed poor.
Avoid doing any Shake invalidation: 0.31s. An absolute lower bound if Shake cheats and does nothing.

With all that, I was a bit dejected - performance investigation reveals nothing of note was not a great conclusion from a days work. I think that other changes to Ghcide to run Shake less and cache data more will probably make this benchmark even less important, so the conclusion worsens - performance investigation of nothing of note reveals nothing of note. How sad.

But in my benchmark I did notice something - a steadily increasing memory size in process explorer. Such issues are pretty serious in an interactive program, and we'd fixed several issues recently, but clearly there were more. Time to change gears.

Space leak detection

Using the benchmark I observed a space leak. But the program is huge, and manual code inspection usually needs a 10 line code fragment to have a change. So I started modifying the program to do less, and continued until the program did as little as it could, but still leaked space. After I fixed a space leak, I zoomed out and saw if the space leak persisted, and then had another go.

The first investigation took me into the Shake Database module. I found that if I ran the Shake script to make everything up to date, but did no actions inside, then there was a space leak. Gradually commenting out lines (over the course of several hours) eventually took me to:

step <- pure $ case v of
    Just (_, Loaded r) -> incStep $ fromStepResult r
    _ -> Step 1

This code increments a step counter on each run. In normal Shake this counter is written to disk each time, which forces the value. In Ghcide we use Shake in memory, and nothing ever forced the counter. The change was simple - replace pure with evaluate. This fix has been applied to Shake HEAD.

Space leak detection 2

The next space leak took me to the Shake database reset function, which moves all Shake keys from Ready to Loaded when a new run starts. I determined that if you didn't run this function, the leak went away. I found a few places I should have put strictness annotations, and a function that mutated an array lazily. I reran the code, but the problem persisted. I eventually realised that if you don't call reset then none of the user rules run either, which was really what was fixing the problem - but I committed the improvements I'd made even though they don't fix any space leaks.

By this point I was moderately convinced that Shake wasn't to blame, so turned my attention to the user rules in Ghcide. I stubbed them out, and the leak went away, so that looked plausible. There were 8 types of rules that did meaningful work during the hover operation (things like GetModificationTime, DoesFileExist, FilesOfInterest). I picked a few in turn, and found they all leaked memory, so picked the simple DoesFileExist and looked at what it did.

For running DoesFileExist I wrote a very quick "bailout" version of the rule, equivalent to the "doing nothing" case, then progressively enabled more bits of the rule before bailing out, to see what caused the leak. The bailout looked like:

Just v <- getValues state key file
let bailout = Just $ RunResult ChangedNothing old $ A v

I progressively enabled more and more of the rule, but even with the whole rule enabled, the leak didn't recur. At that point, I realised I'd introduced a syntax error and that all my measurements for the last hour had been using a stale binary. Oops. I span up a copy of Ghcid, so I could see syntax errors more easily, and repeated the measurements. Again, the leak didn't recur. Very frustrating.

At that point I had two pieces of code, one which leaked and one which didn't, and the only difference was the unused bailout value I'd been keeping at the top to make it easier to quickly give up half-way through the function. Strange though it seemed, the inescapable conclusion was that getValues must somehow be fixing the space leak.

If getValues fixes a leak, it is a likely guess that setValues is causing the leak. I modified setValues to also call getValues and the problem went away. But, after hours of staring, I couldn't figure out why. The code of setValues read:

setValues state key file val = modifyVar_ state $ \vals -> do
    evaluate $ HMap.insert (file, Key key) (fmap toDyn val) vals

Namely, modify a strict HashMap from unordered-containers, forcing the result. After much trial and error I determined that a "fix" was to add:

case HMap.lookup k res of
    Nothing -> pure ()
    Just v -> void $ evaluate v

It's necessary to insert into the strict HashMap, then do a lookup, then evaluate the result that comes back, or there is a space leak. I duly raised a PR to Ghcide with the unsatisfying comment:

I'm completely lost, but I do have a fix.

It's nice to fix bugs. It's better to have some clue why a fix works.

Space leak in HashMap

My only conclusion was that HashMap must have a space leak. I took a brief look at the code, but it was 20+ lines and nothing stood out. I wrote a benchmark that inserted billions of values at 1000 random keys, but it didn't leak space. I puzzled it over in my brain, and then about a day later inspiration struck. One of the cases was to deal with collisions in the HashMap. Most HashMap's don't have any collisions, so a bug hiding there could survive a very long time. I wrote a benchmark with colliding keys, and lo and behold, it leaked space. Concretely, it leaked 1Gb/s, and brought my machine to its knees. The benchmark inserted three keys all with the same hash, then modified one key repeatedly. I posted the bug to the unordered-containers library.

I also looked at the code, figured out why the space leak was occurring, and a potential fix. However, the fix requires duplicating some code, and its likely the same bug exists in several other code paths too. The Lazy vs Strict approach of HashMap being dealt with as an outer layer doesn't quite work for the functions in question. I took a look at the PR queue for unordered-containers and saw 29 requests, with the recent few having no comments on them. That's a bad sign and suggested that spending time preparing a PR might be in vain, so I didn't.

Aside: Maintainers get busy. It's no reflection negatively on the people who have invested lots of time on this library, and I thank them for their effort! Given 1,489 packages on Hackage depend on it, I think it could benefit from additional bandwidth from someone.

Hash collisions in Ghcide

While hash collisions leading to space leaks is bad, having hash collisions at all is also bad. I augmented the code in Ghcide to print out hash collisions, and saw collisions between ("Path.hs", Key GetModificationTime) and ("Path.hs", Key DoesFileExist). Prodding a bit further I saw that the Hashable instance for Key only consulted its argument value, and given most key types are simple data Foo = Foo constructions, they all had the same hash. The solution was to mix in the type information stored by Key. I changed to the definition:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key) `xor` hashWithSalt salt key

Unfortunately, that now gave hash collisions with different paths at the same key. I looked into the hashing for the path part (which is really an lsp-haskell-types NormalizedFilePath) and saw that it used an optimised hashing scheme, precomputing the hash, and returning it with hash. I also looked at the hashable library and realised the authors of lsp-haskell-types hadn't implemented hashWithSalt. If you don't do that, a generic instance is constructed which deeply walks the data structure, completely defeating the hash optimisation. A quick PR fixes that.

I also found that for tuples, the types are combined by using the salt argument. Therefore, to hash the pair of path information and Key, the Key hashWithSalt gets called with the hash of the path as its salt. However, looking at the definition above, you can imagine that both hashWithSalt of a type and hashWithSalt of a key expand to something like:

hashWithSalt salt (Key key) = salt `xor` hash (typeOf key) `xor` (salt `xor` 0)

Since xor is associative and commutative, those two salt values cancel out! While I wasn't seeing complete cancellation, I was seeing quite a degree of collision, so I changed to:

hashWithSalt salt (Key key) = hashWithSalt salt (typeOf key, key)

With that fix in Ghcide, all collisions went away, and all space leaks left with them. I had taken this implementation of hash combining from Shake, and while it's not likely to be a problem in the setting its used there, I've fixed it in Shake too.

Benchmarking Ghcide

With the hash collisions reduced, and the number of traversals when computing a hash reduced, I wondered what the impact was on performance. A rerun of the original benchmark showed the time had reduced to 9.10s - giving a speed up of about 5%. Not huge, but welcome.

Several days later we're left with less space leaks, more performance, and hopefully a better IDE experience for Haskell programmers. I failed in what I set out to do, but found some other bugs along the way, leading to 9 PRs/commits and 4 ongoing issues. I'd like to thank everyone in the Haskell IDE team for following along, making suggestions, confirming suspicions, and generally working as a great team. Merging the Haskell IDE efforts continues to go well, both in terms of code output, and team friendliness.

Shake 0.19 - changes to process execution

2020-05-23T20:29:00.001+01:00

Summary: The new version of Shake has some tweaks to how stdin works with cmd.

I've just released Shake 0.19, see the full change log. Most of the interesting changes in this release are around the cmd/command functions, which let you easily run command lines. As an example, Shake has always allowed:

cmd "gcc -c" [source] "-o" [output]

This snippet compiles a source file using gcc. The cmd function is variadic, and treats strings as space-separated arguments, and lists as literal arguments. It's overloaded by return type, so can work in the IO monad (entirely outside Shake) or the Shake Action monad (inside Shake). You can capture results and pass in options, e.g. to get the standard error and run in a different directory, you can do:

Stderr err <- cmd "gcc -c" [source] "-o" [output] (Cwd "src")

Shake is a dynamic build system with advanced dependency tracking features that let's you write your rules in Haskell. It just so happens that running commands is very common in build systems, so while not really part of a build system, it's a part of Shake that has had a lot of work done on it. Since the command is both ergonomic and featureful, I've taken to using the module Develoment.Shake.Command in non-Shake related projects.

Recent cmd changes

The first API breaking change only impacts users of the file access tracing. The resulting type is now polymorphic, and if you opt to for the FSATrace ByteString, you'll get your results a few milliseconds faster. Even if you stick with FSATrace FilePath, you'll get your results faster than the previous version. Performance of tracing happened to matter for a project I've been working on :-).

The other changes in this release are to process groups and the standard input. In Shake 0.18.3, changes were made to switch to create_group=True in the process library, as that improves the ability to cancel actions and clean up sub-processes properly. Unfortunately, on Linux that caused processes that read from standard input to hang. The correlation between these events, and the exact circumstances that triggered it, took a long time to track down - thanks to Gergő Érdi for some excellent bisection work. Most processes that are run in a build system should not access the standard input, and the only reports have come from docker (don't use -i) and ffmpeg (pass -nostdin), but hanging is a very impolite way to fail. In older versions of Shake we inherited the Shake stdin to the child (unless you specified the stdin explicitly with Stdin), but now we create a new pipe with no contents. There are now options NoProcessGroup and InheritStdin which let you change these settings independently. I suspect a handful of commands will need flag tweaks to stop reading the stdin, but they will probably fail saying the stdin is inaccessible, so debugging it should be relatively easy.

In another tale of cmd not working how you might hope, in Shake 0.15.2 we changed cmd to close file handles when spawning a process. Unfortunately, that step is O(n) in the number of potential handles on your system, where n is RLIMIT_NOFILE and can be quite big, so we switched back in 0.18.4. Since 0.18.4 you can pass CloseFileHandles if you definitely want handles to be closed. It's been argued that fork is a bad design, and this performance vs safety trade-off seems another point in favour of that viewpoint.

The amount of work that has gone into processes, especially around timeout and cross-platform differences, has been huge. I see 264 commits to these files, but the debugging time associated with them has been many weeks!

Other changes

This release contains other little tweaks that might be useful:

Time spent in the batch function is better accounted for in profiles.
Finally deleted the stuff that has been deprecated since 2014, particularly the *> operator. I think a six year deprecation cycle seems more than fair for a pre-1.0 library.
Optimised modification time on Linux.

GHC Unproposals

2020-05-20T15:20:00.000+01:00

Summary: Four improvements to Haskell I'm not going to raise as GHC proposals.

Writing a GHC proposal is a lot of hard work. It requires you to fully flesh out your ideas, and then defend them robustly. That process can take many months. Here are four short proposals that I won't be raising, but think would be of benefit (if you raise a proposal for one of them, I'll buy you a beer next time we are physically co-located).

Use : for types

In Haskell we use : for list-cons and :: for types. That's the wrong way around - the use of types is increasing, the use of lists is decreasing, and type theory has always used :. This switch has been joke-proposed before. We actually switched these operators in DAML, and it worked very nicely. Having written code in both styles, I now write Haskell on paper with : for types instead of ::. Nearly all other languages use : for types, even Python. It's sad when Python takes the more academically pure approach than Haskell.

Is it practical: Maybe. The compiler diff is quite small, so providing it as an option has very little technical cost. The problem is it bifurcates the language - example code will either work with : for types or :: for types. It's hard to write documentation, text books etc. If available, I would switch my code.

Make recursive let explicit

Currently you can write let x = x + 1 and it means loop forever at runtime because x is defined in terms of itself. You probably meant to refer to the enclosing x, but you don't get a type error, and often don't even get a good runtime error message, just a hang. In do bindings, to avoid the implicit reference to self, it's common to write x <- pure $ x + 1. That can impose a runtime cost, and obscure the true intent.

In languages like OCaml there are two different forms of let - one which allows variables to be defined and used in a single let (spelled let rec) and one which doesn't (spelled let). Interestingly, this distinction is important in GHC Core, which has two different keywords, and a source let desugars differently based on whether it is recursive. I think Haskell should add letrec as a separate keyword and make normal let non-recursive. Most recursive bindings are done under a where, and these would continue to allow full recursion, so most code wouldn't need changing.

Is it practical: The simplest version of this proposal would be to add letrec as a keyword equivalent to let and add a warning on recursive let. Whether it's practical to go the full way and redefine the semantics of let to mean non-recursive binding depends on how strong the adoption of letrec was, but given that I suspect recursive let is less common, it seems like it could work. Making Haskell a superset of GHC Core is definitely an attractive route to pursue.

Allow trailing variables in bind

When writing Haskell code, I often have do blocks that I'm in the middle of fleshing out, e.g.:

do fileName <- getLine
   src <- readFile fileName

My next line will be to print the file or similar, but this entire do block, and every sub-part within it, is constantly a parse error until I put in that final line. When the IDE has a parse error, it can't really help me as much as I'd like. The reason for the error is that <- can't be the final line of a do block. I think we should relax that restriction, probably under a language extension that only IDE's turn on. It's not necessarily clear what such a construct should mean, but in many ways that isn't the important bit, merely that such a construct results in a valid Haskell program, and allows more interactive feedback.

Is it practical: Yes, just add a language extension - since it doesn't actually enable any new power it's unlikely to cause problems. Fleshing out the semantics, and whether it applies to let x = y statements in a do block too, is left as an exercise for the submitter. An alternative would be to not change the language, but make GHC emit the error slightly later on, much like -fdefer-type-errors, which still works for IDEs (either way needs a GHC proposal).

Add an exporting keyword

Currently the top of every Haskell file duplicates all the identifiers that are exported - unless you just export everything (which you shouldn't). That approach duplicates logic, makes refactorings like renamings more effort, and makes it hard to immediately know if the function you are working on is exposed. It would be much nicer if you could just declare things that were exported inline, e.g. with a pub keyword - so pub myfunc :: a -> a both defines and exports myfunc. Rust has taken this approach and it works out quite well, modulo some mistakes. The currently Haskell design has been found a bit wanting, with constructs like pattern Foo in the export list to differentiate when multiple names Foo might be in scope, when attaching the visibility to the identifier would be much easier.

Is it practical: Perhaps, provided someone doesn't try and take the proposal too far. It would be super tempting to differentiate between exports to of the package, and exports that are only inside this package (what Rust clumsily calls pub(crate)). And there are other things in the module system that could be improved. And maybe we should export submodules. I suspect everyone will want to pile more things into this design, to the point it breaks, but a simple exporting keyword would probably be viable.

File Access Tracing

2020-05-15T09:39:00.000+01:00

Summary: It is useful to trace files accessed by a command. Shake and FSATrace provide some tools to do that.

When writing a build system, it's useful to see which files a command accesses. In the Shake build system, we use that information for linting, an auto-deps feature and a forward build mode. What we'd like is a primitive which when applied to a command execution:

Reports which files are read/written.
Reports the start and end time for when the files were accessed.
Reports what file metadata is accessed, e.g. modification times and directory listing.
Lets us pause a file access (so the dependencies can be built) or deny a file access (so dependency violations can be rejected early).
Is computationally cheap.
Doesn't require us to write/maintain too much low-level code.
Works on all major OSs (Linux, Mac, Windows).
Doesn't require sudo or elevated privilege levels.

While there are lots of approaches to tracing that get some of those features, it is currently impossible to get them all. Therefore, Shake has to make compromises. The first fours bullet points are about features -- we give up on 2 (timestamps) and 4 (pause/deny); 1 (read/writes) is essential, and we make 3 (metadata) optional, using the imperfect information when its available and tolerating its absence. The last four bullet points are about how it works -- we demand 7 (compatibility) and 8 (no sudo) because Shake must be easily available to its users. We strive for 5 (cheap) and 6 (easy), but are willing to compromise a bit on both.

Shake abstracts the result behind the cmd function with the FSATrace return type. As an example I ran in GHCi:

traced :: [FSATrace] <- cmd "gcc -c main.c"
print traced

Which compiles main.c with gcc, and on my machine prints 71 entries, including:

[ FSARead "C:\\ghc\\ghc-8.6.3\\mingw\\bin\\gcc.exe"
, FSARead "C:\\Neil\\temp\\main.c"
, FSAWrite "C:\\Users\\ndmit_000\\AppData\\Local\\Temp\\ccAadCiR.s"
, FSARead "C:\\ghc\\ghc-8.6.3\\mingw\\bin\\as.exe"
, FSARead "C:\\Users\\ndmit_000\\AppData\\Local\\Temp\\ccAadCiR.s"
, FSAWrite "C:\\Neil\\temp\\main.o"
, ...
]

Most of the remaining entries are dlls that gcc.exe uses, typically from the Windows directory. I've reordered the list to show the flow more clearly. First the process reads gcc.exe (so it can execute it), which reads main.c and writes a temporary file ccAadCiR.s. It then reads as.exe (the assembler) so it can run it, which in turn reads ccAadCiR.s and writes main.o.

Under the hood, Shake currently uses FSATrace, but that is an implementation detail -- in particular the BigBro library might one day also be supported. In order to understand the limitations of the above API, it's useful to understand the different approaches to file system tracing, and which ones FSATrace uses.

Syscall tracing On Linux, ptrace allows tracing every system call made, examining the arguments, and thus recording the files accessed. Moreover, by tracing the stat system call even file queries can be recorded. The syscall tracking approach can be made complete, but because every syscall must be hooked, can end up imposing high overhead. This approach is used by BigBro as well as numerous other debugging and instrumentation tools.

Library preload On both Linux and Mac most programs use a dynamically linked C library to make file accesses. By using LD_LIBRARY_PRELOAD it is possible to inject a different library into the program memory which intercepts the relevant C library calls, recording which files are read and written. This approach is simpler than hooking syscalls, but only works if all syscall access is made through the C library. While normally true, that isn't the case for Go programs (syscalls are invoked directly) or statically linked programs (the C library cannot be replaced).

While the technique works on a Mac, from Mac OS X 1.10 onwards system binaries can't be traced due to System Integrity Protection. As an example, the C compiler is typically installed as a system binary. It is possible to disable System Integrity Protection (but not recommended by Apple); or to use non-system binaries (e.g. those supplied by Nix); or to copy the system binary to a temporary directory (which works provided the binary does not afterwards invoke another system binary). The library preload mechanism is implemented by FSATrace and the copying system binaries trick on Mac is implemented in Shake.

File system tracing An alternative approach is to implement a custom file system and have that report which files are accessed. One such implementation for Linux is TracedFS, which is unfortunately not yet complete. Such an approach can track all accesses, but may require administrator privileges to mount a file system.

Custom Linux tracing On Linux, thanks to the open-source nature of the kernel, there are many custom file systems (e.g FUSE) and tracing mechanisms (e.g. eBPF), many of which can be used/configured/extended to perform some kind of system tracing. Unfortunately, most of these are restricted to Linux only.

Custom Mac tracing BuildXL uses a Mac sandbox based on KAuth combined with TrustedBSD Mandatory Access Control (MAC) to both detect which files are accessed and also block access to specific files. The approach is based on internal Mac OS X details which have been reversed engineered, some of which are deprecated and scheduled for removal.

Windows Kernel API hooking On Windows it is possible to hook the Kernel API, which can be used to detect when any files are accessed. Implementing such a hook is difficult, particularly around 32bit v 64bit differences, as custom assembly language trampolines must be used. Furthermore, some antivirus products (incorrectly) detect such programs as viruses. Windows kernel hooking is available in both FSATrace and BigBro (sharing the same source code), although without support for 32bit processes that spawn 64bit processes.

Current State

Shake currently uses FSATrace, meaning it uses library preloading on Linux/Mac and kernel hooking on Windows. The biggest practical limitations vary by OS:

On Linux it can't trace into Go programs (or other programs that use system calls directly) and statically linked binaries. Integrating BigBro as an alternative would address these issues.
On Mac it can't trace into system binaries called from other system binaries, most commonly the system C/C++ compiler. Using your own C/C++ installation, via Homebrew or Nix, is a workaround.
On Windows it can't trace 64bit programs spawned by 32bit programs. In most cases the 32bit binaries can easily be replaced by 64bit binaries. The only problem I've seen was caused by a five year-old version of sh hiding out in my C:\bin directory, which was easily remedied with a newer version. The code to fix this issue is available, but scares me too much to try integrating.

Overall, the tracing available in Shake has a simple API, is very useful for Shake, and has been repurposed in other build systems. But I do dearly wish such functionality could be both powerful and standardised!

HLint 3.0

2020-05-03T15:07:00.000+01:00

Summary: HLint 3.0 uses the GHC parser.

In June 2019 I posted about our intention to move HLint to the GHC parser. Since then a small group of us have been hard at work making the conversion -- first by parsing with both GHC and haskell-src-exts, and finally, with the newly released HLint 3.0, parsing only with GHC. As of now, if your code can be parsed with GHC, it can probably be parsed with HLint. As new GHC releases come out, with new features and new forms of syntax, HLint will follow along closely.

The change list for this release records 51 separate items, which is about as many as the last nine HLint releases combined. Of those changes, 11 are breaking changes (the ones marked with *). That count omits all the hint conversions, which (hopefully!) aren't user visible. The main API breaks are in the Language.Haskell.HLint API, which has switched from haskell-src-exts types to GHC ones. You can now take a GHC syntax tree and apply HLint to it, or (as before) give HLint the source and have it do the parsing for you. We also took the opportunity to simplify the API while we were at it -- but the underlying functionality remains much the same. We also deleted a small number of command line flags that were no longer useful, and were never used very much. If you have difficulty converting to the new API or relied on some removed functionality, raise a bug.

What was especially nice about this conversion process, and the development of HLint in general, is that it is increasingly becoming a real team, where my role is more reviewer than coder. There have been 21 distinct contributors since the start of the GHC conversion, but I'd like to particularly call out a few major pieces of work that have been completed:

Shayne Fletcher is responsible for the ghc-lib-parser library that makes it feasible to use a single GHC API across multiple GHC versions. Without that, using the GHC library would be at least double the work (and just wouldn't be feasible). Shayne also did a lot of the conversion, mapping many of the rule types from haskell-src-exts to GHC. These API's are surprisingly different given they have the same underlying representation.
Georgi Lyubenov did most of the conversions that Shayne didn't do.
Joseph C. Sible has made sure that while efforts were focused on a complete rewrite of the code base, the underlying hints have continued to improve, removing incorrect hints, adding useful additional hints.
Ziyang Liu has focused on the refactoring side of HLint. The initial refactoring work was completed by Matthew Pickering as part of GSoC 2015. Since then it's had mild attention at best. Ziyang has stepped into the gap, importantly adding tests, CI and improving the refactorings in lots of places. It now feels like a real part of HLint.

We hope you enjoy HLint 3.0 and beyond!

PS. You may spot we're already on HLint 3.0.2 on Hackage - thanks to Ryan Scott for already finding a few bugs.

Writing a fast interpreter

2020-04-28T21:48:00.000+01:00

Summary: Interpretation by closure is a lot faster than I expected.

Let's imagine we have an imperative language (expressions, assignments, loops etc.) and we want to write an interpreter for it. What styles of interpreter are there? And how fast do they perform? I was curious, so I wrote a demo with some benchmarks. The full code, in Rust, is available here.

First, let's get a taste of the the mini-language we want to interpret:

x = 100;
y = 12;
while x != 0 {
    y = y + 1 + y + 3;
    x = x - 1;
}

We can immediately translate x and y from being named variables to indices in an array, namely 0 and 1. Once we've done that there are four interpretation techniques that spring to mind (and in brackets their performance on my benchmark):

Interpret the AST directly (2.1s).
Compile from the AST to closures (1.4s).
Compile from the AST to a stream of instructions (1.5s).
Encode those instructions as bytes (1.5s).
Compile to assembly or JIT - I didn't try this approach (it's a lot more work).

All of these are vastly slower than my version written directly in Rust (which takes a mere 0.003s) -- but my benchmark didn't have any real operations in it, so this comparison will be the absolute worst case.

Let's go through the approaches.

Style 1: AST evaluation

One option is to directly interpret the AST. Given a vector named slots representing the variables by index, we need to change the slots as we go. A fragment of the interpreter might look like:

fn f(x: &Expr, slots: &mut Vec<i64>) -> i64 {
    match x {
        Expr::Lit(i) => *i,
        Expr::Var(u) => slots[*u],
        Expr::Add(x, y) => f(x, slots) + f(y, slots),
        ...
    }
}

It's as simple as the options come. Given the expression and the slots we need to write to, we do whatever the instruction tells us. But the low simplicity leads to low performance.

Style 2: Conversion to closure

Instead of traversing the AST at runtime, we can traverse it once, and produce a closure/function that performs the action when run (e.g. see this blog post). Given that we access the slots at runtime, we make them an argument to the closure. In Rust, the type of our closure is:

type Compiled = Box<dyn Fn(&mut Vec<i64>) -> i64>;

Here we are defining a Fn (a closure -- function plus captured data) that goes from the slots to a result. Because these functions vary in how much data they capture, we have to wrap them in Box. With that type we can now define our evaluation function:

fn f(x: &Expr) -> Compiled {
    match x {
        Expr::Lit(i) => {
            let i = *i;
            box move |_| i
        }
        Expr::Var(u) => {
            let u = *u;
            box move |slots| slots[u]
        }
        Expr::Add(x, y) => {
            let x = compile(x);
            let y = compile(y);
            box move |slots| x(slots) + y(slots)
        }
        ...
    }
}

Instead of taking the AST (compile-time information) and the slot data (runtime information) we use the compile-time information to produce a function that can then be applied to the run-time information. We trade matching on the AST for an indirect function call at runtime. Rust is able to turn tail calls on dynamic functions into jumps and the processor is able to accurately predict the jumps/calls, leading to reasonable performance.

One large advantage of the closure approach is that adding specialised variants, e.g. compiling a nested Add differently, can be done locally and with no additional runtime cost.

Style 3: Fixed sized instructions

Instead of interpreting an AST, or jumping via indirect functions, we can define a set of instructions and interpret an array of them them using a stack of intermediate values. We are effectively virtualising a CPU, including program counter. We can define a bytecode with instructions such as:

enum Bytecode {
    Assign(u32),  // assign the value at the top of the stack to a slot
    Var(u32),     // push the value in slot to the top of the stack
    Lit(i32),     // push a literal on the stack
    Add,          // Add the top two items on the stack
    Jump(u32),
    ...
}

We now interpret these instructions:

let mut pc = 0;
let mut slots = vec![0; 10];
let mut stack = Stack::new();

loop {
    match xs[pc] {
        Assign(x) => slots[x as usize] = stack.pop(),
        Var(x) => stack.push(slots[x as usize]),
        Lit(i) => stack.push(i as i64),
        Add => {
            let x = stack.pop();
            let y = stack.pop();
            stack.push(x + y)
        }
        Jump(pc2) => pc = pc2 as usize - 1,
        ...
    }
    pc = pc + 1;
}

Most of these operations work against the stack. I found that if I used checked array accesses on the stack (the default in Rust) it went about the same speed as AST interpretation. Moving to unchecked access made it similar in performance (slightly worse) than the closure version.

The bytecode approach is much harder to implement, requiring a compiler to the bytecode. It's also much harder to add specialised variants for certain combinations of instructions. To get good performance via the branch predictor probably requires further tricks beyond what I've shown here (e.g. direct threading).

There are advantages to a bytecode though -- it's easier to capture all the program state, which is useful for garbage collection and other operations.

Style 4: Byte encoded instructions

Instead of having a Rust enum to represent the values, we can instead use bytes, so instead of:

Lit(38)
Lit(4)
Add
Assign(0)

We would have a series of bytes [0,38,0,4,1,2,0] (where 0 = Lit, 1 = Add, 2 = Assign). This approach gives a more compact bytecode, and might have an impact on the instruction cache, but in my benchmarks performed the same as style 3.

The <- pure pattern

2020-03-16T09:23:00.000+00:00

Summary: Sometimes <- pure makes a lot of sense, avoiding some common bugs.

In Haskell, in a monadic do block, you can use either <- to bind monadic values, or let to bind pure values. You can also use pure or return to wrap a value with the monad, meaning the following are mostly equivalent:

let x = myExpression
x <- pure myExpression

The one place they aren't fully equivalent is when myExpression contains x within it, for example:

let x = x + 1
x <- pure (x + 1)

With the let formulation you get an infinite loop which never terminates, whereas with the <- pure pattern you take the previously defined x and add 1 to it. To solve the infinite loop, the usual solution with let is to rename the variable on the left, e.g.:

let x2 = x + 1

And now make sure you use x2 everywhere from now on. However, x remains in scope, with a more convenient name, and the same type, but probably shouldn't be used. Given a sequence of such bindings, you often end up with:

let x2 = x + 1
let x3 = x2 + 1
let x4 = x3 + 1
...

Given a large number of unchecked indicies that must be strictly incrementing, bugs usually creep in, especially when refactoring. The unused variable warning will sometime catch mistakes, but not if a variable is legitimately used twice, but one of those instances is incorrect.

Given the potential errors, when a variable x is morally "changing" in a way that the old x is not longer useful, I find it much simpler to write:

x <- pure myExpression

The compiler now statically ensures we haven't fallen into the traps of an infinite loop (which is obvious and frustrating to track down) or using the wrong data (which is much harder to track down, and often very subtly wrong).

What I really want: What I actually think Haskell should have done is made let non-recursive, and had a special letrec keyword for recursive bindings (leaving where be recursive by default). This distinction is present in GHC Core, and would mean let was much safer.

What HLint does: HLint is very aware of the <- pure pattern, but also aware that a lot of beginners should be guided towards let. If any variable is defined more than once on the LHS of an <- then it leaves the do alone, otherwise it will suggest let for those where it fits.

Warnings: In the presence of mdo or do rec both formulations might end up being the same. If the left is a refutable pattern you change between error and fail, which might be quite different. Let bindings might be generalised. This pattern gives a warning about shadowed variables with -Wall.