Monday, July 01, 2019

Thoughts for a Haskell IDE

Summary: We have been working on pieces for a Haskell IDE at Digital Asset.

At Digital Asset, we wrote the DAML programming language. The compiler builds on GHC, and one of the important tools for using DAML is an IDE. You can try the DAML IDE online or download it. Since we wrote the DAML IDE in Haskell, and DAML uses GHC under the hood, it's possible to take the work we did for the DAML IDE and turn them into pieces for a Haskell IDE. In the rest of this post I'll outline what we wrote, and how I think it can make a full Haskell IDE.

What has Digital Asset written?

We have written a Haskell library hie-core, which serves as the "core" of an IDE. It maintains state about which files are open. It generates diagnostics. It runs the parser and type checker. It doesn't figure out how to load your package, and it doesn't have integrations with things like HLint etc. In my view, it should never gain such features - it's deliberately a small core of an IDE, which can be extended with additional rules and handlers after-the-fact.

On the technical side, at the heart of the IDE is a key-value store, where keys are pairs of file names and stages (e.g. TypeCheck) and values are dependent on the stage. We use the Shake build system in memory-only mode to record dependencies between phases. As an example of a rule:

define $ \TypeCheck file -> do
    pm <- use_ GetParsedModule file
    deps <- use_ GetDependencies file
    tms <- uses_ TypeCheck (transitiveModuleDeps deps)
    packageState <- use_ GhcSession ""
    opt <- getIdeOptions
    liftIO $ Compile.typecheckModule opt packageState tms pm

To type check a file, we get the parse tree, the transitive dependencies, a GHC session, and then call a typecheckModule helper function. If any of these dependencies change (e.g. the source file changes) the relevant pieces will be rerun.

Building on top of Shake wasn't our first choice - we initially explored two painful dead ends. While Shake isn't perfect for what we want, it's about 90% of the way there, and having robust parallelism and many years of solid engineering is worth some minor compromises in a few places. Having all the features of Shake available has also been exceptionally helpful, allowing us to try out new things quickly.

What else is required for an IDE?

My hope is that hie-core can become the core of a future IDE - but what else is required?

  • Something to load up a GHC session with the right packages and dependencies in scope. For DAML, we have a custom controlled environment so it's very easy, but real Haskell needs a better solution. My hope is that hie-bios becomes the solution, since I think it has a great underlying design.
  • Some plugins to add features, such as the as-yet-unwritten hie-hlint and hie-ormolu. Since we add lots of features on top of hie-core to make the DAML IDE, we have a good story for extensions in hie-core. Importantly, because shake is designed to be extensible, these extensions can integrate with the full dependency graph.
  • Something to talk Language Server Protocol (LSP) to communicate with editors, for which we use the existing haskell-lsp.
  • An extension for your editor. We provide a VS Code extension as extension in hie-core, but it's a fairly boilerplate LSP implementation, and people have got it working for Emacs already.
  • Something to put it all together into a coherent project, generate it, distribute it etc. A project such as haskell-ide-engine might be the perfect place to bring everything together.

Can I try it now?

Yes - instructions here. I've been using hie-core as my primary Haskell development environment since ZuriHac two weeks ago, and I like it a lot. However, beware:

  • The IDE doesn't load all the relevant files, only the ones you have open.
  • Integration with things like stack doesn't work very well - I've been using hie-bios in "Direct" mode - giving it the flags to start ghci myself. See my integrations for shake and hlint.
  • Features like hs-boot files and Template Haskell need more work to be fully supported, although a bit of Template Haskell has been observed to work.

These issues are being discussed on the hie-bios issue tracker.

Hypothetical FAQ

Q: Is something like FRP better than Shake for describing dependencies? A: I think it's clear that an IDE should use some dependency/incremental computation/parallel rebuilding approach. Shake offers one of those, and is well tested, exception safe, performant etc. The mapping from Shake to what we really want is confined to a single module, so feel free to experiment with alternatives.

Q: Who has contributed? Many many people have contributed pieces, including the whole team at Digital Asset, in particular Tim Williams, David Millar-Durant, Neil Mitchell and Moritz Kiefer.

Q: What is the relationship to haskell-ide-engine? My hope is this piece can slot into the other great things that have been done to make IDE tooling better, specifically haskell-ide-engine. This post is intended to start that discussion.

Tuesday, June 18, 2019

The One PR Per Day Rule

Summary: The rough rule I use for teams I'm on is make at least one PR per day.

One of the principles I've used quite successfully in a number of teams I've been involved with is:

Make at least one Pull Request per day

This principle nicely captures a number of development practices I consider important.

  • Most things should be reflected in code. If you spend a day coding, improving documentation, writing tests etc. there is a natural reflection in the code. If you spend a day helping someone through some problems, that probably indicates there is better documentation to be written. If you spend a day doing dev-ops, that should probably be reflected with Terraform files or similar. Not everything that matters produces code (e.g. organising an office party, immigration paperwork, attending a conference), but most things do.

  • Work incrementally. If a piece of code takes more than one day, it's a good idea to split it into smaller pieces that can land incrementally. It's always possible that after a few days work you'll realise your overarching idea wasn't great, but if you've polished up some libraries and added tests along the way, that still produced value.

  • Work with autonomy. I'm a big fan of giving developers as much autonomy as possible - discuss the broad goals and then let them figure out the details. However, with such freedom, it's much easier for things to go off in the wrong direction. Seeing incremental pieces of code every day gives a fairly good direction indicator, and allows problems to surface before a massive time investment.

  • Write reviewable code. If you have 20K lines in one big blob, there's no realistic way to review it. By splitting code into smaller, manageable, independent units it's much easier to review. More importantly, the reviewer should be able to say "No, that's not a good idea" - doing that to a days work is sad, doing it to a whole months work is brutal.

  • Foster collaboration. In a rapidly moving project, it's important that everyone is benefiting from other peoples incremental improvements, as otherwise everyone solves the same problems. By getting the code merged every day it's much easier for different people to contribute to an area of the code base, avoiding the problem of others staying away from a piece of code that someone else is working on.

  • Get feedback. If the end user is able to test the results every day that's even better, as it means they can be involved in the feedback loop - potentially refining what they actually want.

The "rule" isn't really a rule, it's more a statement of culture and principles, but one I have found concise and simple to explain. While I like this as a statement of culture, I do not measure it, as that would create all the wrong incentives.

Monday, June 17, 2019

Shake from 10,000ft

Summary: A very high-level view of the engineering aspects of Shake.

The theory behind Shake is now well documented in the Build Systems a la Carte paper, but the engineering design of the system is not. This post is a high-level overview of Shake, from 10,000ft (the types are the types I'm thinking of in my mind - read the source code for the ground truth).

Data Storage

At it's heart, the Shake "database" (not really a database) stores a mapping from keys (k) to values (v). That mapping is stored on disk as a list of (k,v) pairs. When Shake computes a new value for k, it appends it to the end of the file. When Shake starts, it loads the list, keeping only the last value for any given k, and producing a Map k v. On disk, each entry is prefixed by its size, so if the file is truncated (e.g. machine shutdown), any trailing incomplete data can be discarded.

When operating in memory, Shake uses the data type Map k (Status v), with the definitions:

data Result v = Result
    {result :: v -- ^ the result associated with the Key
    ,built :: Step -- ^ when it was actually run
    ,changed :: Step -- ^ when it last changed
    ,depends :: [[Id]] -- ^ dependencies
    }

data Status v
    = Loaded (Result v)
    | Running (Either SomeException (Result v) -> IO ())
    | Error SomeException
    | Ready (Result v)

Data is loaded in the Loaded state. When someone demands a key it moves to Running - anyone who subsequently demands it will be added to the callback. After the run completes it becomes either Error or Ready.

Execution Model

Shake runs values in the Action monad, which is a combination of some global state (e.g. settings), per-rule state (e.g. dependencies), continuation monad with IO underlying everything. The execution model of Shake is that every Action computation is either blocked waiting for a Running to complete, or queued/executing with the thread pool. The thread pool has a list of things to do and runs them in a given priority order, respecting parallelism constraints. Most threads start up, do a bit of work, block on a Running and leave it for another item in the thread pool to continue them.

To pause a thread we use continuations, meaning the most important operation on Action (which isn't available to users!) is:

captureRAW :: ((Either SomeException a -> IO ()) -> IO ()) -> Action a

This function stops an Action in its tracks, resuming it when the continuation is called. One invariant of Shake, which is (sadly!) not tracked by the type system is that every continuation passed to captureRAW must be called exactly once.

The symmetry between Running and captureRAW is deliberate, and convenient.

To kick start the thread pool, the user specifies things to run with action - a Shake program completes when all those initial action calls have completed.

Rules

The final piece of the puzzle is what Shake actually does to build a key. The core of Shake is abstract over the k and v, but Shake ships with an outer layer of around ten additional rule types -- the most important of which is files. Taking an idealised (and inefficient) file rule, we can think of it as mapping from file paths to file contents. To provide such a rule, we first declare the type mapping:

type instance RuleResult FilePath = String

And then declare the rule. The rule says how to take the key (the filename), whether its dependencies have changed, the old value (the old contents), and produce a new value (the current contents). In addition, the rule must say if the contents have changed in a meaningful way, which causes anyone who depended on them to rebuild.

Shake programs typically call apply which builds a list of keys in parallel, moving all the keys to Loaded (or at least one to Error) before continuing.

Hidden Complexity

There's a number of pieces I haven't mentioned but which hide quite a lot of complexity:

  • Shake operates on any k/v pair, but serialising arbitrary values is hard, so Shake needs to build mapping and translation tables to make that work.
  • Many rules are defined in terms of pattern matches - e.g. **/*.c - that matching logic is tricky.
  • Many rules ultimately call command line programs, so a flexible command line execution API is required.
  • The rules that Shake ships with are highly optimised and have to operate in a variety of circumstances, e.g. with --skip flags etc, so have a lot of cases in them.
  • Shake goes to a lot of effort to make binary serialisation fast, as otherwise that turns into a bottleneck.
  • Exceptions, parallelism and continuations aren't natural bedfellows - the combination requires some care and attention.
  • There are lots of utility functions, UI concerns, profiling features etc.
  • There are lots of tests. Shake is 17K lines of code, of which 4.5K lines is tests.

The Picture Version

Stepping back, the picture diagram looks like:

For all gory details see the source code.

Thursday, June 13, 2019

HLint's path to the GHC parser

Summary: HLint is going to switch to the GHC parser over the next few months. The plan is below.

For some time, HLint has been accumulating a list of those files which are valid GHC Haskell but don't parse with haskell-src-exts. The list of differences keeps growing. While I have appreciated all the maintainers of haskell-src-exts, there have been a fair few of them recently, and the project has felt like it was in maintenance mode, rather than a vibrant project.

To solve this problem, I decided to switch to the GHC parser. However, the GHC parse tree changes significantly with each version of GHC, and HLint needs to support more than one version of GHC. The solution was ghc-lib - a decoupling of the GHC API, turning it into a reusable library. As of now, the latest haskell-src-exts maintainer has recommended people move to ghc-lib.

The plan for HLint is tracked in a GitHub issue. The first step was to switch so all files are parsed with both haskell-src-exts and ghc-lib - with a failure if either parser fails - that step has been completed and released (with much work from Shayne Fletcher, who is my partner in crime for this transition).

The next step was to abstract over the Language.Haskell.HLint3 API to produce a version that didn't fundamentally rely on the haskell-src-exts data types. That has led to the Language.Haskell.HLint4 API which makes things like parsed modules abstract, and removes functions that Aelve Codesearch showed weren't being used in practice (e.g. functions for approximate Scope resolution).

The next release will ship with a 0.1 breaking-change bump and HLint3 reexporting what is currently HLint4. If you think the HLint4 API does not include necessary functions, please let me know ASAP. After that release, we'll start changing hints one by one to use the GHC parse tree. Once that is complete, we will drop the dependency on haskell-src-exts and the project will be complete.

For command line users of HLint you should notice greater compatibility with GHC, but relatively little else.

Monday, May 20, 2019

Hoogle XSS Vulnerability

Summary: Hoogle 5.0.17.6 and below have an XSS vulnerability, fixed in later versions.

On Friday afternoon I got an email from Alexander Gugel with the subject line "Non-persistent XSS vulnerability on hoogle.haskell.org" - never a good thing to get. He had found that Hoogle was echoing the user search string back into the page, meaning that if you searched for %27"><marquee style you could make all the results scroll past in a disturbingly hypnotic manner. Oh dear!

Step 1: Fix the website

The first concern was to fix the website. While there aren't any cookies stored by Hoogle, and there are no logon forms or similar, the Project Zero blog has taught me that given the tiniest chink, everything can be broken. Fortunately, Alex emailed me using the email address on my webpage, described the problem, and provided a 3 line diff that escaped all the problematic variables. I applied this fix and pushed a new version to hoogle.haskell.org.

Step 2: Use the type system

Like any good Haskeller, my first thought on encountering a bug is to use the type system to prevent it by construction. The problem boils down to taking user input and splicing it into an HTML page. My initial fix was to introduce a type Taint:

newtype Taint a = Taint a

escapeUntaint :: Taint String -> String
escapeUntaint (Taint x) = escapeHTML x

The idea is that instead of the query parameters to the web page being String's that can be carelessly spliced into the output, they were Taint String values whose only real unwrapping function involves escaping any HTML they may contain. Furthermore, Taint can have instances for Monad etc, meaning you can work on tainted values, but the result will always remain tainted.

Using this approach uncovered no additional problems, but gave me much more confidence there weren't any I just hadn't found.

Step 3: Make a release

At this point I made a release of Hoogle 5.0.17.7. This version has no known XSS issues with it.

Step 4: Switch to blaze-html

While Taint is an effective tool for some domains, the real problem for Hoogle was that I was building up HTML values using String - making it way too easy to create invalid HTML, and providing an easy attack vector. The next change was to switch to blaze-html, which uses strong typing to ensure the HTML is always valid. Instead of having to call escapeHTML to turn bad String into good String, I instead used H.string to turn bad String into good Markup. For the rare case where there genuinely was String that contained HTML for good reasons I used H.preEscapedString, making the "don't escape" explicit and longer, and the "do escape" the default - a much safer default.

Step 5: Use Content Security Policy headers

There are a whole suite of headers that can be returned by the server to opt in to additional checking, known as CSP headers. These headers can ban inline script, detect XSS attacks, avoid confusion with MIME types, avoid http downgrade attacks and more. Thanks to Gary Verhaegen many of these are now applied to Hoogle, meaning that even if my code is wrong, the chances of it causing any damange (even just hypnotic scrolling) are much reduced.

Step 6: Relax

Hoogle 5.0.17.8 has all the security fixes listed and is deployed to hoogle.haskell.org. Hopefully no more security issues for a while!

Many thanks to Alexander Gugel for the responsible disclosure, and to Gary Verhaegen for his work on CSP headers.

Tuesday, May 14, 2019

Shake with Applicative Parallelism

Summary: Shake now does that Applicative trick from Haxl.

In Shake 0.17.9 and below, need xs >> need ys builds xs in parallel, then afterwards builds ys in parallel. The same is true of need xs *> need ys, where *> is the applicative equivalent of >>. From Shake 0.18 onwards both versions run everything in parallel. Hopefully that makes some Shake-based build systems go faster.

What change is being made?

If you make two calls to apply without any IO, monadic-bind or state operations in between then they will be executed as though you had made a single call to apply. As examples, need, askOracle and getDirectoryFiles are all calls to apply under the hood, so can be merged. However, note that the invariants are somewhat subtle. Something as simple as:

myNeed xs = do putNormal "Needing here"; need xs

Will not be merged with a preceeding need - the function putNormal queries the state (what is the verbosity level), does IO and contains a monadic bind.

Why are you making this change?

I am making the change for two reasons: 1) people have kept asking for it since Haxl does it; 2) the Hadrian build probably benefits from it. The downsides are relatively low (more complexity inside Shake, slightly slower Action operations) but the benfits are potentially large.

Why didn't you make this change sooner?

My previous reasoning for not making the change was:

Shake could follow the Haxl approach, but does not, mainly because they are targeting different problems. In Haxl, the operations are typically read-only, and any single step is likely to involve lots of operations. In contrast, with Shake the operations definitely change the file system, and there are typically only one or two per rule. Consequently, Shake opts for an explicit approach, rather than allow users to use *> (and then inevitably add a comment because its an unusual thing to do).

I stand by that comment - explicit grouping of need or explicit use of parallel is often better - all it takes is a sneaky >>= and the parallelism disappears. But if this change improves some build times, it's hard to argue strongly against.

Will it break any build systems?

Potentially, but unlikely, and those it will break were already on thin ice. As some examples:

  • If a depends on some state change from b (e.g. creating a directory), but doesn't have a dependency on it, then need [a] >> need [b] might have worked, while need [a,b] might not. The correct solution is for a to depend on b, if it does in fact depend on b, or at the very least use orderOnly.
  • If you use getDirectoryFiles on generated files (something the documentation says is a bad idea) then if merged with the thing that generates the files you will get incoherent results. The solution is to avoid using getDirectoryFiles on generated files.

Thanks to Pepe Iborra for encouraging, testing and troubleshooting this change.

Monday, April 29, 2019

foldr under the hood

Summary: The foldr function seems simple, but is actually very complex, with lots of layers. This post dives through the layers.

The foldr function takes a list and replaces all : (cons) and [] (nil) values with functions and a final value. It's available in the Haskell Prelude and described on Wikipedia. As some examples:

sum = foldr (+) 0
map f = foldr (\x xs -> f x : xs) []

But the simple foldr described on Wikipedia is many steps away from the one in the Haskell Prelude. In this post we'll peel back the layers, learning why foldr is a lot more complicated under the hood.

Layer 1: Wikipedia definition

The definition on Wikipedia is:

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f z []     = z
foldr f z (x:xs) = f x (foldr f z xs)

This recursive definition directly describes what foldr does. Given a list [1,2,3] we get f 1 (f 2 (f 3 z)).

Layer 2: Static argument transformation

The problem with this definition is that it is recursive, and GHC doesn't like to inline recursive functions, which prevents a lot of optimisation. Taking a look at sum, it's a real shame that operations like (+) are passed as opaque higher-order functions, rather than specialised to the machine instruction ADD. To solve that problem, GHC defines foldr as:

foldr f z = go
    where go []     = z
          go (x:xs) = f x (go xs)

The arguments f and z are constant in all sucessive calls, so they are lifted out with a manually applied static argument transformation.

Now the function foldr is no longer recursive (it merely has a where that is recursive), so foldr can be inlined, and now + can meet up with go and everything can be nicely optimised.

Layer 3: Inline later

We now have foldr that can be inlined. However, inlining foldr is not always a good idea. In particular, GHC has an optimisation called list fusion based on the idea that combinations of foldr and build can be merged, sometimes known as short-cut deforestation. The basic idea is that if we see foldr applied to build we can get rid of both (see this post for details). We remove foldr using the GHC rewrite rule:

{-# RULES "my foldr/build" forall g k z. foldr k z (build g) = g k z #-}

The most interesting thing about this rule (for this post at least!) is that it matches foldr by name. Once we've inlined foldr we have thrown away the name, and the rule can't fire anymore. Since this rule gives significant speedups, we really want it to fire, so GHC adds an extra pragma to foldr:

{-# INLINE [0] foldr #-}

This INLINE pragma says don't try and inline foldr until the final stage of the compiler, but in that final stage, be very keen to inline it.

Layer 4: More polymorphism

However, the foldr function in the Prelude is not the one from GHC.List, but actually a more general one that works for anything Foldable. Why limit yourself to folding over lists, when you can fold over other types like Set. So now foldr is generailsed from [] to t with:

foldr :: Foldable t => (a -> b -> b) -> b -> t a -> b

Where foldr on [] is GHC.List.foldr.

Layer 5: A default implementation

But foldr is actually in the type class Foldable, not just defined on the outside. Users defining Foldable can define only foldr and have all the other methods defined for them. But they can equally define only foldMap, and have an implicit version of foldr defined as:

foldr :: (a -> b -> b) -> b -> t a -> b
foldr f z t = appEndo (foldMap (Endo . f) t) z

Where Endo is defined as:

newtype Endo = Endo {appEndo :: a -> a}

instance Monoid (Endo a) where
    mempty = Endo id
    Endo a <> Endo b = Endo (a . b)

The function foldMap f is equivalent to mconcat . map f, so given a list [1,2,3] the steps are:

  • First apply map (Endo . f) to each element to get [Endo (f 1), Endo (f 2), Endo (f 3)].
  • Next apply mconcat to the list to get Endo (f 1) <> Endo (f 2) <> Endo (f 3).
  • Inline all the <> definitions to get Endo (f 1 . f 2 . f 3).
  • Apply the appEndo at the beginning and z at the end for (f 1 . f 2 . f 3) z.
  • Inline all the . to give f 1 (f 2 (f 3 z)), which is what we had at layer 1.

Layer 6: Optimising the default implementation

The real default implementation of foldr is:

foldr f z t = appEndo (foldMap (Endo #. f) t) z

Note that the . after Endo has become #.. Let's first explain why it's correct, then why it might be beneficial. The definition of #. is:

(#.) :: Coercible b c => (b -> c) -> (a -> b) -> (a -> c)
(#.) _ = coerce

Note that it has the same type as . (plus a Coercible constraint), but ignores it's first argument entirely. The coerce function transforms a value of type a into a value of type b with zero runtime cost, provided they have the same underlying representation. Since Endo is a newtype, that means Endo (f 1) and f 1 are implemented identically in the runtime, so coerce switches representation "for free". Note that the first argument to #. only serves to pin down the types, so if we'd passed an interesting function as the first argument it would have been ignored.

Of course, in normal circumstances, a newtype is free anyway, with no runtime cost. However, in this case we don't have a newtype, but a function application with a newtype. You can see the gory details in GHC ticket 7542, but at one point this impeeded other optimisations.

I tried a simulated version of foldr and found that if GHC can't tell that [] is the Foldable the code looks pretty bad, but if it can, at -O1 and above, the two implementations are 100% equivalent (to the point that common subexpression elimination makes them actually the same). It's possible this final layer is a vestigial optimisation, or perhaps it's still important in some circumstances.