Friday, October 18, 2019

Improving Rebindable Syntax

Summary: Rebindable syntax is powerful, but sometimes too flexible. I had some ideas on how to improve it.

In Haskell, when you write 1, GHC turns that into GHC.Num.fromInteger 1, knowing that the binding is GHC.Num.fromInteger :: Num a => Integer -> a. If you want to use a different fromInteger you can turn on the RebindableSyntax extension, which uses whichever fromInteger is in scope. While I was working at Digital Asset on DAML, we built a GHC-based compiler with a different standard library. That standard library eliminates the Char type, has a packed Text instead of String = [Char], doesn't have overloaded numeric literals, renames Monad to Action and other changes. To get that working, we leveraged RebindableSyntax along with a module DA.Internal.RebindableSyntax which is automatically imported into every module unqualified (via a GHC source plugin).

With RebindableSyntax you can get a long way building your own base library, but there were two unpleasant parts:

  • When using RebindableSyntax, if the user writes let fromInteger = undefined in 1 then they use the fromInteger they defined in the let, shadowing the global one. For users who turn on RebindableSyntax deliberately, that's what they want. However, if you want to replace the base library and make it feel "just as good", then you'd rather than fromInteger was always some specific library you point at.
  • In the process of building fresh base libraries, we had to follow all the (pretty complex!) layering choices that GHC has made about how the modules and packages form a directed acyclic graph. There are some modules where using integer literals would cause a module cycle to appear. The fact that a number of fully qualified names are hardcoded in GHC makes for a fairly tight coupling with the base libraries, that would be better avoided.

I had an idea to solve that, but it's not fully fleshed out, and as of now, it's not a problem I still suffer from. However, I thought it worth dumping my (potentially unimplementable, certainly incomplete) thoughts out for the world, in case someone wants to pick them up.

My idea to solve the problem was to add a flag to GHC such as -fbuiltins=base.Builtins which would specify where to get all builtins. You could expect the base library to gain a module Builtins which reexported everything like fromInteger. With that extension, using RebindableSyntax is then saying "whatever is in scope", and using -fbuiltins is saying "qualify everything with this name" - they start to become fairly similar ideas. I see that has having a few benefits:

  1. It becomes easier for someone to write a principled standard library which doesn't have String = [Char], or whatever choice wants making, in a way that provides a coherent experience. One example is DAML, but another is the foundation library, which uses RebindableSyntax in its example programs.
  2. The lowest level GHC base libraries can be restructured to use RebindableSyntax as a way to more easily manage the dependencies between them in the base libraries themselves, rather than a cross-cutting concern with the compiler and base libraries. (This benefit might be possible even today with what we already have. Some people might strongly disagree that it's a benefit.)
  3. Things like which integer library to use can become a library concern, rather than requiring compiler changes.
  4. Currently the code path for RebindableSyntax is always quite different from the normal syntax path. As a result, sometimes it's not quite right and needs patching.

The main obvious disadvantages (beyond potentially the whole thing not being feasible) are that it would cause the compiler to slow down, as currently these types are hard-wired into the compiler.

Sunday, October 13, 2019

Monads as Graphs

Summary: You can describe type classes like monads by the graphs they allow.

In the Build Systems a la Carte paper we described build systems in terms of the type class their dependencies could take. This post takes the other view point - trying to describe type classes (e.g. Functor, Applicative, Monad) by the graphs they permit.

Functor

The Functor class has one operation: given Functor m, we have fmap :: (a -> b) -> m a -> m b. Consequently, if we want to end up with an m b, we need to start with an m a and apply fmap to it, and can repeatedly apply multiple fmap calls. The kind of graph that produces looks like:

We've used circles for the values m a/m b etc and lines to represent the fmap that connects them. Functor supplies no operations to "merge" two circles, so our dependencies form a linear tree. Thinking as a build system, this represents Docker, where base images can be extended to form new images (ignoring the newer multi-stage builds).

Applicative

The Applicative class has two fundamental operations - pure :: a -> m a (which we ignore because its pretty simple) and liftA2 :: (a -> b -> c) -> m a -> m b -> m c (most people think of <*> as the other fundamental operation, but liftA2 is equivalent in power). Thinking from a graph perspective, we now have the ability to create a graph node that points at two children, and uses the function argument to liftA2 to merge them. Since Applicative is a superset of Functor, we still have the ability to point at one child if we want. Children can also be pointed at by multiple parents, which just corresponds to reusing a value. We can visualise that with:

The structure of an Applicative graph can be calculated before any values on the graph have been calculated, which can be more efficient for tasks like parsing or build systems. When viewed as a build system, this represents build systems like Make (ignoring dependencies on generated Makefiles) or Buck, where all dependencies are given up front.

Selective

The next type class we look at is Selective, which can be characterised by the operation ifS :: m Bool -> m a -> m a -> m a. From a graph perspective, Selective interrogates the value of the first node, and then selects either the second or third node. We can visualise that as:

We use two arrows with arrow heads to indicate that we must point at one of the nodes, but don't know which. Unlike before, we don't know exactly what the final graph structure will be until we have computed the value on the first node of ifS. However, we can statically over-approximate the graph by assuming both branches will be taken. In build system terms, this graph corresponds to something like Dune.

Monad

The final type class is Monad which can be characterised with the operation (>>=) :: m a -> (a -> m b) -> m b. From a graph perspective, Monad interrogates the value of the first node, and then does whatever it likes to produce a second node. It can point at some existing node, or create a brand new node using the information from the first. We can visualise that as:

The use of an arrow pointing nowhere seems a bit odd, but it represents the unlimited options that the Monad provides. Before we always knew all the possible structures of the graph in advance. Now we can't know anything beyond a monad-node at all. As a build system, this graph represents a system like Shake.

Monday, July 01, 2019

Thoughts for a Haskell IDE

Summary: We have been working on pieces for a Haskell IDE at Digital Asset.

At Digital Asset, we wrote the DAML programming language. The compiler builds on GHC, and one of the important tools for using DAML is an IDE. You can try the DAML IDE online or download it. Since we wrote the DAML IDE in Haskell, and DAML uses GHC under the hood, it's possible to take the work we did for the DAML IDE and turn them into pieces for a Haskell IDE. In the rest of this post I'll outline what we wrote, and how I think it can make a full Haskell IDE.

What has Digital Asset written?

We have written a Haskell library hie-core, which serves as the "core" of an IDE. It maintains state about which files are open. It generates diagnostics. It runs the parser and type checker. It doesn't figure out how to load your package, and it doesn't have integrations with things like HLint etc. In my view, it should never gain such features - it's deliberately a small core of an IDE, which can be extended with additional rules and handlers after-the-fact.

On the technical side, at the heart of the IDE is a key-value store, where keys are pairs of file names and stages (e.g. TypeCheck) and values are dependent on the stage. We use the Shake build system in memory-only mode to record dependencies between phases. As an example of a rule:

define $ \TypeCheck file -> do
    pm <- use_ GetParsedModule file
    deps <- use_ GetDependencies file
    tms <- uses_ TypeCheck (transitiveModuleDeps deps)
    packageState <- use_ GhcSession ""
    opt <- getIdeOptions
    liftIO $ Compile.typecheckModule opt packageState tms pm

To type check a file, we get the parse tree, the transitive dependencies, a GHC session, and then call a typecheckModule helper function. If any of these dependencies change (e.g. the source file changes) the relevant pieces will be rerun.

Building on top of Shake wasn't our first choice - we initially explored two painful dead ends. While Shake isn't perfect for what we want, it's about 90% of the way there, and having robust parallelism and many years of solid engineering is worth some minor compromises in a few places. Having all the features of Shake available has also been exceptionally helpful, allowing us to try out new things quickly.

What else is required for an IDE?

My hope is that hie-core can become the core of a future IDE - but what else is required?

  • Something to load up a GHC session with the right packages and dependencies in scope. For DAML, we have a custom controlled environment so it's very easy, but real Haskell needs a better solution. My hope is that hie-bios becomes the solution, since I think it has a great underlying design.
  • Some plugins to add features, such as the as-yet-unwritten hie-hlint and hie-ormolu. Since we add lots of features on top of hie-core to make the DAML IDE, we have a good story for extensions in hie-core. Importantly, because shake is designed to be extensible, these extensions can integrate with the full dependency graph.
  • Something to talk Language Server Protocol (LSP) to communicate with editors, for which we use the existing haskell-lsp.
  • An extension for your editor. We provide a VS Code extension as extension in hie-core, but it's a fairly boilerplate LSP implementation, and people have got it working for Emacs already.
  • Something to put it all together into a coherent project, generate it, distribute it etc. A project such as haskell-ide-engine might be the perfect place to bring everything together.

Can I try it now?

Yes - instructions here. I've been using hie-core as my primary Haskell development environment since ZuriHac two weeks ago, and I like it a lot. However, beware:

  • The IDE doesn't load all the relevant files, only the ones you have open.
  • Integration with things like stack doesn't work very well - I've been using hie-bios in "Direct" mode - giving it the flags to start ghci myself. See my integrations for shake and hlint.
  • Features like hs-boot files and Template Haskell need more work to be fully supported, although a bit of Template Haskell has been observed to work.

These issues are being discussed on the hie-bios issue tracker.

Hypothetical FAQ

Q: Is something like FRP better than Shake for describing dependencies? A: I think it's clear that an IDE should use some dependency/incremental computation/parallel rebuilding approach. Shake offers one of those, and is well tested, exception safe, performant etc. The mapping from Shake to what we really want is confined to a single module, so feel free to experiment with alternatives.

Q: Who has contributed? Many many people have contributed pieces, including the whole team at Digital Asset, in particular Tim Williams, David Millar-Durant, Neil Mitchell and Moritz Kiefer.

Q: What is the relationship to haskell-ide-engine? My hope is this piece can slot into the other great things that have been done to make IDE tooling better, specifically haskell-ide-engine. This post is intended to start that discussion.

Tuesday, June 18, 2019

The One PR Per Day Rule

Summary: The rough rule I use for teams I'm on is make at least one PR per day.

One of the principles I've used quite successfully in a number of teams I've been involved with is:

Make at least one Pull Request per day

This principle nicely captures a number of development practices I consider important.

  • Most things should be reflected in code. If you spend a day coding, improving documentation, writing tests etc. there is a natural reflection in the code. If you spend a day helping someone through some problems, that probably indicates there is better documentation to be written. If you spend a day doing dev-ops, that should probably be reflected with Terraform files or similar. Not everything that matters produces code (e.g. organising an office party, immigration paperwork, attending a conference), but most things do.

  • Work incrementally. If a piece of code takes more than one day, it's a good idea to split it into smaller pieces that can land incrementally. It's always possible that after a few days work you'll realise your overarching idea wasn't great, but if you've polished up some libraries and added tests along the way, that still produced value.

  • Work with autonomy. I'm a big fan of giving developers as much autonomy as possible - discuss the broad goals and then let them figure out the details. However, with such freedom, it's much easier for things to go off in the wrong direction. Seeing incremental pieces of code every day gives a fairly good direction indicator, and allows problems to surface before a massive time investment.

  • Write reviewable code. If you have 20K lines in one big blob, there's no realistic way to review it. By splitting code into smaller, manageable, independent units it's much easier to review. More importantly, the reviewer should be able to say "No, that's not a good idea" - doing that to a days work is sad, doing it to a whole months work is brutal.

  • Foster collaboration. In a rapidly moving project, it's important that everyone is benefiting from other peoples incremental improvements, as otherwise everyone solves the same problems. By getting the code merged every day it's much easier for different people to contribute to an area of the code base, avoiding the problem of others staying away from a piece of code that someone else is working on.

  • Get feedback. If the end user is able to test the results every day that's even better, as it means they can be involved in the feedback loop - potentially refining what they actually want.

The "rule" isn't really a rule, it's more a statement of culture and principles, but one I have found concise and simple to explain. While I like this as a statement of culture, I do not measure it, as that would create all the wrong incentives.

Monday, June 17, 2019

Shake from 10,000ft

Summary: A very high-level view of the engineering aspects of Shake.

The theory behind Shake is now well documented in the Build Systems a la Carte paper, but the engineering design of the system is not. This post is a high-level overview of Shake, from 10,000ft (the types are the types I'm thinking of in my mind - read the source code for the ground truth).

Data Storage

At it's heart, the Shake "database" (not really a database) stores a mapping from keys (k) to values (v). That mapping is stored on disk as a list of (k,v) pairs. When Shake computes a new value for k, it appends it to the end of the file. When Shake starts, it loads the list, keeping only the last value for any given k, and producing a Map k v. On disk, each entry is prefixed by its size, so if the file is truncated (e.g. machine shutdown), any trailing incomplete data can be discarded.

When operating in memory, Shake uses the data type Map k (Status v), with the definitions:

data Result v = Result
    {result :: v -- ^ the result associated with the Key
    ,built :: Step -- ^ when it was actually run
    ,changed :: Step -- ^ when it last changed
    ,depends :: [[Id]] -- ^ dependencies
    }

data Status v
    = Loaded (Result v)
    | Running (Either SomeException (Result v) -> IO ())
    | Error SomeException
    | Ready (Result v)

Data is loaded in the Loaded state. When someone demands a key it moves to Running - anyone who subsequently demands it will be added to the callback. After the run completes it becomes either Error or Ready.

Execution Model

Shake runs values in the Action monad, which is a combination of some global state (e.g. settings), per-rule state (e.g. dependencies), continuation monad with IO underlying everything. The execution model of Shake is that every Action computation is either blocked waiting for a Running to complete, or queued/executing with the thread pool. The thread pool has a list of things to do and runs them in a given priority order, respecting parallelism constraints. Most threads start up, do a bit of work, block on a Running and leave it for another item in the thread pool to continue them.

To pause a thread we use continuations, meaning the most important operation on Action (which isn't available to users!) is:

captureRAW :: ((Either SomeException a -> IO ()) -> IO ()) -> Action a

This function stops an Action in its tracks, resuming it when the continuation is called. One invariant of Shake, which is (sadly!) not tracked by the type system is that every continuation passed to captureRAW must be called exactly once.

The symmetry between Running and captureRAW is deliberate, and convenient.

To kick start the thread pool, the user specifies things to run with action - a Shake program completes when all those initial action calls have completed.

Rules

The final piece of the puzzle is what Shake actually does to build a key. The core of Shake is abstract over the k and v, but Shake ships with an outer layer of around ten additional rule types -- the most important of which is files. Taking an idealised (and inefficient) file rule, we can think of it as mapping from file paths to file contents. To provide such a rule, we first declare the type mapping:

type instance RuleResult FilePath = String

And then declare the rule. The rule says how to take the key (the filename), whether its dependencies have changed, the old value (the old contents), and produce a new value (the current contents). In addition, the rule must say if the contents have changed in a meaningful way, which causes anyone who depended on them to rebuild.

Shake programs typically call apply which builds a list of keys in parallel, moving all the keys to Loaded (or at least one to Error) before continuing.

Hidden Complexity

There's a number of pieces I haven't mentioned but which hide quite a lot of complexity:

  • Shake operates on any k/v pair, but serialising arbitrary values is hard, so Shake needs to build mapping and translation tables to make that work.
  • Many rules are defined in terms of pattern matches - e.g. **/*.c - that matching logic is tricky.
  • Many rules ultimately call command line programs, so a flexible command line execution API is required.
  • The rules that Shake ships with are highly optimised and have to operate in a variety of circumstances, e.g. with --skip flags etc, so have a lot of cases in them.
  • Shake goes to a lot of effort to make binary serialisation fast, as otherwise that turns into a bottleneck.
  • Exceptions, parallelism and continuations aren't natural bedfellows - the combination requires some care and attention.
  • There are lots of utility functions, UI concerns, profiling features etc.
  • There are lots of tests. Shake is 17K lines of code, of which 4.5K lines is tests.

The Picture Version

Stepping back, the picture diagram looks like:

For all gory details see the source code.

Thursday, June 13, 2019

HLint's path to the GHC parser

Summary: HLint is going to switch to the GHC parser over the next few months. The plan is below.

For some time, HLint has been accumulating a list of those files which are valid GHC Haskell but don't parse with haskell-src-exts. The list of differences keeps growing. While I have appreciated all the maintainers of haskell-src-exts, there have been a fair few of them recently, and the project has felt like it was in maintenance mode, rather than a vibrant project.

To solve this problem, I decided to switch to the GHC parser. However, the GHC parse tree changes significantly with each version of GHC, and HLint needs to support more than one version of GHC. The solution was ghc-lib - a decoupling of the GHC API, turning it into a reusable library. As of now, the latest haskell-src-exts maintainer has recommended people move to ghc-lib.

The plan for HLint is tracked in a GitHub issue. The first step was to switch so all files are parsed with both haskell-src-exts and ghc-lib - with a failure if either parser fails - that step has been completed and released (with much work from Shayne Fletcher, who is my partner in crime for this transition).

The next step was to abstract over the Language.Haskell.HLint3 API to produce a version that didn't fundamentally rely on the haskell-src-exts data types. That has led to the Language.Haskell.HLint4 API which makes things like parsed modules abstract, and removes functions that Aelve Codesearch showed weren't being used in practice (e.g. functions for approximate Scope resolution).

The next release will ship with a 0.1 breaking-change bump and HLint3 reexporting what is currently HLint4. If you think the HLint4 API does not include necessary functions, please let me know ASAP. After that release, we'll start changing hints one by one to use the GHC parse tree. Once that is complete, we will drop the dependency on haskell-src-exts and the project will be complete.

For command line users of HLint you should notice greater compatibility with GHC, but relatively little else.

Monday, May 20, 2019

Hoogle XSS Vulnerability

Summary: Hoogle 5.0.17.6 and below have an XSS vulnerability, fixed in later versions.

On Friday afternoon I got an email from Alexander Gugel with the subject line "Non-persistent XSS vulnerability on hoogle.haskell.org" - never a good thing to get. He had found that Hoogle was echoing the user search string back into the page, meaning that if you searched for %27"><marquee style you could make all the results scroll past in a disturbingly hypnotic manner. Oh dear!

Step 1: Fix the website

The first concern was to fix the website. While there aren't any cookies stored by Hoogle, and there are no logon forms or similar, the Project Zero blog has taught me that given the tiniest chink, everything can be broken. Fortunately, Alex emailed me using the email address on my webpage, described the problem, and provided a 3 line diff that escaped all the problematic variables. I applied this fix and pushed a new version to hoogle.haskell.org.

Step 2: Use the type system

Like any good Haskeller, my first thought on encountering a bug is to use the type system to prevent it by construction. The problem boils down to taking user input and splicing it into an HTML page. My initial fix was to introduce a type Taint:

newtype Taint a = Taint a

escapeUntaint :: Taint String -> String
escapeUntaint (Taint x) = escapeHTML x

The idea is that instead of the query parameters to the web page being String's that can be carelessly spliced into the output, they were Taint String values whose only real unwrapping function involves escaping any HTML they may contain. Furthermore, Taint can have instances for Monad etc, meaning you can work on tainted values, but the result will always remain tainted.

Using this approach uncovered no additional problems, but gave me much more confidence there weren't any I just hadn't found.

Step 3: Make a release

At this point I made a release of Hoogle 5.0.17.7. This version has no known XSS issues with it.

Step 4: Switch to blaze-html

While Taint is an effective tool for some domains, the real problem for Hoogle was that I was building up HTML values using String - making it way too easy to create invalid HTML, and providing an easy attack vector. The next change was to switch to blaze-html, which uses strong typing to ensure the HTML is always valid. Instead of having to call escapeHTML to turn bad String into good String, I instead used H.string to turn bad String into good Markup. For the rare case where there genuinely was String that contained HTML for good reasons I used H.preEscapedString, making the "don't escape" explicit and longer, and the "do escape" the default - a much safer default.

Step 5: Use Content Security Policy headers

There are a whole suite of headers that can be returned by the server to opt in to additional checking, known as CSP headers. These headers can ban inline script, detect XSS attacks, avoid confusion with MIME types, avoid http downgrade attacks and more. Thanks to Gary Verhaegen many of these are now applied to Hoogle, meaning that even if my code is wrong, the chances of it causing any damange (even just hypnotic scrolling) are much reduced.

Step 6: Relax

Hoogle 5.0.17.8 has all the security fixes listed and is deployed to hoogle.haskell.org. Hopefully no more security issues for a while!

Many thanks to Alexander Gugel for the responsible disclosure, and to Gary Verhaegen for his work on CSP headers.