Sunday, May 21, 2017

Proving fib equivalence

Summary: Using Idris I proved the exponential and linear time fib functions are equivalent.

The Haskell wiki proclaims that Implementing the Fibonacci sequence is considered the "Hello, world!" of Haskell programming. It also provides multiple versions of fib - an exponential version and a linear version. We can translate these into Idris fairly easily:

fibExp : Nat -> Nat
fibExp Z = 0
fibExp (S Z) = 1
fibExp (S (S n)) = fibExp (S n) + fibExp n

fibLin' : Nat -> Nat -> Nat -> Nat
fibLin' Z a b = b
fibLin' (S n) a b = fibLin' n (a + b) a

fibLin : Nat -> Nat
fibLin n = fibLin' n 1 0

We've made the intermediate go function in fibLin top-level, named it fibLib' and untupled the accumulator - but it's still fundamentally the same. Now we've got the power of Idris, it would be nice to prove that fibExp and fibLin are equivalent. To do that, it's first useful to think about why fibLib' works at all. In essence we're decrementing the first argument, and making the next two arguments be fib of increasingly large values. If we think more carefully we can come up with the invariant:

fibLinInvariant : (d : Nat) -> (u : Nat) ->
    fibLin' d (fibExp (1+u)) (fibExp u) = fibExp (d+u)

Namely that at all recursive calls we must have the fib of two successive values (u and u+1), and after d additional loops we end up with fib (d+u). Actually proving this invariant isn't too hard:

fibLinInvariant Z u = Refl
fibLinInvariant (S d) u =
    rewrite fibLinInvariant d (S u) in
    rewrite plusSuccRightSucc d u in
    Refl

Idris simplifies the base case away for us. For the recursive case we apply the induction hypothesis to leave us with:

fibExp (plus d (S u)) = fibExp (S (plus d u))

Ignoring the fibExp we just need to prove d+(u+1) = (d+u)+1. That statement is obviously true because + is associative, but in our particular case, we use plusSuccRightSucc which is the associative lemma where +1 is the special S constructor. After that, everything has been proven.

Armed with the fundamental correctness invariant for fibLib, we can prove the complete equivalence.

fibEq : (n : Nat) -> fibLin n = fibExp n
fibEq n =
    rewrite fibLinInvariant n 0 in
    rewrite plusZeroRightNeutral n in
    Refl

We appeal to the invariant linking fibLin' and finExp, do some minor arithmetic manipulation (x+0 = x), and are done. Note that depending on exactly how you define the fibLinInvariant you require different arithmetic lemmas, e.g. if you write d+u or u+d. Idris is equipped with everything required.

I was rather impressed that proving fib equivalence wasn't all that complicated, and really just requires figuring out what fundamental property makes fibLin actually work. In many ways the proof makes things clearer.

Tuesday, May 16, 2017

Idris reverse proofs

Summary: I proved O(n) reverse is equivalent to O(n^2) reverse.

Following on from my previous post I left the goal of given two reverse implementations:

reverse1 : List a -> List a
reverse1 [] = []
reverse1 (x :: xs) = reverse1 xs ++ [x]

rev2 : List a -> List a -> List a
rev2 acc [] = acc
rev2 acc (x :: xs) = rev2 (x :: acc) xs

reverse2 : List a -> List a
reverse2 = rev2 []

Proving the following laws hold:

proof_reverse1_reverse1 : (xs : List a) -> xs = reverse1 (reverse1 xs)
proof_reverse2_reverse2 : (xs : List a) -> xs = reverse2 (reverse2 xs)
proof_reverse1_reverse2 : (xs : List a) -> reverse1 xs = reverse2 xs

The complete proofs are available in my Idris github repo, and if you want to get the most out of the code, it's probably best to step through the types in Idris. In this post I'll talk through a few of the details.

Directly proving the reverse1 (reverse1 x) = x by induction is hard, since the function doesn't really follow so directly. Instead we need to define a helper lemma.

proof_reverse1_append : (xs : List a) -> (ys : List a) ->
    reverse1 (xs ++ ys) = reverse1 ys ++ reverse1 xs

Coming up with these helper lemmas is the magic of writing proofs - and is part intuition, part thinking about what might be useful and part thinking about what invariants are obeyed. With that lemma, you can prove by induction on the first argument (which is the obvious one to chose since ++ evaluates the first argument first). Proving the base case is trivial:

proof_reverse1_append [] ys =
    rewrite appendNilRightNeutral (reverse1 ys) in
    Refl

The proof immediately reduces to reverse1 ys == reverse1 ys ++ [] and we can invoke the standard proof that if ++ has [] on the right we can ignore it. After applying that rewrite, we're straight back to Refl.

The :: is not really any harder, we immediately get to reverse1 ys ++ (reverse1 xs ++ [x]), but bracketed the wrong way round, so have to apply appendAssociative to rebracket so it fits the inductive lemma. The proof looks like:

proof_reverse1_append (x :: xs) ys =
    rewrite appendAssociative (reverse1 ys) (reverse1 xs) [x] in
    rewrite proof_reverse1_append xs ys in Refl

Once we have the proof of how to move reverse1 over ++ we use it inductively to prove the original lemma:

proof_reverse1_reverse1 : (xs : List a) -> xs = reverse1 (reverse1 xs)
proof_reverse1_reverse1 [] = Refl
proof_reverse1_reverse1 (x :: xs) =
    rewrite proof_reverse1_append (reverse1 xs) [x] in
    rewrite proof_reverse1_reverse1 xs in
    Refl

For the remaining two proofs, I first decided to prove reverse1 x = reverse2 x and then prove the reverse2 (reverse2 x) = x in terms of that. To prove equivalence of the two functions I first needed information about the fundamental property that rev2 obeys:

proof_rev : (xs : List a) -> (ys : List a) ->
    rev2 xs ys = reverse2 ys ++ xs

Namely that the accumulator can be tacked on the end. The proof itself isn't very complicated, although it does require two calls to the inductive hypothesis (you basically expand out rev2 on both sides and show they are equivalent):

proof_rev xs [] = Refl
proof_rev xs (y :: ys) =
    rewrite proof_rev [y] ys in
    rewrite sym $ appendAssociative (rev2 [] ys) [y] xs in
    rewrite proof_rev (y :: xs) ys in
    Refl

The only slight complexity is that I needed to apply sym to switch the way the appendAssocative proof is applied. With that proof available, proving equivalence isn't that hard:

proof_reverse1_reverse2 : (xs : List a) -> reverse1 xs = reverse2 xs
proof_reverse1_reverse2 [] = Refl
proof_reverse1_reverse2 (x :: xs) =
    rewrite proof_rev [x] xs in
    rewrite proof_reverse1_reverse2 xs in
    Refl

In essence the proof_rev term shows that rev behaves like the O(n^2) reverse.

Finally, actually proving that reverse2 (reverse2 x) is true just involves replacing all the occurrences of reverse2 with reverse1, then applying the proof that the property holds for reverse1. Nothing complicated at all:

proof_reverse2_reverse2 : (xs : List a) -> xs = reverse2 (reverse2 xs)
proof_reverse2_reverse2 xs =
    rewrite sym $ proof_reverse1_reverse2 xs in
    rewrite sym $ proof_reverse1_reverse2 (reverse1 xs) in
    rewrite proof_reverse1_reverse1 xs in
    Refl

If you've got this far and are still hungry for more proof exercises I recommend Exercises on Generalizing the Induction Hypothesis which I have now worked through (solutions if you want to cheat).

Sunday, May 07, 2017

Proving stuff with Idris

Summary: I've been learning Idris to play around with writing simple proofs. It seems pretty cool.

The Idris programming language is a lot like Haskell, but with quite a few cleanups (better export syntax, more though out type classes), and also dependent types. I'm not a massive fan of dependent types as commonly presented - I want to write my code so it's easy to prove, not intertwine my beautiful code with lots of invariants. To try out Idris I've been proving some lemmas about list functions. I'm very much an Idris beginner, but thought I would share what I've done so far, partly to teach, partly to remember, and party to get feedback/criticism about all the beginner mistakes I've made.

Before proving stuff, you need to write some definitions. Idris comes with an append function (named ++ as you might expect), but to keep everything simple and out in the open I've defined:

append : List a -> List a -> List a
append [] ys = ys
append (x :: xs) ys = x :: append xs ys

Coming from Haskell the only real difference is that cons is :: and types are :. Given how Haskell code now looks, that's probably the right way around anyway. We can load that code in the Idris interactive environment and run it:

$ idris Main.idr
Main> append [1,2] [3]
[1,2,3] : List Integer

What we can also do, but can't easily do in Haskell, is start proving lemmas about it. We'd like to prove that append xs [] = xs, so we write a proof:

proof_append_nil : (xs : List a) -> append xs [] = xs
proof_append_nil xs = ?todo

We name the proof proof_append_nil and then say that given xs (of type List a) we can prove that append xs [] = xs. That's pretty direct and simple. Now we have to write the proof - we first write out the definition leaving ?todo where we want the proof to go. Now we can load the proof in Idris and type :t todo to get the bit of the proof that is required, and Idris says:

todo : append xs [] = xs

That's fair - we haven't proven anything yet. Since we know the proof will proceed by splitting apart the xs we can rewrite as:

proof_append_nil [] = ?todo_nil
proof_append_nil (x :: xs) = ?todo_cons

We can now ask for the types of the two remaining todo bits:

todo_nil : [] = []
todo_cons : x :: append xs [] = x :: xs

The first todo_nil is obviously true since the things on either side of the equality match, so we can replace it with Refl. The second statement looks like the inductive case, so we want to apply proof_append_nil to it. We can do that with rewrite proof_append_nil xs, which expands to append xs [] = xs, and rewrites the left-hand-side of the proof to be more like the right. Refined, we end up with:

proof_append_nil [] = Refl
proof_append_nil (x :: xs) = rewrite proof_append_nil xs in ?todo_cons

Reloading again and asking for the type of todo_cons we get:

todo_cons : x :: xs = x :: xs

These things are equal, so we replace todo_cons with Refl to get a complete proof of:

proof_append_nil : (xs : List a) -> append xs [] = xs
proof_append_nil [] = Refl
proof_append_nil (x :: xs) = rewrite proof_append_nil xs in Refl

Totality Checking

Proofs are only proofs if you have code that terminates. In Idris, seemingly sprinkling the statement:

%default total

At the top of the file turns on the totality checker, which ensures the proof is really true. With the statement turned on I don't get any warnings about totality issues, so we have proved append really does have [] as a right-identity.

Avoiding rewrite

The rewrite keyword seems like a very big hammer, and in our case we know exactly where we want to apply the rewrite. Namely at the end. In this case we could have equally written:

cong (proof_append_nil xs)

Not sure which would be generally preferred style in the Idris world, but as the proofs get more complex using rewrite certainly seems easier.

Differences from Haskell

Coming from a Haskell background, and sticking to simple things, the main differences in Idris were that modules don't have export lists (yay), lists are :: and types are : (yay), functions can only use functions defined before them (sad, but I guess a restriction to make dependent types work) and case/lambda both use => instead of -> (meh).

Next steps

For my first task in Idris I also defined two reverse functions:

reverse1 : List a -> List a
reverse1 [] = []
reverse1 (x :: xs) = reverse1 xs `append` [x]

rev2 : List a -> List a -> List a
rev2 acc [] = acc
rev2 acc (x :: xs) = rev2 (x :: acc) xs

reverse2 : List a -> List a
reverse2 = rev2 []

The first reverse1 function is reverse in O(n^2), the second is reverse in O(n). I then tried to prove the three lemmas:

proof_reverse1_reverse1 : (xs : List a) -> xs = reverse1 (reverse1 xs)
proof_reverse2_reverse2 : (xs : List a) -> xs = reverse2 (reverse2 xs)
proof_reverse1_reverse2 : (xs : List a) -> reverse1 xs = reverse2 xs

Namely that both reverse1 and reverse2 are self inverses, and then that they are equivalent. To prove these functions required a few helper lemmas, but only one additional Idris function/feature, namely:

sym : (left = right) -> right = left

A function which transforms a proof of equality from one way around to the other way around. I also required a bunch of helper lemmas, including:

proof_append_assoc : (xs : List a) -> (ys : List a) -> (zs : List a) -> append xs (append ys zs) = append (append xs ys) zs

Developing these proofs in Idris took me about ~2 hours, so were a nice introductory exercise (with the caveat that I've proven these lemmas before, although not in 4+ years). I'd invite anyone interested in learning this aspect of Idris to have a go, and I'll post my proofs sometime in the coming week.

Update: additional notes on this material can be found here from Reddit user chshersh.

Friday, April 28, 2017

HLint on Travis/Appveyor

Summary: Running HLint on your CI is now quick and easy.

I've always wanted to run HLint on my continuous integration servers (specifically Travis for Linux and Appveyor for Windows), to automatically detect code that could be improved. That has always been possible, and packages like lens and freer-effects have done so, but it was unpleasant for two reasons:

Setting up a custom HLint settings file and applying these settings was a lot of upfront work.
Building hlint on the CI server could be quite slow.

With HLint v2.0.4, both of these issues are addressed. I am now running HLint as standard for many of my projects. The two steps are outlined below.

Setting up custom HLint settings

Locally run hlint . --default > .hlint.yaml and it will generate a file which ignores all hints your project currently triggers. If you then run hlint . there will be no warnings, as the ignore settings will automatically be picked up. Check in .hlint.yaml.

Later, as a future step, you may wish to review your .hlint.yaml file and fix some of the warnings.

Running HLint on the server

There are now precompiled binaries at GitHub, along with scripts to download and execute them for each CI system. In both cases below, replace ARGUMENTS with your arguments to hlint, e.g. . to check the current directory.

On Travis, execute the following command:

wget https://raw.github.com/ndmitchell/hlint/master/misc/travis.sh -O - --quiet | sh -s ARGUMENTS

On Appveyor, add the following statements to .appveyor.yml:

- set PATH=C:\Program Files\Git\mingw64\bin;%PATH%
- curl -ohlint.bat -L https://raw.githubusercontent.com/ndmitchell/hlint/master/misc/appveyor.bat
- hlint ARGUMENTS

Since these are precompiled binaries the additional time required to run HLint should be minimal.

Thursday, April 06, 2017

HLint 2.0 - with YAML configuration

Summary: I've just released HLint 2.0, which lets you configure the rules with a YAML file.

I've just released HLint 2.0 to Hackage. HLint has always been configurable by writing a specially crafted Haskell file (e.g. to ignore certain hints or add new ones). With HLint 2.0 we now support YAML configuration files, and moreover encourage them over the Haskell format.

New Features of 2.0

Perhaps the most useful feature of this version is that HLint will search the current directory upwards for a .hlint.yaml file containing configuration settings. If a project includes a .hlint.yaml in the root then you won't need to pass it on the command line, and there's a reasonable chance any editor integration will pick it up as well. This idea was shamelessly stolen from Stylish Haskell.

HLint configuration files have also moved from Haskell syntax with special interpretation to YAML syntax. The Haskell syntax had become quite overloaded and was increasingly confused. The YAML syntax gives a fresh start with a chance to properly specify configuration directly rather than encoding it as Haskell expressions. The YAML configuration format enables a few features in this release, and going forward should enable more. To create a template .hlint.yaml file run hlint --default > .hlint.yaml. HLint continues to work without a configuration file.

In addition to a bunch of little hints, there is now a hint to suggest giving modules explicit export lists. I tend to always favour export lists, using module Foo(module Foo) where in the rare case they seem genuinely too much work. If you object to the hint, simply add to .hlint.yaml:

- ignore: {name: Use module export list}

On the other extreme, if you always want to require a complete and explicit export list (banning the trick above) do:

- warn: {name: Use explicit module export list}

Which enables the off-by-default hint requiring an explicit list. To see which off-by-default hints your program triggers pass the command line argument --show.

The biggest new hint is that you can now mark certain flags/extensions/imports/functions as being restricted - maintaining either a whitelist or blacklist and exceptions. As an example, HLint itself contains the restrictions:

- functions:
  - {name: unsafeInterleaveIO, within: Parallel}
  - {name: unsafePerformIO, within: [Util.exitMessageImpure, Test.Util.ref]}
  - {name: unsafeCoerce, within: []}

These unsafe functions can only be used in the modules/functions listed (or nowhere for unsafeCoerce), ensuring no code sneaks in unexpectedly. As another example:

- flags:
  - default: false
  - {name: [-fno-warn-incomplete-patterns, -fno-warn-overlapping-patterns]}

Here we have used default: false to say that any flags not explicitly allowed cannot be used. If someone accidentally checks in development code with {-# OPTIONS_GHC -w #-} then HLint will catch it. This restriction feature is particularly designed for code reviews.

What Might Have Broken

This release changes a lot of stuff. I am aware the following will no longer work:

The HLint2 API has been deleted - I don't think anyone was using it (the HLint and HLint3 ones remain unchanged). I'll continue to evolve the API in future releases.
The Haskell configuration file no longer has a way of importing other configuration files. At the same time I made it so the default configuration file and internal hints are always included, which I hopefully offsets most of the impact of ignoring hint imports.
I may have broken some other things by accident, but that's why there is a bug tracker.

At the moment the Haskell and YAML configuration files are both supported. However, I'll be phasing out the Haskell configuration in the future, and encourage people to move to the YAML. The conversion should be pretty simple.

Monday, April 03, 2017

Code Review Reviewed

Summary: I used to be mildly against code review on all merges. Now I'm for it. Code review is a good idea for knowledge sharing, not spotting bugs.

For my open-source projects, I accept pull requests from external contributors which I review, but my contributions are probably not routinely looked at by anyone else. I suspect most non-huge open-source projects follow the same pattern. For larger projects code review seems more common, but far from ubiquitous.

Until recently, I was in favour of occasional code review, e.g. towards the end of a project taking a look at the code. The problem with that approach is that it is hard to find the right time - at the end there is little appetite for large-scale refactoring and when the project is busy cranking out code there is no time and the code is still changing. While I was a fan of this idea in theory, it never worked in practice.

Some teams use code review on every merge to master, perhaps several times per person per day. That's something I used to disagree with, but I asked some smart people, and their explanations changed my opinion. The important realisation was thinking about what code review is for.

Checking the code works: NO. Code review is not about checking the code works and is bug free. I have tests, and leverage the type system, which is meant to convince me and other readers that my code is correct. For certain very sensitive bits a second set of eyes checking for bugs may be useful, but if all your code is that sensitive it will also be bug-ridden. However, code review may spot the absence or inadequacy of tests.
Checking simple rules and conventions: NO. Often projects have rules about which packages/extensions can be used, tab width and line length etc. While these checks can be carried out by a human, they are much better automated. Even when "code reviewing" students work at York University (aka marking) it quickly became clear that I was wasting time on trivialities, and thus wrote HLint to automate it.
High-level design: A BIT. Code review should think about the high-level design (e.g. should this be a web server or command line tool), but often the code is too detailed to properly elucidate these choices. High-level design should usually be done on a bug tracker or whiteboard before writing code.
Mid-level design: YES. Code review is often very good at challenging details around the mid-level design - e.g. does the choice of Text vs ByteString make sense, is there some abstraction that needs introducing. The author of some code is often heavily influenced by the journey they took to get to some point, by seeing code without that history a reviewer can often suggest a better approach.
Spreading knowledge: YES. Code review is great for spreading knowledge of the code to others in the team. Code review requires the code to be written in such a way that someone else can understand it, and ensures that there is someone else who is familiar with it. For larger projects that's invaluable when someone goes on holiday.
Team cohesion: YES Code review requires different members of the team to share their work with each other, and to have an appreciation of what other people are working through and the challenges it presents. It's all too easy to declare a problem "easy" without understanding the details, and code review removes that temptation.
Training: YES. Code review is also good for teaching new techniques and approaches to the submitter. As an example, some problems are trivial to solve with a generics library, yet some programmers aren't familiar with any - code review helps with ongoing training.

In contrast, I think the disadvantages of code review all revolve around the time required:

It takes other peoples time to do the review, when they could be doing something else.
It takes the submitters time waiting for a review - often the next step is hard to do without the previous step being agreed. As a result, I think code review work should be prioritised.
As a consequence of the two previous issues, code review changes the "unit of work", leading to more changes in a single bundle.

These issues are real, but I think the benefit of knowledge sharing alone outweighs the cost.

Friday, December 23, 2016

Fuzz testing Hexml with AFL

Summary: Hexml 0.1 could read past the end of the buffer for malformed documents. Fuzz testing detected that and I fixed it in Hexml 0.2.

I released Hexml, my fast DOM-based XML parser, and immediately Austin Seipp got suspicious. Here was a moderately large piece of C code, taking untrusted inputs, and poking around in the buffer with memcpy and memchr. He used American Fuzzy Lop (AFL) to fuzz test the Hexml C code, and came up with a number of issues, notably a buffer read overrun on the fragment:

<a b=:fallback

With a lot of help from Austin I setup AFL, fixed some issues with Hexml and with how AFL was being run, released Hexml 0.2 fixing these issues and incorporated AFL into my Travis CI builds.

If you want to actually follow all the steps on your computer, I recommend reading the original GitHub issue from Austin. Alternatively, checkout Hexml and run sh afl.sh.

Building and installing AFL

The first step was to build and install AFL from the tarball, including the LLVM pieces and libdislocator. The LLVM mode allows faster fuzzing, and the libdislocator library provides a library that makes all allocations next to a page boundary - ensuring that if there is a buffer read overrun it results in a segfault than AFL can detect.

An AFL test case

To run AFL you write a program that takes a filename as an argument and "processes" it. In my case that involves calling hexml_document_parse - the full version is online, but the salient bits are:

#include "hexml.c"
... other imports ...

int main(int argc, char** argv)
{
    __AFL_INIT();
    ... read file from argv[0] ...
    document *doc = hexml_document_parse(contents, length);
    hexml_document_free(doc);
    return 0;
}

Here I statically #include the hexml.c codebase and have a main function that calls __AFL_INIT (to make testing go faster), reads from the file, then parses/frees the document. If this code crashes, I want to know about it.

The original AFL driver code used __AFL_LOOP to speed things up further, but that results in a huge number of spurious failures, so I removed it.

Running AFL

To run AFL on my code requires compiling it with one AFL tool, then running it through another. The steps are:

AFL_HARDEN=1 afl-clang-fast -O2 -Icbits cbits/fuzz.c -o $PWD/hexml-fuzz
AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@

I compile with AFL_HARDEN to detect more bugs, producing hexml-fuzz. I run with libdislocator loaded so that my small buffer overrun turns into a fatal segfault. I give afl-fuzz a dictionary of common XML fragments and a few simple XML documents, then let it run over hexml-fuzz. The interactive UI shows bugs as they occur.

Fixing the bugs

Running AFL on Hexml 0.1 produced lots of bugs within a few seconds. Each bug produces an input file which I then ran through a debugger. While there were a few distinct bug locations, they all shared a common pattern. Hexml parses a NUL-terminated string, and in some cases I looked at a character that was potentially NUL and consumed it in the parsing. That might consume the final character, meaning that any further parsing was reading past the end of the string. I audited all such occurrences, fixed them, and reran AFL. Since then I have been unable to find an AFL bug despite lots of compute time.

Running on CI

I run all my code on Travis CI to ensure I don't introduce bugs, and to make accepting pull requests easier (I don't even need to build the code most of the time). Fortunately, running on Travis isn't too hard:

AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so timeout 5m afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@ > /dev/null || true
cat afl-results/fuzzer_stats
grep "unique_crashes *: 0" afl-results/fuzzer_stats

I pipe the output of AFL to /dev/null since it's very long. I run for 5 minutes with timeout. After the timeout hits, I display the fuzzer_stats file and then grep for 0 crashes, failing if it isn't there.

Conclusions

Writing C code is hard, especially if it's performance orientated, and if it's not performance orientated you might want to consider a different language. Even if you don't want to use your code on untrusted input, sooner or later someone else will, and even tiny bugs can result in complete exploits. AFL does a remarkable job at detecting such issues and has made Hexml the better for it.

Monday, December 12, 2016

New XML Parser, Hexml

Summary: I've released a new Haskell library, Hexml, which is an incomplete-but-fast XML parser.

I've just released Hexml, a new C/Haskell library for DOM-style XML parsing that is fast, but incomplete. To unpack that a bit:

Hexml is an XML parser that you give a string representing an XML document, it parses that string, and returns either a parse error or a representation of that document. Once you have the document, you can get the child nodes/attributes, walk around the document, and extract the text.
Hexml is really a C library, which has been designed to be easy to wrap in Haskell, and then a Haskell wrapper on top. It should be easy to use Hexml directly from C if desired.
Hexml has been designed for speed. In the very limited benchmarks I've done it is typically just over 2x faster at parsing than Pugixml, where Pugixml is the gold standard for fast XML DOM parsers. In my uses it has turned XML parsing from a bottleneck to an irrelevance, so it works for me.
To gain that speed, Hexml cheats. Primarily it doesn't do entity expansion, so & remains as & in the output. It also doesn't handle CData sections (but that's because I'm lazy) and comment locations are not remembered. It also doesn't deal with most of the XML standard, ignoring the DOCTYPE stuff.

If you want a more robust version of Hexml then the Haskell pugixml binding on Hackage is a reasonable place to start, but be warned that it has memory issues, that can cause segfaults. It also requires C++ which makes use through GHCi more challenging.

Speed techniques

To make Hexml fast I first read the chapter on fast parsing with Pugixml, and stole all those techniques. After that, I introduced a number of my own.

I only work on UTF8, which for the bits of UTF8 I care about, is the same as ASCII - I don't need to do any character decoding.
Since I don't do entity expansion, all strings are available in the source document, so everything simply provides offsets into the input string. In the Haskell API I use constant-time bytestring slices into the source string to present a nice API.
The memory model for a document is an array of attributes, an array of nodes, and a root node from the list of nodes. To make sure that scanning a document is fast, each node describes their attributes and direct child nodes in terms of a start and length within the attribute and node arrays. For example, the root node might have attributes 1..5 in the attribute array, and direct children 4..19 in the node array. When scanning the child nodes there are no linked-list operations and everything is cache friendly.
To keep the memory compact for attributes, I just have an array and reallocate/copy as necessary. By always doubling the number of attributes on exhaustion I ensure a worst-case of 1-copy per attribute on average.
To keep the memory compact for nodes is a bit more complex, as the direct child nodes are not necessarily allocated consecutively, as child nodes may themselves have child nodes. The solution is to have an array of nodes, with contiguous allocation of used child nodes starting at the beginning. To ensure the child nodes are continguous I first put the nodes at the end of the array, then copy them after a child is complete -- in effect using the end of the array as a stack. By always doubling the number of nodes on exhaustion I ensure a worst-case of 2-copies per node on average.
When parsing the text in the body of a document, since I don't care about &, the only character that is of any interest is <. That allows me to process much of the document with the highly-optimised memchr.
I initially allocate a single buffer that contains the document, a small number of attributes and a small number of nodes, in a single call to malloc. If more attributes/nodes are required they allocate a fresh buffer and just ignore the initially provided one. That ensures that for small documents they don't pay for multiple malloc calls, at the cost of wasting the initial attribute/node allocation on larger documents (which are more memory heavy anyway - so it doesn't matter).
I'm pretty sure Hexml could be optimised further. Specifically, I have a recursive descent parser, and it should be a single function with goto. I also process some characters multiple times, mostly to ensure predictable abstraction barriers around the parsing functions, but that could be elimiated with a goto-based approach.

Installing the Haskell Network library on Windows

Summary: This post describes how to install the Haskell network library on Windows, again.

I recently bought a new computer, and tried to install GHC 8.0.1 then upgrade the network library using Cabal. As I have come to expect, it didn't work. Using Git Bash, I got the error:

$ cabal install network-2.6.3.1
Resolving dependencies...
Configuring network-2.6.3.1...
Failed to install network-2.6.3.1
Build log ( C:\Users\Neil\AppData\Roaming\cabal\logs\network-2.6.3.1.log ):
Configuring network-2.6.3.1...
configure: WARNING: unrecognized options: --with-compiler
checking for gcc... C:\ghc\GHC-80~1.1┼║
checking whether the C compiler works... no
configure: error: in `C:/Neil':
configure: error: C compiler cannot create executables
See `config.log' for more details
cabal: Leaving directory '.'
cabal.exe: Error: some packages failed to install:
old-time-1.1.0.3 failed during the configure step. The exception was:
ExitFailure 77

Running -v3 shows the CC variable is being set to C:\ghc\GHC-80~1.1┼║, which looks like a buffer corruption or encoding issue. I tried my previous solution, but it didn't work. My new solution is:

$ cabal unpack network-2.6.3.1
$ cd network-2.6.3.1
$ cabal configure
... fails with a similar error to above ...
$ sh ./configure
$ cabal build
$ cabal copy
$ cabal register

I had to repeat the same pattern for the latest version of old-time, and the same pattern worked.

Another way that works is to use Stack.

Tuesday, December 06, 2016

Undefined Behaviour in C

Summary: I tripped over undefined behaviour in C. It's annoying.

I've recently been writing some C code to parse XML quickly. While working on that project, I inadvertently wrote some code which is undefined according to the C language standard. The code compiled and ran fine using Visual Studio, but under gcc (even at -O0) it corrupted memory, sometimes leading to a segfault, but usually just leading to a wrong answer. The code in question was (see full code at GitHub):

d->nodes.nodes[0].nodes = parse_content(d);

To give some context, d is a structure that contains various pieces of state - what the string to be parsed is, how much we have parsed, along with a pointer to the output nodes. The parse_content function parses the bit inside an XML tag, returning the indicies in nodes which it used.

The complication comes from nodes not being a fixed size, but dynamically resized if the number of nodes exceeds the capacity. For big documents that means parse_content will reallocate d->nodes.nodes.

According to the C spec, the compiler can evaluate the LHS and RHS of an assignment in any order. Since gcc computes the location of d->nodes.nodes[0] before calling parse_content it uses the address of the node before reallocation. After reallocation the address will have changed, and the assignment will be made to the wrong location.

I spotted the bug by inserting printf statements, and in doing so, I had to rewrite the code to:

str content = parse_content(d);
d->nodes.nodes[0].nodes = content;

That fixes the issue, since now the evaluation order is strictly defined. As a simplified example of the same issue:

char* array;

char f() {
    array = malloc(42);
    return 'x';    
}

void test() {
    array = malloc(0);
    array[0] = f();
}

Here the line array[0] = f() might assign to either the result of malloc(0) or malloc(42), at the compilers discretion.

I manually checked if I had made any other such mistakes, and I couldn't find any. Naturally, I wanted to find a static checker that could detect such a mistake, so I tried a bunch of them. I wasn't very successful:

Visual Studio 2015 code analysis made me write assert after each malloc, but nothing further.
PVS Studio found nothing.
Clang undefined behaviour found nothing, and seemingly doesn't work on Windows.
GCC undefined behaviour found nothing, and seemingly doesn't work on Windows.
RV-Match hit a stack-overflow when running the program.

Wednesday, November 23, 2016

The Haskell.org Website Working Group (HWWG)

Haskell represents both a language and a user community - and moreover a fantastic community full of friends, fun, and deep technical debate. Unfortunately, in recent times the community has started to fracture, e.g. Cabal vs Stack, haskell.org vs haskell-lang.org. These divisions have risen above technical disagreements and at some points turned personal. The solution, shepherded by Simon Peyton Jones, and agreed to by both members of the haskell.org committee and the maintainers of haskell-lang.org, is to form the Haskell Website Working Group (HWWG). The charter of the group is at the bottom of this post.

The goal of the Haskell Website Working Group is to make sure the Haskell website caters to the needs of Haskell programmers, particularly beginners. In doing so we hope to either combine or differentiate haskell.org and haskell-lang.org, and give people clear recommendations of what "downloading Haskell" means. Concretely, we hope that either haskell-lang.org redirects to haskell.org, or that haskell-lang.org ends up being used for something very different from today.

The Haskell Website Working Group (HWWG)

Scope and goals

The HWWG is responsible for the design and content of the user-facing haskell.org web site, including tutorials, videos, news, resource, downloads, etc.
The HWWG is not responsible for:
- The infrastructure of haskell.org
- Toolchains, Hackage, compilers, etc
The HWWG focuses on serving users of Haskell, not suppliers of technology or libraries.
An explicit goal is to re-unite the haskell.org and haskell-lang.org web sites.

Expected mode of operation

HWWG is not responsible for actually doing everything! The web site is on github. Anyone can make a pull request. The general expectation is that uncontroversial changes will be accepted and committed without much debate.
If there is disagreement about a proposed change, it's up to the HWWG to engage in (open) debate, and to seek input as widely as time allows, but finally to make a decision.

Membership

Initial membership comprises of:

Neil Mitchell (chair)
Nicolas Wu
Andrew Cowie
Vincent Hanquez
Ryan Trinkle
Chris Done

It is expected the committee will change over time, but the mechanism has not yet been thought about.

Rules of engagement

Recognising that honestly-held judgements may differ, we will be scrupulously polite both in public and in private.
Recognising that Haskell has many users, and that different users have different needs and tastes, we want haskell.org to be inclusive rather than exclusive, providing a variety of alternative resources (toolchains, tutorials, books, etc) clearly signposted with their intended audiences.
Ultimately the haskell.org committee owns the haskell.org URL, but it delegates authority for the design and content of the web site to the HWWG. In extremis, if the haskell.org committee believes that the HWWG is mismanaging the web site, it can revoke that delegation.

Thursday, September 29, 2016

Full-time Haskell jobs in London, at Barclays

Update: I no longer work for Barclays and am no longer hiring.

Summary: I'm hiring 9 Haskell programmers. Email neil.d.mitchell AT barclays.com to apply.

I work for Barclays, in London, working on a brand new Haskell project. We're looking for nine additional Haskell programmers to come and join the team.

What we offer

A permanent job, writing Haskell, using all the tools you know and love – GHC/Cabal/Stack etc. In the first two weeks in my role I've already written parsers with attoparsec, both Haskell and HTML generators and am currently generating a binding to C with lots of Storable/Ptr stuff. Since it's a new project you will have the chance to help shape the project.

The project itself is to write a risk engine – something that lets the bank calculate the value of the trades it has made, and how things like changes in the market change their value. Risk engines are important to banks and include lots of varied tasks – talking to other systems, overall structuring, actual computation, streaming values, map/reduce.

We'll be following modern but lightweight development principles – including nightly builds, no check-ins to master which break the tests (enforced automatically) and good test coverage.

These positions have attractive salary levels.

What we require

We're looking for the best functional programmers out there, with a strong bias towards Haskell. We have a range of seniorities available to suit even the most experienced candidates. We don't have anything at the very junior end; instead we're looking for candidates that are already fluent and productive. That said, a number of very good Haskell programmers think of themselves as beginners even after many years, so if you're not sure, please get in touch.

We do not require any finance knowledge.

The role is in London, Canary Wharf, and physical presence in the office on a regular basis is required – permanent remote working is not an option.

How to apply

To apply, email neil.d.mitchell AT barclays.com with a copy of your CV. If you have any questions, email me.

The best way to assess technical ability is to look at code people have written. If you have any packages on Hackage or things on GitHub, please point me at the best projects. If your best code is not publicly available, please describe the Haskell projects you've been involved in.

Sunday, August 14, 2016

The Four Flaws of Haskell

Summary: Last year I made a list of four flaws with Haskell. Most have improved significantly over the last year.

No language/compiler ecosystem is without its flaws. A while ago I made a list of the flaws I thought might harm people using Haskell in an industrial setting. These are not flaws that impact beginners, or flaws that stop people from switching to Haskell, but those that might harm a big project. Naturally, everyone would come up with a different list, but here is mine.

Package Management: Installing a single consistent set of packages used across a large code base used to be difficult. Upgrading packages within that set was even worse. On Windows, anything that required a new network package was likely to end in tears. The MinGHC project solved the network issue. Stackage solved the consistent set of packages, and Stack made it even easier. I now consider Haskell package management a strength for large projects, not a risk.

IDE: The lack of an IDE certainly harms Haskell. There are a number of possibilities, but whenever I've tried them they have come up lacking. The fact that every Haskell programmer has an entrenched editor choice doesn't make it an easier task to solve. Fortunately, with Ghcid there is at least something near the minimum acceptable standard for everyone. At the same time various IDE projects have appeared, notably the Haskell IDE Engine and Intero. With Ghcid the lack of an IDE stops being a risk, and with the progress on other fronts I hope the Haskell IDE story continues to improve.

Space leaks: As Haskell programs get bigger, the chance of hitting a space leak increases, becoming an inevitability. While I am a big fan of laziness, space leaks are the big downside. Realising space leaks were on my flaws list, I started investigating methods for detecting space leaks, coming up with a simple detection method that works well. I've continued applying this method to other libraries and tools. I'll be giving a talk on space leaks at Haskell eXchange 2016. With these techniques space leaks don't go away, but they can be detected with ease and solved relatively simply - no longer a risk to Haskell projects.

Array/String libraries: When working with strings/arrays, the libraries that tend to get used are vector, bytestring, text and utf8-string. While each are individually nice projects, they don't work seamlessly together. The utf8-string provides UTF8 semantics for bytestring, which provides pinned byte arrays. The text package provides UTF16 encoded unpinned Char arrays. The vector package provides mutable and immutable vectors which can be either pinned or unpinned. I think the ideal situation would be a type that was either pinned or unpinned based on size, where the string was just a UTF8 encoded array with a newtype wrapping. Fortunately the foundation library provides exactly that. I'm not brave enough to claim a 0.0.1 package released yesterday has derisked Haskell projects, but things are travelling in the right direction.

It has certainly been possible to use Haskell for large projects for some time, but there were some risks. With the improvements over the last year the remaining risks have decreased markedly. In contrast, the risks of using an untyped or impure language remain significant, making Haskell a great choice for new projects.

Thursday, August 04, 2016

Upcoming talk: Writing build systems with Shake, 16 Aug 2016, London

Summary: I'm giving a talk on Shake.

I'm delighted to announce that I'll be giving a talk/hack session on Shake as part of the relatively new "Haskell Hacking London" meetup.

Title: Writing build systems with Shake

Date: Tuesday, August 16, 2016. 6:30 PM

Location: Pusher Office, 28 Scrutton Street, London

Abstract: Shake is a general purpose library for expressing build systems - forms of computation, with caching, dependencies and more besides. Like all the best stuff in Haskell, Shake is generic, with details such as "files" written on top of the generic library. Of course, the real world doesn't just have "files", but specifically has "C files that need to be compiled with gcc". In this hacking session we'll look at how to write Shake rules, what existing functions people have already layered on top of Shake for compiling with specific compilers, and consider which rules are missing. Hopefully by the end we'll have a rule that people can use out-of-the-box for compiling C++ and Haskell.

To put it another way, it's all about layering up. Haskell is a programming language. Shake is a Haskell library for dependencies, minimal recomputation, parallelism etc. Shake also provides as a layer on top (but inside the same library) to write rules about files, and ways to run command line tools. Shake doesn't yet provide a layer that compiles C files, but it does provide the tools with which you can write your own. The aim of this talk/hack session is to figure out what the next layer should be, and write it. It is definitely an attempt to move into the SCons territory of build systems, which knows how to build C etc. out of the box.

Monday, July 25, 2016

Maintainer required for Derive

Summary: I'm looking for a maintainer to take over Derive. Interested?

The Derive tool is a library and set of definitions for generating fragments of code in a formulaic way from a data type. It has a mechanism for guessing the pattern from a single example, plus a more manual way of writing out generators. It supports 35 generators out of the box, and is depended upon by 75 libraries.

The tool hasn't seen much love for some time, and I no longer use it myself. It requires somewhat regular maintenance to upgrade to new versions of GHC and haskell-src-exts. There are lots of directions the tool could be taken, more derivations, integration with the GHC Generic derivation system etc. There's a few generic pieces that could be broken off (translation between Template Haskell and haskell-src-exts, the guessing mechanism).

Anyone who is interested should comment on the GitHub ticket. In the absence of any volunteers I may continue to do the regular upgrade work, or may instead have it taken out of Stackage and stop upgrading it.

Monday, July 18, 2016

Why did Stack stop using Shake?

Summary: Stack originally used Shake. Now it doesn't. There are reasons for that.

The Stack tool originally used the Shake build system, as described on the page about Stack's origins. Recently Edward Yang asked why doesn't Stack still use Shake - a very interesting question. I've taken the information shared in that mailing list thread and written it up, complete with my comments and distortions/inferences.

Stack is all about building Haskell code, in ways that obey dependencies and perform minimal rebuilds. Already in Haskell the dependency story is somewhat muddied. GHC (as available through ghc --make) does advanced dependency tracking, including header includes and custom Template Haskell dependency directives. You can also run ghc in single-shot mode, compiling a file at a time, but the result is about 3x slower and GHC will still do some dependency tracking itself anyway. Layered on top of ghc --make is Cabal which is responsible for tracking dependencies with .cabal files, configured Cabal information and placing things into the GHC package database. Layered on top of that is Stack, which has multiple projects and needs to track information about which Stackage snapshot is active and shared build dependencies.

Shake is good at taking complex dependencies and hiding all the messy details. However, for Stack many of these messy details were the whole purpose of the project. When Michael Snoyman and Chris Done were originally writing Stack they didn't have much experience with Shake, and opted to go for simplicity and directly managing the pieces, which they viewed to be less risky.

Now that Stack is written, and works nicely, the question changes to if it is worth changing existing working code to make use of Shake. Interestingly, at the heart of Stack there is a "Shake-lite" - see Control.Concurrent.Execute. This piece could certainly be replaced by Shake, but what would the benefit be? Looking at it with my Shake implementers hat on, there are a few things that spring to mind:

This existing code is O(n^2) in lots of places. For the size of Stack projects, compared to the time required to compile Haskell, that probably doesn't matter.
Shake persists the dependencies, but the Stack code does not seem to. Would that be useful? Or is the information already persisted elsewhere? Would Shake persisting the information make stack builds which had nothing to do go faster? (The answer is almost certainly yes.)
Since the code is only used on one project it probably isn't as well tested as Shake, which has a lot of tests. On the other hand, it has a lot less features, so a lot less scope for bugs.
The code makes a lot of assumptions about the information fed to it. Shake doesn't make such assumptions, and thus invalid input is less likely to fail silently.
Shake has a lot of advanced dependency forms such as resources. Stack currently blocks when simultaneous configures are tried, whereas Shake would schedule other tasks to run.
Shake has features such as profiling that are not worth creating for a single project, but that when bundled in the library can be a useful free feature.

In some ways Stack as it stands avoids a lot of the best selling points about Shake:

If you have lots of complex interdependencies, Shake lets you manage
them nicely. That's not really the case for Stack, but is in large
heterogeneous build systems, e.g. the GHC build system.
If you are writing things quickly, Shake lets you manage
exceptions/retries/robustness quickly. For a project which has the
effort invested that Stack does, that's less important, but for things
like MinGHC (something Stack killed), it was critically important because no one cared enough to do all this nasty engineering.
If you are experimenting, Shake provides a lot of pieces (resources,
parallelism, storage) that help explore the problem space without
having to do lots of work at each iteration. That might mean Shake is
more of a benefit at the start of a project than in a mature project.

If you are writing a version of Stack from scratch, I'd certainly recommend thinking about using Shake. I suspect it probably does make sense for Stack to switch to Shake eventually, to simplify ongoing maintenance, but there's no real hurry.

Tuesday, July 05, 2016

More space leaks: Alex/Happy edition

Summary: Alex and Happy had three space leaks, now fixed.

Using the techniques described in my previous blog post I checked happy and alex for space leaks. As expected, both had space leaks. Three were clear and unambiguous space leaks, two were more nuanced. In this post I'll describe all five, starting with the obvious ones.

1: Happy - non-strict accumulating fold

Happy contains the code:

indexInto :: Eq a => Int -> a -> [a] -> Maybe Int
indexInto _ _ []                 = Nothing
indexInto i x (y:ys) | x == y    = Just i
                     | otherwise = indexInto (i+1) x ys

This code finds the index of an element in a list, always being first called with an initial argument of 0. However, as it stands, the first argument is a classic space leak - it chews through the input list, building up an equally long chain of +1 applications, which are only forced later.

The fix is simple, change the final line to:

let j = i + 1 in j `seq` indexInto j x ys

Or (preferably) switch to using the space-leak free Data.List.elemIndex. Fixed in a pull request.

2: Happy - sum using foldr

Happy also contained the code:

foldr (\(a,b) (c,d) -> (a+b,b+d)) (0,0) conflictList

The first issue is that the code is using foldr to produce a small atomic value, when foldl' would be a much better choice. Even after switching to foldl' we still have a space leak because foldl' only forces the outer-most value - namely just the pair, not the Int values inside. We want to force the elements inside the pair so are forced into the more painful construction:

foldl' (\(a,b) (c,d) ->
    let ac = a + c; bd = b + d
    in ac `seq` bd `seq` (ac,bd))
    (0,0) conflictList

Not as pleasant, but it does work. In some cases people may prefer to define the auxiliary:

let strict2 f !x !y = f x y
in foldr (\(a,b) (c,d) -> strict2 (,) (a+b) (b+d)) (0,0) conflictList

Fixed in a pull request.

3: Alex - lazy state in a State Monad

Alex features the code:

N $ \s n _ -> (s, addEdge n, ())

Here N roughly corresponds to a state monad with 2 fields, s and n. In this code n is a Map, which operates strictly, but the n itself is not forced until the end. We solve the problem by forcing the value before returning the triple:

N $ \s n _ -> let n' = addEdge n in n' `seq` (s, n', ())

Fixed in a pull request.

4: Alex - array freeze

Alex calls the Data.Array.MArray.freeze function, to convert an STUArray (unboxed mutable array in the ST monad) into a UArray (unboxed immutable array). Unfortunately the freeze call in the array library uses an amount of stack proportional to the size of the array. Not necessarily a space leak, but not ideal either. Looking at the code, it's also very inefficient, constructing and deconstructing lots of intermediate data. Fortunately under normal optimisation a rewrite rule fires for this type to replace the call with one to freezeSTUArray, which is much faster and has bounded stack, but is not directly exported.

Usually I diagnose space leaks under -O0, on the basis that any space leak problems at -O0 may eventually cause problems anyway if an optimisation opportunity is lost. In this particular case I had to -O1 that module.

5: Happy - complex fold

The final issue occurs in a function fold_lookahead, which when given lists of triples does an mconcat on all values that match in the first two components. Using the extra library that can be written as:

map (\((a,b),cs) -> (a,b,mconcat cs)) .
groupSort .
map (\(a,b,c) -> ((a,b),c))

We first turn the triple into a pair where the first two elements are the first component of the pair, call groupSort, then mconcat the result. However, in Happy this construction is encoded as a foldr doing an insertion sort on the first component, followed by a linear scan on the second component, then individual mappend calls. The foldr construction uses lots of stack (more than 1Mb), and also uses an O(n^2) algorithm instead of O(n log n).

Alas, the algorithms are not identical - the resulting list is typically in a different order. I don't believe this difference matters, and the tests all pass, but it does make the change more dangerous than the others. Fixed in a pull request.

The result

Thanks to Simon Marlow for reviewing and merging all the changes. After these changes Happy and Alex on the sample files I tested them with use < 1Kb of stack. In practice the space leaks discovered here are unlikely to materially impact any real workflows, but they probably go a bit faster.

Saturday, June 11, 2016

js-jquery 3.0.0 now out

The js-jquery Haskell library bundles the minified jQuery Javascript code into a Haskell package, so it can be depended upon by Cabal packages and incorporated into generated HTML pages. It does so in a way that doesn't require each Haskell package to bundle its own extra data files, and in a way that meets the licensing requirements of distributions such as Debian.

I've just released version 3.0.0, following on from jQuery 3.0.0 a few days ago. This release breaks compatibility with IE6-8, so if that's important to you, insert an upper bound on the package.

Friday, May 20, 2016

Another space leak: QuickCheck edition

Summary: QuickCheck had a space leak in property, now fixed (in HEAD).

Using the techniques described in my previous blog post I found another space leak, this time in QuickCheck, which has now been fixed. Using QuickCheck we can chose to "label" certain inputs, for example:

$ quickCheck $ \p -> label (if p > 0 then "+ve" else "-ve") True
+++ OK, passed 100 tests:
54% -ve
46% +ve

Here we label numbers based on their value, and at the end QuickCheck tells us how many were in each set. As you might expect, the underlying QuickCheck implementation contains a Map String Int to record how many tests get each label.

Unfortunately, the implementation in QuickCheck-2.8.1 has a space leak, meaning that the memory usage is proportional to the number of tests run. We can provoke such a space leak with:

quickCheckWithResult stdArgs{maxSuccess=10000} $
    \(p :: Double) -> label "foo" True

When running with ghc --make Main.hs -rtsopts && Main +RTS -K1K we get the error:

Main: Stack space overflow: current size 33624 bytes.

Using -K1K we have detected when we evaluate the space leak, at the end of the program, when trying to print out the summary statistics. The approach taken by QuickCheck for label is to generate a separate Map String Int per run, then at each step merge these Map values together using unionWith (+). As such, there are two likely culprits for the space leak:

Perhaps the Map is not evaluated, so in memory we have unionWith (+) x1 $ unionWith (+) x2 $ unionWith (+) x3 $ ....
Perhaps the values inside the Map are not evaluated, so in memory we have Map {"foo" = 1 + 1 + 1 + ...}.

QuickCheck avoids the first space leak by keeping its intermediate state in a record type with a strict field for the Map. QuickCheck suffers from the second problem. As usual, actually fixing the space leak is easy - just switch from importing Data.Map to Data.Map.Strict. The Strict module ensures that the computations passed to unionWith are forced before it returns, and the memory usage remains constant, not linear in the number of tests.

I detected this space leak because the Shake test suite runs with -K1K and when running one particular test on a Mac with GHC 8.0 in profiling mode it caused a stack overflow. I did not diagnose which of those factors was the ultimate cause (it may have even been the random seed at a particular point in time - only certain inputs call label).

Many space leaks are now easy to detect (using -K1K), moderate difficulty to debug (using the -xc technique or just by eye) and usually easy to fix.

Tuesday, April 19, 2016

New Shake with better wildcard patterns

Summary: The new version of Shake supports ** patterns for directory wildcards.

I've just released Shake 0.15.6. Don't be mislead by the 0.0.1 increment of the release, it's got over 50 entries in the changelog since the last release. There are quite a few bug fixes, documentation improvements and optimisations.

One of the most user visible features is the new wildcard patterns. In the previous version Shake supported // for matching any number of directories and * for matching within a path component, so to match all C source files in src you could write:

src//*.c

In the new version of Shake you can also write:

src/**/*.c

The // patterns remain supported, but I intend to encourage use of ** in new code if these patterns don't end up having any unforeseen problems. The advantages of the patterns in the new version are:

The ** patterns seem to be the defacto standard nowadays, being used by rsync, Ant, Gradle, Jenkins etc.
People would often write "src" </> "//*.c", which gives the unexpected result of //*.c. With ** you aren't overloading directories at the same time so everything works out as expected.
** patterns only match relative files, not absolute ones, which is what you usually want in a build system. If you want to match absolute files use */**.
The semantics of patterns were a bit confusing for things like /// - I've now given them precise semantics, but ** avoids this confusion.
I've optimised the pattern matching for both flavours, so there is more precomputation and less backtracking (in practice I don't think that makes any difference).
I've optimised directory traversal using a file pattern, so it doesn't list directories that can't possibly match, which gives a significant speedup.

For this release I've also improved the website at shakebuild.com with more documentation - hopefully it is useful.