Friday, April 28, 2017

HLint on Travis/Appveyor

Summary: Running HLint on your CI is now quick and easy.

I've always wanted to run HLint on my continuous integration servers (specifically Travis for Linux and Appveyor for Windows), to automatically detect code that could be improved. That has always been possible, and packages like lens and freer-effects have done so, but it was unpleasant for two reasons:

Setting up a custom HLint settings file and applying these settings was a lot of upfront work.
Building hlint on the CI server could be quite slow.

With HLint v2.0.4, both of these issues are addressed. I am now running HLint as standard for many of my projects. The two steps are outlined below.

Setting up custom HLint settings

Locally run hlint . --default > .hlint.yaml and it will generate a file which ignores all hints your project currently triggers. If you then run hlint . there will be no warnings, as the ignore settings will automatically be picked up. Check in .hlint.yaml.

Later, as a future step, you may wish to review your .hlint.yaml file and fix some of the warnings.

Running HLint on the server

There are now precompiled binaries at GitHub, along with scripts to download and execute them for each CI system. In both cases below, replace ARGUMENTS with your arguments to hlint, e.g. . to check the current directory.

On Travis, execute the following command:

wget https://raw.github.com/ndmitchell/hlint/master/misc/travis.sh -O - --quiet | sh -s ARGUMENTS

On Appveyor, add the following statements to .appveyor.yml:

- set PATH=C:\Program Files\Git\mingw64\bin;%PATH%
- curl -ohlint.bat -L https://raw.githubusercontent.com/ndmitchell/hlint/master/misc/appveyor.bat
- hlint ARGUMENTS

Since these are precompiled binaries the additional time required to run HLint should be minimal.

Thursday, April 06, 2017

HLint 2.0 - with YAML configuration

Summary: I've just released HLint 2.0, which lets you configure the rules with a YAML file.

I've just released HLint 2.0 to Hackage. HLint has always been configurable by writing a specially crafted Haskell file (e.g. to ignore certain hints or add new ones). With HLint 2.0 we now support YAML configuration files, and moreover encourage them over the Haskell format.

New Features of 2.0

Perhaps the most useful feature of this version is that HLint will search the current directory upwards for a .hlint.yaml file containing configuration settings. If a project includes a .hlint.yaml in the root then you won't need to pass it on the command line, and there's a reasonable chance any editor integration will pick it up as well. This idea was shamelessly stolen from Stylish Haskell.

HLint configuration files have also moved from Haskell syntax with special interpretation to YAML syntax. The Haskell syntax had become quite overloaded and was increasingly confused. The YAML syntax gives a fresh start with a chance to properly specify configuration directly rather than encoding it as Haskell expressions. The YAML configuration format enables a few features in this release, and going forward should enable more. To create a template .hlint.yaml file run hlint --default > .hlint.yaml. HLint continues to work without a configuration file.

In addition to a bunch of little hints, there is now a hint to suggest giving modules explicit export lists. I tend to always favour export lists, using module Foo(module Foo) where in the rare case they seem genuinely too much work. If you object to the hint, simply add to .hlint.yaml:

- ignore: {name: Use module export list}

On the other extreme, if you always want to require a complete and explicit export list (banning the trick above) do:

- warn: {name: Use explicit module export list}

Which enables the off-by-default hint requiring an explicit list. To see which off-by-default hints your program triggers pass the command line argument --show.

The biggest new hint is that you can now mark certain flags/extensions/imports/functions as being restricted - maintaining either a whitelist or blacklist and exceptions. As an example, HLint itself contains the restrictions:

- functions:
  - {name: unsafeInterleaveIO, within: Parallel}
  - {name: unsafePerformIO, within: [Util.exitMessageImpure, Test.Util.ref]}
  - {name: unsafeCoerce, within: []}

These unsafe functions can only be used in the modules/functions listed (or nowhere for unsafeCoerce), ensuring no code sneaks in unexpectedly. As another example:

- flags:
  - default: false
  - {name: [-fno-warn-incomplete-patterns, -fno-warn-overlapping-patterns]}

Here we have used default: false to say that any flags not explicitly allowed cannot be used. If someone accidentally checks in development code with {-# OPTIONS_GHC -w #-} then HLint will catch it. This restriction feature is particularly designed for code reviews.

What Might Have Broken

This release changes a lot of stuff. I am aware the following will no longer work:

The HLint2 API has been deleted - I don't think anyone was using it (the HLint and HLint3 ones remain unchanged). I'll continue to evolve the API in future releases.
The Haskell configuration file no longer has a way of importing other configuration files. At the same time I made it so the default configuration file and internal hints are always included, which I hopefully offsets most of the impact of ignoring hint imports.
I may have broken some other things by accident, but that's why there is a bug tracker.

At the moment the Haskell and YAML configuration files are both supported. However, I'll be phasing out the Haskell configuration in the future, and encourage people to move to the YAML. The conversion should be pretty simple.

Monday, April 03, 2017

Code Review Reviewed

Summary: I used to be mildly against code review on all merges. Now I'm for it. Code review is a good idea for knowledge sharing, not spotting bugs.

For my open-source projects, I accept pull requests from external contributors which I review, but my contributions are probably not routinely looked at by anyone else. I suspect most non-huge open-source projects follow the same pattern. For larger projects code review seems more common, but far from ubiquitous.

Until recently, I was in favour of occasional code review, e.g. towards the end of a project taking a look at the code. The problem with that approach is that it is hard to find the right time - at the end there is little appetite for large-scale refactoring and when the project is busy cranking out code there is no time and the code is still changing. While I was a fan of this idea in theory, it never worked in practice.

Some teams use code review on every merge to master, perhaps several times per person per day. That's something I used to disagree with, but I asked some smart people, and their explanations changed my opinion. The important realisation was thinking about what code review is for.

Checking the code works: NO. Code review is not about checking the code works and is bug free. I have tests, and leverage the type system, which is meant to convince me and other readers that my code is correct. For certain very sensitive bits a second set of eyes checking for bugs may be useful, but if all your code is that sensitive it will also be bug-ridden. However, code review may spot the absence or inadequacy of tests.
Checking simple rules and conventions: NO. Often projects have rules about which packages/extensions can be used, tab width and line length etc. While these checks can be carried out by a human, they are much better automated. Even when "code reviewing" students work at York University (aka marking) it quickly became clear that I was wasting time on trivialities, and thus wrote HLint to automate it.
High-level design: A BIT. Code review should think about the high-level design (e.g. should this be a web server or command line tool), but often the code is too detailed to properly elucidate these choices. High-level design should usually be done on a bug tracker or whiteboard before writing code.
Mid-level design: YES. Code review is often very good at challenging details around the mid-level design - e.g. does the choice of Text vs ByteString make sense, is there some abstraction that needs introducing. The author of some code is often heavily influenced by the journey they took to get to some point, by seeing code without that history a reviewer can often suggest a better approach.
Spreading knowledge: YES. Code review is great for spreading knowledge of the code to others in the team. Code review requires the code to be written in such a way that someone else can understand it, and ensures that there is someone else who is familiar with it. For larger projects that's invaluable when someone goes on holiday.
Team cohesion: YES Code review requires different members of the team to share their work with each other, and to have an appreciation of what other people are working through and the challenges it presents. It's all too easy to declare a problem "easy" without understanding the details, and code review removes that temptation.
Training: YES. Code review is also good for teaching new techniques and approaches to the submitter. As an example, some problems are trivial to solve with a generics library, yet some programmers aren't familiar with any - code review helps with ongoing training.

In contrast, I think the disadvantages of code review all revolve around the time required:

It takes other peoples time to do the review, when they could be doing something else.
It takes the submitters time waiting for a review - often the next step is hard to do without the previous step being agreed. As a result, I think code review work should be prioritised.
As a consequence of the two previous issues, code review changes the "unit of work", leading to more changes in a single bundle.

These issues are real, but I think the benefit of knowledge sharing alone outweighs the cost.

Friday, December 23, 2016

Fuzz testing Hexml with AFL

Summary: Hexml 0.1 could read past the end of the buffer for malformed documents. Fuzz testing detected that and I fixed it in Hexml 0.2.

I released Hexml, my fast DOM-based XML parser, and immediately Austin Seipp got suspicious. Here was a moderately large piece of C code, taking untrusted inputs, and poking around in the buffer with memcpy and memchr. He used American Fuzzy Lop (AFL) to fuzz test the Hexml C code, and came up with a number of issues, notably a buffer read overrun on the fragment:

<a b=:fallback

With a lot of help from Austin I setup AFL, fixed some issues with Hexml and with how AFL was being run, released Hexml 0.2 fixing these issues and incorporated AFL into my Travis CI builds.

If you want to actually follow all the steps on your computer, I recommend reading the original GitHub issue from Austin. Alternatively, checkout Hexml and run sh afl.sh.

Building and installing AFL

The first step was to build and install AFL from the tarball, including the LLVM pieces and libdislocator. The LLVM mode allows faster fuzzing, and the libdislocator library provides a library that makes all allocations next to a page boundary - ensuring that if there is a buffer read overrun it results in a segfault than AFL can detect.

An AFL test case

To run AFL you write a program that takes a filename as an argument and "processes" it. In my case that involves calling hexml_document_parse - the full version is online, but the salient bits are:

#include "hexml.c"
... other imports ...

int main(int argc, char** argv)
{
    __AFL_INIT();
    ... read file from argv[0] ...
    document *doc = hexml_document_parse(contents, length);
    hexml_document_free(doc);
    return 0;
}

Here I statically #include the hexml.c codebase and have a main function that calls __AFL_INIT (to make testing go faster), reads from the file, then parses/frees the document. If this code crashes, I want to know about it.

The original AFL driver code used __AFL_LOOP to speed things up further, but that results in a huge number of spurious failures, so I removed it.

Running AFL

To run AFL on my code requires compiling it with one AFL tool, then running it through another. The steps are:

AFL_HARDEN=1 afl-clang-fast -O2 -Icbits cbits/fuzz.c -o $PWD/hexml-fuzz
AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@

I compile with AFL_HARDEN to detect more bugs, producing hexml-fuzz. I run with libdislocator loaded so that my small buffer overrun turns into a fatal segfault. I give afl-fuzz a dictionary of common XML fragments and a few simple XML documents, then let it run over hexml-fuzz. The interactive UI shows bugs as they occur.

Fixing the bugs

Running AFL on Hexml 0.1 produced lots of bugs within a few seconds. Each bug produces an input file which I then ran through a debugger. While there were a few distinct bug locations, they all shared a common pattern. Hexml parses a NUL-terminated string, and in some cases I looked at a character that was potentially NUL and consumed it in the parsing. That might consume the final character, meaning that any further parsing was reading past the end of the string. I audited all such occurrences, fixed them, and reran AFL. Since then I have been unable to find an AFL bug despite lots of compute time.

Running on CI

I run all my code on Travis CI to ensure I don't introduce bugs, and to make accepting pull requests easier (I don't even need to build the code most of the time). Fortunately, running on Travis isn't too hard:

AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so timeout 5m afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@ > /dev/null || true
cat afl-results/fuzzer_stats
grep "unique_crashes *: 0" afl-results/fuzzer_stats

I pipe the output of AFL to /dev/null since it's very long. I run for 5 minutes with timeout. After the timeout hits, I display the fuzzer_stats file and then grep for 0 crashes, failing if it isn't there.

Conclusions

Writing C code is hard, especially if it's performance orientated, and if it's not performance orientated you might want to consider a different language. Even if you don't want to use your code on untrusted input, sooner or later someone else will, and even tiny bugs can result in complete exploits. AFL does a remarkable job at detecting such issues and has made Hexml the better for it.

Monday, December 12, 2016

New XML Parser, Hexml

Summary: I've released a new Haskell library, Hexml, which is an incomplete-but-fast XML parser.

I've just released Hexml, a new C/Haskell library for DOM-style XML parsing that is fast, but incomplete. To unpack that a bit:

Hexml is an XML parser that you give a string representing an XML document, it parses that string, and returns either a parse error or a representation of that document. Once you have the document, you can get the child nodes/attributes, walk around the document, and extract the text.
Hexml is really a C library, which has been designed to be easy to wrap in Haskell, and then a Haskell wrapper on top. It should be easy to use Hexml directly from C if desired.
Hexml has been designed for speed. In the very limited benchmarks I've done it is typically just over 2x faster at parsing than Pugixml, where Pugixml is the gold standard for fast XML DOM parsers. In my uses it has turned XML parsing from a bottleneck to an irrelevance, so it works for me.
To gain that speed, Hexml cheats. Primarily it doesn't do entity expansion, so & remains as & in the output. It also doesn't handle CData sections (but that's because I'm lazy) and comment locations are not remembered. It also doesn't deal with most of the XML standard, ignoring the DOCTYPE stuff.

If you want a more robust version of Hexml then the Haskell pugixml binding on Hackage is a reasonable place to start, but be warned that it has memory issues, that can cause segfaults. It also requires C++ which makes use through GHCi more challenging.

Speed techniques

To make Hexml fast I first read the chapter on fast parsing with Pugixml, and stole all those techniques. After that, I introduced a number of my own.

I only work on UTF8, which for the bits of UTF8 I care about, is the same as ASCII - I don't need to do any character decoding.
Since I don't do entity expansion, all strings are available in the source document, so everything simply provides offsets into the input string. In the Haskell API I use constant-time bytestring slices into the source string to present a nice API.
The memory model for a document is an array of attributes, an array of nodes, and a root node from the list of nodes. To make sure that scanning a document is fast, each node describes their attributes and direct child nodes in terms of a start and length within the attribute and node arrays. For example, the root node might have attributes 1..5 in the attribute array, and direct children 4..19 in the node array. When scanning the child nodes there are no linked-list operations and everything is cache friendly.
To keep the memory compact for attributes, I just have an array and reallocate/copy as necessary. By always doubling the number of attributes on exhaustion I ensure a worst-case of 1-copy per attribute on average.
To keep the memory compact for nodes is a bit more complex, as the direct child nodes are not necessarily allocated consecutively, as child nodes may themselves have child nodes. The solution is to have an array of nodes, with contiguous allocation of used child nodes starting at the beginning. To ensure the child nodes are continguous I first put the nodes at the end of the array, then copy them after a child is complete -- in effect using the end of the array as a stack. By always doubling the number of nodes on exhaustion I ensure a worst-case of 2-copies per node on average.
When parsing the text in the body of a document, since I don't care about &, the only character that is of any interest is <. That allows me to process much of the document with the highly-optimised memchr.
I initially allocate a single buffer that contains the document, a small number of attributes and a small number of nodes, in a single call to malloc. If more attributes/nodes are required they allocate a fresh buffer and just ignore the initially provided one. That ensures that for small documents they don't pay for multiple malloc calls, at the cost of wasting the initial attribute/node allocation on larger documents (which are more memory heavy anyway - so it doesn't matter).
I'm pretty sure Hexml could be optimised further. Specifically, I have a recursive descent parser, and it should be a single function with goto. I also process some characters multiple times, mostly to ensure predictable abstraction barriers around the parsing functions, but that could be elimiated with a goto-based approach.

Installing the Haskell Network library on Windows

Summary: This post describes how to install the Haskell network library on Windows, again.

I recently bought a new computer, and tried to install GHC 8.0.1 then upgrade the network library using Cabal. As I have come to expect, it didn't work. Using Git Bash, I got the error:

$ cabal install network-2.6.3.1
Resolving dependencies...
Configuring network-2.6.3.1...
Failed to install network-2.6.3.1
Build log ( C:\Users\Neil\AppData\Roaming\cabal\logs\network-2.6.3.1.log ):
Configuring network-2.6.3.1...
configure: WARNING: unrecognized options: --with-compiler
checking for gcc... C:\ghc\GHC-80~1.1┼║
checking whether the C compiler works... no
configure: error: in `C:/Neil':
configure: error: C compiler cannot create executables
See `config.log' for more details
cabal: Leaving directory '.'
cabal.exe: Error: some packages failed to install:
old-time-1.1.0.3 failed during the configure step. The exception was:
ExitFailure 77

Running -v3 shows the CC variable is being set to C:\ghc\GHC-80~1.1┼║, which looks like a buffer corruption or encoding issue. I tried my previous solution, but it didn't work. My new solution is:

$ cabal unpack network-2.6.3.1
$ cd network-2.6.3.1
$ cabal configure
... fails with a similar error to above ...
$ sh ./configure
$ cabal build
$ cabal copy
$ cabal register

I had to repeat the same pattern for the latest version of old-time, and the same pattern worked.

Another way that works is to use Stack.

Tuesday, December 06, 2016

Undefined Behaviour in C

Summary: I tripped over undefined behaviour in C. It's annoying.

I've recently been writing some C code to parse XML quickly. While working on that project, I inadvertently wrote some code which is undefined according to the C language standard. The code compiled and ran fine using Visual Studio, but under gcc (even at -O0) it corrupted memory, sometimes leading to a segfault, but usually just leading to a wrong answer. The code in question was (see full code at GitHub):

d->nodes.nodes[0].nodes = parse_content(d);

To give some context, d is a structure that contains various pieces of state - what the string to be parsed is, how much we have parsed, along with a pointer to the output nodes. The parse_content function parses the bit inside an XML tag, returning the indicies in nodes which it used.

The complication comes from nodes not being a fixed size, but dynamically resized if the number of nodes exceeds the capacity. For big documents that means parse_content will reallocate d->nodes.nodes.

According to the C spec, the compiler can evaluate the LHS and RHS of an assignment in any order. Since gcc computes the location of d->nodes.nodes[0] before calling parse_content it uses the address of the node before reallocation. After reallocation the address will have changed, and the assignment will be made to the wrong location.

I spotted the bug by inserting printf statements, and in doing so, I had to rewrite the code to:

str content = parse_content(d);
d->nodes.nodes[0].nodes = content;

That fixes the issue, since now the evaluation order is strictly defined. As a simplified example of the same issue:

char* array;

char f() {
    array = malloc(42);
    return 'x';    
}

void test() {
    array = malloc(0);
    array[0] = f();
}

Here the line array[0] = f() might assign to either the result of malloc(0) or malloc(42), at the compilers discretion.

I manually checked if I had made any other such mistakes, and I couldn't find any. Naturally, I wanted to find a static checker that could detect such a mistake, so I tried a bunch of them. I wasn't very successful:

Visual Studio 2015 code analysis made me write assert after each malloc, but nothing further.
PVS Studio found nothing.
Clang undefined behaviour found nothing, and seemingly doesn't work on Windows.
GCC undefined behaviour found nothing, and seemingly doesn't work on Windows.
RV-Match hit a stack-overflow when running the program.

Wednesday, November 23, 2016

The Haskell.org Website Working Group (HWWG)

Haskell represents both a language and a user community - and moreover a fantastic community full of friends, fun, and deep technical debate. Unfortunately, in recent times the community has started to fracture, e.g. Cabal vs Stack, haskell.org vs haskell-lang.org. These divisions have risen above technical disagreements and at some points turned personal. The solution, shepherded by Simon Peyton Jones, and agreed to by both members of the haskell.org committee and the maintainers of haskell-lang.org, is to form the Haskell Website Working Group (HWWG). The charter of the group is at the bottom of this post.

The goal of the Haskell Website Working Group is to make sure the Haskell website caters to the needs of Haskell programmers, particularly beginners. In doing so we hope to either combine or differentiate haskell.org and haskell-lang.org, and give people clear recommendations of what "downloading Haskell" means. Concretely, we hope that either haskell-lang.org redirects to haskell.org, or that haskell-lang.org ends up being used for something very different from today.

The Haskell Website Working Group (HWWG)

Scope and goals

The HWWG is responsible for the design and content of the user-facing haskell.org web site, including tutorials, videos, news, resource, downloads, etc.
The HWWG is not responsible for:
- The infrastructure of haskell.org
- Toolchains, Hackage, compilers, etc
The HWWG focuses on serving users of Haskell, not suppliers of technology or libraries.
An explicit goal is to re-unite the haskell.org and haskell-lang.org web sites.

Expected mode of operation

HWWG is not responsible for actually doing everything! The web site is on github. Anyone can make a pull request. The general expectation is that uncontroversial changes will be accepted and committed without much debate.
If there is disagreement about a proposed change, it's up to the HWWG to engage in (open) debate, and to seek input as widely as time allows, but finally to make a decision.

Membership

Initial membership comprises of:

Neil Mitchell (chair)
Nicolas Wu
Andrew Cowie
Vincent Hanquez
Ryan Trinkle
Chris Done

It is expected the committee will change over time, but the mechanism has not yet been thought about.

Rules of engagement

Recognising that honestly-held judgements may differ, we will be scrupulously polite both in public and in private.
Recognising that Haskell has many users, and that different users have different needs and tastes, we want haskell.org to be inclusive rather than exclusive, providing a variety of alternative resources (toolchains, tutorials, books, etc) clearly signposted with their intended audiences.
Ultimately the haskell.org committee owns the haskell.org URL, but it delegates authority for the design and content of the web site to the HWWG. In extremis, if the haskell.org committee believes that the HWWG is mismanaging the web site, it can revoke that delegation.

Thursday, September 29, 2016

Full-time Haskell jobs in London, at Barclays

Update: I no longer work for Barclays and am no longer hiring.

Summary: I'm hiring 9 Haskell programmers. Email neil.d.mitchell AT barclays.com to apply.

I work for Barclays, in London, working on a brand new Haskell project. We're looking for nine additional Haskell programmers to come and join the team.

What we offer

A permanent job, writing Haskell, using all the tools you know and love – GHC/Cabal/Stack etc. In the first two weeks in my role I've already written parsers with attoparsec, both Haskell and HTML generators and am currently generating a binding to C with lots of Storable/Ptr stuff. Since it's a new project you will have the chance to help shape the project.

The project itself is to write a risk engine – something that lets the bank calculate the value of the trades it has made, and how things like changes in the market change their value. Risk engines are important to banks and include lots of varied tasks – talking to other systems, overall structuring, actual computation, streaming values, map/reduce.

We'll be following modern but lightweight development principles – including nightly builds, no check-ins to master which break the tests (enforced automatically) and good test coverage.

These positions have attractive salary levels.

What we require

We're looking for the best functional programmers out there, with a strong bias towards Haskell. We have a range of seniorities available to suit even the most experienced candidates. We don't have anything at the very junior end; instead we're looking for candidates that are already fluent and productive. That said, a number of very good Haskell programmers think of themselves as beginners even after many years, so if you're not sure, please get in touch.

We do not require any finance knowledge.

The role is in London, Canary Wharf, and physical presence in the office on a regular basis is required – permanent remote working is not an option.

How to apply

To apply, email neil.d.mitchell AT barclays.com with a copy of your CV. If you have any questions, email me.

The best way to assess technical ability is to look at code people have written. If you have any packages on Hackage or things on GitHub, please point me at the best projects. If your best code is not publicly available, please describe the Haskell projects you've been involved in.

Sunday, August 14, 2016

The Four Flaws of Haskell

Summary: Last year I made a list of four flaws with Haskell. Most have improved significantly over the last year.

No language/compiler ecosystem is without its flaws. A while ago I made a list of the flaws I thought might harm people using Haskell in an industrial setting. These are not flaws that impact beginners, or flaws that stop people from switching to Haskell, but those that might harm a big project. Naturally, everyone would come up with a different list, but here is mine.

Package Management: Installing a single consistent set of packages used across a large code base used to be difficult. Upgrading packages within that set was even worse. On Windows, anything that required a new network package was likely to end in tears. The MinGHC project solved the network issue. Stackage solved the consistent set of packages, and Stack made it even easier. I now consider Haskell package management a strength for large projects, not a risk.

IDE: The lack of an IDE certainly harms Haskell. There are a number of possibilities, but whenever I've tried them they have come up lacking. The fact that every Haskell programmer has an entrenched editor choice doesn't make it an easier task to solve. Fortunately, with Ghcid there is at least something near the minimum acceptable standard for everyone. At the same time various IDE projects have appeared, notably the Haskell IDE Engine and Intero. With Ghcid the lack of an IDE stops being a risk, and with the progress on other fronts I hope the Haskell IDE story continues to improve.

Space leaks: As Haskell programs get bigger, the chance of hitting a space leak increases, becoming an inevitability. While I am a big fan of laziness, space leaks are the big downside. Realising space leaks were on my flaws list, I started investigating methods for detecting space leaks, coming up with a simple detection method that works well. I've continued applying this method to other libraries and tools. I'll be giving a talk on space leaks at Haskell eXchange 2016. With these techniques space leaks don't go away, but they can be detected with ease and solved relatively simply - no longer a risk to Haskell projects.

Array/String libraries: When working with strings/arrays, the libraries that tend to get used are vector, bytestring, text and utf8-string. While each are individually nice projects, they don't work seamlessly together. The utf8-string provides UTF8 semantics for bytestring, which provides pinned byte arrays. The text package provides UTF16 encoded unpinned Char arrays. The vector package provides mutable and immutable vectors which can be either pinned or unpinned. I think the ideal situation would be a type that was either pinned or unpinned based on size, where the string was just a UTF8 encoded array with a newtype wrapping. Fortunately the foundation library provides exactly that. I'm not brave enough to claim a 0.0.1 package released yesterday has derisked Haskell projects, but things are travelling in the right direction.

It has certainly been possible to use Haskell for large projects for some time, but there were some risks. With the improvements over the last year the remaining risks have decreased markedly. In contrast, the risks of using an untyped or impure language remain significant, making Haskell a great choice for new projects.

Thursday, August 04, 2016

Upcoming talk: Writing build systems with Shake, 16 Aug 2016, London

Summary: I'm giving a talk on Shake.

I'm delighted to announce that I'll be giving a talk/hack session on Shake as part of the relatively new "Haskell Hacking London" meetup.

Title: Writing build systems with Shake

Date: Tuesday, August 16, 2016. 6:30 PM

Location: Pusher Office, 28 Scrutton Street, London

Abstract: Shake is a general purpose library for expressing build systems - forms of computation, with caching, dependencies and more besides. Like all the best stuff in Haskell, Shake is generic, with details such as "files" written on top of the generic library. Of course, the real world doesn't just have "files", but specifically has "C files that need to be compiled with gcc". In this hacking session we'll look at how to write Shake rules, what existing functions people have already layered on top of Shake for compiling with specific compilers, and consider which rules are missing. Hopefully by the end we'll have a rule that people can use out-of-the-box for compiling C++ and Haskell.

To put it another way, it's all about layering up. Haskell is a programming language. Shake is a Haskell library for dependencies, minimal recomputation, parallelism etc. Shake also provides as a layer on top (but inside the same library) to write rules about files, and ways to run command line tools. Shake doesn't yet provide a layer that compiles C files, but it does provide the tools with which you can write your own. The aim of this talk/hack session is to figure out what the next layer should be, and write it. It is definitely an attempt to move into the SCons territory of build systems, which knows how to build C etc. out of the box.

Monday, July 25, 2016

Maintainer required for Derive

Summary: I'm looking for a maintainer to take over Derive. Interested?

The Derive tool is a library and set of definitions for generating fragments of code in a formulaic way from a data type. It has a mechanism for guessing the pattern from a single example, plus a more manual way of writing out generators. It supports 35 generators out of the box, and is depended upon by 75 libraries.

The tool hasn't seen much love for some time, and I no longer use it myself. It requires somewhat regular maintenance to upgrade to new versions of GHC and haskell-src-exts. There are lots of directions the tool could be taken, more derivations, integration with the GHC Generic derivation system etc. There's a few generic pieces that could be broken off (translation between Template Haskell and haskell-src-exts, the guessing mechanism).

Anyone who is interested should comment on the GitHub ticket. In the absence of any volunteers I may continue to do the regular upgrade work, or may instead have it taken out of Stackage and stop upgrading it.

Monday, July 18, 2016

Why did Stack stop using Shake?

Summary: Stack originally used Shake. Now it doesn't. There are reasons for that.

The Stack tool originally used the Shake build system, as described on the page about Stack's origins. Recently Edward Yang asked why doesn't Stack still use Shake - a very interesting question. I've taken the information shared in that mailing list thread and written it up, complete with my comments and distortions/inferences.

Stack is all about building Haskell code, in ways that obey dependencies and perform minimal rebuilds. Already in Haskell the dependency story is somewhat muddied. GHC (as available through ghc --make) does advanced dependency tracking, including header includes and custom Template Haskell dependency directives. You can also run ghc in single-shot mode, compiling a file at a time, but the result is about 3x slower and GHC will still do some dependency tracking itself anyway. Layered on top of ghc --make is Cabal which is responsible for tracking dependencies with .cabal files, configured Cabal information and placing things into the GHC package database. Layered on top of that is Stack, which has multiple projects and needs to track information about which Stackage snapshot is active and shared build dependencies.

Shake is good at taking complex dependencies and hiding all the messy details. However, for Stack many of these messy details were the whole purpose of the project. When Michael Snoyman and Chris Done were originally writing Stack they didn't have much experience with Shake, and opted to go for simplicity and directly managing the pieces, which they viewed to be less risky.

Now that Stack is written, and works nicely, the question changes to if it is worth changing existing working code to make use of Shake. Interestingly, at the heart of Stack there is a "Shake-lite" - see Control.Concurrent.Execute. This piece could certainly be replaced by Shake, but what would the benefit be? Looking at it with my Shake implementers hat on, there are a few things that spring to mind:

This existing code is O(n^2) in lots of places. For the size of Stack projects, compared to the time required to compile Haskell, that probably doesn't matter.
Shake persists the dependencies, but the Stack code does not seem to. Would that be useful? Or is the information already persisted elsewhere? Would Shake persisting the information make stack builds which had nothing to do go faster? (The answer is almost certainly yes.)
Since the code is only used on one project it probably isn't as well tested as Shake, which has a lot of tests. On the other hand, it has a lot less features, so a lot less scope for bugs.
The code makes a lot of assumptions about the information fed to it. Shake doesn't make such assumptions, and thus invalid input is less likely to fail silently.
Shake has a lot of advanced dependency forms such as resources. Stack currently blocks when simultaneous configures are tried, whereas Shake would schedule other tasks to run.
Shake has features such as profiling that are not worth creating for a single project, but that when bundled in the library can be a useful free feature.

In some ways Stack as it stands avoids a lot of the best selling points about Shake:

If you have lots of complex interdependencies, Shake lets you manage
them nicely. That's not really the case for Stack, but is in large
heterogeneous build systems, e.g. the GHC build system.
If you are writing things quickly, Shake lets you manage
exceptions/retries/robustness quickly. For a project which has the
effort invested that Stack does, that's less important, but for things
like MinGHC (something Stack killed), it was critically important because no one cared enough to do all this nasty engineering.
If you are experimenting, Shake provides a lot of pieces (resources,
parallelism, storage) that help explore the problem space without
having to do lots of work at each iteration. That might mean Shake is
more of a benefit at the start of a project than in a mature project.

If you are writing a version of Stack from scratch, I'd certainly recommend thinking about using Shake. I suspect it probably does make sense for Stack to switch to Shake eventually, to simplify ongoing maintenance, but there's no real hurry.

Tuesday, July 05, 2016

More space leaks: Alex/Happy edition

Summary: Alex and Happy had three space leaks, now fixed.

Using the techniques described in my previous blog post I checked happy and alex for space leaks. As expected, both had space leaks. Three were clear and unambiguous space leaks, two were more nuanced. In this post I'll describe all five, starting with the obvious ones.

1: Happy - non-strict accumulating fold

Happy contains the code:

indexInto :: Eq a => Int -> a -> [a] -> Maybe Int
indexInto _ _ []                 = Nothing
indexInto i x (y:ys) | x == y    = Just i
                     | otherwise = indexInto (i+1) x ys

This code finds the index of an element in a list, always being first called with an initial argument of 0. However, as it stands, the first argument is a classic space leak - it chews through the input list, building up an equally long chain of +1 applications, which are only forced later.

The fix is simple, change the final line to:

let j = i + 1 in j `seq` indexInto j x ys

Or (preferably) switch to using the space-leak free Data.List.elemIndex. Fixed in a pull request.

2: Happy - sum using foldr

Happy also contained the code:

foldr (\(a,b) (c,d) -> (a+b,b+d)) (0,0) conflictList

The first issue is that the code is using foldr to produce a small atomic value, when foldl' would be a much better choice. Even after switching to foldl' we still have a space leak because foldl' only forces the outer-most value - namely just the pair, not the Int values inside. We want to force the elements inside the pair so are forced into the more painful construction:

foldl' (\(a,b) (c,d) ->
    let ac = a + c; bd = b + d
    in ac `seq` bd `seq` (ac,bd))
    (0,0) conflictList

Not as pleasant, but it does work. In some cases people may prefer to define the auxiliary:

let strict2 f !x !y = f x y
in foldr (\(a,b) (c,d) -> strict2 (,) (a+b) (b+d)) (0,0) conflictList

Fixed in a pull request.

3: Alex - lazy state in a State Monad

Alex features the code:

N $ \s n _ -> (s, addEdge n, ())

Here N roughly corresponds to a state monad with 2 fields, s and n. In this code n is a Map, which operates strictly, but the n itself is not forced until the end. We solve the problem by forcing the value before returning the triple:

N $ \s n _ -> let n' = addEdge n in n' `seq` (s, n', ())

Fixed in a pull request.

4: Alex - array freeze

Alex calls the Data.Array.MArray.freeze function, to convert an STUArray (unboxed mutable array in the ST monad) into a UArray (unboxed immutable array). Unfortunately the freeze call in the array library uses an amount of stack proportional to the size of the array. Not necessarily a space leak, but not ideal either. Looking at the code, it's also very inefficient, constructing and deconstructing lots of intermediate data. Fortunately under normal optimisation a rewrite rule fires for this type to replace the call with one to freezeSTUArray, which is much faster and has bounded stack, but is not directly exported.

Usually I diagnose space leaks under -O0, on the basis that any space leak problems at -O0 may eventually cause problems anyway if an optimisation opportunity is lost. In this particular case I had to -O1 that module.

5: Happy - complex fold

The final issue occurs in a function fold_lookahead, which when given lists of triples does an mconcat on all values that match in the first two components. Using the extra library that can be written as:

map (\((a,b),cs) -> (a,b,mconcat cs)) .
groupSort .
map (\(a,b,c) -> ((a,b),c))

We first turn the triple into a pair where the first two elements are the first component of the pair, call groupSort, then mconcat the result. However, in Happy this construction is encoded as a foldr doing an insertion sort on the first component, followed by a linear scan on the second component, then individual mappend calls. The foldr construction uses lots of stack (more than 1Mb), and also uses an O(n^2) algorithm instead of O(n log n).

Alas, the algorithms are not identical - the resulting list is typically in a different order. I don't believe this difference matters, and the tests all pass, but it does make the change more dangerous than the others. Fixed in a pull request.

The result

Thanks to Simon Marlow for reviewing and merging all the changes. After these changes Happy and Alex on the sample files I tested them with use < 1Kb of stack. In practice the space leaks discovered here are unlikely to materially impact any real workflows, but they probably go a bit faster.

Saturday, June 11, 2016

js-jquery 3.0.0 now out

The js-jquery Haskell library bundles the minified jQuery Javascript code into a Haskell package, so it can be depended upon by Cabal packages and incorporated into generated HTML pages. It does so in a way that doesn't require each Haskell package to bundle its own extra data files, and in a way that meets the licensing requirements of distributions such as Debian.

I've just released version 3.0.0, following on from jQuery 3.0.0 a few days ago. This release breaks compatibility with IE6-8, so if that's important to you, insert an upper bound on the package.

Friday, May 20, 2016

Another space leak: QuickCheck edition

Summary: QuickCheck had a space leak in property, now fixed (in HEAD).

Using the techniques described in my previous blog post I found another space leak, this time in QuickCheck, which has now been fixed. Using QuickCheck we can chose to "label" certain inputs, for example:

$ quickCheck $ \p -> label (if p > 0 then "+ve" else "-ve") True
+++ OK, passed 100 tests:
54% -ve
46% +ve

Here we label numbers based on their value, and at the end QuickCheck tells us how many were in each set. As you might expect, the underlying QuickCheck implementation contains a Map String Int to record how many tests get each label.

Unfortunately, the implementation in QuickCheck-2.8.1 has a space leak, meaning that the memory usage is proportional to the number of tests run. We can provoke such a space leak with:

quickCheckWithResult stdArgs{maxSuccess=10000} $
    \(p :: Double) -> label "foo" True

When running with ghc --make Main.hs -rtsopts && Main +RTS -K1K we get the error:

Main: Stack space overflow: current size 33624 bytes.

Using -K1K we have detected when we evaluate the space leak, at the end of the program, when trying to print out the summary statistics. The approach taken by QuickCheck for label is to generate a separate Map String Int per run, then at each step merge these Map values together using unionWith (+). As such, there are two likely culprits for the space leak:

Perhaps the Map is not evaluated, so in memory we have unionWith (+) x1 $ unionWith (+) x2 $ unionWith (+) x3 $ ....
Perhaps the values inside the Map are not evaluated, so in memory we have Map {"foo" = 1 + 1 + 1 + ...}.

QuickCheck avoids the first space leak by keeping its intermediate state in a record type with a strict field for the Map. QuickCheck suffers from the second problem. As usual, actually fixing the space leak is easy - just switch from importing Data.Map to Data.Map.Strict. The Strict module ensures that the computations passed to unionWith are forced before it returns, and the memory usage remains constant, not linear in the number of tests.

I detected this space leak because the Shake test suite runs with -K1K and when running one particular test on a Mac with GHC 8.0 in profiling mode it caused a stack overflow. I did not diagnose which of those factors was the ultimate cause (it may have even been the random seed at a particular point in time - only certain inputs call label).

Many space leaks are now easy to detect (using -K1K), moderate difficulty to debug (using the -xc technique or just by eye) and usually easy to fix.

Tuesday, April 19, 2016

New Shake with better wildcard patterns

Summary: The new version of Shake supports ** patterns for directory wildcards.

I've just released Shake 0.15.6. Don't be mislead by the 0.0.1 increment of the release, it's got over 50 entries in the changelog since the last release. There are quite a few bug fixes, documentation improvements and optimisations.

One of the most user visible features is the new wildcard patterns. In the previous version Shake supported // for matching any number of directories and * for matching within a path component, so to match all C source files in src you could write:

src//*.c

In the new version of Shake you can also write:

src/**/*.c

The // patterns remain supported, but I intend to encourage use of ** in new code if these patterns don't end up having any unforeseen problems. The advantages of the patterns in the new version are:

The ** patterns seem to be the defacto standard nowadays, being used by rsync, Ant, Gradle, Jenkins etc.
People would often write "src" </> "//*.c", which gives the unexpected result of //*.c. With ** you aren't overloading directories at the same time so everything works out as expected.
** patterns only match relative files, not absolute ones, which is what you usually want in a build system. If you want to match absolute files use */**.
The semantics of patterns were a bit confusing for things like /// - I've now given them precise semantics, but ** avoids this confusion.
I've optimised the pattern matching for both flavours, so there is more precomputation and less backtracking (in practice I don't think that makes any difference).
I've optimised directory traversal using a file pattern, so it doesn't list directories that can't possibly match, which gives a significant speedup.

For this release I've also improved the website at shakebuild.com with more documentation - hopefully it is useful.

Monday, April 11, 2016

GHCid 0.6 Released

Summary: I've released a new version of GHCid, which can interrupt running tests.

I've just released version 0.6.1 of GHCid. To a first approximation, ghcid opens ghci and runs :reload whenever your source code changes, formatting the output to fit a fixed height console. Unlike other Haskell development tools, ghcid is intended to be incredibly simple - it works when nothing else does. This new version features:

Much faster: Since version 0.5 GHCid passes -fno-code to ghci when it makes sense, which is about twice as fast.

Interruptible test commands: Since version 0.4 ghcid has supported a --test flag to pass a test command (e.g. :main) which is run whenever the code is warning free. As of version 0.6 that command will be interrupted if it needs to :reload, allowing long running tests and persistent "tests" - e.g. spawning web servers or GUIs. Thanks to Reid Draper for showing it was possible as part of his ordeal project and Luigy Leon for merging that with GHCid.

Stack integration: If you have a stack.yaml function and a .stack-work directory it will use stack commands to run your project. Thanks to the Stack Team, in particular Michael Sloan, for helping get through all the hoops and providing the necessary functionality in Stack.

More restart/reload flags: It's been possible for a while to pass --restart to restart ghci if certain files change (e.g. the .cabal file). Now there is a separate --reload flag to cause :reload instead of a full restart, and both flags can take directories instead of individual files.

Major relayering: For 0.6 I significantly refactored much of the GHCid code. There has always been an underlying Language.Haskell.Ghcid API, and GHCid was built on top. With the new version the underlying library has been given a significant refactoring, mostly so that interruptions are handled properly without race conditions and with a sane multithreading story. On top of that is a new Session layer, which provides a session abstraction - a ghci instance which tracks more state (e.g. which warnings have been emitted for already loaded files). Then the Ghcid module builds on top, with much less state management. By simplifying and clarifying the responsibility of each layer certain issues such as leaking old ghci processes and obscure race conditions disappeared.

I've been making use of many of these features in the Shake website generator, which I invoke with:

ghcid --test=":main debug" --reload=parts --reload=../docs

This project uses Stack, so relies on the new stack integration. It runs :main debug as the test suite, which generates the website whenever the code reloads. Furthermore, if any of the parts (template files) or docs (Markdown pages) change the website regenerates. I can now edit the website, and saving it automatically regenerates the web pages within a second.

Wednesday, April 06, 2016

Github Offline Issues with IssueSync

For a while I've been looking for something to download the GitHub issues for a project. I do a lot of development work on a train with no internet, so referring to the tickets offline is very useful. I've tried lot of tools, in a very wide variety of languages (Ruby, Python, Perl, Javascript, PHP) - but most of them don't seem to work - and the only one I did manage to get working only gave a curses UI.

Finally, I've found one that works - IssueSync. Installing it worked as described. Running it worked as described. I raised tickets for the author and they fixed them. I even sent a pull request and the author discussed and merged it. It downloads all your issues to Markdown files in an issues directory. I then "list" my issues using:

head -n1 -q issues/*.md | grep -v CLOSED

It's simple and works nicely.

Thursday, March 03, 2016

Compiling GHC on Windows

Summary: How to compile GHC on Windows using Stack and the new Shake-based GHC build system.

Regularly updated instructions are at https://github.com/snowleopard/hadrian/blob/master/doc/windows.md - use them.

Here are a list of instructions to compile GHC, from source, on Windows. I tested these instructions on a clean machine using the free Windows 10 VirtualBox image (I bumped the VM CPUs to 4, and RAM to 4096Mb).

The first step is to install Stack (I just accepted all the defaults), then open a command prompt and run:

stack setup
stack install happy alex
stack exec -- pacman -S gcc binutils git automake-wrapper tar make patch autoconf --noconfirm
stack exec -- git clone --recursive git://git.haskell.org/ghc.git
cd ghc
stack exec -- git clone git://github.com/snowleopard/hadrian
stack build --stack-yaml=hadrian/stack.yaml --only-dependencies
stack exec --stack-yaml=hadrian/stack.yaml -- hadrian/build.bat -j

The entire process (after the VM has downloaded) takes a bit less than an hour. These steps use the Stack supplied tools (MinGW, Git), and the new Shake-based build system. The hope is that by using the isolation Stack provides, combined with the portability improvements from writing the build system in Haskell, these instructions will work robustly on many Windows machines.

I have not tried these instructions on other platforms, but suspect that removing the pacman line might be sufficient to get it to work.

Update: Instructions simplified following improvements to the build system.

Wednesday, February 17, 2016

Selling Haskell in the pub

Summary: A Haskell sales pitch without any code.

I often find myself in a pub, without pen or paper, trying to persuade someone to try Haskell. In these situations I tend to use three arguments:

Low cost abstractions: In Haskell the cost of creating a helper function is low, because the syntax for functions is very concise (a single line) and the optimiser often removes all overhead of the function (by inlining it). The power of such helper functions is greatly enhanced by higher-order functions. In many languages each function must be top-level (within a file or class), but Haskell permits functions local to a small block, providing encapsulation and less necessity for good names. In most languages there is a much higher cost per function, and thus trivial functions are insufficiently valuable. By reducing the cost, Haskell encourages both less repetition and describing problems in a more abstract way, which is very helpful for taming program complexity.

Refactoring works: In Haskell, refactoring is easy, safe and common. Most projects involve writing a chunk of code, then continually changing it as the project evolves. In other languages most refactorings have complex side conditions that must be met. In Haskell, most refactorings are simple, and even refactoring tools can have complex refactorings mechanically proven correct. Any code which violates the expected side-conditions is considered "dangerous" and libraries are expected to provide robust abstractions. The static type checker ensures that most refactorings have been carried correctly. It is common to change a fundamental type in the middle of millions of lines of code, quickly make the changes required and have a high degree of confidence that it still works. The ability to easily refactor means that code can evolve with the project, without accumulating technical debt.

Language polygots: There are few programmers who know only Haskell. I know Haskell programmers who are also experts in Perl, PHP, C, Javascript, C++, Fortran, R, Algol - pretty much any language you care to name. In contrast, when my wife has attended other programming language meetups, many of the participants knew only that language. There are many reasons Haskell programmers know lots of languages, not least because Haskell has rarely been taught as a first language. I find it interesting that many people who are experts in both Haskell and other languages typically prefer Haskell.

In response to these arguments, if people are starting to get convinced, they usually ask:

Are there lots of libraries? Yes, 9000+. There are more R libraries for statistics, more Perl libraries for Bioinformatics etc - but most of the standard stuff is covered. Wrapping a C library with the foreign function interface (FFI) isn't too hard.
What's the performance like? It's usually somewhere between C and Python. If you carefully optimise you can get close to the performance of C. The profiling tools are reasonable. Performance is not usually a problem, but can be solved with C and FFI if it is.

Many of the above arguments are also supportive of other statically typed functional languages. I tend to find my pub interventions are usually aimed at Python/Java/C++/PHP programmers, where I'd consider it a win if they tried O'Caml or Scala instead.

Monday, February 01, 2016

New HLint version and API features

Summary: In the new version of HLint, errors are warnings and warnings are suggestions. I plan to remove the old API very soon.

I've just released a new version of HLint, the tool for suggesting hints to improve your Haskell code. This version comes with two big changes.

Firstly, hints have been reclassified in severity. What used to be errors are now warnings, and hints that used to be warnings are now suggestions. As people have mentioned in the past, nothing HLint suggested was really an error, and now HLint matches that.

Secondly, there is now an hlint API entry point in Language.Haskell.HLint3 which captures the pattern of running HLint with command-line arguments, but in process as a library. With that, I don't think there are any API patterns that are better captured by the Language.Haskell.HLint API. If no one contacts me with issues, I will be making the module Language.Haskell.HLint a copy of Language.Haskell.HLint3 in the next release.

This release (like all minor releases) fixes a few bugs and adds a few new features. As a few random examples, HLint now warns on a redundant DefaultSignatures language extension, suggests fmap in preference to liftM and warns when otherwise is used as a pattern variable. There is a complete change log in the repo.

Tuesday, January 19, 2016

One Bag of Huel

I've been a fan of the idea of meal replacement products for a while - the Matrix gruel always seemed simple and appealing. One such product is Huel, which is available in the UK, seems fairly sensibly thought out, and isn't cadim-yummy. When having lunch at home, I found myself eating stuff that was nutritionally garbage, inconvenient and time consuming to get and not very nice. Given the downsides, Huel seemed worth a try. Having gone through one bag, it seems to be working nicely, and I intend to continue this as my default lunch-at-home plan.

General random thoughts on Huel:

My wife pronounces Huel with a strong and long 'u', huuuuuuel, to imitate the sound of someone vomiting. Soylent is probably a better name.
I bought a first bag, and they threw in a free flask/shaker and t-shirt. I've been using the flask they supplied, and the t-shirt is very nice.
I like that it has all the vitamins and minerals required, so I'm not likely to be missing anything.
The taste is OK. Not horrid, perfectly drinkable, but not too nice, so unlikely to go overboard. They got the balance right.
Most of the flavouring suggestions seem like someone trolling. The default vanilla flavour works fine for me.
The taste/consistency depends a lot on exactly how much water goes in, so experiment a bit (I find 420ml and 3 scoops works for me).
By the end of the flask, it's significantly more dense, so I add a little bit of water with about 40ml to go.
If you add the powder before the water it's a bit of a disaster and you get clumped powder at the bottom.
Wash the flask as soon as you finish, or it sets quite hard.
I think I spend about 3 mins preparation time making the mixture and washing up after, which isn't bad.
The pouches come sealed at the top with a resealable strip that is initially unsealed. Cut the top of the strip, don't tear it, or you bump into the resealable strip. Before starting, you have to clean the resealable strip out with a knife (which gives a bad first impression, but otherwise is unproblematic).
The powder has a habit of escaping a bit when filling up the flask. If it gets on clothes, it stays there until you wash them, and a gentle brushing down has little effect. A little annoying, but not fatal.
It's sufficiently filling that I think I'm probably going under the number of calories I should be getting at lunch. I'm experimenting with trying to start my lunch Huel a bit earlier, and then have another Huel or snack in the afternoon - splitting lunch in two.

Since this is a post about a specific product, I should probably mention I have no relationship with the company other than having spent £45 on Huel.

Thursday, January 14, 2016

A simple Haskell function

Summary: An example of a small function I recently wrote - from type signature to tests.

When writing a build system there are lots of nasty corner cases to consider. One is that command line limits combined with lots of arguments sometimes requires splitting a single command up into multiple commands, each of which is under some maximum length. In this post I'll describe a function that was required, my implementation, and how I tested it.

Type signature and documentation

Before I even got to the function, it already had a type signature and some Haddock documentation:

-- | @chunksOfSize size strings@ splits a given list of strings into chunks not
--   exceeding @size@ characters. If that is impossible, it uses singleton chunks.
chunksOfSize :: Int -> [String] -> [[String]]

As an example:

chunksOfSize 5 ["this","is","a","test"] == [["this"],["is","a"],["test"]]

Implementation

My implementation was:

chunksOfSize n = repeatedly $ \xs ->
    let i = length $ takeWhile (<= n) $ scanl1 (+) $ map length xs
    in splitAt (max 1 i) xs

First we use the repeatedly function from the extra library. This has the signature:

repeatedly :: ([a] -> (b, [a])) -> [a] -> [b]

Given a list of input, you supply a function that splits off an initial piece and returns the rest. One of the examples in the documentation is:

repeatedly (splitAt 3) xs  == chunksOf 3 xs

So we can see how repeatedly lets us focus on just the "next step" of this list, ignoring the recursion. For the function argument we have two tasks - first decide how many items to put in this chunk, then to split the chunks. Splitting the chunks is the easy bit, and can be written:

splitAt (max 1 i) xs

If we know the next i elements will be at or below the limit, then we can use splitAt to divide the elements. As a special case, if no elements would be allowed, we allow one, using max 1 to ensure we never pass 0 to splitAt (and thus enter an infinite loop). That leaves us with:

i = length $ takeWhile (<= n) $ scanl1 (+) $ map length xs

Reading from right to left, we reduce each element to it's length, then use scanl1 to produce a running total - so each element represents the total length up to that point. We then use takeWhile (<= n) to keep grabbing elements while they are short enough, and finally length to convert back to something we can use with splitAt.

Tests

When testing, I tend to start with a few concrete examples then move on to QuickCheck properties. As an initial example we can do:

quickCheck $
    chunksOfSize 3 ["a","b","c","defg","hi","jk"] ==
    [["a","b","c"],["defg"],["hi"],["jk"]]

Here we are explicitly testing some of the corner cases - we want to make sure the full complement of 3 get into the first chunk (and we haven't got an off-by-one), we also test a singleton chunk of size 4. Now we move on to QuickCheck properties:

quickCheck $ \n xs ->
    let res = chunksOfSize n xs
    in concat res == xs &&
       all (\r -> length r == 1 || length (concat r) <= n) res

There are really two properties here - first, the chunks concat together to form the original. Secondly, each chunk is either under the limit or a singleton. These properties capture the requirements in the documentation.

A final property we can check is that it should never be possible to move the first piece from a chunk to the previous chunk. We can write such a property as:

all (> n) $ zipWith (+)
    (map (sum . map length) res)
    (drop 1 $ map (length . head) res)

This property isn't as important as the other invariants, and is somewhat tested in the example, so I didn't include it in the test suite.

Performance and alternatives

The complexity is O(n) in the number of Char values, which is as expected, since we have to count them all. Some observations about this point in the design space:

In a strict language this would be an O(n^2) implementation, since we would repeatedly length and scanl the remainder of the tail each time. As it is, we are calling length on the first element of each chunk twice, so there is minor constant overhead.
Usually in Haskell, instead of counting the number of elements and then doing splitAt we would prefer to use span - something like span ((<= n) . fst) .... While possible, it makes the special singleton case more difficult, and requires lots of tuples/contortions to associate each element with its rolling sum.
For a build system, the entire input will be evaluated before, and the entire output will be kept in memory afterwards. However, if we think about this program with lazy streaming inputs and outputs, it will buffer each element of the output list separately. As a result memory would be bounded by the maximum of the longest string and the Int argument to chunksOfSize.
It is possible to write a streaming version of this function, which returns each String as soon as it is consumed, with memory bounded by the longest string alone. Moreover, if the solution above was to use lazy naturals, it would actually come quite close to being streaming (albeit gaining a quadratic complexity term from the takeWhile (<= n)).
The type signature could be generalised to [a] instead of String, but I would suspect in this context it's more likely for String to be replaced by Text or ByteString, rather than to be used on [Bool]. As a result, sticking to String seems best.

Refactoring the previous version

The function already existed in the codebase I was working on, so below is the original implementation. This implementation does not handle the long singleton special case (it loops forever). We can refactor it to support the singleton case, which we do in several steps. The original version was:

chunksOfSize _    [] = []
chunksOfSize size strings = reverse chunk : chunksOfSize size rest
  where
    (chunk, rest) = go [] 0 strings
    go res _         []     = (res, [])
    go res chunkSize (s:ss) =
        if newSize > size then (res, s:ss) else go (s:res) newSize ss
      where
        newSize = chunkSize + length s

Refactoring to use repeatedly we get:

chunksOfSize size = repeatedly $ second reverse . go [] 0
  where
    go res _         []     = (res, [])
    go res chunkSize (s:ss) =
        if newSize > size then (res, s:ss) else go (s:res) newSize ss
      where
        newSize = chunkSize + length s

Changing go to avoid the accumulator we get:

chunksOfSize size = repeatedly $ go 0
  where
    go _         []     = ([], [])
    go chunkSize (s:ss) =
        if newSize > size then ([], s:ss) else first (s:) $ go newSize ss
      where
        newSize = chunkSize + length s

It is then reasonably easy to fix the singleton bug:

chunksOfSize size = repeatedly $ \(x:xs) -> first (x:) $ go (length x) xs
  where
    go _         []     = ([], [])
    go chunkSize (s:ss) =
        if newSize > size then ([], s:ss) else first (s:) $ go newSize ss
      where
        newSize = chunkSize + length s

Finally, it is slightly simpler to keep track of the number of characters still allowed, rather than the number of characters already produced:

chunksOfSize size = repeatedly $ \(x:xs) -> first (x:) $ go (size - length x) xs
  where
    go n (x:xs) | let n2 = n - length x, n2 >= 0 = first (x:) $ go n2 xs
    go n xs = ([], xs)

Now we have an alternative version that is maximally streaming, only applies length to each element once, and would work nicely in a strict language. I find the version at the top of this post more readable, but this version is a reasonable alternative.

Acknowledgements: Thanks to Andrey Mokhov for providing the repo, figuring out all the weird corner cases with ar, and distilling it down into a Haskell problem.

Monday, December 21, 2015

Solving the GCHQ puzzle "by hand"

The GCHQ 2015 Christmas puzzle is a Nonogram puzzle, which involves filling in squares on a grid to reveal a picture, guided by constraints on the rows and columns. For a computer, a nice way to solve this problem is using a SAT solver. But humans aren't great at SAT solving, and I was given a print-out of this puzzle while on holiday, with no computer. I'd never encountered such a puzzle before, so working with a friend (and some wine) we came up with an approach, and set about applying it. Alas, robustly applying an algorithm with many steps is not easy for a human, and we eventually ended up with contradictions. On returning from holiday, I automated our approach, and tested it. Our approach worked, and the code is below.

The Problem

The puzzle is:

It comprises a 25x25 grid, some filled in squares, and alongside each row/column are the number of consecutive squares that must be filled in each line. For example, the 8th row down must have two runs of three filled squares, with a gap in between, and potentially gaps before or after.

Our Approach

Our approach was to take each line and compute the number of "free" gaps - how many spaces could be inserted with choice. For one row (4 from the bottom) the entire grid is constrained, with no free gaps. Starting with the most constrained lines, we tried to figure out where the pieces could go, based on the existing squares. We quickly realised that negative information was important, so tagged each square with "don't know" (left blank), must be filled (we shaded it in) or must be unfilled (we drew a circle in it). For each line in isolation, looking at the constraints, we inferred squares to be filled or unfilled by examining the possible locations of each run.

The Code

I implemented our approach in Haskell, complete code is available here.

Our constraint system works over a grid where each square is in one of three states. We can encode the grid as [[Maybe Bool]]. The [[.]] is a list of lists, where the outer list is a list of rows, and the inner list is a list of squares. Each of the inner lists must be the same length, and for the GCHQ puzzle, they must all be 25 elements long. For the squares we use Maybe Bool, with Nothing for unknown and Just for known, using True as filled and False as unfilled.

Given the [[Maybe Bool]] grid and the constraints, our approach was to pick a single line, and try to layout the runs, identifying squares that must be True/False. To replicate that process on a computer, I wrote a function tile that given the constraints and the existing line, produces all possible lines that fit. The code is:

tile :: [Int] -> [Maybe Bool] -> [[Bool]]
tile [] xs = maybeToList $ xs ~> replicate (length xs) False
tile (c:cs) xs = concat [map (\r -> a ++ b ++ c ++ r) $ tile cs xs
    | gap <- [0 .. length xs - (c + sum cs + length cs)]
    , (false,xs) <- [splitAt gap xs]
    , (true,xs) <- [splitAt c xs]
    , (space,xs) <- [splitAt 1 xs]
    , Just a <- [false ~> replicate gap False]
    , Just b <- [true ~> replicate c True]
    , Just c <- [space ~> replicate (length space) False]]

The first equation (second line) says that if there are no remaining constraints we set all remaining elements to False. We use the ~> operator to check our desired assignment is consistent with the information already in the line:

(~>) :: [Maybe Bool] -> [Bool] -> Maybe [Bool]
(~>) xs ys | length xs == length ys &&
             and (zipWith (\x y -> maybe True (== y) x) xs ys)
           = Just ys
(~>) _ _ = Nothing

This function takes a line of the grid (which may have unknowns), and a possible line (which is entirely concrete), and either returns Nothing (inconsistent) or Just the proposed line. We first check the sizes are consistent, then that everything which is concrete (not a Nothing) matches the proposed value.

Returning to the second equation in tile, the idea is to compute how many spaces could occur at this point. Taking the example of a line 25 long, with two runs of size 3, we could have anywhere between 0 and 18 (25-3-3-1) spaces first. For each possible size of gap, we split the line up (the splitAt calls), then constrain each piece to match the existing grid (using ~>).

Given a way of returning all possible lines, we then collapse that into a single line, by marking all squares which could be either True or False as Nothing:

constrainLine :: [Int] -> [Maybe Bool] -> Maybe [Maybe Bool]
constrainLine cs xs = if null xs2 then Nothing else mapM f $ transpose xs2
    where xs2 = tile cs xs
          f (x:xs) = Just $ if not x `elem` xs then Nothing else Just x

If there are no satisfying assignments for the line, we return Nothing - that implies the constraints are unsatisfiable. Next, we scale up to a side of constraints, by combining all the constraints and lines:

constrainSide :: [[Int]] -> [[Maybe Bool]] -> Maybe [[Maybe Bool]]
constrainSide cs xs = sequence $ zipWith constrainLine cs xs

Finally, to constrain the entire grid, we constrain one side, then the other. To simplify the code, we just transpose the grid in between, so we can treat the rows and columns identically:

constrainGrid :: [[Int]] -> [[Int]] -> [[Maybe Bool]] -> Maybe [[Maybe Bool]]
constrainGrid rows cols xs =
    fmap transpose . constrainSide cols .
    transpose =<< constrainSide rows xs

To constrain the whole problem we apply constrainGrid repeatedly, until it returns Nothing (the problem is unsatisfiable), we have a complete solution (problem solved), or nothing changes. If nothing changes then there might be two solutions, or our approach might not be powerful enough without using search.

The Result

After four iterations we end up with a fully constrained answer. To see the progress, after one iteration we have:

..XXXXX...X.OO..X.XXXXX..
..OOOOX.X.O.....O.XOOOO..
..XXX.X....O...OX.X.XX...
X.XXX.X....XXXXXX.X.XX...
X.XXX.X..XXXX..XX.X.XX..X
X.OOOOX...XO...OO.XOOO..X
XXXXXXXOXOXOXOXOXOXXXXXXX
..OOO.OO..XOOOX.XOOOOO.O.
..XX..XX.OXOXOXXXOXO...X.
..XO..OO....OXX.O.O....X.
..X...X......X..X......O.
..O...O......XO.X........
..XX..X.X....O.OO.X......
..OXX.O.X....XXXX.X......
..XX..XXXXX..O.OO........
..X...O.X..O..O.X...O....
..X...X.X.OXO.O.X...X....
..OOO.O.X..O..O.X...X..X.
X.XXXXX.......O.X...X..X.
X.OOO.X.....XOO.X...X..X.
X.XXX.X.....XXOOX...X...O
XOXXXOXOXXXOXXXXXXXXXXOXX
..XXX.X.....XXXXX..XXXX.O
..OOOOX......OOOO...O.X..
..XXXXX......XOXX.O.X.X..

Here a . stands for Nothing. After four iterations we reach the answer in a total of 0.28s:

XXXXXXXOXXXOOOXOXOXXXXXXX
XOOOOOXOXXOXXOOOOOXOOOOOX
XOXXXOXOOOOOXXXOXOXOXXXOX
XOXXXOXOXOOXXXXXXOXOXXXOX
XOXXXOXOOXXXXXOXXOXOXXXOX
XOOOOOXOOXXOOOOOOOXOOOOOX
XXXXXXXOXOXOXOXOXOXXXXXXX
OOOOOOOOXXXOOOXXXOOOOOOOO
XOXXOXXXOOXOXOXXXOXOOXOXX
XOXOOOOOOXXXOXXOOOOXOOOXO
OXXXXOXOXXXXOXXOXOOOOXXOO
OXOXOOOXOOOXOXOXXXXOXOXXX
OOXXOOXOXOXOOOOOOXXOXXXXX
OOOXXXOXXOXXOXXXXXXOXXXOX
XOXXXXXXXXXOXOXOOXXOOOOXO
OXXOXOOXXOOOXXOXXXOOOOOXO
XXXOXOXOXOOXOOOOXXXXXOXOO
OOOOOOOOXOOOXXOXXOOOXXXXX
XXXXXXXOXOOXXOOOXOXOXOXXX
XOOOOOXOXXOOXOOXXOOOXXOXO
XOXXXOXOOOXXXXOOXXXXXOOXO
XOXXXOXOXXXOXXXXXXXXXXOXX
XOXXXOXOXOOXXXXXXOXXXXXXO
XOOOOOXOOXXOOOOOOXOXOXXOO
XXXXXXXOXXOOOXOXXOOOXXXXX

Update: On the third attempt, my friend managed to solve it manually using our technique, showing it does work for humans too.

Wednesday, December 09, 2015

MinGHC is Dead, Long Live Stack

Summary: The MinGHC project has now finished.

The MinGHC project was started to produce a minimal Windows installer which didn't contain many packages, but which could install many packages - in particular the network package. But time has moved on, and Stack now offers everything MinGHC does, but cross-platform and better. To install GHC using Stack, just do stack setup, then stack exec -- my command. Even if you prefer to use Cabal, stack exec -- cabal install hlint is a reasonable approach.

A few final remarks on MinGHC:

MinGHC was an experiment started by Michael Snoyman, which myself (Neil Mitchell) and Elliot Cameron joined in with. I had fun working with those guys.
The ideas and approaches behind MinGHC got reused in Stack, so the knowledge learnt has transferred.
The MinGHC project involved a Shake build system coupled to an NSIS EDSL installer. I think the technologies for building the installer worked out quite nicely.
The existing MinGHC installers are not going away, but we won't be making any changes, and probably won't upload new versions for GHC 7.10.3.
It's possible to build Windows installers on a Linux box using a Wine version of NSIS. Very cool.
I find maintaining such fundamental pieces of an ecosystem, involving installation and system configuration, to be much less fun than just writing Haskell code. Kudos to the Stack and Cabal guys.

Tuesday, December 08, 2015

What's the point of Stackage LTS?

Summary: Stackage LTS is mostly pointless.

Stackage provides a set of precise versions of a subset of Hackage which all place nicely together. At any one time, there are two such version sets:

Nightly, which aims to be the latest version of all packages.
LTS (Long Term Support), which with every major release updates every package, and every minor release updates packages that only changed their minor version.

Stack currently defaults to LTS. I don't think LTS fulfils any need, and does cause harm (albeit very mild harm), so should be abandoned. Instead, people should use a specific nightly that suits them, and upgrade to later nightlies on whatever schedule suits them. I share these somewhat half-baked thoughts to hopefully guide how I think Stackage should evolve, and there may be good counterarguments I haven't thought of.

Why is Nightly better?

Nightly always has newer versions of packages. I believe that newer versions of packages are better.

As a package author, with each new release, I fix bugs and evolve the library in ways which I think are beneficial. I make mistakes, and new versions are how I fix them, so people using old versions of my software are suffering for my past mistakes.
As a package user, if I find a bug, I will endeavour to report it to the author. If I'm not using the latest version, it's highly likely the author will ask me to upgrade first. If a feature I want gets added, it's going to be to the latest version. The latest version of a package always has better support.
If the latest version of a package does not suit my needs, that's an important alarm bell, and I need to take action. In some cases, that requires finding a fork or alternative package to do the same thing. In some cases, that involves talking to the author to make them aware of my particular use case. In either case, doing this sooner rather than later makes it easier to find a good solution.

What's the benefit of LTS?

As far as I am aware, LTS major releases are just Nightly at a particular point in time - so if you only ever use x.0 LTS, you might as well just use Nightly. The main benefit of LTS comes from picking a major LTS version, then upgrading to the subsequent minor LTS versions, to access new minor versions of your dependencies.

Upgrading packages is always a risk, as LTS Haskell says. What LTS minor releases do are minimize the risk of having to fix compilation errors, at the cost of not upgrading to the latest packages. However, the reasons I do upgrades are:

Sometimes, if one package has a known bug that impacts me, I specifically upgrade just that one package. I don't want to upgrade other packages (it introduces unnecessary risk), so I take my package set (be it LTS or Nightly) and replace one constraint.
Every so often, I want to upgrade all my packages, so that I'm not missing out on improvements and so I'm not falling behind the latest versions - reducing the time to upgrade in future. I do that by picking the latest version of all packages, fixing breakages, and upgrading.
When upgrading, compile-time errors are a minor issue at worst. Before Stackage, my major headache was finding a compatible set of versions, but now that is trivial. Fixing compile-time errors is usually a little work, but fairly easy and predictable. Checking for regressions is more time consuming, and running the risk of regressions that escape the test suite has a significant cost. Tracking down if there are any resulting regressions is very time-consuming.

I don't see a use case for upgrading "a little bit", since I get a lot less benefit for only marginally less work.

Why is LTS better?

When I asked this question on Twitter, I got two reasons that did seem plausible:

Matt Parsons suggested that LTS provided a higher likelihood of precompiled binaries shared between projects. That does make sense, but a similar benefit could be achieved by preferring Nightly on the 1st on the month.
Gabriel Gonzalez suggested that when including a resolver in tutorials, LTS looks much better than a Nightly. I agree, but if there only was Nightly, then it wouldn't look quite as bad.

If Haskell packages had bug fixes that were regularly backported to minor releases, then the case for an LTS version would be stronger. However, my experience is that fixes are rarely backported, even for the rare security issues.

Is this a problem?

Currently everyone developing Haskell has two choices, and thanks to Stack, might not even be aware which one they've ended up making (Stack will pick nightly on its own if it thinks it needs to). Reducing the choices and simplifying the story removes work for the Stackage maintainers, and cognitive burden from the Stackage users.

Stackage was a radical experiment in doing things in an entirely different way. I think it's fair to say it has succeeded. Now, after experience using the features, I think it's right to critically look at tweaks to the model.