Neil Mitchell's Blog (Haskell etc)

Monday, May 20, 2019

Hoogle XSS Vulnerability

Summary: Hoogle 5.0.17.6 and below have an XSS vulnerability, fixed in later versions.

On Friday afternoon I got an email from Alexander Gugel with the subject line "Non-persistent XSS vulnerability on hoogle.haskell.org" - never a good thing to get. He had found that Hoogle was echoing the user search string back into the page, meaning that if you searched for %27"><marquee style you could make all the results scroll past in a disturbingly hypnotic manner. Oh dear!

Step 1: Fix the website

The first concern was to fix the website. While there aren't any cookies stored by Hoogle, and there are no logon forms or similar, the Project Zero blog has taught me that given the tiniest chink, everything can be broken. Fortunately, Alex emailed me using the email address on my webpage, described the problem, and provided a 3 line diff that escaped all the problematic variables. I applied this fix and pushed a new version to hoogle.haskell.org.

Step 2: Use the type system

Like any good Haskeller, my first thought on encountering a bug is to use the type system to prevent it by construction. The problem boils down to taking user input and splicing it into an HTML page. My initial fix was to introduce a type Taint:

newtype Taint a = Taint a

escapeUntaint :: Taint String -> String
escapeUntaint (Taint x) = escapeHTML x

The idea is that instead of the query parameters to the web page being String's that can be carelessly spliced into the output, they were Taint String values whose only real unwrapping function involves escaping any HTML they may contain. Furthermore, Taint can have instances for Monad etc, meaning you can work on tainted values, but the result will always remain tainted.

Using this approach uncovered no additional problems, but gave me much more confidence there weren't any I just hadn't found.

Step 3: Make a release

At this point I made a release of Hoogle 5.0.17.7. This version has no known XSS issues with it.

Step 4: Switch to blaze-html

While Taint is an effective tool for some domains, the real problem for Hoogle was that I was building up HTML values using String - making it way too easy to create invalid HTML, and providing an easy attack vector. The next change was to switch to blaze-html, which uses strong typing to ensure the HTML is always valid. Instead of having to call escapeHTML to turn bad String into good String, I instead used H.string to turn bad String into good Markup. For the rare case where there genuinely was String that contained HTML for good reasons I used H.preEscapedString, making the "don't escape" explicit and longer, and the "do escape" the default - a much safer default.

Step 5: Use Content Security Policy headers

There are a whole suite of headers that can be returned by the server to opt in to additional checking, known as CSP headers. These headers can ban inline script, detect XSS attacks, avoid confusion with MIME types, avoid http downgrade attacks and more. Thanks to Gary Verhaegen many of these are now applied to Hoogle, meaning that even if my code is wrong, the chances of it causing any damange (even just hypnotic scrolling) are much reduced.

Step 6: Relax

Hoogle 5.0.17.8 has all the security fixes listed and is deployed to hoogle.haskell.org. Hopefully no more security issues for a while!

Many thanks to Alexander Gugel for the responsible disclosure, and to Gary Verhaegen for his work on CSP headers.

Tuesday, May 14, 2019

Shake with Applicative Parallelism

Summary: Shake now does that Applicative trick from Haxl.

In Shake 0.17.9 and below, need xs >> need ys builds xs in parallel, then afterwards builds ys in parallel. The same is true of need xs *> need ys, where *> is the applicative equivalent of >>. From Shake 0.18 onwards both versions run everything in parallel. Hopefully that makes some Shake-based build systems go faster.

What change is being made?

If you make two calls to apply without any IO, monadic-bind or state operations in between then they will be executed as though you had made a single call to apply. As examples, need, askOracle and getDirectoryFiles are all calls to apply under the hood, so can be merged. However, note that the invariants are somewhat subtle. Something as simple as:

myNeed xs = do putNormal "Needing here"; need xs

Will not be merged with a preceeding need - the function putNormal queries the state (what is the verbosity level), does IO and contains a monadic bind.

Why are you making this change?

I am making the change for two reasons: 1) people have kept asking for it since Haxl does it; 2) the Hadrian build probably benefits from it. The downsides are relatively low (more complexity inside Shake, slightly slower Action operations) but the benfits are potentially large.

Why didn't you make this change sooner?

My previous reasoning for not making the change was:

Shake could follow the Haxl approach, but does not, mainly because they are targeting different problems. In Haxl, the operations are typically read-only, and any single step is likely to involve lots of operations. In contrast, with Shake the operations definitely change the file system, and there are typically only one or two per rule. Consequently, Shake opts for an explicit approach, rather than allow users to use *> (and then inevitably add a comment because its an unusual thing to do).

I stand by that comment - explicit grouping of need or explicit use of parallel is often better - all it takes is a sneaky >>= and the parallelism disappears. But if this change improves some build times, it's hard to argue strongly against.

Will it break any build systems?

Potentially, but unlikely, and those it will break were already on thin ice. As some examples:

If a depends on some state change from b (e.g. creating a directory), but doesn't have a dependency on it, then need [a] >> need [b] might have worked, while need [a,b] might not. The correct solution is for a to depend on b, if it does in fact depend on b, or at the very least use orderOnly.
If you use getDirectoryFiles on generated files (something the documentation says is a bad idea) then if merged with the thing that generates the files you will get incoherent results. The solution is to avoid using getDirectoryFiles on generated files.

Thanks to Pepe Iborra for encouraging, testing and troubleshooting this change.

Monday, April 29, 2019

foldr under the hood

Summary: The foldr function seems simple, but is actually very complex, with lots of layers. This post dives through the layers.

The foldr function takes a list and replaces all : (cons) and [] (nil) values with functions and a final value. It's available in the Haskell Prelude and described on Wikipedia. As some examples:

sum = foldr (+) 0
map f = foldr (\x xs -> f x : xs) []

But the simple foldr described on Wikipedia is many steps away from the one in the Haskell Prelude. In this post we'll peel back the layers, learning why foldr is a lot more complicated under the hood.

Layer 1: Wikipedia definition

The definition on Wikipedia is:

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f z []     = z
foldr f z (x:xs) = f x (foldr f z xs)

This recursive definition directly describes what foldr does. Given a list [1,2,3] we get f 1 (f 2 (f 3 z)).

Layer 2: Static argument transformation

The problem with this definition is that it is recursive, and GHC doesn't like to inline recursive functions, which prevents a lot of optimisation. Taking a look at sum, it's a real shame that operations like (+) are passed as opaque higher-order functions, rather than specialised to the machine instruction ADD. To solve that problem, GHC defines foldr as:

foldr f z = go
    where go []     = z
          go (x:xs) = f x (go xs)

The arguments f and z are constant in all sucessive calls, so they are lifted out with a manually applied static argument transformation.

Now the function foldr is no longer recursive (it merely has a where that is recursive), so foldr can be inlined, and now + can meet up with go and everything can be nicely optimised.

Layer 3: Inline later

We now have foldr that can be inlined. However, inlining foldr is not always a good idea. In particular, GHC has an optimisation called list fusion based on the idea that combinations of foldr and build can be merged, sometimes known as short-cut deforestation. The basic idea is that if we see foldr applied to build we can get rid of both (see this post for details). We remove foldr using the GHC rewrite rule:

{-# RULES "my foldr/build" forall g k z. foldr k z (build g) = g k z #-}

The most interesting thing about this rule (for this post at least!) is that it matches foldr by name. Once we've inlined foldr we have thrown away the name, and the rule can't fire anymore. Since this rule gives significant speedups, we really want it to fire, so GHC adds an extra pragma to foldr:

{-# INLINE [0] foldr #-}

This INLINE pragma says don't try and inline foldr until the final stage of the compiler, but in that final stage, be very keen to inline it.

Layer 4: More polymorphism

However, the foldr function in the Prelude is not the one from GHC.List, but actually a more general one that works for anything Foldable. Why limit yourself to folding over lists, when you can fold over other types like Set. So now foldr is generailsed from [] to t with:

foldr :: Foldable t => (a -> b -> b) -> b -> t a -> b

Where foldr on [] is GHC.List.foldr.

Layer 5: A default implementation

But foldr is actually in the type class Foldable, not just defined on the outside. Users defining Foldable can define only foldr and have all the other methods defined for them. But they can equally define only foldMap, and have an implicit version of foldr defined as:

foldr :: (a -> b -> b) -> b -> t a -> b
foldr f z t = appEndo (foldMap (Endo . f) t) z

Where Endo is defined as:

newtype Endo = Endo {appEndo :: a -> a}

instance Monoid (Endo a) where
    mempty = Endo id
    Endo a <> Endo b = Endo (a . b)

The function foldMap f is equivalent to mconcat . map f, so given a list [1,2,3] the steps are:

First apply map (Endo . f) to each element to get [Endo (f 1), Endo (f 2), Endo (f 3)].
Next apply mconcat to the list to get Endo (f 1) <> Endo (f 2) <> Endo (f 3).
Inline all the <> definitions to get Endo (f 1 . f 2 . f 3).
Apply the appEndo at the beginning and z at the end for (f 1 . f 2 . f 3) z.
Inline all the . to give f 1 (f 2 (f 3 z)), which is what we had at layer 1.

Layer 6: Optimising the default implementation

The real default implementation of foldr is:

foldr f z t = appEndo (foldMap (Endo #. f) t) z

Note that the . after Endo has become #.. Let's first explain why it's correct, then why it might be beneficial. The definition of #. is:

(#.) :: Coercible b c => (b -> c) -> (a -> b) -> (a -> c)
(#.) _ = coerce

Note that it has the same type as . (plus a Coercible constraint), but ignores it's first argument entirely. The coerce function transforms a value of type a into a value of type b with zero runtime cost, provided they have the same underlying representation. Since Endo is a newtype, that means Endo (f 1) and f 1 are implemented identically in the runtime, so coerce switches representation "for free". Note that the first argument to #. only serves to pin down the types, so if we'd passed an interesting function as the first argument it would have been ignored.

Of course, in normal circumstances, a newtype is free anyway, with no runtime cost. However, in this case we don't have a newtype, but a function application with a newtype. You can see the gory details in GHC ticket 7542, but at one point this impeeded other optimisations.

I tried a simulated version of foldr and found that if GHC can't tell that [] is the Foldable the code looks pretty bad, but if it can, at -O1 and above, the two implementations are 100% equivalent (to the point that common subexpression elimination makes them actually the same). It's possible this final layer is a vestigial optimisation, or perhaps it's still important in some circumstances.

Tuesday, April 16, 2019

Code Statistics and Measuring Contributions

Summary: The only way to understand a code base is to ask someone who works on it.

This weekend a relative asked me how can we tell who wrote the code behind the black hole image, and was interested in the stats available on GitHub. There are lots of available stats, but almost every stat can be misleading in some circumstances. The only people who have the context to interpret the stats are those who work on the project, hence my default approach to assessing a project is to ask someone who works on it, with the understanding that they may look at relevant stats on GitHub or similar. In this post lets go through some of the reasons that a simplistic interpretation of the stats is often wrong.

These remarks all apply whether you're trying to assign credit for a photo, trying to do performance reviews for an employee or trying to manage a software project.

What to measure

There are broadly two ways to measure activity on the code in a repo. The first is additions/deletions of lines of code, where a modified line of code is usually measured as an addition and deletion. The second is number of commits or pull requests, which measures how many batches of changes are made. The problem with the latter is that different people have different styles - some produce big batches, some tiny batches - a factor of 10 between different developers is not unusual. There are also policy reasons that commits may be misleading - some projects squash multiple commits down to one when merging. The number of lines of code gives a better measure of what has changed, but it's merely better, not good - the rest of this post assumes people are focusing on number of lines of code changed.

All code is equal

Treating number of lines changed as the contribution assumes that every line is equally hard - but that's far from the case. At a previous company I worked on code that ranged from the internals of a compiler, to intricate C++ libraries, to Haskell GUI's. I estimate that I could produce 100x the volume of Haskell GUI's compared to C++ libraries. Other colleagues worked only only on the compiler, or only on GUIs - vastly changing how much code they produced per hour.

Similarly, each line of code is not equally important. Last week I wrote a few 100 lines of code. Of those, nearly all were done on Monday, and the remainder of the week involved a single line that is ferociously difficult with lots of obscure side conditions (libraries and link order...). That one line is super valuable, but simplistic measuring suggests I napped all Tuesday and Wednesday.

Code is attributed properly

Developers typically have user handles or email addresses that are used for code contributions. I currently have at least two handles, and in the past when we did stats on a $WORK project there were 6 different email addresses that I claimed ownership of. As a consequence, my work shows up under lots of different names, and counting it can be difficult. The longer a project runs, the more chance of developers changing identity.

The person who changed code did the work

A big part of software engineering is making old code obsolete. I was recently involved in deleting many thousands of lines that was no longer necessary. With a small team, we created a new project, implemented it, converted 90% of the uses over to the new code, and then stopped. Separately, someone else did the last 10% of the conversion, and then was able to delete a huge number of lines of code. There was definitely work in deleting the final bit of code, but the "labour" involved in that final deletion was mostly carried out months ago by others.

Similarly, when copying a new project in (often called vendoring) there is a big commit to add a lot of new code that was originally written by others, but which gets attributed to a named individual.

All code is in one place

Often projects will incorporate library code. For example, the official contribution of Niklas Broberg to HLint is 8 lines. However, he's called out explicitly in the README as someone whose contributions were essential to the project. In this case, because he wrote a library called haskell-src-exts without which HLint could not exist, and then continued to improve it for the benefit of HLint for many years.

Furthermore, projects like HLint rely on a compiler, libraries, operating system, and even a version control system. Usually these get overlooked when giving credit since they are relatively old and shared between many projects - but they are an essential part of getting things to work.

More code is better

The only purpose of code is to do a thing - whatever that thing might be. In all other ways, code is a liability - it has to be read, tested, compiled etc. Given the choice between 10 lines or 1,000,000 lines of code, I would always prefer 10 lines if they did the same thing. A smarter programmer who can do more with less lines of code is better. The famous quote attributed to Bill Gates is still true many decades later:

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

Code is the essence

Measuring code suggests that code is the thing that matters. The code certainly does matter, but the code is just a representation of an underlying algorithm. The code follows a high-level design. Often much more significant contributions are made by picking the right design, algorithm, approach etc.

Code is all that matters

In a large project there is code, but the code doesn't exist in a vacuum. There are other code-adjacent tasks to be performed - discussions, mentoring, teaching, reporting issues, buying computers etc. Many of these are never reflected in the code, yet if omitted, the code wouldn't happen, or would happen slower.

Sunday, April 07, 2019

Code Review: Approve with Suggestions

Summary: Code review is not a yes/no decision - mostly I say yes with suggestions.

As I wrote previously, I didn't used to be a fan of code review for $WORK code, but now I am. After I review some code there are three responses I might give:

"Request changes" - this code has some fatal flaw. Maybe I saw a race condition. Maybe there's insufficient testing. Maybe it's just a bad idea. Please fix it or convince me I'm wrong and I'll review again.
"Approved" - this code is great. Let's merge it. If the CI has already passed, I'll probably merge it now myself.
"Approved with suggestions" - this code is fine, I'm happy for it to be merged, but I thought of a few ways to make it better.

I think I use "Approved with suggestions" about 80% of the time. To use this status I think the code is correct, readable, and will have no negative effects - I'm happy for it to be merged. At the same time, I can think of a few ways to improve it - e.g. using some utility function, simplifying things a bit, making the documentation clearer. If the original author disagrees with me, I'm not going to bother arguing. I'm happy for these commits to be in a follow up PR, or pushed on top of this PR, whatever suits them.

What's different about approved with suggestions, at least for trusted individuals (e.g. colleagues), is that if they make these tweaks I have no real interest in rereviewing. I'm happy for my approval to remain sticky and for them to seek out rereview only if they think they need it. Importantly, this guideline is consistent with my reasons for reviewing. After the first review, if the code doesn't change meaningfully, a rereview offers none of the benefits that make me want to review in the first place.

Since the programming language DAML that I work on is now open source I can point at a concrete example using a pull request to our GHC fork. Here Shayne added a function:

qualifyDesugar :: (String -> OccName) -> String -> RdrName
qualifyDesugar occName =
  (mkRdrQual $ mkModuleName "DA.Internal.Desugar") . occName

It matches the stated intention, but it seems to do a bit too much - it turns a String into an OccName using a supplied function, when it could have just taken the OccName directly. Simpler, better, more maintainable. So I suggested:

qualifyDesugar :: OccName -> RdrName
qualifyDesugar = mkRdrQual $ mkModuleName "DA.Internal.Desugar"

A nice improvement. However, Shayne is responsible enough to make such simple tweaks that it didn't require another review. A typical case of Approve with suggestions.

Monday, March 18, 2019

GHC Rebuild Times - Shake profiling

Summary: GHC rebuild times are slow. Using the Shake profiler and Hadrian we can find out why.

I just checked out GHC, and using Hadrian (the Shake-based build system), built GHC on Windows using:

hadrian\build.stack.bat -j --flavour=quickest --integer-simple --configure --profile

Namely use stack to install dependencies and build Hadrian itself, then compile as quick as I can get it, on all CPUs (8 on my machine), run configure for me and a profile report.html output. After compiling Hadrian, 40m54s later I had a built GHC. That's not a quick process! Why did it take so long? If I bought a better machine would it go faster? How might we optimise GHC? These questions and more can be answered with Shake profiling.

Shake has had profiling for years, but in the recently-released Shake 0.17.7 I've overhauled it. The profile report is generated as a web page, and the generated output in the new version is smaller (2x smaller), loads faster (100x or more) and is more intuitive (not really a numeric thing). In the rest of this post I'll pepper some screenshots from the Shake profiler without thoughts about what it could mean. I've also uploaded the profile so you can play around with it:

Hadrian Profile

Summary Page

The first page you see when opening the report is the summary.

This page gives us some basic stats. There was 1 run of the build system. It ran 3,645 traced actions (e.g. command line calls or other expensive actions) and there were 15,809 rules run (where a rule is something with dependency information - somewhere between one third to two thirds of those are likely to be files in typical build systems).

Turning to performance, the entire build, on one CPU would take 2h26m. The build on my 8 CPU machine took 40m45s, with on average 3.58 commands going at once (quite a bit less than the 8 I wanted). The critical path is about 37m14s, so that's the lower bound with infinite CPUs, so buying a machine with more CPUs won't really help (faster CPUs and a faster disk probably would though).

OK, so I'm now unhappy that GHC doesn't execute enough in parallel. So let's see what it does manage to do in parallel by switching to the Command Plot.

Command Plot

We now see a graph of what was executing at each point during the build. We see spikes in a hideous light blue for GHC, showing that when GHC gets going, it can get near to the 8 CPUs we requested. However, we see lots of periods of time with only 1 task executing. In most cases these are either sh at the start (which I happen to know is configure), or cabal-configure (which is more obviously configure). However, there are also Haskell blips where we get down to 1 CPU. I'm now increasingly convinced that the biggest problem Hadrian has (performance wise) is lack of parallelism. To confirm that, let's switch to the Parallelizability tab.

Parallelizability

This next tab predicts how long it will take to build Hadrian at various different thread counts. The green line is if there were no dependencies, the blue line is with the dependencies we have, and the yellow line is the difference. As we can see, at 8 CPU's the difference is about 16m - I could have had my GHC a whole 16m faster if we could parallelise GHC more. At the same time, it looks like the sweet spot for compiling GHC is currently around 6 CPUs - more doesn't make a huge amount of difference. How sad. To investigate let's jump to the Rules tab.

Rules

Now we've moved on from pretty graphs to tables of rules. The most interesting columns for performance work are Time (how long something took), ETime (how long it took if you only pay for the fraction of the computer you are using) and WTime (how long you were the only thing running for). The first is most relevant if you want to take less CPU, the second two if you aren't hitting the parallelism you are hoping for. Since we aren't hitting the parallelism, we can sort by WTime.

For WTime, if we eliminated that step, the total build would improve by that amount of time. Looking at the first two entries, which are the initial configure and then configure of the base library, we see a total of 8m38s. If we could get rid of configure, or speed it up, or interleave it with other operations, we could save almost 10 minutes off a GHC build. Using the search bar we can figure out how much configure costs us in total.

Now we have used the search bar to filter to only rules that run the command cabal-configure or sh, and we've named them all in the group configure (so it sums them up for us). We see we spend 15m18s configuring, and would go about 10m faster if we didn't - so it's expensive, and serialises the build a lot. I hate configure.

Slow Stage0 Compilation

Ignoring configure, the next slow things are building the stage0 compiler, so let's focus in on that.

While we can use the search bar for entering JavasScript expressions, we can equally just enter substrings. Let's delve into /compiler/ and sort by Time. We quickly see a bunch of stage0 and stage1 compiles, with HsInstances.o and DynFlags.o right at the top. Those files take a really long time to compile, and end up serialising the build quite a bit. However, it's a bit odd that we see stage0, but a lot less of stage1. Let's compare the two stages:

Now we're using a regular expression to pull out the .o compiles in compiler, and group them by their stage. We can see that both involve 1,527 compiles, but that stage0 takes 43m, while stage1 is under 18m. Why? Either we're using different sets of flags (e.g. compiling stage0 with higher optimisations or warnings), or GHC HEAD (the output of stage0 which we use to compile stage1) is significantly faster than GHC 8.6.3 (which I used to compile stage0). I've no idea, but tracking down the difference could save at least 7 minutes on the wall clock time of building GHC.

Conclusion

Compiling GHC is slow, but the biggest problem is it doesn't parallelise well. Using Shake profiling we've found that configure and stage0 are the culprits. There's lots to be done, and hopefully a Summer of Code project too.

Sunday, February 17, 2019

Quadratic "deriving Generic" Compile Times

Summary: For large data types, deriving Generic can take a long time to compile.

I was building GHC using Hadrian, and part of that process involves compiling Cabal multiple times - once to build Hadrian itself, and then once per GHC stage thereafter (typically 3 stages). Compiling Cabal takes quite a long time, but one thing I spotted was that compiling the LicenseId file alone was taking 30 seconds! Given that it seemed to be on the critical path, and usually gets compiled 4 times, that seemed worth investigating.

Looking at the file, it defines an enumeration with 354 license types in it. Pulling apart the file, I found that with only the data type and deriving Generic it was taking about 9 seconds without optimisation or 22 seconds with optimisation, using GHC 8.6.3. I then wrote a test program that generated and compiled instances of the program:

{-# LANGUAGE DeriveGeneric #-}
module Sample where
import GHC.Generics
data Sample = Ctor1 | Ctor2 | ...
   deriving Generic

The resulting graph shows that compilation time is quadratic in the number of constructors:

This phenomenon could either be because the deriving Generic itself is quadratic, or more likely that the output generated by deriving Generic provokes quadratic behaviour in some subsequent part of the compiler. I'm going to report this as a GHC bug.

Update: There's already GHC bug 5642 which explains that it might just be a fundamental property of how Generic instances look.

For Cabal the fix might be somewhat simpler - the deriving Generic is only used to define a Binary instance, which can be written in 2 lines for an Enum anyway - discussion on the issue tracker. Hopefully this change will speed up compiling Cabal, and thus speed up compiling GHC.

As I watched the GHC output go by I noticed that both Parser and DynFlags in GHC itself were also very slow, with the latter taking nearly 2 minutes. I'm sure there's more scope for optimising GHC builds, so Andrey Mokhov and I have proposed a Google Summer of Code project to tackle other low-hanging fruit.

The quick program I wrote for generating the data in this blog post is below:

import Control.Monad
import Data.List
import System.Time.Extra
import System.Process.Extra
import Numeric.Extra

main = do
    forM_ [1..] $ \n -> do
        writeFile "Sample.hs" $ unlines $
            ["{-# LANGUAGE DeriveGeneric #-}"
            ,"module Sample where"
            ,"import GHC.Generics"
            ,"data Sample = " ++ intercalate " | " ["Ctor" ++ show i | i <- [1..n]]
            ,"   deriving Generic"
            ]
        xs <- forM [0..2] $ \o ->
            fmap fst $ duration $ systemOutput_ $ "ghc -fforce-recomp Sample.hs -O" ++ show o
        putStrLn $ unwords $ show n : map (showDP 2) xs

Tuesday, February 05, 2019

Announcing ghc-lib

On behalf of Digital Asset I'm delighted to announce ghc-lib, a repackaging of the GHC API to allow it to be used on different GHC versions. The GHC API allows you use the GHC compiler as a library, so you can parse, analyze and compile Haskell code.

The GHC API comes pre-installed with GHC, and is tied to that GHC version - if you are using GHC 8.6.3, you get version 8.6.3 of the API, and can't change it. The ghc-lib package solves that problem, letting you mix and match versions of the GHC compiler and GHC API. Why might you want that?

Imagine you are writing a tool to work with several versions of the GHC compiler. The GHC API changes significantly between each version, so doing this would require writing a lot of C preprocessor code to support it. An alternative is to use one version of ghc-lib which works across multiple versions of GHC.
Imagine you are modifying the GHC API or want features from GHC HEAD. With ghc-lib you can depend on the revised GHC API, without upgrading the compiler used to build everything, speeding up iteration.

While ghc-lib provides the full GHC API, it doesn't contain a runtime system, nor does it create a package database. That means you can't run code produced by ghc-lib (no runtime), and compiling off-the-shelf code is very hard (no package database containing the base library). What you can do:

Parse Haskell code, making ghc-lib a potential replacement for haskell-src-exts. See the demo mini-hlint in the ghc-lib repo;
Compile Haskell code as far as GHC's Core language, which includes renaming and type checking. See the demo mini-compile in the ghc-lib repo, and the carefully tailored file it compiles.

The package ghc-lib is released on Hackage, and can be used like any normal package, e.g. cabal install ghc-lib. Since ghc-lib conflicts perfectly with the GHC API and template-haskell, you may wish to ghc-pkg hide ghc-lib and use the language extension PackageImports to do import "ghc-lib" GHC. There will be two release streams within the ghc-lib name:

Version 8.8.1 will be the version of ghc-lib produced against the released GHC 8.8.1, once GHC 8.8.1 is released. There is no release against GHC 8.6.3 because we had to make changes to GHC to enable ghc-lib, which were only upstreamed in the last few months.
Version 0.20190204 is the version of ghc-lib using GHC HEAD on the date 2019-02-04.

We've been developing and using ghc-lib internally at Digital Asset for the last six months. The use of ghc-lib has enabled us to track GHC HEAD for certain projects, and develop improvements to GHC itself, and then integrate them without requiring us to rebuild all Haskell source code on every step. Smoothing out that development loop has been a massive productivity boon to us.

While this is Digital Asset's first open source project in a while, we have been making lots of contributions behind the scenes - it's no coincidence several of my recent posts involve my colleague Shayne. In particular, our engineers have been behind several GHC proposals.

While I'm announcing the project, much of the work has been done by Shayne Fletcher and Jussi Maki, but neither of them have active blogs just yet!

Friday, January 25, 2019

HLint Unused Extension Hints

Summary: HLint detects unused extensions in LANGUAGE pragmas, including over 17,000 on Hackage.

HLint has detected unused LANGUAGE pragmas for a while - where you enable an extension (e.g. {-# LANGUAGE EmptyDataDecls #-}) but don't use it. HLint v2.1.13 includes some improvements from Yair and myself making these hints even more powerful. As a result, I thought it worth showing some examples of what HLint can do in this area. I started by running HLint on all of Hackage, which found 17,718 "Unused LANGUAGE pragma" hints, including the examples in this post.

Detecting unused extensions

For extensions that show up as syntax (e.g. EmptyDataDecls, ViewPatterns etc), HLint has rules saying which constructs require which extensions. For extensions that aren't syntax directed (e.g. AllowAmbiguousTypes or IncoherentInstances), HLint can't detect whether they are used or not. In all, HLint has rules for detecting 36 different unused extensions. Taking a look at some examples from Hackage:

abcBridge-0.15\src\Data\ABC\AIG.hs:18:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE EmptyDataDecls #-}
Perhaps you should remove it.

mallard-0.6.1.1\lib\Database\Mallard\Validation.hs:4:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE TemplateHaskell #-}
Perhaps you should remove it.

scholdoc-texmath-0.1.0.1\src\Text\TeXMath\Writers\TeX.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE GeneralizedNewtypeDeriving, ViewPatterns, GADTs #-}
Perhaps:
  {-# LANGUAGE GeneralizedNewtypeDeriving, GADTs #-}

As we can see, HLint can spot entirely redundant extension declarations, and also prune those that are partly redundant.

Duplicate extensions

Sometimes extension are simply duplicated, and HLint detects these, either between two separate pragmas, or within a single pragma.

ghcjs-base-stub-0.2.0.0\src\GHCJS\Marshal\Pure.hs:3:1: Warning: Use fewer LANGUAGE pragmas
Found:
  {-# LANGUAGE DefaultSignatures #-}
  {-# LANGUAGE DefaultSignatures #-}
Perhaps:
  {-# LANGUAGE DefaultSignatures #-}

abstract-deque-tests-0.3\Data\Concurrent\Deque\Tests.hs:1:1: Warning: Use fewer LANGUAGE pragmas
Found:
  {-# LANGUAGE BangPatterns, RankNTypes, CPP, BangPatterns #-}
Perhaps:
  {-# LANGUAGE BangPatterns, RankNTypes, CPP #-}

Implied extensions

The new feature for v2.1.13 is that extension are detected as redundant if they are implied by other extensions. For example, if you have PolyKinds defined then that implies KindSignatures. HLint now features a list of such implications, which it uses to detect redundant extensions.

AERN-RnToRm-0.5.0.1\src\Data\Number\ER\RnToRm\UnitDom\Base.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE MultiParamTypeClasses #-}
Perhaps you should remove it.
Note: Extension MultiParamTypeClasses is implied by FunctionalDependencies

attoparsec-0.13.2.2\Data\Attoparsec\ByteString\Char8.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE BangPatterns, CPP, FlexibleInstances, TypeFamilies,
    TypeSynonymInstances, GADTs #-}
Perhaps:
  {-# LANGUAGE BangPatterns, CPP, FlexibleInstances, TypeFamilies,
    GADTs #-}
Note: Extension TypeSynonymInstances is implied by FlexibleInstances

Redundant extensions that imply non-redundant extensions

Sometimes there is an extension that you can tell is unused (e.g. RecordWildCards), which implies an extension that is either being used or can't be detected (e.g. DisambiguateRecordFields). In such cases HLint gives a note that the implied extension might now need to be provided explicitly, although usually it won't be necessary. As examples:

gogol-maps-engine-0.3.0\gen\Network\Google\Resource\MapsEngine\Projects\List.hs:7:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE RecordWildCards #-}
Perhaps you should remove it.
Note: may require `{-# LANGUAGE DisambiguateRecordFields #-}` adding to the top of the file

manifolds-0.5.0.1\Data\Function\Affine.hs:14:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE FunctionalDependencies #-}
Perhaps you should remove it.
Note: may require `{-# LANGUAGE MultiParamTypeClasses #-}` adding to the top of the file

Being wrong

Finally, sometimes HLint gets it a bit wrong. As an example:

shake-0.17.1\src\Development\Shake\Internal\FileInfo.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
  {-# LANGUAGE GeneralizedNewtypeDeriving, DeriveDataTypeable, CPP,
    ForeignFunctionInterface #-}
Perhaps:
  {-# LANGUAGE GeneralizedNewtypeDeriving, DeriveDataTypeable, CPP
    #-}

Here HLint has detected that ForeignFunctionInterface is not used, but in fact it is, although only under one branch of an #ifdef. To fix the hint we can put the extension itself under CPP, adjust the CPP definitions given to HLint, or ignore the hint.

Tuesday, January 22, 2019

Release delays with Stackage

Summary: There are two steps that delay new versions of packages in Stackage.

I aim to get the latest version of my software out to as many people as quickly as possible. Older versions have bugs, new versions have new features - that's why I release new versions. Unfortunately there are two steps in Stackage that slow down this process.

Taking an example, HLint depends on haskell-src-exts, and is tightly coupled, so (to a first approximation) every 0.1 bump to haskell-src-exts requires changing HLint. There are also lots of other packages that depend on haskell-src-exts. This situation leads to two delays in getting HLint to Stackage users, both of which are on display in bug 4214:

Issue 1: Reluctance to remove packages

Stackage has a policy that if a new package (e.g. haskell-src-exts) is released which breaks your package (e.g. haskell-src-meta) you have an unspecified amount of time to release an update. My experience is either packages are updated quickly (all upgrades on that ticket happened within 12 days) or the package maintainers never reply (46 days later no other maintainer has even left a comment).

It used to be the case that there were hard time limits (maximum one month), but my experience was those were never enforced. Unfortunately this lag can cause a significant delay until Stackage Nightly picks up an upgrade. It seems like a more mechanical rule (e.g. after 2 weeks with no update, or 6 weeks total) might keep the process ticking faster. I appreciate it's hard to break people's work, which is why making it come down to human judgement seems to lengthen the process significantly.

Delay imposed: up to 2 months, and sometimes requires chasing.

Issue 2: Existence of Stackage LTS

While the first issue is very much a trade off, the second one is (in my view) just a bad design of Stackage, as I've said before. There is Stackage Nightly which has the latest code. There is Stackage LTS which has older and therefore buggier code, up to 2-3 months older. Having two options is fine, but the stack tool and documentation direct people towards LTS as a preference. LTS is useful if you view the act of upgrading between 0.0.1 versions as low risk (which it isn't) or you find it easier to fix multiple distinct breaking changes when they are overlapped (which it isn't). Unfortunately Stackage LTS users won't get a new version of HLint until a new Stackage LTS version is created, even after it gets merged. On the plus side, this process happens automatically without any intervention by package authors.

Delay imposed: 2-3 months.

PS. While I criticise Stackage, that's because I want to make it better, since it is a very useful distribution channel for many people, and I'm grateful for the work the Stackage maintainers do to keep the process ticking along.

Thursday, January 17, 2019

Ignoring HLint

Summary: HLint now has more ways to ignore hints you don't like.

HLint makes suggestions about how to improve your Haskell code. But not everyone likes all the suggestions all the time, so HLint comes with ways of ignoring those annoying hints, and HLint 2.1.11 provides even more mechanisms. Without further ado, let's take a quick tour - full details are in the HLint README.

Method 1: the --default flag

To ignore all hints your code currently generates run hlint as normal, but passing the --default flag, which will generate a config file with all hints that fire set to ignore. Typically, when approaching a new code base to run HLint on, I start by doing:

hlint . --default > .hlint.yaml

After that, it's easy to remove ignored hints from .hlint.yaml one by one and fix the code.

Method 2: Add -ignore directives

In the .hlint.yaml file you can write:

- ignore: {name: Eta reduce}

This directive ignores the named hint, and is what --default generates. There are also more refined ways of ignoring a hint in certain modules, or ignoring all hints in certain modules (see the README).

Method 3: Add a {- comment -}

Method 3 actually has 3 sub-methods, you can write any of:

{-# ANN module "HLint: ignore Eta reduce" #-}
{-# HLINT ignore "Eta reduce" #-}
{- HLINT ignore "Eta reduce" -}

For ANN pragmas it is important to put them after any import statements. If you have the OverloadedStrings extension enabled you will need to give an explicit type to the annotation, e.g. {-# ANN module ("HLint: ignore Eta reduce" :: String) #-}. The ANN pragmas can also increase compile times or cause more recompilation than otherwise required, since they are evaluated by TemplateHaskell.

For {-# HLINT #-} pragmas GHC may give a warning about an unrecognised pragma, which can be supressed with -Wno-unrecognised-pragmas.

For {- HLINT -} comments they are likely to be treated as comments in syntax highlighting, which can lead to them being overlooked.

My current preference is {- HLINT -}, but I think GHC should just special case {-# HLINT #-} and then in a few GHC releases we could use that. Unfortunately, other people disagree with me, so {- HLINT -} is the best we have.

Method 4: Using the C Pre Processor

hlint defines the __HLINT__ preprocessor definition (with value 1), so problematic definitions (including those that don't parse) can be hidden with:

#ifndef __HLINT__
foo = ( -- HLint would fail to parse this
#endif

Wednesday, January 09, 2019

GHC: From Bug to Merge (2)

Summary: The story of another bug, from report, patch, revisions, to merge.

I recently posted the story of a GHC bug that took 3 months to fix, which isn't great. I hoped that the recent infrastructure work to move GHC to GitLab would speed that up in future. Fortunately, I got to test that theory shortly after.

When experimenting with RebindableSyntax and MonadFailDesugaring I kept getting the error:

The failable pattern ‘Just x’
    is used together with -XRebindableSyntax. If this is intentional,
    compile with -Wno-missing-monadfail-instances.

It's annoying the warning is on by default, but nevermind, let's add -Wno-missing-monadfail-instances to silence the compiler. But alas, no flags could make the warning go away. Looking at the code, it's clear why:

; if | rebindableSyntax && (desugarFlag || missingWarning)
        -> warnRebindableClash pat

If you have RebindableSyntax and MonadFailDesugaring turned on, the value of the warning flag (missingWarning) is ignored. Boolean logic is fiddly, but replacing || with && seems to do the right thing.

Raise a PR

At this point I got Shayne Fletcher involved, who actually ran with most of the steps from here onwards. Given the change is small and the original code was obviously untested, we decided to raise a GitHub PR, skipping the Trac ticket and GHC Proposal steps.

A few days later GHC GitLab became available, so we closed the first PR and opened a GitLab PR.

Fix the PR

As with the previous bug to merge story, the immediate feedback was "please add a test suite entry", which we did.

Thanks to the better integration with CI etc, the PR clearly passed the tests and got merged shortly thereafter.

Timeline

24 Dec - raise GitHub PR
27 Dec - raise GitLab PR
27 Dec - request for changes
28 Dec - code updated
29 Dec - merged

This bug was small enough to skip the bug tracker and proposal process, but even ignoring those steps, the speed was fantastic even over the holiday period. Hopefully this speed is the new normal!

Wednesday, December 12, 2018

GHC: From Bug to Merge

Summary: The story of a bug, from report, proposal, patch, revisions, to merge.

A while ago I found a bug in GHC. Like all good bugs, this one starts with playing around, in this case seeing if it was possible to eliminate String from a base-like library. There are two interesting extensions:

OverloadedStrings changes each string literal to be wrapped in fromString, so "hello" in the source code becomes fromString "hello" when desugared.
RebindableSyntax means that when GHC sees fromString or fail (or lots of other functions) it uses the version in scope, rather than the version in the base library.

Taking a look at the code:

foo x = do
    Just y <- x
    return y

In GHC 8.6 it desugars to something equivalent to:

foo x = x >>= \v ->
    case v of
        Just y -> return y
        _ -> fail ("pattern match error" :: String)

In particular, it does not insert a fromString around the generated "pattern match error". If you are using the built-in fail, that's fine, since the built-in fail wants a String. But if we define a custom fail function which doesn't use String, but which takes the output of fromString, then we can get type errors with the definitions:

newtype Text = Text String
fail (Text x) = error x
fromString = Text

The solution is fairly simple - desugar adding a fromString - which is what GHC 8.8 will do. This post is the story of getting that change integrated.

Raise a GHC ticket

The first step after finding a bug is to raise a GHC ticket, which I did as Trac 15645, also offering to fix it. After discussing a bit with the GHC developers, we came the conclusion that the essential property of the bug is:

Overloaded strings should imply even generated strings are overloaded, if they are passed to user-controlled functions.

I think (and still think) that this ticket is a pure bug fix, partly because changing it does not imply changing the user manual in any way. However, GHC developers quite reasonably disagreed, and suggested I go through the GHC proposal process.

Raise a GHC proposal

The GHC Proposal process involves creating a proposal, having people discuss it, and then passing it via a committee. At this point I got Shayne Fletcher involved, who actually ran with most of the steps from here onwards. We raised GHC Proposal 168, setting out the case why the change should be made. We @ mentioned people on GitHub, and posted to Twitter to solicit opposing views, but only 4 people made any comment, and none of them disagreed. We submitted to the committee after the discussion was verifiably dead (not that it was ever super alive), and had the proposal accepted.

Write the code

After having the proposal accepted the hard work of writing the code began as Phab D5251. The code itself wasn't simple, but neither was it very complex. An initial version was reviewed with lots of good suggestions from Simon Peyton Jones. Shayne made some revisions, including adding a regression test, and the patch was ready. Sometime later the code was merged to GHC master.

Timeline

The whole process took a long time - 14 Sep to 12 Dec (fortunately all 2018), so nearly 3 months. That timeline was significantly extended because GHC is in the process of changing hosting for their code, which both makes it harder to merge stuff and involves the people who would normally be merging stuff doing other work. However, I thought it instructive to show where the time went.

14 Sep - 14 Sep: Open bug and discuss it
14 Sep - 17 Sep: Writing the proposal
17 Sep - 20 Sep: Discussing the proposal
20 Sep - 24 Sep: Confirming the discussion had died down
24 Sep - 21 Oct: Waiting for the committee decision
21 Oct - 22 Oct: Opening the Phab ticket
22 Oct - 25 Oct: Discussion on the code review
25 Oct - 29 Oct: Addressing review comments
29 Oct - 31 Oct: Additional review and discussion
1 Nov - 26 Nov: Waiting for people to give a final approval
26 Nov - 11 Dec: Waiting to merge in

I think the interesting thing is that of the 3 months, over 2 months was spent waiting for confirmations of a mechanical nature. However, every time real in-depth thoughtful discussion was required (code review, discussing the bug) it happened almost immediately.

Friday, November 23, 2018

Counting the Cost of Colons in Haskell

Summary: Haskell uses :: as the type operator. That was a mistake that costs us over 1 million characters of source code.

Haskell uses :: for type annotations, e.g. (1 :: Int). Most other FP languages with types use :, including Scala, OCaml, Agda, Idris and Elm. Haskell uses : for list cons, so you can write:

myList = 1:2:[] :: [Int]

Whereas in other languages you write:

myList = 1::2::[] : [Int]

Moreover, the reason Haskell preferred :: was the belief that if the cons operator was :: then people would quite naturally insert spaces around it, giving:

myList = 1 :: 2 :: [] : [Int]

The final program is now noticeably longer than the original. Back when people first invented Haskell, I imagine they were mainly list manipulation operations, whereas now plenty of libraries work mainly at the type level. That raises the question - would Hackage be shorter or longer if we used : for types?

Method

I downloaded the latest version of every Hackage package. For each .hs file, I excluded comments and strings, then counted the number of instances of : and ::, also noting whether there were spaces around the :. The code I used can be found here.

Results

Instances of :: = 1,991,631
Instances of : = 265,880 (of which 79,109 were surrounded by two spaces, and 26,931 had one space)

Discussion

Assuming we didn't add/remove any spaces, switching the symbols would have saved 1,725,751 characters. If we assume that everyone would have written :: with spaces, that saving drops to 137,9140 characters. These numbers are fairly small, representing about 0.17% of the total 993,793,866 characters on Hackage.

Conclusion

The Haskell creators were wrong in their assumptions - type should have been :.

Update: the first version of this post had numbers that were too low due to a few bugs, now fixed.

Thursday, November 22, 2018

Downloading all of Hackage

Summary: I wanted to download the latest version of every package in Hackage. Here's a script and explanation of how to do it.

Imagine you want the latest version of every package on Hackage. I found two tools that mirror every version of every package:

Using hackage-mirror you can do hackage-mirror --from="http://hackage.haskell.org" --to="C:/hackage-mirror". But this project is long deprecated and doesn't actually work anymore.
Using hackage-mirror-tool you might be able to do it, but it requires a new Cabal, isn't on Hackage, doesn't seem to work on Windows and doesn't say whether it downloads to disk or not.

Given it's a fairly simple problem, after investigating these options for an hour, I decided to cut my losses and write a script myself. Writing the script took a lot less than an hour, and I even wrote this blog post while the download was running. The complete script is at the bottom of this post, but I thought it might be instructive to explain how I went about developing it.

Step 0: Set up my working environment

I created a file named Download.hs where I was writing the source code, used ghcid Download.hs in a VS Code terminal to get fast error feedback using Ghcid, and opened another terminal to execute runhaskell Download.hs for testing.

Step 1: Find where a download link is

You can download a package from Hackage at http://hackage.haskell.org/package/shake/shake-0.17.tar.gz. You can also use https, but for my purposes and bulk downloading I figured http was fine. I hunted around to find a link which didn't contain the version number (as then I wouldn't have to compute the version number), but failed.

Step 2: Find a list of package versions

Looking at the cabal tool I found the cabal list --simple command, which prints a big list of packages in the form:

foo 1.0
foo 2.1
bar 1.0

For each package on Hackage I get all versions sequentially, with the highest version number last. I can execute this command using systemOutput_ "cabal list --simple" (where systemOutput_ comes from the extra library).

Step 3: Generate the list of URLs

Now I have the data as a big string I want to convert it into a list of URL's. The full pipeline is:

map (toUrl . last) . groupOn fst .  map word1 . lines

Reading from right to left, I split the output into a list of lines with lines, then split each line on its first space (using word1 from the extra library). I then use groupOn fst so that I get consecutive runs of each package (no points for guessing where groupOn comes from). For each list of versions for a package I take the last (since I know that's the highest one) and transform it into the URL using:

let toUrl (name, ver) = "http://hackage.haskell.org/package/" ++ name ++ "/" ++ name ++ "-" ++ ver ++ ".tar.gz"

Step 4: Download the URLs

I could make multiple calls to wget, but that's very slow, so instead I write them to a file and make a single call:

writeFile "_urls.txt" $ unlines urls
system_ "wget --input-file=_urls.txt"

I use the name _urls.txt so I can spot that special file in amongst all the .tar.gz files this command produces.

Step 5: Putting it all together

The complete script is:

import Data.List.Extra
import System.Process.Extra

main :: IO ()
main = do
    let toUrl (name, ver) = "http://hackage.haskell.org/package/" ++ name ++ "/" ++ name ++ "-" ++ ver ++ ".tar.gz"
    urls <- map (toUrl . last) . groupOn fst .  map word1 . lines <$> systemOutput_ "cabal list --simple"
    writeFile "_urls.txt" $ unlines urls
    system_ "wget --input-file=_urls.txt"

After waiting 46 minutes I had 13,258 packages weighing in at 861Mb.

Update: In the comments Janek Stolarek suggested the simpler alternative of cabal list --simple | cut -d' ' -f1 | sort | uniq | xargs cabal get (I had missed the existence of cabal get). Niklas Hambüchen also shares a script https://github.com/nh2/hackage-download which can download even faster.

Sunday, October 21, 2018

Announcing Profiterole - GHC Profile Viewer

Summary: profiterole reformats GHC profile reports so they are easier to read.

Do you often work with GHC time profiling reports? Do you find them excessively long and hard to navigate? Profiterole reads standard GHC .prof files and generates both textual and HTML reports which are typically more than 10x smaller. As an example compare HLint profile input to HLint Profiterole output.

Usage

To run, first install (cabal update && cabal install profiterole), generate a GHC profile the normal way, then run:

profiterole myprogram.prof

Profiterole will generate myprogram.profiterole.txt and myprogram.profiterole.html - both contain the same information, but the HTML has hyperlinks. There are three columns of numbers:

TOT is the total time spent in any item under this code, what GHC calls inherited time.
INH is the total time spent in the items that Profiterole did not move out to the top level.
IND is the individual time, just like GHC profiles.

For large programs, using +RTS -P (instead of the common -p) will give more accurate results.

How it works

Profiterole aims to make the profile shorter by combining common subtrees and lifting them to the root - e.g. if you call parseFile from 7 places in the code, instead of having 7 pieces of parseFile profiling, Profiterole will give you one. With only 1 place containing parseFile, it's easier to optimise parseFile, and it's easier to read the code calling it without getting lost in the internals.

How to profile

Given profile data, different ways of looking at it reveal different insights, and the ones discovered by Profiterole have definitely had value. I tend to use:

I first use Profiteur to get an overall sense of where the time is going visually. Profiteur lets me orientate myself, but tends to be a little difficult to drill into the details and repeat experiments.
I then use Profiterole to see if there were any repeated pieces of code Profiteur missed, and then dig into the details using Profiterole.
Only if I'm really going into the details do I go to the GHC .prof output - it's pretty rare.

Tuesday, October 16, 2018

Announcing Shake 0.17

I'm delighted to announce Shake 0.17. As always, the full changelog is on GitHub, but I'd like to highlight three areas that have seen most attention.

Error Messages

Error messages now support GHC's HasCallStack feature, giving code locations in error messages. As an example, let's define rules for both *.txt and overlap.*, then try and build overlap.txt. With Shake 0.17 we get the far more informative error:

Error when running Shake build system:
at Example.hs:50:46-55:
* Depends on: overlap.txt
* Raised the exception:
Build system error - key matches multiple rules:
Key type:       FileQ
Key value:      overlap.txt
Rules matched:  2
Rule 1:         "overlap.*" %> at Example::21:94-106:
Rule 2:         ".txt" %> at Example::24:94-106:
Modify your rules so only one can produce the above key

We can see where the dependency was introduced (line 50), where the rules were defined (lines 21 and 24), and what their patterns were.

The Database module

The new module Development.Shake.Database provides operations for working with open Shake databases - meaning you can now open the database, run some actions against, and shut it. Unlike before, you can now run against an open database repeatedly, and query the resulting database for live or erroneous files. When combined with the new feature that /dev/null for shakeFiles results in no on-disk representation of Shake, you can create an in-memory database, execute it many times, then throw it away. These features aren't targetted at build systems, but allow reuse of Shake in other domains.

If you are using the Database module, and have a way to observe changes interactively, the deprioritize function may be of use, to cause Shake to build some unimportant rules last.

This work was supported by Digital Asset.

Changes to Builtin Rules

Most users shouldn't need to define their own types of rules, but for those who do, the biggest improvement is probably the better documentation in Development.Shake.Rule, complete with a worked example. At the same time, the builtin rules have changed slightly in incompatible ways - the docs explain the new state. These changes are intended to set the stage for Cloud Shake, following the pattern described in Build Systems a la Carte. I hope that a forthcoming release of Shake will provide an actual implementation of Cloud Shake.

Tuesday, October 02, 2018

Full-time Haskell jobs in Zürich/New York, at Digital Asset

UPDATE: All positions have been filled.

Summary: We're hiring 3 Haskell programmers and a few other roles too.

I am currently working at Digital Asset, working on our DAML programming language. We're seeking 3 additional Haskell programmers to join, 2 in New York and 1 in Zurich (remote work is not currently an option). There are also a ton of other jobs on our website, including Formal Methods and nix + some Haskell Build Engineering (also available at a more junior level).

What we do

We have built DAML, the Digital Asset Modelling Language, which is the centerpiece of our distributed ledger technology. DAML is a contract language that consists of a strongly-typed purely functional core extended with domain specific constructs to express the flow of rights and obligations underlying today's multi-party business processes. Application Developers using DAML and our distributed ledger technology are supported by the DAML SDK. It provides a type-safe integration of DAML with existing technology like Java, Scala, XML and SQL, and contains DAML Studio, which provides a modern IDE experience to develop, test, and analyse DAML programs.

Working on the Language Engineering team with Digital Asset involves partnering with people around the world (we have centers in New York, Zurich and Sydney), working with exciting new technology, where many of the answers haven't yet been figured out, producing solutions for clients, such as replacing the settlement and clearing platform of the Australian Stock Exchange (ASX), and making sure the end result has the quality required for robust usage. It's challenging work, but the impact could be huge.

What we require

We're looking for the best functional programmers out there, with a strong bias towards Haskell. If not Haskell, then Scala is useful, as other teams in the company write Scala. However, we've hired ML/F# programmers too, with good results. In particular we want:

Experienced functional programmer. Either some open-source libraries (Hackage/GitHub) or commercial experience.
Writes good, clean, effective code.
Existing compiler development experience is useful, if it's with GHC then even better.

We do not require any existing blockchain/DLT/finance knowledge.

How to apply

To apply, email neil.mitchell AT digitalasset.com with a copy of your CV. If you have any questions, email me.
The best way to assess technical ability is to look at code people have written. If you have any packages on Hackage or things on GitHub, please point me at the best projects. If your best code is not publicly available, please describe the Haskell projects you've been involved in.

Thursday, September 13, 2018

Review of Apple Watch (series 2)

Summary: I like mine.

I've worn a watch since I was 10, starting with a Casio F-91W, replacing it with an identical model about seven times over the years. Last year the strap broke (it's almost always the strap that breaks), and I made the decision to buy an Apple Watch. I'm very happy with my Apple Watch (series 2, 38mm).

What I use it for

The main things I use my watch for are:

Alarms and timers, often using Siri to set them. If you treat the voice interface like a bad command line it works well - if you say "Add an alarm at 4pm" and you had an alarm at 4pm yesterday, it just shows you the alarm but doesn't enable it. Instead you have to say "Set alarm at 4pm". If you say "Set alarm at 4:10pm" the odds of getting an alarm at 10pm are quite high.
Receiving notifications, email, texts, Slack messages, phone calls. When you lift your arm you get a quick preview of the message, which is enough to decide whether it's important (take out the phone), or can happily be ignored.
Sleep tracking, via the Sleepwatch app. It's an awesome app that tracks your sleep showing trends.
Paying for things via the ApplePay. Nowadays I'm on the verge of not shopping at places that don't take ApplePay. It's more secure than contactless, works everywhere contactless works, has a higher limit, and is incredibly convenient. It can also swipe through multiple credit cards. Easy and seamless.

What I wish was better

There are other features the watch offers that I was hoping to use, but for various reasons haven't worked out.

I am terrible at navigation, and wanted to use Google Maps on my watch, particularly while cycling. Unfortunately, there is no Google Maps app for the watch, and the Apple Maps one is even less useful than normal Apple Maps. There is a recurrent bug where it loses the map display and just displays a checkered background - which can be fixed by some complex steps including wiping the watch and reinstalling. Part of the reason for buying this watch was for navigation, so I hope this gets resolved eventually.
I wanted to quickly and easily track train departures on the move. Unfortunately the National Rail train time app is useless - it omits the time the train is leaving, merely saying "On time" or not. As a consequence you have to memorise the timetable, plus believe it has refreshed and isn't just showing stale information. All in all, impressively close, and totally useless.
The actual display of the watch is adequate - it takes a noticeable pause to display the watch face (sub-second), but compared to my Casio, it's not always available. I like actual seconds on my watch, which limits me to exactly one digital watch face. I can't believe "knowing the precise time" is such a niche feature on a watch.
You can't hide apps from the watch that you don't use, which means my watch face has 50 odd apps, of which I use maybe 10. Removing most of the apps would make navigation much easier.
The watch slows down over time, so after a couple of months it starts to lag. A reboot fixes that.
The straps you can buy for the watch are fantastically overpriced. The default one is OK, but my wrist in between two holes, so it's usually either a bit loose or a bit tight.
Exercise features haven't been much use to me, but I'd blame that on me rather than the watch...

Conclusions

The positives are enough to make it worth me having an Apple Watch, and inevitably replacing when the battery life gets too bad (for the moment, it runs about 30hrs on a charge, which is fine).

Thursday, August 30, 2018

Licensing my Haskell packages

Summary: I plan to license all future packages under the "BSD-3-Clause OR Apache-2.0" license.

A few weeks ago I calculated that the distribution of Haskell library licenses is approximately:

BSD-3-Clause: 67%
MIT: 20%
Apache-2.0: < 2%

In contrast, Wikipedia suggests for most open-source libraries the Apache-2.0 license beats BSD-3-Clause, and it is the permissive choice of FSF/Google etc. I was curious why it was so rare in the Haskell world, so asked on Twitter. The resulting thread got my thinking about license choices, which changed my view of what license I'd like to use. In this post I'm going to go through my reasoning.

The license I want to use

What I want to say is that anyone is free to use my code for any purpose. If they make changes to my code which would be of general benefit and have a chance of being accepted upstream, they should post those changes publicly. I give my work away freely, and want the version I'm giving away to be the best possible version for everyone. No license matches this intent, none force you to share code that you improve but only use internally, and the legal definition of "general benefit" can only be arbitrated by me. As a result, I'd like people to follow those principles, but chose to release my code with far fewer restrictions, in the hope people will do the right thing and share improvements anyway.

The license I use

When I first started releasing code (around 2004) I originally licensed my code as GPL-2.0, because that was a protective open-source license and I was still dipping my toes in the open source pond. By 2007 I was releasing new libraries as BSD-3-Clause, since that was what everyone in the Haskell community was using and seemed to provide the benefits I wanted (people sent me patches without being legally compelled to, just for the good of the code, which I prefer). It took until 2012 for me to switch my earliest libraries to BSD-3-Clause - one to avoid annoyance at work and another at the request of a commercial company who were linking it to proprietary pieces, and then went on to contribute extensively for the benefit of the project. Currently, all my projects are BSD3 licensed.

The license I will use

But what license should I be using? These questions prompted me to hunt around and I came to the conclusion that the right license for me is:

BSD-3-Clause OR Apache-2.0

Concretely, I plan to license all my future libraries under both the BSD-3-Clause and Apache-2.0 licenses, but a user is able to use it under either. My reasoning is as follows:

Why BSD-3-Clause over MIT?

I debated BSD-3-Clause vs MIT for a long time. They both offer substantially the same freedoms, but BSD-3-Clause also requires you can't use my name to endorse things you build - which seems reasonable. I like MIT because it's less ambiguous as a name, and it makes explicit freedoms that are implicit in the BSD-3-Clause license. The fact that BSD-3-Clause is more commonly used for Haskell libraries and my existing libraries is a point in it's favour. In the end, I picked BSD-3-Clause.

Why Apache-2.0?

The Apache-2.0 license offers a patent grant - I'm promising that if you use my code I'm not going to sue you using my patents. If I gave you code and then later sued you for using it, that would be mean. More importantly (from my side at least) it ensures everyone contributing to my library is following the same basic "don't be mean" principle, so I can continue to use my code free from patent concerns.

Why both?

The OR in the license means that I (and all contributors to my libraries) license all the code under BSD-3-Clause, and entirely separately also license all the code under Apache-2.0. Users are free to use the library under either of the available licenses, making it a user-centric OR. The Apache-2.0 license is incompatible with the GPL-2.0 and LGPL-2.1-only licenses, meaning any library building on my code plus the GTK bindings would be in a license quandary. By licensing under both most users can use Apache-2.0 (it gives you patent protection, so it's in your best interest), and those that would have problems otherwise can stick with BSD-3-Clause.

Next steps

Licensing is a slightly sensitive topic, so I'm declaring my intent, and waiting for feedback. Hopefully this change is positive for everyone, but anyone with concerns should let me know. As to technical detail, Cabal 2.2 supports SPDX license expressions, which is the syntax I've been using throughout this post.