Neil Mitchell's Blog (Haskell etc): June 2007

Friday, June 29, 2007

Functional composition

Haskell has great support for abstraction: small functions can be written which do one general thing. Haskell also has great support for composition: putting small functions together is easy.

On IRC today someone asked for a function that computes the mode of a list. This is the same as finding the most common element in a list. One simple solution is a nice display of composition of smaller abstraction functions:


mostCommon :: Ord a => [a] -> a
mostCommon = head . maximumBy (comparing length) . group . sort

And a single example:


; mostCommon "Haskell"
'l'

To read the function mostCommon, it is best to read from right to help:

sort: put the elements in order, i.e. "Haeklls".

group: group consecutive elemnts, i.e. ["H","a","e","k","ll","s"]. It would be entirely reasonable to have a groupSet function that did grouping of non-consecutive elements, and I often use such a function.

maximumBy (comparing length): find the maximum value, looking only at the number of elements in each list, i.e. "ll".

head: take a representative value from the list, i.e. 'l'.

If asked to write the same function in Javascript (say), I would have started with a Hash table that looped counting the number of occurrences of each letter, then done a loop through this to find the maximum. The code would have been longer, and more prone to bugs.

It is easy to see that by combining functions that have already been written, tested and documented, it is possible to come up with very succinct operations.

Monday, June 18, 2007

The Year 3000 problem

(Not a Haskell post, but I wish it was, I hate writing VBScript!)

When writing code for someone else, on a money per hour basis, you have to make choices that are in the clients best interest. Usually best interest is the cheapest solution that works. One example of this choice is perfection versus practicality.

When writing code for a single task, if the code meets the requirements for the task, it "works". If you have to write a function to parse a number, which launches nuclear missiles if given a negative number, and you don't give it negative numbers, then that's fine. If you spend additional time making it robust to negative numbers, you are wasting someone else's money.

If you are writing beautiful code, or a library (where people can poke the interface in unexpected ways) then handling side conditions is very important. The fewer obscure side conditions, the fewer surprises to the users. Fewer surprises means less need for documentation and fewer help emails.

As projects become larger, typically an increasing proportion of the code becomes a library which the remaining section of the program makes use of. Once this happens, it is worth going back and reconsidering and refining the interfaces to functions. Judging when a program goes from "quick hack" to "large system" is hard. The best time to go back and make this refinement is when the code is still small, but you know it will be growing shortly. Alas, this is exactly when deadlines are likely to make refactoring impossible.

So back to the title of the post, the year 3000 problem. I was writing code to search through a list for items with a particular year. The list goes back to 2005 only. I wrote a function which given a blank string returns the latest ten items, and given a year, it returns all the items which have that year. I implemented the year check using substring searching (indexOf/InStr/strstr - depending on your language). There are reasons for using substring searching rather than simple equality, but they are not relevant here. You can write this code as:


function search(year)
{
   var items = complicated_code_here();

   for (var i in items)
   {
      if ((year == "" && i < 10) ||
          (items[i].year.indexOf(year) != -1))
         display(items[i]);
   }
}

Simple enough. Now its very easy to add a facility to call the search function with the appropriate year and display the results. Now imagine that you get a change request, asking for a button to display all results. What would you do? Remember that you are looking for the solution that meets the clients needs with the least amount of work.

The answer I came up with, simply call search("2");. The code works, and will continue to work for 992 years. In the year 3000 there will no longer be a 2 in year field, and it will stop working. By then, I'm expecting to be dead, and I hope to outlive VBScript by decades, so this isn't something that will ever be a problem.

When learning to code it is important to think about the corner cases, and to deal with them appropriately - always aiming for perfection. Once you have a reasonable idea what the "perfect" solution looks like, then you need to start making concious decisions about whether perfection is the right choice. Usually it is, but sometimes you need to make compromises for the benefit of the person paying.

Friday, June 15, 2007

Boilerplate considered harmful (Uniplate edition!)

I've just released my Uniplate library, which hopes to remove the boilerplate from Haskell programs. I try to keep my PhD research as practically orientated as I can, but often research things start off as proof of concept and get refined into something more practically usable over time. Uniplate is different - its the one part of my PhD that should be useful to almost everyone, and is useful and polished right now.

I've been working on Uniplate (or Play, as it was originally called) for about two years now. Uniplate did not start off as a project to remove boilerplate, but in my work on Catch I found various patterns kept cropping up. I started to think about abstracting them out - first in a very chaotic manner - then as time progressed I refined the ideas until they hung together much more convincingly. I've used Uniplate in Catch (well over 100 times!), in Yhc.Core (support is built into the library), and most recently in Hoogle. In addition, a few other people have picked up Uniplate (Eric Mertens and Matt Naylor), and with very little time were quite fluent in the use of the library.

I previously posted a bit about how you could use Scrap Your Boilerplate (SYB) to remove boilerplate from Haskell. I'm now going to recap that post, but giving the examples using Uniplate as well. Hopefully this will start to encourage people to make use of Uniplate - the results can be very effective. Recently Matt ported one of his projects, and some functions went from 20 lines of complicated code to 3 lines of simple code - revealing some bugs in the original definition in the process!

A Data Structure

Before showing some operations, I'm going to first introduce a data structure on which we can imagine operations are performed. Here is a data type that looks like an imperative programming language:


{-# OPTIONS_GHC -fglasgow-exts #-}
import Data.Generics
import Data.Generics.PlateData

data Expr = Var String
          | Lit Int
          | Call String [Expr]
            deriving (Data, Typeable)

data Stmt = While Expr [Stmt]
          | Assign String Expr
          | Sequence [Stmt]
            deriving (Data,Typeable)

We define the data type as normal, adding deriving for Data and Typeable - the two key SYB types. We also add an import of Data.Generics and a flag, just to get the GHC machinery working for the derivings. Finally, we add an import of Data.Generics.PlateData - which says that we want to automatically derive the necessary Uniplate instances, and use the Uniplate operations.

Queries

So lets imagine you have to get a list of all literals. In SYB this is easy:


extractLits :: Data a => a -> [Int]
extractLits = everything (++) ([] `mkQ` f)
     where f (Lit x) = [x]
           f _ = []

Wow, easy! But we can make it even easier with Uniplate:


extractLits :: Data a => a -> [Int]
extractLits x = [y | Lit y <- universeBi x]

Both functions will operate on anything which has a Data instance, so you can run it on an Expr, Stmt, [Stmt], [Either Stmt Expr] - the choice is yours. The Uniplate function universeBi simply gives you a list of all the Expr types in x, from which you can pick the Lit's using a nice list comprehension.

Transformations

Now lets negate all the literals, in SYB we have:


negLit (Lit x) = Lit (negate x)
negLit x       = x

negateLits :: Data a => a -> a
negateLits = everywhere (mkT negLit)

Again, its pretty easy. We can also do something very similar in Uniplate:


negateLits :: Data a => a -> a
negateLits = transformBi negLit

The only difference is a mkT.

Advantages of Uniplate

There are a number of advantages:

Compatability - the above code will work only in GHC, but if we modify the instance declaration to remove deriving(Data,Typeable) and replace it with an explicit instance (generated by Derive, if you wish), then the Uniplate code will work perfectly happy in Hugs. The Uniplate library also provides substantial Haskell 98 compatibility.

Built in compiler support in GHC to the same level as SYB - piggy-backing off the SYB support.

Usually produces shorter code - especially for queries.

A range of traversals not in SYB (although SYB could add them - and I believe should)

Higher performance - about 30% faster in the above examples, up to 80% faster if you are will to write different instances.

What Uniplate is NOT

The SYB library has gone much further than queries and transformations - they provide what amounts to runtime reflection and an incredible level of power. They also offer type generic programming, extensible functions and much more besides. Uniplate does not attempt to offer anything beyond standard traversals.

It is also important to point out that SYB is strictly more powerful than Uniplate. You can implement Uniplate on top of SYB, but cannot implement SYB on top of Uniplate. I would still encourage everyone to read up on SYB - it can be complex to pull of some of the more advanced tricks, but the power can take Haskell to whole new levels.

Conclusion

I hope that people will take a look at Uniplate, and see if it can meet their needs - in any project where a data type is defined and operated over it can probably be of benefit. The code reductions that can be made with Uniplate (or SYB) are substantial, and can hopefully promote the concise declarative style which Haskell was designed for.

Friday, June 08, 2007

The Test Monad

Lots of people have written clever things about writing monads - their mathematical structure, properties etc. I wrote a Test Monad for my FilePath library, which I've since refined for Hoogle. The inspiration for this comes from Lennart Augustsson, who gave a talk at fun in the afternoon on ways to abuse Haskell syntax for other purposes.

For the testing of the parser in Hoogle, I want to write a list of strings, versus the data structures they should produce:


"/info" === defaultQuery{flags = [Flag "info" ""]}
"/count=10" === defaultQuery{flags = [Flag "count" "10"]}

In Haskell, the most syntactically pleasant way of writing a list of things is using do notation in a monad. This means that each (===) operation should be in a monad, and the (>>) bind operation should execute one test, then the next. We also need some way of "executing" all these tests. Fortunately this is all quite easy:


data Test a = Pass

instance Monad Test where
    a >> b = a `seq` b

instance Show (Test a) where
    show x = x `seq` "All tests passed"

pass :: Test ()
pass = Pass

The basic type is Test, which has only one value, being Pass. To represent failure, simply call error, which is failure in all Haskell programs and allows you to give a useful error message. The helper function pass is provided to pin down the type, so you don't get ambiguity errors. The Monad instance simply ensures that all the tests are evaluated, so that if any crash then the whole program will crash. The Show instance demands that all the tests passed, then gives a message stating that.

We can then layer our own test combinators on top, for example for parsec:



parseTest f input output =
    case f input of
        Left x -> err "Parse failed" (show x)
        Right x -> if x == output then pass else
                   err "Parse not equal" (show x)
    where
        err pre post = error $ pre ++ ":\n  " ++
                               input ++ "\n  " ++
                               show output ++ "\n  " ++
                               post

This parseTest function takes a parsec parser, an input and an output. If the parse fails, or doesn't produce the correct answer, an error is raised. If the test passes, then the function calls pass. It's then simple enough to define each test set as:


parse_Query = do
    let (===) = parseTest parseQuery

Here (===) is defined differently for each do block. By evaluating one of these test blocks in an interpreter, the show method will automatically be called, executing all the tests.

I've found this "monad" lets you express tests very succinctly, and execute them easily. I moved to this test format from stand-alone files listing sample inputs and expected results, and its proved much easier to maintain and more useful.

Releasing Catch

Finally, after over two years of work, I've released Catch! I recommend people use the tarball on hackage. The manual has all the necessary installation instructions. To make installation easier, rather than following the Cabal route of releasing lots of dependent libraries, I've take the transitive closure and shoved them into one .cabal file.

I've had the release prepared for some time, but was waiting until after the ICFP reviewing had finished (I got rejected :( ), to make a release, because of their anonymity requirements. Now that's all finished, I invite everyone to download and try Catch.

The final release is only 2862 lines of code, which doesn't seem like much for two years worth of work. The actual analysis is only 864 lines long, of which about half the lines are blank, and one or two are comments. If you include all the Yhc.Core libraries, my proposition library, my boilerplate removal library (all of which were developed to make Catch possible), you end up with just under 8000 lines of code. Still substantially shorter than my A-level computing project, which I wrote in C++.

Catch has seen a bit of real world use, particular in HsColour, where it found two crashing bugs, and XMonad where it influenced the design of the central API. I'm starting to routinely Catch check all the libraries I release, which is a good sign. One of the things that restricts Catch from being more widely used is the lack of Haskell 98 code in general use - Catch is in no way limited to Haskell 98, but Yhc which Catch uses as a front end is. Hopefully the Haskell' standard will encourage more people to code to a recognised standard, and increase the utility of source code analysis/manipulation tools.

If anyone does try out Catch, and has any comments, please do email me (ndmitchell @at@ gmail .dot. com). If you take the time to check your program with Catch you are welcome to use the "Checked by Catch logo".

Neil Mitchell's Blog (Haskell etc)