Neil Mitchell's Blog (Haskell etc)

Friday, June 08, 2007

Releasing Catch

Finally, after over two years of work, I've released Catch! I recommend people use the tarball on hackage. The manual has all the necessary installation instructions. To make installation easier, rather than following the Cabal route of releasing lots of dependent libraries, I've take the transitive closure and shoved them into one .cabal file.

I've had the release prepared for some time, but was waiting until after the ICFP reviewing had finished (I got rejected :( ), to make a release, because of their anonymity requirements. Now that's all finished, I invite everyone to download and try Catch.

The final release is only 2862 lines of code, which doesn't seem like much for two years worth of work. The actual analysis is only 864 lines long, of which about half the lines are blank, and one or two are comments. If you include all the Yhc.Core libraries, my proposition library, my boilerplate removal library (all of which were developed to make Catch possible), you end up with just under 8000 lines of code. Still substantially shorter than my A-level computing project, which I wrote in C++.

Catch has seen a bit of real world use, particular in HsColour, where it found two crashing bugs, and XMonad where it influenced the design of the central API. I'm starting to routinely Catch check all the libraries I release, which is a good sign. One of the things that restricts Catch from being more widely used is the lack of Haskell 98 code in general use - Catch is in no way limited to Haskell 98, but Yhc which Catch uses as a front end is. Hopefully the Haskell' standard will encourage more people to code to a recognised standard, and increase the utility of source code analysis/manipulation tools.

If anyone does try out Catch, and has any comments, please do email me (ndmitchell @at@ gmail .dot. com). If you take the time to check your program with Catch you are welcome to use the "Checked by Catch logo".

Wednesday, May 23, 2007

Preconditions on XMonad

{- ICFP referees look away now -}

I've proved that the central XMonad StackSet module is safe on several occasions, as the code keeps evolving. Each time I take the source code, run Catch on it, and send an email back to the XMonad list giving the results. So far only one other person (Spencer Janssen) has taken the time to download and install Catch and run the tests to validate my result. The reason for this is that building Catch is slightly tricky, due to the Yhc and Yhc.Core dependencies. I'm working on putting together a proper release for Hackage, expect that within a month - all the code works, its just the packaging and user interface thats lacking.

The other day dons asked me how he could "get an idea" of what Catch is doing. If you blindly accept a formal proof, its hard to get a feel of whether its correct, or if you are proving what you expect. The detailed answer is going to appear in my thesis (and hopefully as a paper beforehand), but I thought it may help to give a very light overview of what thoughts go through Catch.

In Principle

The concept behind Catch is that each function has a precondition that must hold in order for the function to execute without error. If a function f calls a function g which has a precondition, then the precondition for f must guarantee that the precondition to g holds. If you set this up so that error has the precondition False, then you have a pattern match checker.

The second concept is that given a postcondition on a function, you can transform that to a precondition on the arguments to the function. If there is a requirement that the result of f x y meets a certain condition, then this can be expressed as conditions on the variables x and y.

Before all this machinery can be put into action, it is first necessary to perform many transformations on the source code - filling in default pattern matches, doing simplifications, removing higher-order functions and abstracting some library functions. All these transformations are performed automatically, and are intended to set up the Catch machinery to do the best job it can.

In Reality

As it happens, most of the pattern matches that are checked in XMonad are relatively trivial - and do not push the power of Catch to its full potential. A few are slightly more complex, and one of these is focusLeft (XMonad code evolves very fast, so this may not be the current darcs version!):


focusLeft = modify Empty $ \c -> case c of
    Node _ _ []     [] -> c
    Node m t (l:ls) rs -> Node m l ls (t:rs)
    Node m t []     rs -> Node m x xs [t] where (x:xs) = reverse rs -- wrap

Catch identifies two separate potential pattern match errors in this statement. Firstly the lambda expression passed as the second argument to modify is potentially unsafe - as it does not mention the Empty constructor. A quick look at the modify function shows that by this stage the value must be a Node. The way Catch solves this is by transforming the code, bringing the two pices of code together. Once the case expression within modify is merged with the one in the lambda, pattern match safety is a simple transformation of elimating redundant alternatives.

The second potential error is in the where statement. If reverse rs is empty then the pattern will not match. This is a perfect example of the postconditions in action, the generated postcondition is that reverse rs must be a (:) constructed value. Using the machinery in Catch, this condition is transformed into the condition that rs must be a (:) value. Looking at the alternatives, Catch can see this is always the case, and declares the pattern safe.

In order to prove the entire module, Catch requires 23 properties, and 12 preconditions. The process takes 1.16 seconds are requires 1521.05 Kb of memory.

In Conclusion

Catch is an automated tool - no user annotations are required - which means that some people may feel excluded from its computational thoughts. If a user does wish to gain confidence in the Catch checking process, a full log is produced of all the preconditions and postconditions required, but it isn't bedtime reading. Hopefully this post will let people get an idea for how Catch works at a higher level.

I am very glad that Catch is an automated tool. XMonad is a fast changing code base, with many contributors. In a manual system, requiring proofs to remain in lockstep with the code, the pace of progress would be slowed dramatically. Hopefully Catch can give cheap verification, and therefore verification which can be of practical user to non-experts.

#if NOCPP

I use both GHC and Hugs for my development work. I primarily work with WinHugs for development, then move on to GHC for deployment. Unfortunately one of the things that WinHugs doesn't do very well is CPP support, it can only be turned on or off for all files, and requires multiple passes and slows things down. So what you really need is some way of defining code which executes in the presence and absence of CPP - so you can customise correctly.

It's not too hard to make this work, but it can be a bit fiddly to come up with the correct constructs. The easier one is to exclude code from the CPP, which can be easily accomplished with #if 0:


{-
#if 0
-}
cppOff = True
{-
#endif
-}

If the CPP is on, the closing comment and the cppOff definition will be ignored. The other direction, having code only available when CPP is on, requires the nested comments available in Haskell:


{-
#if 0
{-
#endif
-}
cppOn = 1
{-
#if 0
-}
#endif
-}

The non-cpp contains an extra open comment, which nests the comments, and leaves the definition invisible.

With these two fragments, you can customise your Haskell code so that it performs differently on systems with and without a CPP, remaining valid Haskell on sytems with no knowledge of CPP.

Friday, May 18, 2007

47% faster than GHC*

* This is building on from last time. In particular, that means that all the same disclaimers apply.

Having done "wc -c" as the previous benchmark, the next thing to look at is "wc -l" - line counting. A line is defined as being a sequence of non-newline characters, optionally ending with a newline. This means that "test" has 1 line, "test\n" has 1 line, "test\ntest" has 2 lines. Its trivial to express this in Haskell:


print . length . lines =<< getContents

The Haskell code uses the same getContents as before, but the C code needed more changes:


#include "stdio.h"

int main()
{
 int i = 0;
 int last = 0;
 while (last = getchar() != EOF) {
  if (last == '\n')
   i++;
 }
 if (last == '\n')
  i--;
 printf("%i\n", i);
 return 0;
}

The C code is quite hard to write. In fact, this code is the original version I wrote - with an accidental mistake. Rest assured that I benchmarked with the correct code, but see if you can spot the mistake.

As in the previous example, I've demanded that all implementations use getchar at their heart, to make everything a fair comparison. The benchmark was then line counting on the same 35Mb file as previously:

And the numbers: C=4.969, Supero=5.063, GHC=9.533. This time the difference between C and Supero is a mere 2%. However the difference between Supero and GHC has grown considerably! The reason for this is an advanced optimisation that has managed to remove all the intermediate lists. In a normal Haskell program, getContents would build a list of characters, lines would take that list, and build a list of strings, then length would take that list and consume it. With my optimisation, no intermediate lists are created at all.

While in the previous benchmark it was easy to examine and understand the generated GHC Core, for this one it is much harder. The benchmark has ended up using a recursive group of functions, none of which use unboxed numbers. For this reason, I suspect that the benchmark could be improved further for Supero. Once again, credit needs to be given to GHC for the results - it is clearly doing some clever things.

This benchmark is slightly more interesting than the previous one, but is still a bit of a micro-benchmark - not following a real program. My next task is to probably move on to "wc -w" to complete then set, then a real benchmark suite such as nobench. I hope that this demonstrates that Haskell can be beautiful, pure, and match C for speed in some cases.

Wednesday, May 16, 2007

13% faster than GHC*

* As usual, I'm going to qualify this with "on one particular benchmark, on one particular machine, with one particular set of following wind".

I've been working on optimisation techniques for the last week, as one chapter of my PhD is likely to be on that topic. The approach I'm using is to take Yhc.Core, optimise it at the Core level, then spit out GHC Haskell which can then be compiled. The optimisation technique I've been working on can best be described as "deforestation on steroids" - its entirely automatic (no notion of special named functions like in foldr/build etc) and isn't too expensive.

The particular benchmark I've been working on is "print . length =<< getContents" or "wc -c". In order to make it a fair comparison, I require that all implementations must use the C getchar function at their heart for getting the next character - not allowing any buffering or block reads. This is actually a substantial benefit for C, since in Haskell things like clever buffering can be done automatically (and GHC does them!) - but it does result in a test which is purely on computational crunch.

Writing this benchmark in C is quite easy:


#include "stdio.h"

int main()
{
int i = 0;
while (getchar() != EOF)
    i++;
printf("%i\n", i);
return 0;
}

Writing in Haskell, forcing the use of getchar, is a little bit more tricky:


{-# OPTIONS_GHC -fffi #-}

import Foreign.C.Types
import System.IO
import System.IO.Unsafe

foreign import ccall unsafe "stdio.h getchar" getchar :: IO CInt

main = print . length =<< getContents2

getContents2 :: IO String
getContents2 = do
  c <- getchar
  if c == (-1) then return [] else do
      cs <- unsafeInterleaveIO getContents2
      return (toEnum (fromIntegral c) : cs)

I then benchmarked the C on -O3, versus the Haskell going through my optimiser, and going through GHC on -O2. The benchmark was counting the characters in a 35Mb file, run repeatedly, as usual:

The actual numbers are C=4.830, Supero=5.064, GHC=5.705. This makes Supero 5% slower than C, and GHC 18% slower than C. The reason that Supero is faster is because its managed to deforest the length/getContents pair. The inner loop in Supero allocates no data at all, compared to the GHC one which does. Supero allocates 11,092 bytes in the heap, GHC allocates 1,104,515,888.

This benchmark is not a particularly good one - the task is simple, the computation is minimal and most expense is spent inside the getchar function. It is however a good place to start, and shows that Haskell can get very close to C speeds - even using standard type String = [Char] processing. Supero has made global optimisations that GHC was unable to, and does not have any techniques with which it could acheive the same results. GHC has then taken the Supero output and automatically identified strictness, performed unboxing/inlining, and done lots of other useful low-level optimisations.

I'm going to move on to "wc -l" next, and hopefully onto the nobench suite. The techniques used in Supero are all general, and I hope that as the complexity of the problem increases, the payoff will do so too.

Wednesday, May 09, 2007

Does XMonad crash?

{- ICFP referees look away now. -}

Dons recently asked me to look over the StackSet.hs module of XMonad, checking for pattern-match errors, using my Catch tool. This module is the main API for XMonad, and having a proof of pattern-match safety would be a nice result. I performed an initial check of the library, including introducing the necessary infrastructure to allow Catch to execute, then sent them the required patches. After I had given them the Catch tool and the initial results, sjanssen picked up the results and went on to modify the code further - with the end result that now StackSet will never crash with a pattern-match error.

I'm not going to cover the changes that XMonad required to achieve safety - that is a story that sjanssen is much better equiped to tell. What I am going to describe is the process I followed to get to the initial results.

Step 1: Download Catch, Yhc and XMonad

Catch can be obtained from http://www-users.cs.york.ac.uk/~ndm/catch/ - however a word of warning - it is an involved installation procedure. In the next few months I hope to make Catch use Cabal, and then ordinary users will be able to easily install Catch.

Yhc can be obtained from http://www.haskell.org/haskellwiki/Yhc

XMonad can be obtained from http://www.xmonad.org/

Step 2: Make sure StackSet.hs compiles with Yhc

Initially StackSet used pattern guards, one small patch and these were removed. There was also a small issue with defaulting and Yhc, which I was able to paper over with another small patch. With these two minor changes Yhc was able to compile StackSet.

Step 3: Write a test for StackSet

The test for StackSet is based on the Catch library testing framework, and is all available at http://darcs.haskell.org/~sjanssen/xmonad/tests/Catch.hs

The basic idea is to write a main function which is all the functions under test:

--------------------------------------------------------
module Catch where
import StackSet

main = screen ||| peekStack ||| index ||| ...
--------------------------------------------------------

Catch takes care of all the remaining details. The ||| operator says to check that under all circumstances the function will not crash with a pattern-match error. The module is actually slightly more involved, but this is all detail which Catch will shortly hide into a library.

A future intention is to have a tool which automatically generates such a module from a library. This then allows the person performing the checking to simply point at a library, with no additional infrastructure.

Step 4: Test using Catch

This step is the first time we actually invoke Catch - and its as simple as a single line:

> catch Catch.hs -errors -partial

And thats it, now Catch gives you all the necessary information to start fixing your code. I emailed a summary of the results as given by Catch to the XMonad list: http://www.haskell.org/pipermail/xmonad/2007-May/000196.html

At this point I hopped on IRC, and in discussions with sjannsen talked him through installing Catch and the necessary dependencies. Once he had this installed, he was able to modify the StackSet API and implementation to guarantee that no pattern match errors are raised. After all of this, now running Catch over the test script gives success.

A lesson to take away

There were only two sticking points in running Catch over XMonad. The first was that Yhc can't always compile the code, and may require slight tweaks to be made. There are two solutions to fixing this, we can either fix Yhc, or I can make Catch use GHC Core. The GHC Core solution shouldn't be too much work, once GHC has a library for external Core.

The second sticking point is that installing Catch is a lot harder than it should be. A little bit of Cabal magic and a proper release should squash these problems easily.

So what lessons did I learn? Perhaps the thing that surprised me most is how easily someone else was able to get to grips with the Catch tool. This gives me a strong sense of hope that average users may be able to employ the Catch tool upon their programs, either in this version of a future version.

Monday, April 30, 2007

Phantom Types for Real Problems

As part of my work on Hoogle 4 I've been bashing my deferred binary library quite a lot. I encountered one bug which I hadn't considered, and was able to come up with a solution, using phantom types. I've read about phantom types before, but never actually put them into practice!

The idea of the deferred binary library is that writing instances should be as short as possible - they don't need to be massively high performance, and I wanted to see if I could do it with minimal boilerplate. Each instance needs to be able to both save and load a value - which means it needs to do some sort of pattern-match (to see which value it has), and have the constructor available (to create a new value). The shortest scheme I was able to come up with is:


data Foo = Foo String Bool | Bar Foo

instance BinaryDefer Foo where
    bothDefer = defer
        [\ ~(Foo a b) -> unit Foo << a << b
        ,\ ~(Bar a) -> unit Bar << a
        ]

The idea is that to save a value, you step through the list trying to match each constructor, using Control.Exception.catch to see if the value does not match. Once you have matched, you save an index, and the components. To load a value, you take the index and pick the correct construction. This time, you deliberately do not mention any of the fields, and by remaining lazy on the fields, the pattern match is not invoked. This solution took quite a lot of thought to arrive at, but does minimize the size of the instance declarations.

The instance relies crucially on the laziness/strictness of the functions - the load function must not examine any of its arguments (or you will pattern-match fail), the save function must examine at least one argument (to provoke pattern-match failure). After writing a description such as this, I am sure some users will have spotted the issue!


data Bool = False | True

instance BinaryDefer Bool where
    bothDefer = defer
        [\ ~(False) -> unit False
        ,\ ~(True) -> unit True
        ]

The problem here is 0-arity constructors, the load code works just fine, but the save code has no fields it can demand to ensure the correct constructor has been reached. I encountered this bug in Hoogle 4 - it is subtle, and in a pile of instance rules, these 0-arity cases can easily be overlooked. The solution is to introduce a new operator, (<<!), whose only purpose is to strictly demand something:


instance BinaryDefer Bool where
    bothDefer = defer
        [\ ~(x@False) -> unit False <<! x
        ,\ ~(x@True) -> unit True <<! x
        ]

This (<<!) operator is only required for 0-arity constructors, but is not harmful in other cases. Now the correct fix has been decided upon, we can fix the Hoogle instances. Unfortunately, we can still write the incorrect case, and will get no warning - a bug just waiting to happen. The solution to this is to use phantom types.

The whole point of a phantom type is that it doesn't actually reflect real data, but exists only at the type level, artificially restricting type signatures to ensure some particular property. In our case, the property we want to ensure is that "unit" is never used on its own, but either with (<<), (<<~) or (<<!). We can achieve this by giving unit a different type, even though in reality all operate on the same type of data:


data PendingNone
data PendingSome

unit :: a -> Pending a PendingNone
(<<) :: BinaryDefer a => Pending (a -> b) p -> a -> Pending b PendingSome
(<<~) :: BinaryDefer a => Pending (a -> b) p -> a -> Pending b PendingSome
(<<!) :: Pending a p -> a -> Pending a PendingSome

Now the type of unit is PendingNone, but the type of all other operations is PendingSome. By demanding that the list "defer" accepts is all PendingSome, we are able to raise a type error if the (<<!) is missed.

By using phantom types, a easy to make error can be guarded against statically. Plus, its not very hard to do!

Friday, April 27, 2007

Hoogle 4 type matching

I've done some more work on Hoogle 4, and can now create a database from a textual file, then perform text searches upon the database and show the results. It is basically a command line version of Hoogle restricted to text search only. A lot of the time went on debugging binarydefer, which has saved a lot of effort and meant that all the mistakes due to complex file prodding and seeking are localised in a library which can be reused throughout. Hoogle 4 is both a library and a program, and now I have a minimal API which is mostly free from any monads.

Now I get on to the fun bit, how to do type search. I've done type search in three entirely separate ways:

Hoogle 1: Alpha conversion with missing arguments
Hoogle 2: Unification
Hoogle 3: Edit distance

Of course, it wouldn't be a new version of Hoogle if I didn't have a new way to perform type searches :) The new type searches are based on six operations, the idea is that you start with a target type (what the user queried for), and transform the type into the source type (from the source code). Each transformation has a cost associated with it, and can be run in both directions. I finalised the six tranformation steps some time ago, and they have been floating around on my desk as a piece of paper - not a safe place for paper to live. As such, I've decided to write them up there, in the hope that Google won't loose them.

Each operation has an associated symbol, I've included these symbols in \tex as as well Unicode.

Alpha conversion \alpha (α): a --> b. The standard lambda calculus rule, but this time alpha conversion is not free.

Unboxing \box (□) : m a --> a. Given something in a box, take it out. For example Maybe a --> a would be an example.

Restriction (!): M --> a. Given something concrete like Bool, turn it into a free variable.

Aliasing (=): a --> alias(a). This makes use of the type keyword in Haskell, for example String --> [Char], because type String = [Char].

Context \Rightarrow (⇒): C a => a --> a. Drops the context on a variable, i.e. Eq a => a --> a.

Membership \in (∈): C M => M --> C a => a. For example, Bool --> Eq a => a.

There are some additional restrictions, and methods for applying these rules, but these are the core of the system.

Wednesday, April 25, 2007

More Hoogle progress

I haven't done much on Hoogle 4 in quite a while, and since I keep telling people that various bugs will be "fixed in Hoogle 4", it seems only polite to fix those bugs in the near future. I decided that this evening I would focus on Hoogle 4, since I was slowly going insane from writing academic papers. I'm now just about to go to bed, so thought I would summarise what I did.

One of the things that has been slowing me down on Hoogle is the fact that I rarely get a chance to work on it, and when I do that means I have to refamiliarise myself with the code base. Each time I do this, I forget something, forget what I was working for, and generally end up slowing down development. It also means that working on Hoogle for "an odd half hour" is not practical, it takes at least a couple of hours to get up to speed. In order to solve this, I've done some diagrams of the module structure, and the main Hoogle database. They may not mean much to anyone else, but I hope they are enough to keep my mind fresh with where things go.

As part of documenting the module structure, it also gives a chance to reflect on where the module structure is a bit unclear, and modify as necessary. Previously the module structure had DataBase and Item each being dependent on each other - with tricks like having an extra field in Item which was of a parametrised type, i.e:

data Item a = Item {name :: String, typeSig :: TypeSig, other :: a}

Fortunately now all that is gone, and the structure is a lot more logical.

The other thing I've been doing is moving to the Deferred Binary library. Part of the reason for Hoogle 4 becoming bogged down is the creation of a central database that is loaded on demand, and contains tries etc to perform efficient searches. Unfortunately, this loading on demand aspect had infected the entire code, resulting in a big IO Monad mess. It was slightly more pleasant than coding in C, but all the low-level thoughts had seeped through, requiring careful though, slow progress and frequent mistakes.

To take one example, the Hoogle.DataBase.Texts module, which is responsible for performing all textual searches and storing the lazily-loaded trie. Compare the old version and the current version. The code has shrunk from 186 lines to 65 - of which only about 30 are real code. The IO Monad is nowhere to be seen - everything is pure. The really low-level functions such as sizeTree and layoutTree have gone entirely.

I'm not sure when I will next be able to get to Hoogle 4, but next time I hope to start working on it faster, and that the cleaner code makes for happier coding!

Note: I have started displaying adverts on this blog because it was only a button click away (very impressive integration by Google), it seemed like a fun thing to try and that the adverts in Gmail are sometimes quite helpful. I have deliberately picked to not show too many adverts, placed them off to the side where they won't interfere with reading and tried to keep it minimal - but if they are annoying please let me know. I am going to donate whatever money (if any) from this to charity - but please don't take that as an incentive to artificially click on adverts you aren't interested in.

Monday, April 23, 2007

Boilerplate considered harmful

At the moment I'm working on a boilerplate removal for Haskell, which is faster (runtime), shorter (code), more type safe and requires fewer extensions than Scrap Your Boilerplate (SYB). However, since I haven't finished and released a stable version, I can't really recommend people use that. The reason I started working on this is because I was unaware of SYB when I started. Last week I also introduced someone to SYB, who had done quite a bit of Haskell programming, but had not stumbled across SYB. As a result, I think it needs a bit more attention - SYB is one of the strengths of Haskell!

Disadvantages

Before saying how great SYB is, its important to point out the things that make it not so great:

Only implemented in GHC - sorry to the poor Hugs users
Requires rank-2 types, which means its not actually Haskell 98
Occassionally the rank-2-ness infects the code you write, with unfortunate error messages (although this is not that common)

A data structure

Before showing some operations, I'm going to first introduce a data structure on which we can imagine operations are performed. I don't like the example from the SYB benchmark - it feels like an XML file (as is the stated intention), which means that the logic behind it is a bit disturbed. So instead I'll pick a data type like an imperative programming language:

{-# OPTIONS_GHC -fglasgow-exts #-}
import Data.Generics

data Expr = Var String | Lit Int | Call String [Expr] deriving (Data, Typeable)
data Stmt = While Expr [Stmt] | Assign String Expr | Sequence [Stmt] deriving (Data,Typeable)

We define the data type as normal, adding deriving for Data and Typeable - the two key SYB types. We also add an import and a flag, just to get the GHC machinery working for the derivings.

Queries

So lets imagine you have to get a list of all literals. In SYB this is easy:

extractLits :: Data a => a -> [Int]
extractLits = everything (++) ([] `mkQ` f)
where f (Lit x) = [x] ; f _ = []

Wow, easy! This function will operate on anything which has a Data instance, so you can run it on an Expr, Stmt, [Stmt], [Either Stmt Expr] - the choice is yours. For the purposes of a short introduction, I'd recommend treating all the bits except the "f" as just something you write - read the full SYB paper to get all the details of what everything can be used for.

Traversals

Now lets negate all the literals, we have:

negateLits :: Data a => a -> a
negateLits = everywhere (mkT f)
where f (Lit x) = Lit (negate x) ; f x = x

Again, its pretty easy. And once again, consider all apart from the "f" as just something you write.

The gains in code reduction that can be made with SYB are quite substantial, and by removing the boilerplate you get code which can be reused more easily. Boilerplate code is bad, and should be removed where necessary.

Thursday, April 19, 2007

Visual Basic, the story

Today on the Haskell mailing list, someone confused VBScript and VBA. Its not really surprising, the "Visual Basic" or "VB" name has been attached to at least 3 entirely different programming languages! This is an attempt to clear up the confusion. My first programming language (after .bat) was Visual Basic, and I loved it. I've worked on Excel/Word programs using VB, written applications for money in VB, and lots more besides - its a great language. Often people criticize VB for being "under powered" or a "beginners language" - those people are wrong - but that's not something I'm going to cover here :-) [Technical note: I may have got some of this wrong, I was a user of the language - not an implementer]

Visual Basic 3

We'll start the story at Visual Basic 3, because that's the first version I ever used. My Sixth Form would copy it to floppy disk if you bought in 3, which is basically school sponsored piracy. Anyway, this language was an imperative language, with build in support for graphical interfaces.

You can spot a VB 3 program a mile away, they look clunky and alas VB made it really easy to change the background colour of a form. The ability to easily change the colours made it a hotbed for garish colours - does you inhouse application have pastel shades of red? If so its either a C programmer who went to a lot of work, or more likely, a VB programmer without any sense of style. I've seen these apps running on point-of-sale terminals in some shops.

Visual Basic 5/6

I'm going to lump these two versions into one, there were some changes, but not many. Continuing the same trend as VB 3, but moving towards being an object orientated language. This language had a very strong COM interoperation - all VB objects were COM objects, all COM objects could be used in VB. The COM ability made it ideal for many things, and opened up lots of easy to use libraries. All windows become objects, and some of the clunky bits in VB 3 got depreciated in favour of the newer methods, but most programs still worked exactly as before.

Alas VB 6 still allows the background colours of a form to be set, but if you don't (please don't!) the program looks very Windows native - much more than Java or even C#.

Visual Basic for Applications (VBA)

At the time of VB 5, Microsoft moved over to using VB as the language of scripting in Office, replacing WordScript etc. The VBA language is VB, but rather than have the VB libraries, it has the Office libraries. The languages are identical in general use.

VBScript

Microsoft also decided they needed a JavaScript (or JScript as they call it, or ECMAScript as dull people call it) competitor. They decided to repurpose VB, changing it slightly to give it a more scripting feel. If you use IE, you can run VBScripts in the browser. If you like email viruses you can run the .vbs files you can your inbox. If you have active scripting installed, you can run .vbs files at the command line.

But please, don't use VBScript. JavaScript is a much nicer language, and works everywhere VBScript does and in more places besides. The language VBScript should quietly go away and die. There is one exception: ASP only provides one method (uploading a binary file) for VBScript but not JavaScript. As a result, I do have a website written in VBScript which I have to maintain.

VB.NET

And finally we get to the newest edition, or ¬VB.NET as I will refer to it. When changing to C# with .NET, Microsoft decided that shunning all the VB developers would be a bad idea, so instead they shunned them and wrote a syntax wrapper round C#. The ¬VB.NET language is a perfectly fine language, but its not VB. Things like late binding have been removed, the syntax has been made more regular, the object facilities have totally changed etc. If you have a VB project of any mild complexity, then you'll not get very far with the automatic converter.

If you are a VB developer looking to move towards the new .NET framework, I'd look at both ¬VB.NET and C#, before deciding which one to pick. The differences aren't that major, but at the time I was learning (.NET 1.0 beta) the ¬VB.NET was clearly the unloved younger brother of C# - the examples were all C#, and it was clearly the favoured language. Now both languages have evolved so they are not as similar, but neither is worthy of the VB name.

Me and VB

VB was my first language, and my "first choice" language for years. After learning C I initially poured scorn on VB for not being a "real language", then I realised I was an idiot and that VB is a lot better for most application programming than C, and you can always interface to a C .dll if you must. Now Haskell and C# (for GUI's) have become my standard languages - I haven't started up VB 6 for quite a while. I've got a few programs written in VB 6 around that I technically maintain, but none that I've had any reports on in ages, and none that I can imagine working on. I've got at least 50 programs in my VB directory, but I've forgotten quite where I put my VB directory - its certainly not been opened in a while.

Even though I may not use stand alone VB 6 anymore, I still maintain and actively work on some VBA solutions for accountants in Excel, and a couple of website templating systems using VBA in Word. I also develop and maintain a corporate website using VBScript, which requires new features adding at least once a year. VB is not yet dead, but its definitely going that way.

Tuesday, April 17, 2007

Coding Nirvana

Today I experienced the perfection of coding, and I never want to go back to the old hacking method! My FilePath library is nearly ready for release, but just before it went out I got a bug report.

The FilePath library is based on the idea that all functions must satisfy a list of properties, which are documented in the interface. I use QuickCheck to automatically test that these properties hold. This leads to the great situation where there are only two classes of bug:

A property is wrong
A property is missing

The bug report I received was of the second kind - a property was missing. The next step is to decide if this property should hold or not, and in this case the reporter was perfectly right, this property should hold. Next step is to decide if a more general version of this property should hold - and in discussion with the reporter a more generalised version was formulated. By working at the level of properties the code is no longer the hard part, and while collaborative coding on a single function isn't that useful, collaborative property discussion works really well.

Once the properties have been decided upon, and having left a night to see if anyone had any other thoughts on the properties, the coding can begin. This then makes the process of coding much less hacky, and much more precise:

Write out the properties
Write out code to implement them
If the tests fail, you have a counter example - go back to 2
Done

By the time you get to 4, the code meets the properties. You have already done 1000's of tests (more than 10,000 in FilePath) and can have a fairly strong degree of confidence in the work you have produced. The bug reporter has already agreed with the changes you've made, even before you've made them.

I have never done a library before so focused on properties, but it has had amazing advantages. The advantage of having correct and reliable code may be obvious - but in practice the amount of confidence in the code surprised me. By far the bigger benefit has been collaboration - the very high declarative nature of properties makes them something well suited to "group thought".

Monday, April 16, 2007

No Termination Checking

Originally my PhD had an element in it on Termination Checking, and since someone asked on IRC how that bit was going, I thought I'd let everyone know that my PhD now doesn't include termination checking - there are several reasons why:

Related Work

Not long before I started working on the termination thing, this piece of work came out: Automated Termination Analysis for Haskell: From Term Rewriting to Programming Languages. That does a really good job. The authors behind that paper have been working on term rewriting termination for a really long time, and have built up a very powerful framework. Their encoding of Haskell into their framework means they have a very powerful tool - based on years of experience. I don't think I could hope to come anywhere near that.

Less Motivation

I want the tool I produce to solve a practical problem. I want it to be something that people can use on a daily basis. In Haskell termination is rarely a problem, and if it is, usually the problem occurs in all circumstances - not just the corner cases (of course there are exceptions, but this is my overall impression). A lot of termination issues relate to variables being bound unexpectedly, lazy evaluation and black holes - and these will occur on any sufficient test input. Once non-termination has occured, there is no point in running a static checker - use a dynamic tool such as Hat-Nonterm or Black-Hat - both specially designed to pinpoint the cause. These factors mean that my motivation to work on a termination checker is less - I don't see it as fulfiling a massive practical need in Haskell. (I know some will disagree, but this is the reason for me not doing it, not for others not doing it - motivation is a personal thing)

Hard to Proove

If you have a static checker, there are two main reasons for using it - to obtain a proof of safety and to find real bugs. It is most useful when the issues that are spotted can be worked around in some way, so that even if they were not a valid problem, afterwards you can obtain a statically checked version that you can gurantee will terminate for all inputs. Pattern-match checking has this property, by completing the patterns or performing extra (potentially redundant) checks, you can be sure all patterns are safe. From my understanding, termination checking does not - to introduce a decreasing variable into an algorithm that does not already have one is a lot of work, and potentially requires a rethink of the entire algorithm. For this reason, even with the help of a willing programmer, I see termination checking as something that will obtain less successful results than I would like.

Thursday, April 12, 2007

SOC: GuiHaskell

The Google Summer of Code allocations are now out, and GuiHaskell got one of the slots from Haskell.org. There were quite a few students applying for GuiHaskell - I thought the students who applied were very good, and any of them would have done a wonderful job. In the end Asumu Takikawa was chosen, and I wish him well with the project.

The link with his application is available here.

The basic idea as I see the project is to start from the existing GuiHaskell base and evolving so that by the end of the summer it is a very solid alternative to WinHugs, but with several key advantages:

Written in Haskell using Gtk2Hs, not C with the Win32 API
Cross platform: Windows, Linux and Mac at least
Cross compiler: Hugs, GHC and GHCi to begin with
Integration with Haddock/Cabal/Hoogle etc.

The initial aim is not to make it an IDE, but to do everything an IDE might do apart from having an integrated editor. Depending on the progress, this may be something that is tackled this summer, or perhaps not.

Exploding Patterns in FilePath

[ICFP referees please stop reading now]

After what feels like forever, finally my FilePath library has hit 1.0. It took over 250 email messages debating FilePath design choices from Haskell 1.0 onwards, but finally we were able to come to an agreement about what the libary should look like. I'd like to thank everyone who participated in the discussion, used the library, and encouraged me to keep pushing it.

I've previously blogged about my QuickCheck testing of the FilePath library, something which has been essential to coming up with a coherent design - and which lead to various decisions being made for me. Now I'd like to talk about the pattern match story. For the purposes of this post, I'm going to term "exploding patterns" to mean pattern-match errors that can be provoked by appropriately chosen input.

After doing all the QuickCheck tests, you might think I was quite confident of avoiding exploding patterns, but I was not. The QuickCheck tests all work with lists of 25 characters - perhaps the empty list crashes the function? Perhaps only some really rare input? After all that loose random testing, the time came for some serious proof!

Taking a look at the code, I have used 5 incomplete functions (fromJust, init, head, last, tail) and one explicit non-exhaustive pattern-match. Can any of these explode? Turning the code over to my pattern-match checker, it quickly (2 seconds, 2Mb later) informs me that it can prove none of the incomplete functions crash, but it can't prove the explicit pattern match. So lets take a look:

f x | isWindows =
| isPosix = ....

Oh dear, it won't crash currently (isPosix = not isWindows), but you can certainly see that these patterns aren't going to stay complete if we add another operating system. Quick as a flash, we replace "isPosix" with "otherwise", and run the checker again. Yay, no exploding patterns remain.

So there we have it, in under 5 minutes, I was able to proove my entire library to be safe. I can still use all the fun functions like head, but in a safe manner. It turns out the library was always safe, but one minor modification to improve the code, and I can be sure that things won't go wrong - the released 1.0 has this modification. I have the guarantees of safety, without much work at all. Not only that, but as the library evolves, I'm only 2 seconds away (literally!) from pattern-match safety, compared to my QuickCheck test which takes around a minute, this is a very cheap verification.

So how come that within an hour of releasing 1.0 someone had already replied with a bug report about a crashing pattern? The answer is simple, Setup.hs in Cabal has code in it which could (at the time of release, but not anymore) cause a pattern explosion. I didn't write, or validate, the Setup.hs script - but it just goes to show that pattern-match errors can easily slip through!

Wednesday, April 11, 2007

Redesigned Website

I've redesigned my main academic website: http://www-users.cs.york.ac.uk/~ndm/

Why did I decide to redesign my web page? Because I realised that:

I was being put off releasing code because adding it to my website was a pain to add to
I wanted tags
The number of projects had exceeded the vertical space in some browsers
I wanted better cross-site linking - for example all downloads should appear on one place

The site was previously written in PHP as that is the language supported on the server, but I don't know PHP, and from what I do know, I hate it. The solution was obviously to rewrite the site using Haskell - but how.

I initially looked towards a web framework, but given I can't install anything on the server, that makes it a bit hard. The next choice was to write a preprocessor that takes the source code and generates the end result HTML. This was what I decided to do.

I now write each page in HTML, but in a modified form of HTML. For example, the code for the Catch page is at: http://www-users.cs.york.ac.uk/~ndm/catch/catch.html - do view source to see the original. The start is a list of properties, which include tagging information, slides/papers etc and names. The rest of the document is simply the content.

The preprocessor the gives an appropriate beginning and end to each document, fills in some magic tags I have defined, and outputs a complete document. All slides/tags etc are collated and placed in central files - for example http://www-users.cs.york.ac.uk/~ndm/downloads/

The code weighs in at 350 lines, and can be found at http://www-users.cs.york.ac.uk/~ndm/Main.hs - its overly long and could do with some proper abstraction being applied - I just hacked it together as required. However, its fairly simple and generates a nice result without too much effort. Adding a new page is now as simple as creating a new file, and all the indexes are automatically maintained.

I am not sure if this approach will suit everyone, or even if it will suit anyone but me, but I think this is a very nice way to create a site which is easily extendable and very consistent, while keeping the overhead on each page down.

I am sure there is a common tool underlying this, just waiting to get out, but unless anyone else expresses an interest in this approach, the tool will remain buried in my specific needs.

Friday, April 06, 2007

A beautiful proof

I got told a rather beautiful proof the other day, and wanted to write it down.

Question: Given any two irrational numbers x and y, does x^y ever evaluate to a rational number?

Answer: Yes.

Proof: Given the definitions:

r = 2^0.5 (square root of 2, known to be irrational)
z = r^r

Either:

z is rational. Set x=r, y=r and you have a proof.
z is irrational. Set x=z, y=r. x^y = z^r = (r^r)^r = r^(r*r) = r^2 = 2, which is rational.

This proof does not provide a particular value of x and y, but merely shows there must be one.

Friday, March 30, 2007

Eq but not Ord

I recently wondered, which data types in the Haskell library have instances for Eq, but not for Ord? Of course, everything with Ord is obligated to provide Eq, but how many provide only Eq?

The answer is 82:

Action, Body, BuildInfo, Callconv, Clause, Compiler, CompilerFlavor, Con, ConfigFlags, Constr, ConstrRep, Counts, DataRep, Dec, Errno, Exception, Executable, Exp, Extension, Fixity, FixityDirection, Foreign, FunDep, Guard, Handle, HandlePosn, HsAlt, HsAssoc, HsBangType, HsConDecl, HsDecl, HsExp, HsExportSpec, HsFieldUpdate, HsGuardedAlt, HsGuardedAlts, HsGuardedRhs, HsImportDecl, HsImportSpec, HsLiteral, HsMatch, HsPat, HsPatField, HsQualType, HsRhs, HsStmt, HsType, IOArray, IOErrorType, IOException, IORef, Lexeme, Library, License, Lit, LocalBuildInfo, Match, MVar, Node, Orient, PackageDescription, PackageIdentifier, Pat, Range, Safety, SockAddr, Socket, SocketStatus, StableName, StablePtr, STArray, State, Stmt, STRef, Strict, Tree, TVar, TyCon, Type, TypeRep, URI, URIAuth

Wednesday, March 28, 2007

let vs letrec

I've been working hard on Catch for the last few months, and will hopefully have something to release to the world very soon - a few people have alpha copies already. As part of that I've been playing with Yhc Core lots and lots. At once stage I introduced a bug into Yhc, by accident, due to the difference between let and letrec, so I thought I'd describe my mistake for the whole world to see.

Haskell has a keyword called let, but some functional languages have both let and letrec - Haskell's form (although called let) is actually letrec. What's the difference? Consider the fragment of code:

let x = 1 in (let x = x in x)

If we are dealing with let, the answer is 1. If we are dealing with letrec, the answer is a black hole of non-termination - _|_. The rule is that in a letrec you introduce the names you are defining before computing their values (so the x = x refers to itself), but with let you only introduce the names once you get to in (so the x = 1 is still in scope).

Often people define let's in a non-recursive way, but occasionally the recursion is fundamental:

repeat xs = let res = x : res in res

Here repeat is being defined with letrec - the res on the LHS is used on the RHS. This enables only one (:) cell to ever be created, saving memory.

Yhc.Core has let, which is recursive - unlike GHC Core which has both let and letrec. Yhc.Core also provides a method "removeRecursiveLet" which removes all recursive lets, replacing them with equivalent code.

So now we come to the bug, I recently changed Yhc.Core so that when desugaring FatBar/Failure (from complex pattern matches) we introduce let bindings, instead of inlining the let multiple times. This can lead to a substantial reduction in code size for bigger programs. The basic idea is that before when a Failure was spotted, we replaced it with an expression, now we replace it with a variable which we bind to the expression. If there are multiple failures with the same expression, we gain with less code.

Given the function:

binAdd False as [] = as
binAdd False [] (b:bs) = b:bs

This desugared to:

| Num.binAdd v248 v249 v250 =
| let v_fail = error "Pattern match failure"
| in case v248 of
| Prelude.False ->
| let v_fail = case v249 of
| Prelude.[] -> case v250 of
| (Prelude.:) v251 v252 -> (Prelude.:) v251 v252
| _ -> v_fail
| _ -> v_fail
| in case v250 of
| Prelude.[] -> v249
| _ -> v_fail
| _ -> v_fail

As you might be able to see, there are two vfail's here which are part of a letrec, when I really was hoping for a let. Adding a simple numbering scheme to the v_fails mades these refer to different ones and the problem was solved.

Wednesday, March 21, 2007

HsExcel

I was reading JFP this weekend, and read the paper "Spreadsheet functional programming" by David Wakeling. As a Windows user who has done a lot of work with Excel, I thought it was great. People before have either tried to pervert Excel into a functional language, or write a new spreadsheet - and given that Open Office spreadsheet sucks totally - this seems to be a hard challenge.

Despite being a lovely concept, I thought the implementation described missed the mark in a few places. These are not academic reasons, but reasons of practicality for a production system and integration with the underlying Excel. I fully expect that the reasons stem from missing a deep understanding of Excel - something few academics would be likely to master. However, my Dad is a Chartered Accountant, and I've been developing systems based around Excel for over 10 years, so can fill in the Excel half.

Spreadsheet Functional Programming

Before describing how I would implement the ideas in the above paper, its best to recap the paper a bit, for those people who can't be bothered to read it. (Note: the above paper is very accessible, even a novice programmer should be able to get something from it - so skip this summary and read the paper!)

The basic idea is that Excel should be able to call Haskell, so:

=Haskell("1+2")

should call off to Haskell and evaluate 1+2, hopefully returning 3. Naturally that sort of thing could be done in Excel much more conveniently, so the strength should be in definition of proper Haskell functions within Excel:

f :: Int -> Int -> Int
f a b = (a + b) / 2

=Haskell("f 1 2")

(imagine a more complex example for f, I'm too lazy to think of one, but the paper has one)

Obviously, =Haskell should go in the Cell you want the result to appear in, but what about the definition of f? This paper chooses to place definitions of f in comments, a neat solution that keeps them in the document, but out of the small cells, which would be inappropriate.

This is extended by allowing users to write references such as A1 in their Haskell strings. Unfortunately this shows up a weakness - now dragging formula around does not automatically update their references as it normally does in Excel, since the reference is hidden from Excel inside a string.

The final point to mention is that the evaluation is done with Hugs, by writing out a static file and running it - something I believe could end up being too slow for practical use. Whether it is or not isn't something I can comment on, having not tried it in real life.

HsExcel

Reading the paper, there are lots of things which are done spot on, and seem obvious in retrospect. The best ideas always seem good in retrospect, so I consider this a very good thing. Unfortunately I see 3 issues:

Writing code in comments looks highly unpleasant

Haskell is in strings, which enforces separation between Haskell and Excel

Not fast enough

I have come up with a design which I believe addresses all of these issues, which I have named HsExcel. I have not implemented it, so cannot guarantee it would really work, but I suspect it would not be that much of a challenge. I have no free time currently, but it is on my todo list (somewhere).

The first issue I shall tackle is writing code in comments. A much nicer solution would be to add a tab to Excel which contained Haskell code in a nice big box, perhaps also allowing Haskell code in comments if people found it preferable for small snippets. This can be done using Excel dialog panes with a bit of VB magic.

Next is the issue of Haskell in strings, instead of =Haskell("f 1 2") I would write:

=hs("f",1,2)

Here the Excel will evaluate 1 and 2, and pass them to the Haskell function f. If 1 was A1, then it would behave like a normal A1 in Excel. Perhaps allowing =Haskell as well would be nice, but I think the common case would be =hs. This also allows better flexibility, nested Excel and Haskell functions together, intermingled. I also think this will make cell dependency updating work (an issue not covered in the paper).

Once this has been done, Haskell could operate on any Excel values, including Cell Ranges etc. Unfortunately the paper is silent on the issue of marshaling Excel data to Haskell, but I suspect it could be done relatively easily.

One nice feature to add would be that =hs_f(1,2) works, just like an Excel function. I'm not sure if VBA is permitted to add primitive functions to Excel, or just addons. I'm also not sure if addons can dynamically change the available functions. If this was the case, then all entires in the Haskell source could become first class Excel functions.

The final question is the issue of speed. Currently the implementation runs the string from the beginning each time, writing out fresh values for the changed cells. I think spawning a shell and talking to an interpreter with each request would be higher performance. Once this solution was adopted, either GHCi or Hugs could be used, potentially allowing for a further speed up. The author comments that this is hard to do using VB's Shell command, and indeed it is. However, it is possible to prod standard Win32 system calls from VB, albeit with a great deal of work, which could get this working properly.

Conclusion

The paper presented a lot of good idea, I think a bit of user interface polish could bring them to the fore. HsExcel could be a useful tool - I certainly have projects I could use it for. I suspect that it would be under a weeks work for anyone already fluent in Haskell and (more importantly) Excel/VBA. Any volunteers?