Neil Mitchell's Blog (Haskell etc): April 2007

Monday, April 30, 2007

Phantom Types for Real Problems

As part of my work on Hoogle 4 I've been bashing my deferred binary library quite a lot. I encountered one bug which I hadn't considered, and was able to come up with a solution, using phantom types. I've read about phantom types before, but never actually put them into practice!

The idea of the deferred binary library is that writing instances should be as short as possible - they don't need to be massively high performance, and I wanted to see if I could do it with minimal boilerplate. Each instance needs to be able to both save and load a value - which means it needs to do some sort of pattern-match (to see which value it has), and have the constructor available (to create a new value). The shortest scheme I was able to come up with is:


data Foo = Foo String Bool | Bar Foo

instance BinaryDefer Foo where
    bothDefer = defer
        [\ ~(Foo a b) -> unit Foo << a << b
        ,\ ~(Bar a) -> unit Bar << a
        ]

The idea is that to save a value, you step through the list trying to match each constructor, using Control.Exception.catch to see if the value does not match. Once you have matched, you save an index, and the components. To load a value, you take the index and pick the correct construction. This time, you deliberately do not mention any of the fields, and by remaining lazy on the fields, the pattern match is not invoked. This solution took quite a lot of thought to arrive at, but does minimize the size of the instance declarations.

The instance relies crucially on the laziness/strictness of the functions - the load function must not examine any of its arguments (or you will pattern-match fail), the save function must examine at least one argument (to provoke pattern-match failure). After writing a description such as this, I am sure some users will have spotted the issue!


data Bool = False | True

instance BinaryDefer Bool where
    bothDefer = defer
        [\ ~(False) -> unit False
        ,\ ~(True) -> unit True
        ]

The problem here is 0-arity constructors, the load code works just fine, but the save code has no fields it can demand to ensure the correct constructor has been reached. I encountered this bug in Hoogle 4 - it is subtle, and in a pile of instance rules, these 0-arity cases can easily be overlooked. The solution is to introduce a new operator, (<<!), whose only purpose is to strictly demand something:


instance BinaryDefer Bool where
    bothDefer = defer
        [\ ~(x@False) -> unit False <<! x
        ,\ ~(x@True) -> unit True <<! x
        ]

This (<<!) operator is only required for 0-arity constructors, but is not harmful in other cases. Now the correct fix has been decided upon, we can fix the Hoogle instances. Unfortunately, we can still write the incorrect case, and will get no warning - a bug just waiting to happen. The solution to this is to use phantom types.

The whole point of a phantom type is that it doesn't actually reflect real data, but exists only at the type level, artificially restricting type signatures to ensure some particular property. In our case, the property we want to ensure is that "unit" is never used on its own, but either with (<<), (<<~) or (<<!). We can achieve this by giving unit a different type, even though in reality all operate on the same type of data:


data PendingNone
data PendingSome

unit :: a -> Pending a PendingNone
(<<) :: BinaryDefer a => Pending (a -> b) p -> a -> Pending b PendingSome
(<<~) :: BinaryDefer a => Pending (a -> b) p -> a -> Pending b PendingSome
(<<!) :: Pending a p -> a -> Pending a PendingSome

Now the type of unit is PendingNone, but the type of all other operations is PendingSome. By demanding that the list "defer" accepts is all PendingSome, we are able to raise a type error if the (<<!) is missed.

By using phantom types, a easy to make error can be guarded against statically. Plus, its not very hard to do!

Friday, April 27, 2007

Hoogle 4 type matching

I've done some more work on Hoogle 4, and can now create a database from a textual file, then perform text searches upon the database and show the results. It is basically a command line version of Hoogle restricted to text search only. A lot of the time went on debugging binarydefer, which has saved a lot of effort and meant that all the mistakes due to complex file prodding and seeking are localised in a library which can be reused throughout. Hoogle 4 is both a library and a program, and now I have a minimal API which is mostly free from any monads.

Now I get on to the fun bit, how to do type search. I've done type search in three entirely separate ways:

Hoogle 1: Alpha conversion with missing arguments
Hoogle 2: Unification
Hoogle 3: Edit distance

Of course, it wouldn't be a new version of Hoogle if I didn't have a new way to perform type searches :) The new type searches are based on six operations, the idea is that you start with a target type (what the user queried for), and transform the type into the source type (from the source code). Each transformation has a cost associated with it, and can be run in both directions. I finalised the six tranformation steps some time ago, and they have been floating around on my desk as a piece of paper - not a safe place for paper to live. As such, I've decided to write them up there, in the hope that Google won't loose them.

Each operation has an associated symbol, I've included these symbols in \tex as as well Unicode.

Alpha conversion \alpha (α): a --> b. The standard lambda calculus rule, but this time alpha conversion is not free.

Unboxing \box (□) : m a --> a. Given something in a box, take it out. For example Maybe a --> a would be an example.

Restriction (!): M --> a. Given something concrete like Bool, turn it into a free variable.

Aliasing (=): a --> alias(a). This makes use of the type keyword in Haskell, for example String --> [Char], because type String = [Char].

Context \Rightarrow (⇒): C a => a --> a. Drops the context on a variable, i.e. Eq a => a --> a.

Membership \in (∈): C M => M --> C a => a. For example, Bool --> Eq a => a.

There are some additional restrictions, and methods for applying these rules, but these are the core of the system.

Wednesday, April 25, 2007

More Hoogle progress

I haven't done much on Hoogle 4 in quite a while, and since I keep telling people that various bugs will be "fixed in Hoogle 4", it seems only polite to fix those bugs in the near future. I decided that this evening I would focus on Hoogle 4, since I was slowly going insane from writing academic papers. I'm now just about to go to bed, so thought I would summarise what I did.

One of the things that has been slowing me down on Hoogle is the fact that I rarely get a chance to work on it, and when I do that means I have to refamiliarise myself with the code base. Each time I do this, I forget something, forget what I was working for, and generally end up slowing down development. It also means that working on Hoogle for "an odd half hour" is not practical, it takes at least a couple of hours to get up to speed. In order to solve this, I've done some diagrams of the module structure, and the main Hoogle database. They may not mean much to anyone else, but I hope they are enough to keep my mind fresh with where things go.

As part of documenting the module structure, it also gives a chance to reflect on where the module structure is a bit unclear, and modify as necessary. Previously the module structure had DataBase and Item each being dependent on each other - with tricks like having an extra field in Item which was of a parametrised type, i.e:

data Item a = Item {name :: String, typeSig :: TypeSig, other :: a}

Fortunately now all that is gone, and the structure is a lot more logical.

The other thing I've been doing is moving to the Deferred Binary library. Part of the reason for Hoogle 4 becoming bogged down is the creation of a central database that is loaded on demand, and contains tries etc to perform efficient searches. Unfortunately, this loading on demand aspect had infected the entire code, resulting in a big IO Monad mess. It was slightly more pleasant than coding in C, but all the low-level thoughts had seeped through, requiring careful though, slow progress and frequent mistakes.

To take one example, the Hoogle.DataBase.Texts module, which is responsible for performing all textual searches and storing the lazily-loaded trie. Compare the old version and the current version. The code has shrunk from 186 lines to 65 - of which only about 30 are real code. The IO Monad is nowhere to be seen - everything is pure. The really low-level functions such as sizeTree and layoutTree have gone entirely.

I'm not sure when I will next be able to get to Hoogle 4, but next time I hope to start working on it faster, and that the cleaner code makes for happier coding!

Note: I have started displaying adverts on this blog because it was only a button click away (very impressive integration by Google), it seemed like a fun thing to try and that the adverts in Gmail are sometimes quite helpful. I have deliberately picked to not show too many adverts, placed them off to the side where they won't interfere with reading and tried to keep it minimal - but if they are annoying please let me know. I am going to donate whatever money (if any) from this to charity - but please don't take that as an incentive to artificially click on adverts you aren't interested in.

Monday, April 23, 2007

Boilerplate considered harmful

At the moment I'm working on a boilerplate removal for Haskell, which is faster (runtime), shorter (code), more type safe and requires fewer extensions than Scrap Your Boilerplate (SYB). However, since I haven't finished and released a stable version, I can't really recommend people use that. The reason I started working on this is because I was unaware of SYB when I started. Last week I also introduced someone to SYB, who had done quite a bit of Haskell programming, but had not stumbled across SYB. As a result, I think it needs a bit more attention - SYB is one of the strengths of Haskell!

Disadvantages

Before saying how great SYB is, its important to point out the things that make it not so great:

Only implemented in GHC - sorry to the poor Hugs users
Requires rank-2 types, which means its not actually Haskell 98
Occassionally the rank-2-ness infects the code you write, with unfortunate error messages (although this is not that common)

A data structure

Before showing some operations, I'm going to first introduce a data structure on which we can imagine operations are performed. I don't like the example from the SYB benchmark - it feels like an XML file (as is the stated intention), which means that the logic behind it is a bit disturbed. So instead I'll pick a data type like an imperative programming language:

{-# OPTIONS_GHC -fglasgow-exts #-}
import Data.Generics

data Expr = Var String | Lit Int | Call String [Expr] deriving (Data, Typeable)
data Stmt = While Expr [Stmt] | Assign String Expr | Sequence [Stmt] deriving (Data,Typeable)

We define the data type as normal, adding deriving for Data and Typeable - the two key SYB types. We also add an import and a flag, just to get the GHC machinery working for the derivings.

Queries

So lets imagine you have to get a list of all literals. In SYB this is easy:

extractLits :: Data a => a -> [Int]
extractLits = everything (++) ([] `mkQ` f)
where f (Lit x) = [x] ; f _ = []

Wow, easy! This function will operate on anything which has a Data instance, so you can run it on an Expr, Stmt, [Stmt], [Either Stmt Expr] - the choice is yours. For the purposes of a short introduction, I'd recommend treating all the bits except the "f" as just something you write - read the full SYB paper to get all the details of what everything can be used for.

Traversals

Now lets negate all the literals, we have:

negateLits :: Data a => a -> a
negateLits = everywhere (mkT f)
where f (Lit x) = Lit (negate x) ; f x = x

Again, its pretty easy. And once again, consider all apart from the "f" as just something you write.

The gains in code reduction that can be made with SYB are quite substantial, and by removing the boilerplate you get code which can be reused more easily. Boilerplate code is bad, and should be removed where necessary.

Thursday, April 19, 2007

Visual Basic, the story

Today on the Haskell mailing list, someone confused VBScript and VBA. Its not really surprising, the "Visual Basic" or "VB" name has been attached to at least 3 entirely different programming languages! This is an attempt to clear up the confusion. My first programming language (after .bat) was Visual Basic, and I loved it. I've worked on Excel/Word programs using VB, written applications for money in VB, and lots more besides - its a great language. Often people criticize VB for being "under powered" or a "beginners language" - those people are wrong - but that's not something I'm going to cover here :-) [Technical note: I may have got some of this wrong, I was a user of the language - not an implementer]

Visual Basic 3

We'll start the story at Visual Basic 3, because that's the first version I ever used. My Sixth Form would copy it to floppy disk if you bought in 3, which is basically school sponsored piracy. Anyway, this language was an imperative language, with build in support for graphical interfaces.

You can spot a VB 3 program a mile away, they look clunky and alas VB made it really easy to change the background colour of a form. The ability to easily change the colours made it a hotbed for garish colours - does you inhouse application have pastel shades of red? If so its either a C programmer who went to a lot of work, or more likely, a VB programmer without any sense of style. I've seen these apps running on point-of-sale terminals in some shops.

Visual Basic 5/6

I'm going to lump these two versions into one, there were some changes, but not many. Continuing the same trend as VB 3, but moving towards being an object orientated language. This language had a very strong COM interoperation - all VB objects were COM objects, all COM objects could be used in VB. The COM ability made it ideal for many things, and opened up lots of easy to use libraries. All windows become objects, and some of the clunky bits in VB 3 got depreciated in favour of the newer methods, but most programs still worked exactly as before.

Alas VB 6 still allows the background colours of a form to be set, but if you don't (please don't!) the program looks very Windows native - much more than Java or even C#.

Visual Basic for Applications (VBA)

At the time of VB 5, Microsoft moved over to using VB as the language of scripting in Office, replacing WordScript etc. The VBA language is VB, but rather than have the VB libraries, it has the Office libraries. The languages are identical in general use.

VBScript

Microsoft also decided they needed a JavaScript (or JScript as they call it, or ECMAScript as dull people call it) competitor. They decided to repurpose VB, changing it slightly to give it a more scripting feel. If you use IE, you can run VBScripts in the browser. If you like email viruses you can run the .vbs files you can your inbox. If you have active scripting installed, you can run .vbs files at the command line.

But please, don't use VBScript. JavaScript is a much nicer language, and works everywhere VBScript does and in more places besides. The language VBScript should quietly go away and die. There is one exception: ASP only provides one method (uploading a binary file) for VBScript but not JavaScript. As a result, I do have a website written in VBScript which I have to maintain.

VB.NET

And finally we get to the newest edition, or ¬VB.NET as I will refer to it. When changing to C# with .NET, Microsoft decided that shunning all the VB developers would be a bad idea, so instead they shunned them and wrote a syntax wrapper round C#. The ¬VB.NET language is a perfectly fine language, but its not VB. Things like late binding have been removed, the syntax has been made more regular, the object facilities have totally changed etc. If you have a VB project of any mild complexity, then you'll not get very far with the automatic converter.

If you are a VB developer looking to move towards the new .NET framework, I'd look at both ¬VB.NET and C#, before deciding which one to pick. The differences aren't that major, but at the time I was learning (.NET 1.0 beta) the ¬VB.NET was clearly the unloved younger brother of C# - the examples were all C#, and it was clearly the favoured language. Now both languages have evolved so they are not as similar, but neither is worthy of the VB name.

Me and VB

VB was my first language, and my "first choice" language for years. After learning C I initially poured scorn on VB for not being a "real language", then I realised I was an idiot and that VB is a lot better for most application programming than C, and you can always interface to a C .dll if you must. Now Haskell and C# (for GUI's) have become my standard languages - I haven't started up VB 6 for quite a while. I've got a few programs written in VB 6 around that I technically maintain, but none that I've had any reports on in ages, and none that I can imagine working on. I've got at least 50 programs in my VB directory, but I've forgotten quite where I put my VB directory - its certainly not been opened in a while.

Even though I may not use stand alone VB 6 anymore, I still maintain and actively work on some VBA solutions for accountants in Excel, and a couple of website templating systems using VBA in Word. I also develop and maintain a corporate website using VBScript, which requires new features adding at least once a year. VB is not yet dead, but its definitely going that way.

Tuesday, April 17, 2007

Coding Nirvana

Today I experienced the perfection of coding, and I never want to go back to the old hacking method! My FilePath library is nearly ready for release, but just before it went out I got a bug report.

The FilePath library is based on the idea that all functions must satisfy a list of properties, which are documented in the interface. I use QuickCheck to automatically test that these properties hold. This leads to the great situation where there are only two classes of bug:

A property is wrong
A property is missing

The bug report I received was of the second kind - a property was missing. The next step is to decide if this property should hold or not, and in this case the reporter was perfectly right, this property should hold. Next step is to decide if a more general version of this property should hold - and in discussion with the reporter a more generalised version was formulated. By working at the level of properties the code is no longer the hard part, and while collaborative coding on a single function isn't that useful, collaborative property discussion works really well.

Once the properties have been decided upon, and having left a night to see if anyone had any other thoughts on the properties, the coding can begin. This then makes the process of coding much less hacky, and much more precise:

Write out the properties
Write out code to implement them
If the tests fail, you have a counter example - go back to 2
Done

By the time you get to 4, the code meets the properties. You have already done 1000's of tests (more than 10,000 in FilePath) and can have a fairly strong degree of confidence in the work you have produced. The bug reporter has already agreed with the changes you've made, even before you've made them.

I have never done a library before so focused on properties, but it has had amazing advantages. The advantage of having correct and reliable code may be obvious - but in practice the amount of confidence in the code surprised me. By far the bigger benefit has been collaboration - the very high declarative nature of properties makes them something well suited to "group thought".

Monday, April 16, 2007

No Termination Checking

Originally my PhD had an element in it on Termination Checking, and since someone asked on IRC how that bit was going, I thought I'd let everyone know that my PhD now doesn't include termination checking - there are several reasons why:

Related Work

Not long before I started working on the termination thing, this piece of work came out: Automated Termination Analysis for Haskell: From Term Rewriting to Programming Languages. That does a really good job. The authors behind that paper have been working on term rewriting termination for a really long time, and have built up a very powerful framework. Their encoding of Haskell into their framework means they have a very powerful tool - based on years of experience. I don't think I could hope to come anywhere near that.

Less Motivation

I want the tool I produce to solve a practical problem. I want it to be something that people can use on a daily basis. In Haskell termination is rarely a problem, and if it is, usually the problem occurs in all circumstances - not just the corner cases (of course there are exceptions, but this is my overall impression). A lot of termination issues relate to variables being bound unexpectedly, lazy evaluation and black holes - and these will occur on any sufficient test input. Once non-termination has occured, there is no point in running a static checker - use a dynamic tool such as Hat-Nonterm or Black-Hat - both specially designed to pinpoint the cause. These factors mean that my motivation to work on a termination checker is less - I don't see it as fulfiling a massive practical need in Haskell. (I know some will disagree, but this is the reason for me not doing it, not for others not doing it - motivation is a personal thing)

Hard to Proove

If you have a static checker, there are two main reasons for using it - to obtain a proof of safety and to find real bugs. It is most useful when the issues that are spotted can be worked around in some way, so that even if they were not a valid problem, afterwards you can obtain a statically checked version that you can gurantee will terminate for all inputs. Pattern-match checking has this property, by completing the patterns or performing extra (potentially redundant) checks, you can be sure all patterns are safe. From my understanding, termination checking does not - to introduce a decreasing variable into an algorithm that does not already have one is a lot of work, and potentially requires a rethink of the entire algorithm. For this reason, even with the help of a willing programmer, I see termination checking as something that will obtain less successful results than I would like.

Thursday, April 12, 2007

SOC: GuiHaskell

The Google Summer of Code allocations are now out, and GuiHaskell got one of the slots from Haskell.org. There were quite a few students applying for GuiHaskell - I thought the students who applied were very good, and any of them would have done a wonderful job. In the end Asumu Takikawa was chosen, and I wish him well with the project.

The link with his application is available here.

The basic idea as I see the project is to start from the existing GuiHaskell base and evolving so that by the end of the summer it is a very solid alternative to WinHugs, but with several key advantages:

Written in Haskell using Gtk2Hs, not C with the Win32 API
Cross platform: Windows, Linux and Mac at least
Cross compiler: Hugs, GHC and GHCi to begin with
Integration with Haddock/Cabal/Hoogle etc.

The initial aim is not to make it an IDE, but to do everything an IDE might do apart from having an integrated editor. Depending on the progress, this may be something that is tackled this summer, or perhaps not.

Exploding Patterns in FilePath

[ICFP referees please stop reading now]

After what feels like forever, finally my FilePath library has hit 1.0. It took over 250 email messages debating FilePath design choices from Haskell 1.0 onwards, but finally we were able to come to an agreement about what the libary should look like. I'd like to thank everyone who participated in the discussion, used the library, and encouraged me to keep pushing it.

I've previously blogged about my QuickCheck testing of the FilePath library, something which has been essential to coming up with a coherent design - and which lead to various decisions being made for me. Now I'd like to talk about the pattern match story. For the purposes of this post, I'm going to term "exploding patterns" to mean pattern-match errors that can be provoked by appropriately chosen input.

After doing all the QuickCheck tests, you might think I was quite confident of avoiding exploding patterns, but I was not. The QuickCheck tests all work with lists of 25 characters - perhaps the empty list crashes the function? Perhaps only some really rare input? After all that loose random testing, the time came for some serious proof!

Taking a look at the code, I have used 5 incomplete functions (fromJust, init, head, last, tail) and one explicit non-exhaustive pattern-match. Can any of these explode? Turning the code over to my pattern-match checker, it quickly (2 seconds, 2Mb later) informs me that it can prove none of the incomplete functions crash, but it can't prove the explicit pattern match. So lets take a look:

f x | isWindows =
| isPosix = ....

Oh dear, it won't crash currently (isPosix = not isWindows), but you can certainly see that these patterns aren't going to stay complete if we add another operating system. Quick as a flash, we replace "isPosix" with "otherwise", and run the checker again. Yay, no exploding patterns remain.

So there we have it, in under 5 minutes, I was able to proove my entire library to be safe. I can still use all the fun functions like head, but in a safe manner. It turns out the library was always safe, but one minor modification to improve the code, and I can be sure that things won't go wrong - the released 1.0 has this modification. I have the guarantees of safety, without much work at all. Not only that, but as the library evolves, I'm only 2 seconds away (literally!) from pattern-match safety, compared to my QuickCheck test which takes around a minute, this is a very cheap verification.

So how come that within an hour of releasing 1.0 someone had already replied with a bug report about a crashing pattern? The answer is simple, Setup.hs in Cabal has code in it which could (at the time of release, but not anymore) cause a pattern explosion. I didn't write, or validate, the Setup.hs script - but it just goes to show that pattern-match errors can easily slip through!

Wednesday, April 11, 2007

Redesigned Website

I've redesigned my main academic website: http://www-users.cs.york.ac.uk/~ndm/

Why did I decide to redesign my web page? Because I realised that:

I was being put off releasing code because adding it to my website was a pain to add to
I wanted tags
The number of projects had exceeded the vertical space in some browsers
I wanted better cross-site linking - for example all downloads should appear on one place

The site was previously written in PHP as that is the language supported on the server, but I don't know PHP, and from what I do know, I hate it. The solution was obviously to rewrite the site using Haskell - but how.

I initially looked towards a web framework, but given I can't install anything on the server, that makes it a bit hard. The next choice was to write a preprocessor that takes the source code and generates the end result HTML. This was what I decided to do.

I now write each page in HTML, but in a modified form of HTML. For example, the code for the Catch page is at: http://www-users.cs.york.ac.uk/~ndm/catch/catch.html - do view source to see the original. The start is a list of properties, which include tagging information, slides/papers etc and names. The rest of the document is simply the content.

The preprocessor the gives an appropriate beginning and end to each document, fills in some magic tags I have defined, and outputs a complete document. All slides/tags etc are collated and placed in central files - for example http://www-users.cs.york.ac.uk/~ndm/downloads/

The code weighs in at 350 lines, and can be found at http://www-users.cs.york.ac.uk/~ndm/Main.hs - its overly long and could do with some proper abstraction being applied - I just hacked it together as required. However, its fairly simple and generates a nice result without too much effort. Adding a new page is now as simple as creating a new file, and all the indexes are automatically maintained.

I am not sure if this approach will suit everyone, or even if it will suit anyone but me, but I think this is a very nice way to create a site which is easily extendable and very consistent, while keeping the overhead on each page down.

I am sure there is a common tool underlying this, just waiting to get out, but unless anyone else expresses an interest in this approach, the tool will remain buried in my specific needs.

Friday, April 06, 2007

A beautiful proof

I got told a rather beautiful proof the other day, and wanted to write it down.

Question: Given any two irrational numbers x and y, does x^y ever evaluate to a rational number?

Answer: Yes.

Proof: Given the definitions:

r = 2^0.5 (square root of 2, known to be irrational)
z = r^r

Either:

z is rational. Set x=r, y=r and you have a proof.
z is irrational. Set x=z, y=r. x^y = z^r = (r^r)^r = r^(r*r) = r^2 = 2, which is rational.

This proof does not provide a particular value of x and y, but merely shows there must be one.

Neil Mitchell's Blog (Haskell etc)