Sunday, March 30, 2008

Toddler's play with HTML in Haskell

I just read a blog article entitled Kid's play with HTML in Haskell, where the author extracts some information from an HTML document, using the Haskell XML Toolbox. I have an alternative XML/HTML library, TagSoup, so I decided to implement their problem with my library.

The Problem

Given an HTML file, extract all hyperlinks to mp3 files.

In TagSoup

[mp3 | TagOpen "a" atts <- parseTags txt
, ("href",mp3) <- atts
, takeExtension mp3 == ".mp3"]

The code is a list comprehension. The first line says use TagSoup to parse the text, and pick all "a" links. The second line says pick all "href" attributes from the tag you matched. The final line uses the FilePath library to check the extension is mp3.

A Complete Program

The above fragment is all the TagSoup logic, but to match exact the interface to the original code, we can wrap it up as so:

import System.FilePath
import System.Environment
import Text.HTML.TagSoup

main = do
[src] <- getArgs
txt <- readFile src
mapM_ putStrLn [mp3 | TagOpen "a" atts <- parseTags txt
, ("href",mp3) <- atts
, takeExtension mp3 == ".mp3"]


If you have a desire to quickly get a bit of information out of some XML/HTML page, TagSoup may be the answer. It isn't intended to be a complete HTML framework, but it does nicely optimise fairly common patterns of use.

Tuesday, March 11, 2008

Poor Ada Error Message

I have been demonstrating on the York University "Algorithms and Data Structures" course for 4 years now. As part of the course, first year students learn Ada. A lot of the error messages are really bad - but over time I've created a mental mapping between the message and the cause. I am now fairly fluent at recognising what mistake a student has made, given the exercise they are attempting and the error message. But yesterday I encountered a brand new misleading error message.

The error message was:

bad.adb:12:22: actual for "N" must be a variable

As always, the most useful thing in the error message is the line number. I read enough of the error message to check whether its a parse error, type error or something else, then head for the line mentioned. (I follow this same tactic in all languages, not just Ada.)

12: Add_Cell(I, Next(N));

Nothing obviously wrong about this statement, so I read the error message. It seems to want N to be a variable. But I already know that N is a variable, or at the very least a parameter, so this condition seems to be met.

Next step is to head to the definitions of Next and Add_Cell, to see if they can shed some light on the situation.

function Next(N: List) return List is ...
procedure Add_Cell(I: Integer, N: in out List) is ...

The in out in Add_Cell can be read as "pass as a pointer". Aha, maybe the error message is complaining that the second argument to Add_Cell can't be made a pointer, as its a return value from a function. That would explain it, and indeed, that turned out to be the cause of the problem. But back to the error message, what was it trying to tell us?

bad.adb:12:22: actual for "N" must be a variable
12: Add_Cell(I, Next(N));

When the error message refers to N it isn't talking about the variable N I can see, but the second argument of Add_Cell, which is also called N. If the function being calling was in a separate library, it would have been even harder to understand. A more helpful error message might have been:

bad.adb:12:22: expression passed as the second argument to Add_Cell must be a variable
Found: Next(N)
Expected: A variable
Reason: Second argument of Add_Cell is declared "in out"

In general compiler error messages should be in terms of the line where the error resides, not requiring a large amount of global knowledge. The error can be resolved, but without help from the message. All compilers have bad error messages in some circumstances, but this one seems almost malicious!

Monday, March 10, 2008

Sorting At Speed

Sorting is currently a hot topic within the the Haskell community. Christopher brought it up in a recent thread on the mailing list, and this weekend I ended up spending several hours looking at sort routines.

I was browsing through the Yhc standard libraries, as one does on the weekend, and was drawn to Yhc's sort function. It had some undesirable characteristics for one of the projects I was working on, so I wondered if other Haskell systems used different implementations. I checked GHC, and discovered it their sort was different. In general, when Yhc and GHC have different implementations of a standard library function, the GHC one is better tuned for performance. I decided to replace the Yhc sort function with the GHC one, but before doing so, thought a quick performance test was in order. So I came up with something simple:

test = do
src <- readFile "Sort.hs"
print $ ordered $ sort $ sort $ reverse $ sort src

The file "sort.txt" was simply the source code to the program. The code sorts the contents of this file, then reverses it, sorts it and sorts it again. This means that the program performs one sort over semi-random data, one over reverse-ordered data and one over ordered-data. These are some fairly standard cases that should be checked. This test is not a comprehensive benchmark, but a nice quick indicator.

I ran the Yhc sort function against the GHC version, and was shocked to find that the Yhc code was twice as fast. I ran the benchmark under Yhc, GHC and Hugs (using reduction count in Hugs), and in all cases the performance was doubled. I was not expecting this result!

The code for the GHC sort is:

mergesort :: (a -> a -> Ordering) -> [a] -> [a]
mergesort cmp = mergesort' cmp . map (:[])

mergesort' :: (a -> a -> Ordering) -> [[a]] -> [a]
mergesort' cmp [] = []
mergesort' cmp [xs] = xs
mergesort' cmp xss = mergesort' cmp (merge_pairs cmp xss)

merge_pairs :: (a -> a -> Ordering) -> [[a]] -> [[a]]
merge_pairs cmp [] = []
merge_pairs cmp [xs] = [xs]
merge_pairs cmp (xs:ys:xss) = merge cmp xs ys : merge_pairs cmp xss

merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
merge cmp [] ys = ys
merge cmp xs [] = xs
merge cmp (x:xs) (y:ys)
= case x `cmp` y of
GT -> y : merge cmp (x:xs) ys
_ -> x : merge cmp xs (y:ys)

The function works by splitting the list into one element lists, resulting in each basic list being ordered. These lists are then merged in pairs until a single list is left. For example, given the input "sort":

"s" "o" "r" "t"
"os" "rt"

We first split each character into its own list, then merge adjacent pairs. This code corresponds to the standard merge sort. But instead of making each initial list a single element, we could use sequences of increasing elements, for example using the risers function:

risers :: Ord a => [a] -> [[a]]
risers [] = []
risers [x] = [[x]]
risers (x:y:etc) = if x <= y then (x:s):ss else [x]:(s:ss)
where (s:ss) = risers (y:etc)

Now, if we apply risers "sort" we get ["s","ort"]. We can now follow the same merge procedure as before:

"s" "ort"

Instead of doing 3 merges, we have done only 1. Given the input "abcd" the effect would have been even more dramatic. We can refine this scheme further, by detecting both ascending and descending chains of elements in the initial list. This technique is used by Yhc, and is based on code originally written by Lennart Augustsson. Knowing the original source of the code, my shock at the performance benefits offered by the Yhc version have decreased substantially.

The GHC sort function should probably be replaced by the one from the Yhc libraries. This would offer increased performance, especially in the case of ordered or reverse-ordered lists. The asymptotic complexity of the two sorts means that there must exist some value of n such that sort [1..n] runs faster in Yhc than GHC (assuming sufficient stack/heap for both). I wonder whether Int32 is capable of expressing such a value...

A side note: I have been playing with the risers function for several years. I've used it as an example of pattern match checking, both specialised to Int and on a general Ord class. I've used it for supercompiling. It's appeared in blog posts, a TFP paper (Mitchell+Runciman 2007), a Haskell Workshop paper (Xu 2006) etc. I originally had this example suggested from a functional programming exam paper, but only today at lunch did I discover its true origins. The risers function had originally been the first-step in a merge sort!

Tuesday, March 04, 2008

Lazy Evaluation: Strict vs. Speculative

For the last few days I have been thinking about how to write a low-level program optimiser, based on the ideas from Supero. Supero works at the level of a lazy Core expressions, but actual hardware works on a sequence of strict instructions. The possible idea is to translate the lazy expressions to strict sequences, then borrow the ideas from supercompilation once more. In particular I have been looking at the GRIN approach, which defines such a set of instructions.

The GRIN work is very clever, and has many ideas that I would like to reuse. However, the one aspect that gave me slight concern is the complexity. A GRIN program requires the use of several analysis passes, and many many transformation rules. While this approach is perfectly acceptable, one of the goals of the Supero work is to make the optimisation process simpler -- comprising of a few simple but powerful rules.

I will first explain how strictness works, then how my speculative approach works. Readers who already know about unboxing are encouraged to skip to the speculative section.


When doing low-level compilation, one of the most important stages is strictness analysis, and the associated unboxing. To take the example of the factorial function in Haskell:

factorial :: Int -> Int
factorial n = if n > 0 then n * factorial (n-1) else 1

Here it is easy to see that the factorial function always evaluates n. We can also use our knowledge of the definition of Int:

data Int = Int# I#

Where I# is an actual machine integer (possibly stored in a register), and Int# is a lazy box surrounding it. Since we know that factorial will always unwrap our n, we can pass the n around without the Int# box. I have made all the conversions from I# to Int explicit using an Int#, but have left all the unboxings implicit. The operators ># etc. are simply unboxed and strict variants of the standard operators.

factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (Int# n# - 1) else 1

Also, since we know factorial is strict in its first argument, we can evaluate the first argument to the recursive call strictly. Applying all these optimisations can now write:

factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (n# -# 1) else 1

We have removed the explicit boxing in the recursive call, and work entirely with unboxed integers. Now factorial is entirely strict. We can even write a wrapper around our strict version, to provide a lazy interface matching the original.

factorial :: Int -> Int
factorial n = factorial# n#

I have used n# to denote the unboxing of n. Now factorial looks like it did before, but operates much faster, on unboxed integers.


I would like to not include a strictness analyser in my optimiser, or if it is included, have it be the result of a series of transformations -- without explicit "stop and analyse" then "use the results" stages. As part of my thoughts on this, I was trying to consider how to optimise factorial without invoking the strictness analyser.

The speculative transformation I have defined first generates factorial# - I have left out the details of why it decides to.

factorial :: Int -> Int
factorial n = if n > 0 then n * factorial (n-1) else 1

factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (Int# n# - 1) else 1

This step is entirely safe - we have defined factorial#, but we have not written a wrapper that invokes it, even in the recursive case. The factorial# function is equivalent to factorial if the initial argument was evaluated. We have transformed factorial# using only local knowledge, at the point. We can also transform factorial, replacing any uses of n which are guaranteed to come after n is evaluated, with (Int# n#). This transformation is merely reusing the knowledge we have gained unwrapping n:

factorial n = if n > 0 then Int# n# * factorial (Int# n# - 1) else 1

Now we promote any primitive operations on only unboxed values. Given (-), it is cheaper to evalute the subtraction than to store a lazy thunk to the function.

factorial n = if n > 0 then Int# n# * factorial (Int# (n# -# 1)) else 1

factorial# n# = if n# ># 0 then n# *# factorial (Int# (n# -# 1)) else 1

We can now use our knowledge that if we know an argument to a function is already evaluated, we can call the strict variant (this corresponds closely to constructor specialisation):

factorial n = if n > 0 then n# *# factorial# (n# -# 1) else 1

factorial# n# = if n# ># 0 then n# *# factorial# (n# -# 1) else 1

We can also replace the * in factorial with *# as we know we will have to evaluate the result of a function. Now we have ended up with a fast inner loop, operating only on unboxed integers. We have not required strictness information to make any transformation.

One way of viewing the difference between strictness and this transformation is the flow of information. In strictness, the caller is informed that a particular argument will be evaluated. In speculative, the callee informs the caller that an argument has already been evaluated. These two concepts are not the same, and while they overlap, there are instances where they differ considerably.

Strict vs. Speculative

Consider the following example:

strict :: Int -> Int
strict x = x `seq` lazy x (x-1) (x+1)

lazy :: Int -> Int -> Int -> Int
lazy a b c = if a == 0 then b else c

Here the lazy function is strict in a, but not either of b or c. A strictness analyser would generate a variant of lazy with only the first argument unboxed. In contrast the speculative variant will determine that x-1 and x+1 should be evaluated, and pass unboxed values in all arguments of lazy, even though lazy may not evaluate b or c.

To see this behaviour in GHC, it helps to make lazy recursive:

module Temp where

strict :: Int -> Int
strict x = x `seq` lazy x (x+1) (x-1)

lazy :: Int -> Int -> Int -> Int
lazy a b c = if a == 0 then lazy b b b else c

Now run with the options ghc Temp.hs -c -O2 -ddump-simpl, and you will see the lazy variant has type lazy :: Int# -> Int -> Int -> Int.

These thoughts are still very preliminary, and there are a number of unanswered questions:

  • What is the overlap between strict and speculative?

  • Can both variants be combined? (almost certainly yes)

  • Is speculative really simpler?

  • Is speculative sufficient?

  • What are the performance benefits of speculative?