Sunday, April 25, 2010

Dangerous Primes - Why Uniplate Doesn't Contain transform'

The Uniplate library contains many traversal operations, some based on functions in SYB (Scrap Your Boilerplate). SYB provides everywhere for bottom-up traversals and everywhere' for top-down traversals. Uniplate provides transform for bottom-up traversals, but has no operation similar to everywhere'. This article explains why I didn't include a transform' operation, and why I believe that most uses of everywhere' are probably incorrect.

The transform operation applies a function to every node in a tree, starting at the bottom and working upwards. To give a simple example:

data Exp = Lit Int | Add Exp Exp | Mul Exp Exp

f (Mul (Lit 1) x) = x
f (Mul (Lit 3) x) = Add x (Add x x)
f x = x

Calling transform f on the input 3 * (1 * 2) gives 2 + (2 + 2). We can write transform in terms of the Uniplate operation descend, which applies a function to every child node:

transform f x = f (descend (transform f) x)

On every application of f the argument always consists of a root node which has not yet been processed, along with child nodes that have been processed. My thesis explains how we can guarantee a transform reaches a fixed point, by calling f again before every constructor on the RHS of any clause:

f (Mul (Lit 1) x) = x
f (Mul (Lit 3) x) = f (Add x (f (Add x x)))
f x = x

Now transform f is guaranteed to reach a fixed point. The transform operation is predictable, and naturally defines bottom-up transformations matching the users intention. Unfortunately, the ordering and predictability of transform' is significantly more subtle. We can easily define transform':

transform' f x = descend (transform' f) (f x)

Here the transformation is applied to every child of the result of f. With transform every node is processed exactly once, but with transform' some nodes are processed multiple times, and some are not processed at all. The first clause of f, which returns x, does not result in the root of x being processed. Similarly, our second cause returns two levels of constructor, causing the inner Add to be both generated and then processed.

When people look at transform' the intuitive feeling tends to be that all the variables on the RHS will be processed (i.e. x), which in many cases mostly matches the behaviour of transform'. Being mostly correct means that many tests work, but certain cases fail - with our function f, the first example works, but 1 * (1 * 1) results in 1 * 1. The original version of Uniplate contained transform', and I spent an entire day tracking down a bug whose cause turned out to be a function whose RHS was just a variable, much like the first clause of f.

Before describing the solution to top-down transformations, it's interesting to first explore where top-down transformations are necessary. I have identified two cases, but there many be more. Firstly, top-down transformations are useful when one LHS is contained within another LHS, and you wish to favour the larger LHS. For example:

f (Mul (Add _ _) _) = ...
f (Add _ _) = Mul ...

Here the second LHS is contained within the first. If we perform a bottom-up transformation then the inner Add expression will be transformed to Mul, and the first clause will never match. Changing to a top-down transformation allows the larger rule to match first.

Secondly, top-down transformations are useful when some information about ancestors is accumulated as you proceed downwards. The typical example is a language with let expressions which builds up a set of bindings as it proceeds downwards, these bindings then affect which transformations are made. Sadly, transform' cannot express such a transformation.

The solution to both problems is to use the descend operation, and explicitly control the recursive step. We can rewrite the original example using descend:

f (Mul (Lit 1) x) = f x
f (Mul (Lit 3) x) = Add (f x) (Add (f x) (f x))
f x = descend f x

Here we explicitly call f to continue the transformation. The intuition that all variables are transformed is now stated explicitly. In this particular example we can also common-up the three subexpressions f x in the second clause, giving a more efficient transformation. If needed, we can add an extra argument to f to pass down information from the ancestors.

After experience with Uniplate I decided that using transform'/everywhere' correctly was difficult. I looked at all my uses of transform' and found that a few of them had subtle bugs, and in most cases using transform would have done exactly the same job. I looked at all the code on Hackage (several years ago) and found only six uses of everywhere', all of which could be replaced with everywhere without changing the meaning. I consider top-down transformations in the style of everywhere' to be dangerous, and strongly recommend using operations like descend instead.

Upcoming Workshops

Are you interested in either generics programming or Haskell? Are you going to ICFP 2010? Why not:

Submit a paper to the Workshop on Generic Programming - Generic programming is about making programs more adaptable by making them more general. This workshop brings together leading researchers and practitioners in generic programming from around the world, and features papers capturing the state of the art in this important area.

Offer a talk at the Haskell Implementors Workshop - an informal gathering of people involved in the design and development of Haskell implementations, tools, libraries, and supporting infrastructure.

I'm on the program committee for both, and look forward to receiving your contributions.

Wednesday, April 07, 2010

File Recovery with Haskell

Haskell has long been my favoured scripting language, and in this post I thought I'd share one of my more IO heavy scripts. I have an external hard drive, that due to regular dropping, is somewhat unreliable. I have a 1Gb file on this drive, which I'd like to copy, but is partly corrupted. I'd like to copy as much as I can.

In the past I've used JFileRecovery, which I thoroughly recommend. The basic algorithm is that it copies the file in chunks, and if a chunk copy exceeds a timeout it is discarded. It has a nice graphical interface, and some basic control over timeout and block sizes. Unfortunately, JFileRecovery didn't work for this file - it has three basic problems:

  1. The timeout sometimes fails to stop the IO, causing the program to hang.

  2. If the copy takes too long, it sometimes gives up before the end of the file.

  3. If the block size is too small it takes forever, if it is too large it drops large parts of the file.

To recover my file I needed something better, so wrote a quick script in Haskell. The basic algorithm is to copy the the file in 10Mb chunks. If any chunk fails to copy, I split the chunk and retry it after all other pending chunks. The result is that the file is complete after the first pass, but the program then goes back and recovers more information where it can. I can terminate the program at any point with a working file, but waiting longer will probably recover more of the file.

I have included the script at the bottom of this post. I ran this script from GHCi, but am not going to turn it in to a proper program. If someone does wish to build on this script please do so (I hereby place this code in the public domain, or if that is not possible then under the ∀ n . BSD n licenses).

The script took about 15 minutes to write, and makes use of exceptions and file handles - not the kind of program traditionally associated with Haskell. A lot of hard work has been spent polishing the GHC runtime, and the Haskell libraries (bytestring, exceptions). Now this work has been done, slotting together reasonably complex scripts is simple.

{-# LANGUAGE ScopedTypeVariables #-}

import Data.ByteString(hGet, hPut)
import System.IO
import System.Environment
import Control.Monad
import Control.Exception

src = "file on dodgy drive (source)"
dest = "file on safe drive (destination)"

main :: IO ()
main =
withBinaryFile src ReadMode $ \hSrc ->
withBinaryFile dest WriteMode $ \hDest -> do
nSrc <- hFileSize hSrc
nDest <- hFileSize hDest
when (nSrc /= nDest) $ hSetFileSize hDest nSrc
copy hSrc hDest $ split start (0,nSrc)

copy :: Handle -> Handle -> [(Integer,Integer)] -> IO ()
copy hSrc hDest [] = return ()
copy hSrc hDest chunks = do
putStrLn $ "Copying " ++ show (length chunks) ++ " of at most " ++ show (snd $ head chunks)
chunks <- forM chunks $ \(from,len) -> do
res <- Control.Exception.try $ do
hSeek hSrc AbsoluteSeek from
hSeek hDest AbsoluteSeek from
bs <- hGet hSrc $ fromIntegral len
hPut hDest bs
case res of
Left (a :: IOException) -> do putChar '#' ; return $ split (len `div` 5) (from,len)
Right _ -> do putChar '.' ; return []
putChar '\n'
copy hSrc hDest $ concat chunks

start = 10000000
stop = 1000

split :: Integer -> (Integer,Integer) -> [(Integer,Integer)]
split i (a,b) | i < stop = []
| i >= b = [(a,b)]
| otherwise = (a,i) : split i (a+i, b-i)

There are many limitations in this code, but it was sufficient to recover my file quickly and accurately.