Saturday, September 12, 2009

How I Use HLint

HLint is a tool for automatically suggesting improvements to your Haskell code. This post describes how I use HLint, and provides and some background on its development. Before reading this article, if you are an active Haskell programmer who has not yet tried out HLint, I suggest you perform the following steps:


cabal update && cabal install hlint
cd your-current-project
hlint . --report
# open report.html in your web browser


The original purpose of HLint was to help teach beginners. When helping with the functional programming course at York, I used to wander round the students, looking at their code, and suggesting improvements. After three years helping with the same course, I found myself regularly suggesting the same improvements. For example, the pattern if a then True else b came up a lot, which can be written more succinctly as a || b. Of course, having turned myself into a pattern recognition tool, the obvious step was to automate myself - and HLint is the result.

I am no longer at a University, and so the way I use HLint has changed. Often on the Haskell Cafe mailing list people ask for code reviews - intermediate level Haskellers trying to gain knowledge from those around them. The suggestions resulting from a code review are often split into two categories. There are small-scale suggestions about things such as using a better library function, and large-scale suggestions about what the structure of the program should be. Often it is useful to tackle the small-scale issues, tidying and polishing what is already there, before investigating any large-scale issues. Unfortunately reviewers are often short of time, so they may not get round to making large-scale suggestions. The hope is that HLint can automate much of the small-scale suggestions, allowing clever people to use their time more effectively on the more complex problems.

Another reason to use HLint is one of developer pride. Some developers do not react well to criticism, and take comments about their code in a very personal way. Worse still, if you declare that some small syntactic pattern is the "wrong way to do it", then you can inadvertently end up just point out the failings. In contrast, if HLint is run first, then the human suggestions are typically deeper, and are design trade-offs that can be debated.

HLint is not designed as a tool to fix existing code, but more as a tool to promote learning, thus pre-emptively fixing future code. I do not intend people to slavishly apply the hints given by HLint - each hint should be carefully considered. For example, the darcs project uses HLint, but has decided that they are not interested in eta reduction hints, so have used HLint's ignoring facility.

One use of HLint is to provide an easy mechanism to start participating in an open source project. One of the largest hurdles in project participation is writing your first patch. Many projects have different conventions and requirements, plus there is usually a large code base that needs to be learnt. A good first step might be to run HLint over the code. While many of the hints suggested by HLint might be design decisions, or minor issues, there are likely to be a few more unambiguous improvements. As a simple example, taking the xmonad code base and applying HLint shows that the import Data.Maybe statements in XMonad\Core.hs could be combined. This would be a perfect first patch for a budding xmonad developer.

HLint can be used in many ways, but my two golden rules for HLint usage are:


  1. Do not blindly apply the output of HLint

  2. Never review code that hasn't had HLint applied

Tuesday, June 16, 2009

Draft paper on Derive, comments wanted

It's been a long time since I last blogged (about 3 months). Since then I've had a paper on Firstify accepted in to the Haskell Symposium (I'll post the final version to my website shortly). I've also been writing a paper on Derive to go with my invited talk at Approaches and Applications of Inductive Programming (co-located with ICFP this year). I have to submit a final version by the 22nd of June (6 days time), but any comments on this draft would be gratefully received - either add them as comments to this post or send an email to ndmitchell AT gmail DOT com.

Download link: http://community.haskell.org/~ndm/temp/derive_draft.pdf

Title: Deriving a DSL from One Example

Abstract: Given an appropriate domain specific language (DSL), it is possible to describe the relationship between Haskell data types and many generic functions, typically type class instances. While describing the relationship is possible, it is not always an easy task. There is an alternative -- simply give one example output for a carefully chosen input, and have the relationship derived.

When deriving a relationship from only one example, it is important that the derived relationship is the intended one. We identify general restrictions on the DSL, and on the provided example, to ensure a level of predictability. We then apply these restrictions in practice, to derive the relationship between Haskell data types and generic functions. We have used our scheme in the Derive tool, where over 60% of type classes are derived from a single example.

Home page: http://community.haskell.org/~ndm/derive/

Darcs repo: http://community.haskell.org/~ndm/darcs/derive

The work presented in this paper will become the basis of Derive 2.0. Many thanks for any comments!

Saturday, March 21, 2009

Concise Generic Queries

A few weeks ago David Miani asked how to write concise queries over a data type. The answer is certainly generic programming, a technique that I feel is underused in the Haskell community. I suggested David look at Uniplate, but he found greater success with SYB. Sean Leather gave a solution using EMGM. One of the advantages of Uniplate is conciseness, so I decided to tackle the same problem and compare.

A full description of the task, including data type definitions, is at Sean's blog. From a data type representing structured files (tables, headings, paragraphs) find a heading with a particular name then within that heading find a paragraph starting with "Description". The rest of this post contains solutions using Uniplate, EMGM (taken from Sean) and SYB (from David). The SYB solution is slightly different from the EMGM or Uniplate solutions, but they all do roughly the same generic operations. It is entirely possible that the EMGM/SYB solutions could be improved, but that is a job for other people.

Uniplate Solution

The Uniplate solution is:


projDesc :: String -> OrgFileP -> Maybe String
projDesc name p = listToMaybe [y |
OrgHeadingP _ x ys <- universeBi p, name == x,
ParagraphP y <- universeBi ys, "Description" `isPrefixOf` y]


The code can be read as:


  • Line 1: Type signature, given a name and a file, return the paragraph if you find one

  • Line 3: Find a heading with the right name

  • Line 4: Find a paragraph below that heading, whose name starts with "Description"

  • Line 2: Pick the paragraph



I find this code to be a clear, concise and simple description of the problem. The thought process to come up with the solution was as follows: You want to search, or perform a query. The first question is whether this is a deep (all nodes) or shallow (just the children) query - David doesn't say but the example seems to imply deep. If it's deep use universeBi. Operations are combined with a list comprehension that finds an element, check it has the necessary properties (the name), then performs more operations. The result is the code you see above.

EMGM Solution

Sean's solution can be found at his blog:


projDesc :: String -> OrgFileP -> Maybe String
projDesc name file = do
hdg <- G.firstr (headings name file)
para <- firstPara hdg
if para =~ "Description" then return para else Nothing

headings :: String -> OrgFileP -> [OrgHeadingP]
headings name = filter check . G.collect
where
check (OrgHeadingP _ possible _) = name == possible

firstPara :: OrgHeadingP -> Maybe String
firstPara hdg = paraStr =<< G.firstr (G.collect hdg)
where
paraStr (ParagraphP str) = Just str
paraStr _ = Nothing


This solution isn't bad, but is more verbose than the Uniplate solution. Perhaps it could be rewritten with list comprehensions? It seems that G.collect is similar to universeBi - although I am not sure.

SYB Solution

David's SYB solution can be found here along with another solution using different combinators.


eitherOr :: Either a b -> Either a b -> Either a b
eitherOr x@(Right _) _ = x
eitherOr _ y = y

getP14Desc :: OrgElement -> Either ErrString String
getP14Desc org = everything eitherOr (Left descError `mkQ` findDesc) =<<
everything eitherOr (Left findError `mkQ` findP14) org
where
findP14 h@(Heading {headingName=name})
| name == "Project14" = Right h
findP14 _ = Left findError

findDesc (Paragraph {paragraphText=text})
| text =~ "Description" = Right text
findDesc _ = Left findError

descError = "Couldn't find description for project"
findError = "Couldn't find project."


Summary

The relative merits of each solution are highly subjective, but I believe the Uniplate solution is concise. The Uniplate solution is a simple translation of the problem, without any clever steps, so hopefully other users (who didn't write the library!) will be able to achieve similar results. The Uniplate solution required only one function from the Uniplate library, so has a small learning curve. Even if you don't choose Uniplate, generic programming techniques are very useful, and can make your code concise and robust.

Monday, March 09, 2009

Website move

Today I spotted that I could no longer push to my darcs repos hosted at York University. A little more checking showed that my home page had also been removed - I guess that's what happens when you are no longer a student there (although a warning email before would have been nice...). So I am pleased to announce my new website address:

http://community.haskell.org/~ndm/

Thanks to the wondrous Haskell community for providing all the resources I needed to move my website with no human intervention at haste. Expect my darcs repos to move somewhere shortly too.

I have now submitted the final bound copies of my thesis, and have uploaded a copy to my website (I had uploaded it to York, but didn't get chance to announce it!). I should say a great thank you to everyone who helped with my work/thesis, in particular Colin Runciman for supervising me for six years, and Detlef Plump and Simon Peyton Jones for examining me and really helping improve the final document with their comments.

The thesis has four content chapters, corresponding to Uniplate, Supero, Firstify and Catch. I have submitted a paper to ICFP 09 which expands/clarifies the Firstify work, which I'll upload as a draft shortly. For the other chapters, the version in the thesis is an improvement on the version in any papers I've published.

Sunday, February 22, 2009

Hoogle package search

Recently on the Haskell mailing list there has been some discussions of which packages Hoogle searches by default. One person remarked that it was unfortunate that the network package isn't searched by default. There are lots of packages on Hackage, and Hoogle needs to decide how to cope with so much choice. There are a number of questions that I need to answer in Hoogle:


  1. What packages should Hoogle search by default? All of hackage? The base libraries? Only the packages a user has installed? Only packages that make it in to the Haskell Platform?

  2. What groups of packages should Hoogle have available? Each package individually? All packages which compile on Windows? All packages by a certain author? All packages whose minor version number is even?

  3. What UI should Hoogle show? Should there be checkboxes for each os's package? Should their be a checkbox for each compiler/version? Should their be no UI but some documentation?



And these questions present a number of trade offs:


  • The packages have to be divided under sensible and clear lines - I don't want to (and shouldn't) arbitrate divisions like "good" or "popular".

  • The more packages you search, the less relevant the results will be.

  • The fewer packages you search, the more chance that you miss something.

  • The more UI that is added the more confusing things get.

  • My development time for Hoogle derives Bounded, Finite and increasingly also derives Small.



Thoughts and suggestions are very welcome. I've set up a wiki page to track peoples thoughts, please make your view and arguments known: http://haskell.org/haskellwiki/Hoogle/Packages.

(As an aside, I recently found that dolphin friendly tuna is actually really harmful to the environment, far more harmful than dolphin unfriendly tuna. Read more here.)

Tuesday, February 03, 2009

Monomorphism and Defaulting

Haskell has some ugly corners - not many, but a few. One that many people consider exceptionally ugly is the monomorphism restriction. In this post I'm going to discuss three related issues - Constant Applicative Forms (CAFs), the monomorphism restriction and defaulting. But before we start, lets take a simple example.

Computing Pi

Haskell already provides the pi function which represents the value of pi, but lets assume it didn't. Taking a quick look at Wikipedia we can see that one way of computing Pi is the Gregory-Leibniz series. We can calculate pi as:

pi = (4/1) + (-4/3) + (4/5) + (-4/7) + (4/9) + (-4/11) ...

So let's write that as a Haskell program:


pie = sum $ take 1000000 $ zipWith (/) (iterate negate 4) [1,3..]


Here the constant 1000000 gives the accuracy of our approach, increasing this value will give a higher precision. As it currently stands, the Haskell library says pi = 3.14159265358979 and our program says pie = 3.14159165358977. Thirteen matching digits should be suffient for most uses of pi :-)

CAFs

The disadvantage of our pie function is that (under Hugs) it takes about 4 seconds to evaluate. If we are performing lots of calculations with pi, calculating pie each time will be a serious problem. CAFs are the solution!

A CAF is a top-level constant, which doesn't take any arguments, and will be computed at most once per program execution. As a slight subtlety, if the constant has class constraints on it (i.e. is Num a => a, instead of a) then it isn't a CAF because the class constraints act like implicit arguments. Our pie function above doesn't take any arguments, so is a CAF.

Defaulting

While pie doesn't have any class constraints, the right-hand side of pie does! Take a look in Hugs:


Main> :t sum $ take 1000000 $ zipWith (/) (iterate negate 4) [1,3..]
:: (Enum a, Fractional a) => a

Main> :t pie
:: Double


The right-hand side works for any Enum and Fractional type, for example Float, but pie is restricted to Double. The reason is the defaulting mechanism in Haskell - if a type can't be nailed down precisely, but is one of a handful of built-in classes, then it will default to a particular type. This feature is handy for working at an interactive environment, but can sometimes be a little unexpected.

Monomorphism restriction

Without defaulting the compiler would infer the type of pie as ::(Enum a, Fractional a) => a. However, such a definition would be rejected by the monomorphism restriction. The monomorphism restriction states that a function with no explicit arguments, but with class constraints, must be given a type annotation. This rejects functions like:


snub = sort . nub


To fix the problem there are two solutions:


snub i_hate_the_evil_mr = (sort . nub) i_hate_the_evil_mr

snub :: Ord a => [a] -> [a]
snub = sort . nub


For a function like pie only the second approach is applicable. The addition of dummy arguments to avoid the monomorphism restriction is sufficiently common that the HLint tool never suggests eta-reduction if the argument is named mr.

Conclusion

So why was the monomorphism restriction first introducted? For a function with no explicit arguments, the programmer might think they had written a CAF, but class constraints may substantially degrade the performance. Defaulting reduces the number of cases where the monomorphism restriction would otherwise bite, but it is still useful to be aware of the ugly corners.

There are proposals afoot to remove the monomorphism restriction and to increase the power of the default mechanism - hopefully both will be included in to Haskell'.

Tuesday, January 27, 2009

Small scripts with Haskell

Normally I give blog posts detailing the fun, interesting or advanced stuff I do with Haskell. But that isn't a real representation of my programming life! Most of the time I am doing small scripts that do little tasks, so I thought I'd describe one of those. This post is written as Literate Haskell, which means you can save the whole contents as a .lhs file and run it in GHCi or Hugs.

The task I had to complete was to take a directory of files, and for each file foo.txt generate the files foo_m1.txt to foo_m3.txt, where each one file is a block of lines from the original delimited by a blank line. i.e. given the file with the lines ["","1","1","","2","","3"], the numbers "1" would go in foo_m1.txt etc.

This blog post isn't how I actually wrote the original script - I didn't use literate Haskell (since I find it ugly), I didn't give explicit import lists (since they are needlessly verbose), I didn't give type signatures (but I should have) and I didn't split the IO and non-IO as well (but again, I should have). It is intended as a guide to the simple things you can easily do with Haskell. Now on to the code...


> import System.FilePath(takeExtension, dropExtension, (<.>), (</>))
> import System.Directory(getDirectoryContents)
> import Data.Char(isSpace)
> import Control.Monad


First, let's import some useful modules. To find more about a particular function just use Hoogle and search for it, but a quick summary:


takeExtension "foo.txt" = ".txt"
dropExtension "foo.txt" = "foo"
"foo" <.> "txt" = "foo.txt"
"bar" </> "foo.txt" = "bar/foo.txt"
getDirectoryContents "C:\Windows" = running "dir C:\Windows" at the command prompt
isSpace ' ' = True


Every Haskell program starts with a main function, which is an IO action. For this program, we are going to keep all the IO in main, and only use other pure functions. With most file processing applications its best to read files from one directory, and write them to another. That way, if anything goes wrong, its usually easy to recover. In this case we read from "data" and write to "res".


> main :: IO ()
> main = do
> files <- getDirectoryContents "data"
> forM_ files $ \file -> when (takeExtension file == ".txt") $ do
> src <- readFile $ "data" </> file
> forM_ (zip [1..] (splitFile src)) $ \(i,x) ->
> writeFile ("res" </> dropExtension file ++ "_m" ++ show i <.> "txt") x


Or in some kind of pseudo-code:


main =
set files to be the list of files in the directory "data"
for each file in files which has the extension ".txt"
{
set src to be the result of reading the file
for each numbered result of splitFile
{
write out the value from splitFile to the location "res/file_m#.txt"
where # is the 1-based index into the list of results
}
}


We can now move on to the pure bits left over. We want a function splitFile that takes a file, and splits it in to three chunks for each of the blocks in the file. When processing text, often there will be stray blank lines, and the term "blank lines" will also apply to lines consisting only of spaces. The code is below:


> splitFile :: String -> [String]
> splitFile xs = map (tabify . unlines) [s1,s2,s3]
> where
> xs2 = dropWhile null $ map (dropWhile isSpace) $ lines xs
> (s1,_:rest) = break null xs2
> (s2,_:s3) = break null $ dropWhile null rest


And now presented more as a list of steps:


  • split the text in to lines

  • for each line drop all the leading spaces from it

  • drop all the leading blank lines

  • break on the first empty line, the bits before are chunk 1

  • drop all leading blank lines for the rest

  • break on the first empty line in the rest, before is chunk 2, after is chunk 3

  • for each of the chunks, put the lines back together, then tabify them



The tabify requirement was added after. The person decided that all continuous runs of spaces should be converted to tabs, so the file could better be loaded in to a spread sheet. Easy enough to add, just a simple bit of recursive programming:


> tabify (' ':xs) = '\t' : tabify (dropWhile (== ' ') xs)
> tabify (x:xs) = x : tabify xs
> tabify [] = []


And again in English:


  • if you encouter a space, drop it and all successive spaces, and write out a tab

  • otherwise just continue onwards



Haskell is a great language for writing short scripts, and as the libraries improve it just keeps getting better.