Monday, April 21, 2008

Summer of Code 2008

This year I am going to be participating in the Google Summer of Code as a student, for the organisation, on the Haskell API Search as an interface to Hackage project - aka Hoogle 4. I will be mentored by Niklas Broberg, author of various tools including Haskell Source Extensions, which is already used by Hoogle. My project link gives the summary I gave for the project, but below I've posted the interesting bits from my full application. I am going to be posting my progress at least once a week once the project phase starts (about 6 weeks time). I welcome any comments!

What is the goal of the project you propose to do?

There are two main goals:

1) Make Hoogle more useful to the community, along the same path as it is currently used.

2) Make Hoogle suitable to use as the standard interface to Hackage.

Can you give some more detailed design of what precisely you intend to achieve?

# Removal of all bugs

Hoogle 3 has a number of embarrassing bugs, some of which are not easily fixed. The nastiest of these is to do with monads, which are horribly mistreated. Since I now know the underlying issues which have caused a problem with Hoogle 3, things like higher-kinded type classes can be solved in a more principled manner.

# Support for some Haskell type extensions

Hoogle 3 does not deal with multi-parameter type classes. I would like to support a variety of type system extensions, primarily by mapping them on to Haskell 98 equivalent types.

# Faster searching

The current implementation is O(n) in the number of functions in the library, where the constant factor is absolutely massive. I wish to make text searching O(s), where s is the length of the search string, and have an incredibly low constant overhead -- using the a lazy file-based trie.

The type searching also needs a massive speed up. I have some ideas on how to proceed, but it is a difficult problem! I will spend a small amount of time investigating this problem, but may have to use a simpler algorithm, rather than delay the rest of the project.

# Better deployment

Currently there is Hoogle for the base libraries, and a special (very hacked) version that supports Gtk2hs only. I have received several requests for custom Hoogle instances for tools such as XMonad, Ycr2js, wxHaskell etc. The new Hoogle will make deployment of individual versions for specific packages easy.

# Support for multiple packages

I wish to support searching through every package on Hackage at once. This requires a massive speed up in the searching algorithms.

# Generalised text searching

By searching both function names, and also cabal package descriptions, Hoogle can be much more useful in finding packages, as opposed to individual functions.

# Better Design

Hoogle 3 is a web application, with a hacked on command line program. Hoogle 4 will be a central API which can be reused from any IDE tools, and also used to build the web interface and the command line application.

# Generalised interface to all of Cabal

Hopefully all the above goals will result in a tool that is suitable to be an interface to Cabal.

What deliverables do you think are reasonable targets? Can you outline an approximate schedule of milestones?

I would plan to release a beta of Hoogle 4 approximately half way through the project, as a web application. Much of the initial design has been done, so this is primarily hacking time.

I would then hope to complete the final hackage integration for the second half. This stage will require discussion with the cabal people, and will be a combination of design, implementation and server administration/setup.

In what ways will this project benefit the wider Haskell community?

Hoogle is already of use to the community, but has never seen a final release, and has a reasonable number of known bugs. This project would produce a polished version of a tool for which we already know there is huge demand.

Hackage is working well, and gaining new packages every day. As the number of packages increases, the interface to hackage must be updated to handle this volume. Discussions with some of the hackage/cabal team seem to suggest that a search interface is the way forward. By making Hackage easier to use, everyone benefits.

Saturday, April 12, 2008

darcs Feature Request (Part II)

I previously requested a feature for darcs. I always pull from an http repo, and push over SSH. I have to push using --no-set-default and typing the ssh repo in full, which I automate with a .bat file in each repo.

Today I noticed that darcs has _darcs/prefs/repos, which seems to list the repo's that darcs has used. In one of my typical repo files, I have an http entry and an SSH entry. To get darcs to behave the way I want, all I need to do is push using the first non-http repo in that list.

I have implemented my version of the darcs push command inside my paper tool, the code is all online here. Now I can delete all my push.bat scripts, and just type paper push from any darcs repo. As an added bonus, I now don't need to change to the root directory to perform a push.

It would be really nice if someone could incorporate this feature into the main darcs codebase. However, I'm quite happy using my paper tool for now. I certainly don't have time to patch, or even build darcs :-)

Wednesday, April 09, 2008

TagSoup Parsing: Dictionary Extraction

I've just read issue 10 of The Monad.Reader. It's a great issue, including a tutorial on using the new GHCi debugger, and how to write an efficient Haskell interpreter in Haskell. The running example for the GHCi debugger is parsing the computer dictionary and extracting descriptions from keywords, using the TagSoup library. The article starts with an initial version of the extraction code, then fixes some mistakes using the debugger present in GHCi. The code was written to teach debugging, not as a demonstration of TagSoup. This post explains how I would have written the program.

The original program is written in a low-level style. To search for a keyword, the program laboriously traverses through the file looking for the keyword, much like a modern imperative language might. But Haskell programmers can do better. We can separate the task: first parsing the keyword/description pairs into a list; then searching the list. Lazy evaluation will combine these separate operations to obtain something just as efficient as the original. By separating the concerns, we can express each at a higher-level, reducing the search function to a simple lookup. It also gives us more flexibility for the future, allowing us to potentially reuse the parsing functions.

I have fixed a number of other bugs in the code, and my solution is:

module Main where
import Text.HTML.TagSoup
import Maybe

main = do
putStr "Enter a term to search for: "
term <- getLine
html <- readFile "Dictionary.html"
let dict = parseDict $ parseTags html
putStrLn $ fromMaybe "No match found." $ lookup term dict

parseDict :: [Tag] -> [(String,String)]
parseDict = map parseItem
. sections (~== "<dt>")
. dropWhile (~/= "<div class=glosslist>")

parseItem xs = (innerText a, unwords $ words $ innerText b)
where (a,b) = break (~== "<dd>") (takeWhile (~/= "</dd>") xs)

Instead of searching for a single keyword, I parse all keywords using parseDict. The parseDict function first skips over the gunk at the top of the file, then finds each definition, and parses it. The parseItem function spots where tags begin and end, and takes the text from inside. The unwords $ words expression is a neat trick for normalising the spacing within an arbitrary string.

This revised program is shorter than the original, I find it easier to comprehend, and it provides more functionality with fewer bugs. The TagSoup library provides a robust base to work from, allowing concise expression of HTML/XML extraction programs.

Tuesday, April 08, 2008

Optional Parameters in Haskell

I use optional parameters in my TagSoup library, but it seems not to be a commonly known trick, as someone recently asked if the relevant line was a syntax error. So, here is how to pass optional parameters to a Haskell function.

Optional Parameters in Other Languages

Optional parameters are in a number of other languages, and come in a variety of flavours. Ada and Visual Basic both provide named and positional optional parameters. For example, given the definition:

Sub Foo(b as Boolean = True, i as Integer = 0, s as String = "Hello")

We can make the calls:

Call Foo(s = "Goodbye", b = False)
Call Foo(False, 1)

In the first case we give named parameters, in the second we give all the parameters up to a certain position.

In some languages, such as GP+, you can say which parameters should take their default values:

Call Foo(_, 42, _)

Optional Parameters in Haskell

Haskell doesn't have built-in optional parameters, but using the record syntax, it is simple to encode named optional parameters.

data Foo = Foo {b :: Bool, i :: Integer, s :: String}
defFoo = Foo True 0 "Hello"

foo :: Foo -> IO ()

Now we can pass arguments by name, for example:

foo defFoo{s = "Goodbye", b = False}

This syntax takes the value defFoo, and replaces the fields s and b with the associated values. Using a type class, we can abstract this slightly:

class Def a where
def :: a

instance Def Foo where
def = defFoo

Now we can make all functions taking default arguments use the def argument as a basis, allowing type inference and type classes to choose the correct default type. Even still, optional parameters in Haskell are not quite as neat as in other languages, but the other features of Haskell mean they are required less often.

This technique has been used in TagSoup, particularly for the parseTagOptions function. I've also seen this technique used in cpphs with the runCpphs function.

Tuesday, April 01, 2008

Automated Proof Reading

I've been spending the last few months writing papers, which requires lots and lots of proof reading. Yesterday Tom shared one trick he'd used to help with his thesis. He took his thesis, ran detex over it, then used a text-to-speech program to read it back to him. This trick ensures you pick up all the subtle things like repeated words, and forces you to listen to every word, not skim reading bits. I thought this was a great idea, so implemented something similar in my paper preparation tool (release forthcoming), so typing paper talk reads the paper back to you, converting the LaTeX in a sensible way, to make it as readable as possible.

Here is result of running paper talk on the introduction section of my Catch paper:

The audio is a little clearer before being compressed for upload, but still has a very clunky feel. It is surprisingly useful - I made about 12 minor changes as a result. It has some oddities, for example ML becomes millilitre, and some pause issues with brackets, which the preprocessor sorts out. The entire paper takes 49 minutes to read, but I think I will be doing this will all papers from now on.