Neil Mitchell's Blog (Haskell etc): 2008

Thursday, December 11, 2008

mapM, mapM_ and monadic statements

In my last post on F# I mentioned that do mapM f xs; return 1 caused a space leak, and that the programmer should have written mapM_. I also proposed that monadic statements should work more like in F# where non-unit return values can't be ignored. Various people seemed to misunderstand both points, so I thought I'd elaborate.

mapM as a Space Leak

First, I should clarify what I understand as a space leak. A space leak is not a memory leak in the C sense. A space leak is when a computation retains more live memory than necessary for some period of time. One sign of a possible space leak is that lots of memory is retained by garbage collection.

Comparing mapM and mapM_ on the following program:


main = do
    mapM* putChar (replicate 10000 'a')
    return ()

The mapM_ variant has a maximum heap residency of 2Kb, while the mapM variant has 226Kb. Given an input list of length n, the residency of mapM_ is O(1), while mapM is O(n).

The exact reasons for the space leak are quite detailed, and I'm not going to attempt to cover them. My intuition is that the return list is wrapped in the IO monad, and therefore can't be deallocated until the IO action finishes. In summary, unless you are going to use the end result of a monadic map, always use mapM_.

Monadic Statements

In the above example it would be nice if the compiler had complained. You generated a value, but you didn't use it. Fortunately, it is a very easy fix - change the type of monadic bind (>>) :: Monad m => m a -> m b -> b to Monad m => m () -> m b -> m b. Now, if a monadic statement generates a value that isn't (), you get a type error. The above examples with mapM would be rejected by the type checker.

But what if we really wanted to call mapM? There are two options. The first is to bind the result, for example do _ <- mapM f xs; return 1. The second option, which F# favours, is do ignore $ mapM f xs ; return 1, with the auxiliary ignore :: Monad m => m a -> m (). I prefer the second option, as it clearly states that you want to ignore the result of a computation. You could even write a rule ignore . mapM f == mapM_ f.

Sunday, December 07, 2008

F# from a Haskell perspective

I've recently started a full-time job at Standard Chartered. Before that I was doing an internship with Credit Suisse, where I spent a reasonable amount of time doing F# programming. Before I started F# I had 6 years of Haskell experience, plenty of C# experience, but little exposure to ML. I've now had 3 months to experiment with F#, using an old version (the one before the latest Community Technology Preview) and here are my impressions.

F# is a functional language from Microsoft, previously a Microsoft Research language, which is moving towards a fully supported language. F# is based on ML, and some (perhaps many) ML programs will compile with F#. At the same time, F# has complete access to the .NET framework and can interoperate with languages such as C#. F# is a hybrid language - at one extreme you can write purely functional ML, and at the other extreme you can write imperative C#, just using a different syntax. F# seems to be designed as a practical language - it isn't elegant or small, but does interoperate very nicely with every .NET feature.

The Language

The F# language is based on ML, with the addition of indentation based layout, and many of the weaknesses in F# come from ML. F# is certainly more verbose than Haskell: in some places you need an extra keyword (often a let); pattern matching is not nearly as complete; the indentation isn't as natural as Haskell. However there are some nice syntactic features in F# that are not in Haskell, including generalised list/array/sequence comprehensions and active patterns.

The type checker in F# is powerful, but unpredictable. I often get surprised by where type annotations need to go, particularly when working with .NET object types. The tuple type is treated specially in many cases, and this also leads to surprise - inserting or removing a pair of brackets can effect the type checker. Much of this complexity is necessary to manage the interaction with .NET, but it does complicate the language. Unfortunately, even with the advanced type features in F#, there are no type classes. The lack of type classes precludes the standard implementation of things such as Uniplate and QuickCheck. However, F# does have some nice reflection capabilities, and often entirely generic implementations can be given using reflection. There is certainly an interesting design trade-off between reflection based operations and type classes, something I have looked at in the past and hope to explore again in future.

F# is an impure language, which offers some practical benefits from Haskell, but also encourages a less functional style. In Haskell I sometimes work within a localised state monad - F# makes this much more natural. The impurity also allows simple interaction with .NET. Having programmed with an impure language I did find myself reaching for localised state much more often - and was occasionally tempted into using global state. In most cases, this state became problematic later. Before using F# I thought purity was a good thing, now I'm convinced that purity is a good thing but that impurity is often very useful!

Haskell could learn some things from F#. Every statement in F# must either be bound to a value or evaluate to (). In Haskell it is possible to write do mapM f xs; return 1. Any experienced Haskell programmer should spot that the mapM is a space leak (it should be mapM_), but the type system doesn't enforce it. In F# the type system does. The change in Haskell is simple, and in my opinion, desirable.

The Platform

F# is a fully fledged member of the .NET platform. You can write a class in VB.NET, write a derived class in F#, and then derive from that class in C#. All the languages can produce and consume the same libraries. This integration with .NET allows companies that already use the Microsoft tools to easily migrate - even on a function by function basis. However, the combination of both an imperative framework and a functional language at some times leads to confusing choices. All of the standard .NET libraries work with arrays, but for a functional program the list is a more natural type. F# provides both, and it was never clear which I should use where, leading to lots of conversions. The .NET libraries are very powerful, but often are overly imperative. For example, the XSD libraries (Xml Schema Description) are very imperative - you have to create objects, mutate properties, then make calls. However, in Haskell, I probably wouldn't have had any XSD support, certainly nothing as well-supported as in .NET.

The Tool Chain

The most impressive aspect of F# is the integration with the Visual Studio environment. F# contains a debugger, profiler, auto-completion, identifier lookup and many other tools. While other functional languages have some of these tools, the Visual Studio environment tends to have very refined and polished implementations. The integration with F# is sometimes a little fragile, or at least was in the version I was using, but the tools are already very powerful and are likely to continue to improve.

Overall

The F# language isn't the most beautiful language ever, but it's not bad. The integration with .NET is incredible, and while this requires compromises in the language, the benefits are considerable. I still prefer Haskell as a language, but for many users the tool chain is a more important consideration, and here F# excels.

This post was brought to you by Ctrl and v, as the computer I am currently using doesn't have a # key!

Update: I'd recommend reading Vesa Karvonen's comment below - he has additional perspectives on F# from a more ML perspective.

Saturday, December 06, 2008

Enabling Reply To All in Outlook

Update: See http://neilmitchell.blogspot.com/2011/12/enabling-reply-to-all-in-outlook.html for an updated version of this functionality.

Some companies lock down the use of Outlook by disabling the Reply To All button. This makes it harder to manage email, and requires manually copying email addresses to get the same effect. But using a bit of Office VBA, it is possible to make a functioning Reply To All button. The following solution has been tested in Outlook 2003, but should work for older versions as well.

First, enable macros in Outlook. Go to Tools, Macro, Security and select Medium or Low security.

Second, add a Reply To All action. Go to Tools, Macro, Visual Basic Editor and put the following code in the text editor.


Option Explicit

Public Sub ReallyReplyAll()
Dim o As MailItem
Set o = Application.ActiveExplorer.Selection.Item(1)
o.ReplyAll.Display
End Sub

Finally, add a toolbar button to invoke the action. Go to Tools, Customise, Commands, Macros, and drag and drop the command Project1.ThisOutlookSession.ReallyReplyAll on to the toolbar. You can put this command exactly where you used to have Reply To All, and give it the same icon/name.

To test, select an email and click on the button you just added, it should do exactly what Reply To All would have done. There are some minor limitations to this method:

The button will not disable itself when it isn't applicable, i.e. when there are no emails selected. You will still be able to click on the button, but it won't do anything.
If you select a medium level of macro security, you will have to go through a security confirmation the first time you click Reply To All in an Outlook session.

If possible, try to educate the person in charge that Reply To All is perfectly good email etiquette, and that people should be trusted to use it responsibly. However, if that fails, the above method is a useful fallback.

Monday, September 29, 2008

General Updates

It's been a little while since I last posted. I've recently got back from ICFP 2008, and quite a few people asked me what I was doing now. I've also got a few comments on a a few other things. The following as a section of disjointed paragraphs on a variety of topics, both academic and personal.

Catch Talk

I gave a talk at the Haskell Symposium, about Catch. A video of the talk is now online.

Generics Talk

Alexey gave a talk about generic programming libraries at the Haskell Symposium. I was particularly interested in this talk as it is somewhat like a competition between libraries, where Uniplate is one of the competitors. One thing I noticed is that the Uniplate version of the SYB example in the talk can be written as one single lexeme, namely uniplateBi. The talk was much more about generics libraries, while Uniplate is probably more accurately described as a traversal library, so issues such as conciseness of code were left out. One thing I did disagree with from the talk was the assertion that Uniplate requires Template Haskell and Data/Typeable deriving. In reality Uniplate requires neither, but if they are present, then you have the option of using them to write even less code.

From a combination of the paper and the talk I think its fair to conclude that if Uniplate does what you want, its a pretty good choice. This fits well with the Uniplate philosophy of giving up a small amount of power, to allow a massive simplification, while still being powerful enough for most tasks.

PhD/Work

I've had my PhD viva (passed with minor corrections), and have nearly finished making the minor corrections. I'll update my website with a revised copy of the thesis shortly. I'm currently working at Credit Suisse on a three month internship. I'm not working on Haskell stuff, but instead am doing F# programming. To get a feel for some of the things that are done by Credit Suisse I recommend looking at Ganesh's ICFP talk/paper and Howard's CUFP talk. Disclaimer: Nothing I say on this blog, or anywhere public, has anything to do with Credit Suisse, but are my personal thoughts.

Personal Life

I've just moved to Cambridge, and got engaged to my girlfriend (now fiancee), Emily King. I'll be commuting to Credit Suisse for the next two months.

My Libraries/Tools

Now I'm working full-time (long hours plus a long commute), its hard for me to put the same amount of time into updating and maintaining my Haskell libraries and tools. I will still be accepting patches and answering questions, but probably not fixing too many bugs at any great speed. I'm still maintaining my bug tracker, so feel free to add bugs, fix bugs, or comment on bugs. If anyone has any particular interest in a tool, I'd consider taking on a co-maintainer to reduce some of the maintenance burden.

Well Typed

There is now a Haskell consultancy, Well Typed, comprising of Duncan Coutts and Ian Lynagh. These are two very good Haskell hackers, who are now selling their knowledge and experience. Between them, they've had substantial experience with GHC, Cabal, Hackage, ByteString, TemplateHaskell and numerous Haskell libraries. They've also taught lots of students Haskell, and helped lots of beginners on IRC and mailing lists. If I want help with Haskell, or with the general infrastructure and tools, they are usually the first people I approach. I strongly recommend that anyone needing Haskell help in a commercial environment get in contact with them - they can help you get the most out of Haskell. Disclaimer: I haven't been asked to write this section, and haven't checked with Ian/Duncan first, but I do wish them luck!

Thursday, August 28, 2008

Running your own Hoogle on a Web Server

As promised, here is a guide on deploying Hoogle on a web server. Before doing so, you need to generate the necessary Hoogle databases, as described yesterday, and place them in the datadir configured with Cabal. Then:

Move the hoogle binary to a location where it can act as a CGI binary, perhaps changing its name to index.cgi, if necessary. Configure the CGI program to run, possibly changing the program to be executable or adding settings somewhere.

Copy the files from src/res in the darcs repo into a res directory located beside the binary.

Create a file log.txt and give it global write permissions.

Now you should have Hoogle running on a web server! Some of the features, such as OpenSearch integration, won't work - but Hoogle should be usable. If anyone does get Hoogle running on a web server I'd love to hear, any feedback appreciated. In particular, if there are any tweaks required please let me know.

Wednesday, August 27, 2008

Hoogle Database Generation

Brief Annoucement: A new release of the Hoogle command line is out, including bug fixes and additional features. Upgrading is recommended.

Two interesting features of Hoogle 4 are working with mulitple function databases (from multiple packages), and running your own web server. Both these features aren't fully developed yet, and may change in their use, but can be used with care. This post covers how to generate your own databases, and how the web version databases are generated. Tomorrow I'm going to post on how to run your own Hoogle web server, but you'll need to generate your databases first! I'm going to walk through all the steps to create a database from the filepath library, as an example

Hoogle Databases

A Hoogle database is a set of searchable things, including text and type searching, and has a ".hoo" extension. A database may include the definitions from one package, or from multiple packages. Typically the Hoogle databases installed would include one database for each package (i.e. base.hoo, filepath.hoo), a default database (default.hoo) comprising of all the standard search items, and any number of custom databases (all.hoo) which comprise of different combinations of the other databases.

When using Hoogle, adding +name will include the given database in the search list, and -name will exclude the given package from the search. By default, Hoogle will use default.hoo, but if any +name commands are given then those databases will be used instead.

Hoogle looks for databases in the current directory, in the data directory specified by Cabal, and in any --include directories passed at the command line.

Step 1: Creating a Textbase

A Textbase is a textual representation of a function database. To generate a textbase you need to install the darcs version of Haddock, then use runhaskell Setup haddock --hoogle on your package. For filepath, this will create the file dist/doc/html/filepath/filepath.txt, which is a textbase.

Step 2: Converting a Textbase to a Database

To convert a textbase to a database use the command hoogle --convert=filepath.txt in the appropriate folder. If a package depends on any other packages, then adding +package will allow Hoogle to use the dependencies to generate a more accurate database. In the case of filepath, which depends on base, we use hoogle --convert=filepath.txt +base. This command requires base.hoo to be present.

Adding the dependencies is not strictly necessary, but will allow Hoogle to generate a more accurate database. For example, the base package defines type String = [Char], without the +base flag this type synonym would not be known to Hoogle.

We now have filepath.hoo, which can be used as a search database.

Step 3: Combining Databases

To generate a database comprising of both filepath and base, type hoogle --output=default.hoo --combine=filepath.hoo --combine=base.hoo. By combining databases you allow easy access to common groups of packages, and searching all these packages at once becomes faster than listing each database separately.

Web Version Databases

The web version uses the Hackage tarballs to generate documentation for most of its databases, but also has three custom databases:

base - the base package is just too weird, and isn't even on hackage. A darcs version and some tweaking is required.

keyword - the keyword database is a list of the keywords in Haskell, and is taken from the web page on the wiki.

hackage - the hackage database is a list of all the packages on Hackage, indexed only by the package name.

All the code for generating the web version databases is found in data/generate in the Hoogle darcs repo at http://code.haskell.org/hoogle.

Future Improvements

There are two database related tasks that still need to be done: Cabal integration and indexing all of Hackage.

Bug 80: In the future I would like Hoogle databases to be generated by Cabal automatically on installing a package. Unfortunately, I don't have the time to implement such a feature currently, and even if I did implement it, I'm unlikely to ever use it. If anyone wants to work on this, please get in contact. This is mainly a project working with Cabal.

Bug 79: The other work is to index all the packages on Hackage. The problem here is generating the textbases, once they have been created the rest is fairly simple. However, to run Haddock 2 over a package requires that the package builds, and that all the dependencies are present. Unfortunatley my machine is not powerful enough to cope with the number of packages on Hackage. Hopefully at some point the machinery that builds Haddock documentation for Hackage will also generate textbases, however in the mean time if someone wants to take on the task of generating textbases for Hackage, please get in contact.

Bug Tracker

I'm not working on Hoogle full-time anymore, so am using my bug tracker to keep track of outstanding issues. In order to interact more effectively with my bug tracker, you might want to read this guide. It describes how to vote for bugs etc.

Wednesday, August 20, 2008

Hoogle New Features

I've now finished my Hoogle Summer of Code work, though I still intend to continue working on Hoogle when I get the chance. Before the coding period expired, I was able to add a number of new features to Hoogle. These features are all available at Hoogle, under http://haskell.org/hoogle/.

More Compact Text Searching

The old text search feature was very fast, using an on disk trie to navigate around the possible matches. The downside to this trie was the space it consumed, about half the database was devoted to it. Fortunately, I came up with an alternative way to get fast text searching (albeit slightly slower), in a lot more compact form.

Much smaller database files also mean much faster database generation, as the time spent in the IO routines is the main bottleneck.

Faster IO routines

I rewrote the underlying binary layer in Hoogle, to make it faster. It's not as fast as I would like, and I think that moving to memory-mapped files is probably a good idea. With these improvements, along with the compact text searching, I am able to generate databases in about 2 seconds (compared to about 20 seconds before).

Database Restricted Searches

Hoogle has been able to run database restricted searches for some time, but now the databases contain enough information to make it practical. By adding +package or -package to the search you can include or exclude certain packages. For example, to find out which map functions are in the containers package try map +containers. To find out which map functions are not in the containers or bytestring packages try map -containers -bytestring. I have also split out the GHC.* modules from base, so if you want to find some unboxed types in GHC's libraries try # +ghc. Note that not all the documentation links work from the GHC modules, I am still trying to fix this.

By default Hoogle searches the following packages: array, base, bytestring, cabal, containers, directory, filepath, haskell-src, hunit, keyword, mtl, parallel, parsec, pretty, process, quickcheck, random, stm, template-haskell, time, xhtml

The "ghc" package is also available if specified with +ghc and includes the GHC.* modules of base only.

Hoogle 3

I have now replaced the default Hoogle with Hoogle 4, but have copied Hoogle 3 to http://haskell.org/hoogle/3. Unfortunately, it doesn't yet work, as I need some admin help. But it will in the next few days, I hope. The only reason I can think of for using Hoogle 3 is Gtk2hs library searching, which I do want to add to Hoogle 4 when possible.

Give Me Feedback

There are quite a lot of enhancements to Hoogle that I still want to make. I have tried to list all these improvements in my bug tracker. If you find a bug, or want some feature, open an issue. If you have a particular interest in a bug, you can star it, to be informed on its progress and to indicate to me that you care.

I'm particularly interested in two pieces of feedback:

I don't use Hoogle 4 because ...

Do you use any type/name search engine? Do you want to still use Hoogle 3? Do you use Hayoo? If you use something else, what feature draws you to it? What do you dislike about Hoogle 4?

I use Hoogle 4, but my life would be nicer if ...

There are many things which effect Hoogle 4 users that I'm not aware of. If you open a bug saying what annoys you (or leave a comment and I'll do it for you) then I can keep track of this information. Even if you don't necessarily see any way to fix the problems, I'd still like to know them.

Thanks for everyone who has given feedback on Hoogle so far, it has been very useful.

Friday, August 15, 2008

GSoC Hoogle: Week 12

This week I've been trying to get Hoogle 4 to the point where it can replace Hoogle 3. This is the final official week of Google Summer of Code, but I'm planning to continue hacking Hoogle next week, and then as time allows after that.

The priority this week was getting the documentation links working. The problem was not with Hoogle - displaying the links is trivial - but ensuring that Cabal + Haddock + Hoogle + random build scripts combine to generate the correct databases. This work involved lots of little changes in lots of places, but is now working properly. Included in this work is dependency tracking of packages (so that all packages using base know that String = [Char] etc), and merging multiple databases to create a single one.

After the Hoogle database was generated correctly, I started looking at using some of the additional information present. I have now added Haddock documentation inline in the search results. If the documentation is too long to fit comfortably, Hoogle uses AJAX wizzy-ness (or more accurately, DHTML) to allow the user to expand and show all the documentation. I suspect that this will eliminate many cases of the user actually following to the Haddock webpages. This feature is fairly new, and I have pushed it out because its useful - there are still many small improvements that need to be made.

This week I also spent some time attempting to generate documentation for all the Hackage libraries. I had some success, but the computer I am currently using is years old and lacks the necessary processing power. I will tackle this at some point in the future, once I have purchased a new machine (which should be quite soon).

With all these changes, I find Hoogle 4 to be significantly more usable than Hoogle 3. Please give it a try, and give feedback. At this point I'm particularly interested in any issues that would cause you to use Hoogle 3 instead of Hoogle 4.

Hoogle 3: http://haskell.org/hoogle

Hoogle 4: http://haskell.org/hoogle/beta

If there are no major issues, I will be replacing Hoogle 4 as the standard Hoogle sometime next week.

Next week: I will be no longer doing Google Summer of Code :-) I plan to refine some of the existing bits of Hoogle, and ensure that anything I haven't done is in a bug tracker for later.

User visible changes: The web search engine now gives Haddock links and displays Haddock documentation inline.

Monday, August 11, 2008

GSoC Hoogle: Week 11

This week I've been releasing lots. Hoogle 4 is finally starting to come together, and should be a worthy replacement for Hoogle 3 very shortly. Rather than go into detail about the past week, I'm just going to give some of the bullet points:

I have released 4 versions of the command line version of Hoogle, available on Hackage. Many bugs have been spotted by some very useful testers, and improvements have been made.

I have released a web version of Hoogle 4, and encourage feedback.

I have started to update the wiki Manual, which now contains some details of Hoogle's query syntax.

I gave a talk at AngloHaskell 2008, which is available online, as slides and an audio stream. All of the other talks were excellent and are well worth listening to.

I have started to build Hoogle documentation for all of Hackage. The machine I'm doing this on is very slow, so its not a quick process!

Next week: I'm hoping to work on generating better Hoogle databases, including a Hoogle database for the whole of Hackage. I also have a number of bugs to fix.

User visible changes: Users can download and use Hoogle, and the web interface is online.

Tuesday, August 05, 2008

Hoogle 4.0 web client preview

Since releasing a command line version of Hoogle 4 yesterday, I've had some useful feedback from a number of people. As a result, I have added a few bugs to the bug tracker, and fixed a few mistakes in the searching and ranking. The Hoogle on Hackage is currently 4.0.0.3 and is a recommended upgrade to all early testers.

I've now written a web interface to Hoogle 4, which has been uploaded to http://haskell.org/hoogle/beta/. This web interface is primarily so people can test searching/ranking without installing anything. There are a number of limitations:

The links to documentation do not work - this is the most severe problem, and probably stops people permanently changing to the new version.

The Haddock documentation is not present.

Some database entries are duplicates.

The Lambdabot says feature is missing.

The Suggestion feature is incomplete.

The AJAX style client features are not present.

The first three issues are fixed in Hoogle, but need various support through Haddock and Cabal to work. Other than these limitations, I am very interested in hearing what people think. As before, particularly regressions from Hoogle 3 or poor results/ranking.

Monday, August 04, 2008

Hoogle 4.0 release (beta, command line)

I am pleased to announce Hoogle 4.0, available on Hackage. A couple of things to note:

This is a release of the command-line version only. It will have identical searching abilities to the web-based version, which I'm about to write.

It currently only searches the same packages as Hoogle 3 (the final release will search more).

It currently doesn't support the --info flag as previously described (problems with Haddock, not with Hoogle).

Walkthrough: Installation

If you have cabal-install available, it should be as simple as:


$ cabal update && cabal install hoogle

Otherwise, follow the standard Cabal/Hackage guidelines. Hoogle depends on about 4 packages on Hackage which are not available with a standard GHC install, so these will need to be built.

Walkthrough: A few searches

Here are some example searches. I have used --count=5 to limit the number of results displayed. If you are using a terminal with ANSI escape codes I recommend also passing --color to enable colored output.


$ hoogle map --count=5
Prelude map :: (a -> b) -> [a] -> [b]
Data.ByteString map :: (Word8 -> Word8) -> ByteString -> ByteString
Data.IntMap map :: (a -> b) -> IntMap a -> IntMap b
Data.IntSet map :: (Int -> Int) -> IntSet -> IntSet
Data.List map :: (a -> b) -> [a] -> [b]

$ hoogle "(a -> b) -> [a] -> [b]" --count=5
Prelude map :: (a -> b) -> [a] -> [b]
Data.List map :: (a -> b) -> [a] -> [b]
Control.Parallel.Strategies parMap :: Strategy b -> (a -> b) -> [a] -> [b]
Prelude fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative <$> :: Functor f => (a -> b) -> f a -> f b

$ hoogle Data.Map.map --count=5
Data.Map map :: (a -> b) -> Map k a -> Map k b
Data.Map data Map k a
module Data.Map
Data.Map mapAccum :: (a -> b -> (a, c)) -> a -> Map k b -> (a, Map k c)
Data.Map mapAccumWithKey :: (a -> k -> b -> (a, c)) -> a -> Map k b -> (a, Map k c)

$ hoogle "Functor f => (a -> b) -> f a -> f b" --count=5
Prelude fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative <$> :: Functor f => (a -> b) -> f a -> f b
Control.Monad fmap :: Functor f => (a -> b) -> f a -> f b
Control.Monad.Instances fmap :: Functor f => (a -> b) -> f a -> f b
Data.Traversable fmapDefault :: Traversable t => (a -> b) -> t a -> t b

How you can help

I've released a command line version of the search to solicit feedback. I'm interested in all comments, but especially ones of the form:

I prefer the command line version of Hoogle 3 because ...

When I search for ... I would expect result ... to appear, or to appear above result ...

I was hoping for the feature ...

It takes too long when I ...

I'm going to be accumulating Hoogle 4 bugs in my bug tracker, or by email (http://www-users.cs.york.ac.uk/~ndm/contact/) - whichever you find more convenient.

Now I'm going to start work on the Web search :-)

Sunday, August 03, 2008

GSoC Hoogle: Week 10

This week I've been in Bristol, and am just about to head off to the Harbour Festival. Next week I'm heading off to AngloHaskell 2008, and will be talking about Hoogle type searching on the Saturday.

This week has been type search, yet again. There were issues with algorithmic complexity, combinatorial explosions and other fun stuff. However, its now finished. The type search is now fast enough (you can run Hoogle in Hugs against the core libraries) and gives good results. Rather than describe type searching, its easier to give an example. Searching for (a -> b) -> [a] -> [b] in Hoogle 3 gives:


Prelude.map :: (a -> b) -> [a] -> [b]
Data.List.map :: (a -> b) -> [a] -> [b]
Control.Parallel.S... parMap :: Strategy b -> (a -> b) -> [a] -> [b]
Prelude.scanr :: (a -> b -> b) -> b -> [a] -> [b]
Data.List.scanr :: (a -> b -> b) -> b -> [a] -> [b]
Prelude.scanl :: (a -> b -> a) -> a -> [b] -> [a]
Data.List.scanl :: (a -> b -> a) -> a -> [b] -> [a]
Prelude.concatMap :: (a -> [b]) -> [a] -> [b]

But in Hoogle 4 gives:


Prelude map :: (a -> b) -> [a] -> [b]
Data.List map :: (a -> b) -> [a] -> [b]
Prelude fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative <$> :: Functor f => (a -> b) -> f a -> f b
Control.Monad fmap :: Functor f => (a -> b) -> f a -> f b
Control.Monad.Instances fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative liftA :: Applicative f => (a -> b) -> f a -> f b
Data.Traversable fmapDefault :: Traversable t => (a -> b) -> t a -> t b
Control.Monad liftM :: Monad m => (a1 -> r) -> m a1 -> m r
Control.Parallel.Strategies parMap :: Strategy b -> (a -> b) -> [a] -> [b]

I think the new results are better. For more details, come to the AngloHaskell talk.

Next Week: I want to release a public beta of Hoogle 4 in command line form. I want to start on the web search engine and tweak the ranking algorithm. I'll also be writing up type search in the form of a presentation.

User Visible Changes: Type search works well and fast.

Thursday, July 24, 2008

GSoC Hoogle: Week 9

I'm off camping for the next weekend in a couple of hours, so this is my early weekly summary. From next week, for a week and a half, I'll actually have an SSH connection so expect to see 200+ patches flow into the Hoogle repo in a few days.

This week I've been rewriting the type search. I spent 3 days writing code, type checking it, but not actually having enough written to run it. Late last night I finished the code, and this morning I debugged it. Amazingly (although actually quite commonly for Haskell) it worked with only minor tweaks. I now have a type search which should scale to large databases and provide fast and accurate searches.

All the basic tests work, and I can generate a Hoogle database for the array library. I still can't generate a Hoogle database for the base library, due to a stack overflow, but I think the cause of the stack overflow has changed and should be easy to debug.

Next Week: A public beta of the command line version is now overdue, and hopefully will happen next week. I aim to finish the actual search side of Hoogle, and move on to the web interface.

User Visible Changes: Type search works again, mostly.

Sunday, July 20, 2008

GSoC Hoogle: Week 8

This week I've been travelling quite a bit, and rather busy with other things. Hopefully next week I'll be able to focus more time on Hoogle!

This week I fleshed out the final part of type search, including support for instances and alpha renaming of variables. After having implemented all the bits in the type search, I tried to convert the base libraries - and it failed, taking up too much time/memory to feasibly finish.

The type search is based around the idea of having nodes in a graph representing types, and then moving between these nodes, at a cost. In order to avoid a blow-up in the number of nodes in the graph, types are alpha-normalised and then alpha-renaming is performed afterwards. Instead of having 3 type nodes for (a,b), (c,d) and (a,a) there is just one named (a,b) and a 3 sets of alpha-renamings. All is good.

However, once you introduce instance restrictions, the types blow up. For example, from the type node a, you can move to Eq a => a, Ord a => a, Show a => a etc. The large (but feasible) number of type nodes, combined with even a small number of class names, gives a huge number of nodes. In fact, for every type variable in a node there are 2^n possible instance contexts it could take. All is bad.

Fortunately there is a solution - move instance checking outside the type graph. This makes the number of nodes feasible, and should work fairly well. It also has a few other benefits, including slightly better scoring and a simpler implementation in a few places. I also came up with a strategy for moving the cost associated with alpha-renaming into the graph search, which further simplifies things.

Of course, all this work takes time, so overall progress is slower than I would have liked. However, the results so far are promising, and the problems of scale seem to have been successfully addressed. The problem of fast and accurate type searching is hard, but hopefully Hoogle 4 will have a scalable solution that should be useful.

Next week: I want to finish the implementation of type searching, and check it works on the full base libraries. A release would be good, although may take place early in the following week.

User visible changes: Creating a database for the base library will now fail with a stack overflow. Hopefully next weeks changes will fix this!

Friday, July 11, 2008

GSoC Hoogle: Week 7

This week I've continued to improve the type searching, and generated Hoogle databases for the core libraries. I'm away from a computer all weekend until Tuesday evening, which has happened every ICFP contest for the last 3 years.

I've substantially refactored the type searching, basing it on a proper abstract Graph data type. Now the mechanisms for dealing with type search and graph traversal are separate it is much easier to express clearly what type search is doing. I've also fleshed out the type searching code so that it can accurately perform searches with all the necessary features. There are still a number of tasks to do before the type searching code is finished, but each is a fairly discrete unit of work with well-understood problems.

The other challenge for the week has been generating Hoogle databases for the core libraries - the base library and all the other libraries GHC ships with a release. With these libraries in place, it is feasible to use Hoogle to perform useful queries. The libraries are generated using a combination of Cabal, Haddock and Hoogle. I've made changes in both the Haddock and Hoogle layers so that the full base libraries can now be processed.

In order to deal with the full base libraries there are numerous GHC extensions that must be supported. In particular, Hoogle now supports multi-parameter type classes, higher-ranked types, type operators, unboxed types, unboxed tuples and NDP style arrays. All of these features are translated down into Haskell 98 types, but most closely approximate their behaviour in GHC, and can be used in searches.

Throughout the week I've been profiling the database creation code in Hoogle. The databases for the core libraries come to about 4.5Mb, and are highly optimised for performing searches - often at the cost of making them harder to create. I've halved the time to create databases during the week, using profiling to direct improvements. The time required to process the Core libraries takes 60 seconds, which is a certainly an acceptable timeframe, but could always be faster. Currently the biggest culprit in the profile is the hPutByte function:


hPutByte :: Handle -> Int -> IO ()
hPutByte hndl i = hPutChar hndl $ chr i

Since the underlying databases are written using lots of hTell and hSeek commands, it is not possible to use something like the Data.Binary library. However, if anyone has any suggestions on how to improve performance they would be gratefully received.

Next week: I want to finish off the remaining type search features, and then package up a command line release for Hackage. Hopefully Hoogle 4 will be ready for initial use by early testers.

User visible changes: Type search is more robust, but still not fully featured. Database creation if faster and robust. You can search the base libraries.

Sunday, July 06, 2008

GSoC Hoogle: Week 6

This week I've been tackling type searching. I have just (in the last few minutes) got my first type search to work. At the moment type search is very limited, but all the ideas and scafolding are in place, so should now proceed relatively quickly.

In all previous versions on Hoogle, type searching was O(n), where n is the number of functions in the database. Hoogle compared the type search to each possible answer, computed a closeness score, then at the end wrote out the closest matches. This meant that before the first answer could be given, all functions had to be checked, i.e. the time for the first answer was O(n). As the Hoogle database is about to get massively bigger, this approach is insufficient.

The new version of Hoogle is much cleverer. It works by exploring a graph, following similar ideas to Dijkstra's algorithm, to reach more suitable results first. Typically, the best answers will be given without any search of the graph, and then as the graph is searched more results will appear with lower closeness. With the new scheme the complexity is O(m), where m is the number of results you want. I hope at some point after the SoC is finished to describe the algorithm properly, so others can understand it, and hopefully improve upon it.

Next week: Finishing off type searching, so it supports all the features planned. Build system work, and potentially a cabal pre-release.

User visible changes: Type search works to some degree, but not perfectly. Database debugging options (conversion and dumping to a text file) have been added.

Thursday, June 26, 2008

GSoC Hoogle: Week 5

This week I was going to tackle type searching, but then realised I'm going to spend 6 hours on Friday on a train (hence the weekly update on Thursday), so can spend that time productively working on paper tackling type search. So instead of type search, I worked on a few other pieces, some of which make type search easier:

Haddock Database Generation More patches to get better output from Haddock. The code now handles class methods properly, and deals with some FFI bits.

Lazy Name Searching Searching for a name is now fairly lazy. When searching for a name, Hoogle can return the prefix of the results without doing too much computation to calculate all the results. This work is useful in its own right, but very necessary for the type searching, and can be reused.

Hoogle --info The biggest feature added this week is the --info flag. When this flag is given, Hoogle picks the first result and gives more details, including any Haddock documentation associated with the function. For example:

$ hoogle +tagsoup openurl --info Text.HTML.Download openURL :: String -> IO String This function opens a URL on the internet. Any http:// prefix is ignored. > openURL "www.haskell.org/haskellwiki/Haskell" Known Limitations: * Only HTTP on port 80 * Outputs the HTTP Headers as well * Does not work with all servers It is hoped that a more reliable version of this function will be placed in a new HTTP library at some point!

Next week: Type searching! See last week for a description of what I hope to achieve.

User visible changes: The --info flag now exists.

Sunday, June 22, 2008

GSoC Hoogle: Week 4

This week I've stayed in one place, and had lots of opportunity to get on with Hoogle. I've done a number of different things this week:

More on Haddock databases I fixed a number of issues with the Haddock generated Hoogle information. These patches have been submitted back to Haddock.

Binary Defer library I merged the binary defer library into the Hoogle sources, and modified it substantially. Some of the modifications were thanks to suggestions from the Haskell community, particularly David Roundy. The library is now more robust, and is being used as a solid foundation to build the rest of Hoogle on top of.

Text Searching You can now search for words, even multiple words, and the search will be performed. The text searching uses efficient data structures, scales excellently, and returns better results first.

Suggestions These improvements were detailed earlier in the week.

Next week: Type searching. I have various ideas on how to go about this, but it is the most tricky part of the whole project. I hope to come up with the perfect solution by the end of the week, but if not, will come up with something good enough for Hoogle 4 then revise it after the Summer is over (it could easily suck in a whole Summer of time if I am not careful!). Much of the low-level infrastructure is already present, so it is just the search algorithm.

User visible changes: Text searching works. A session with Hoogle as it currently stands:


> cabal haddock --hoogle
-- generates tagsoup.txt
> hoogle --convert=tagsoup.txt
Generating Hoogle database
Written tagsoup.hoo
> hoogle +tagsoup is open --color
Text.HTML.TagSoup.Type isTagOpen :: Tag -> Bool
Text.HTML.TagSoup.Type isTagOpenName :: String -> Tag -> Bool

Wednesday, June 18, 2008

Hoogle 4 New Features

I'm still developing Hoogle 4, and there are many things that don't work (such as searching for types and the web version). However, it's starting to come together, and I'm beginning to implement new features that aren't in Hoogle 3. Today I've implemented two useful features.

Multi Word Text Search

In Hoogle 3, if you entered "is just" it would be treated as a type search, exactly the same as "m a". Now, it will search for "is" and search for "just" and intersect the results. This seems to be something that people often try, so hopefully will make Hoogle more intuitive.

Intelligent Suggestions

Hoogle 3 tries to give suggestions, for example if I search for "a -> maybe a" it will helpfully suggest "a -> Maybe a". Unfortunately it's not that clever. If your search term contains a type variable (starting with a lower-case letter), which is more than one letter, it will suggest you try the capitalised version. For example, "(fst,snd) -> snd" will suggest "(Fst,Snd) -> Snd", which isn't very helpful.

The new mechanism uses knowledge about the types, arities and constructors present in the Hoogle database. Some examples:


"Just a -> a"  ===> "Maybe a -> a"
"a -> Maybe"   ===> "a -> Maybe b"
"a -> MayBe a" ===> "a -> Maybe a"
"a -> maybe a" ===> "a -> Maybe a"

Sunday, June 15, 2008

darcs over FTP

I'm currently unable to access SSH, and suspect this situation will persist for most of the Summer. Most of my darcs repo's are behind SSH, so this presents a problem. I've been looking for a way to work with darcs over FTP, and have managed to get it going on Windows. The following are instructions for (1) me when I forget them and (2) any Windows users who want to follow the same path. If you are a Linux user, then similar information is available from this blog post.

Step 1: Install Sitecopy

Go to http://dennisbareis.com/freew32.htm and download and install SITECPY.

Add "C:\Program Files\SITECOPY" to your path.

Add "C:\Home" to a %HOME% variable.

Open up a command line and type:


c:\> mkdir home
c:\> cd home
c:\home> mkdir .sitecopy
c:\home> echo . > .sitecopyrc

Step 2: Prepare the FTP site

Go to the FTP site, and create a directory. In my particular example, I created darcs/hoogle so I could mirror the Hoogle repo.

Step 3: Configure Sitecopy

Edit the file "c:\home\.sitecopyrc" to contain:


site          hoogle
   server     ftp.york.ac.uk
   username   ndm500
   local      C:\Neil\hoogle
   remote     web/darcs/hoogle
   port       21

Obviously, substituting in your relevant details.

Step 4: Initialise Sitecopy

Type:


sitecopy --init hoogle

darcs push

Now to do a darcs push, you can type:


sitecopy --update hoogle

The first copy will take a long time, but subsequent copies should be a lot faster.

darcs pull

After all this, you can either pull using FTP, or if your FTP is also a web site, you can pull over http. For example:


darcs get http://www-users.york.ac.uk/~ndm500/darcs/hoogle/

GSoC Hoogle: Week 3

This week I've travelled a further 600 miles by train, but am now starting to get settled for the Summer, and down to work on Hoogle.

My main focus this week has been getting Haddock to generate Hoogle databases. For Haddock 0.8 I added in a --hoogle flag to generate Hoogle databases, and a similar --hoogle flag to Cabal. Unfortunately, for Haddock 2.0, the feature was removed as most of the code got rewritten. Now I've added the feature back, making extensive use of the GHC API to reduce the amount of custom pretty-printing required, and to support more Haskell features. The code has been added to the development Haddock branch, and will be present in the next release.

Most of the challenge was working with the GHC API. It's certainly a powerful body of code, but suffers from being inconsistent in various places and poorly documented. I mainly worked with the code using :i to view the API. I got bitten by various problems such as the Outputable module exporting useful functions such as mkUserStyle :: QueryQualifies -> Depth -> PprStyle, but not exporting any functions that can create a Depth value, and therefore not actually being usable. If Hoogle and Haddock could be used over the GHC API, it would substantially improve the development experience!

I've also worked more on defining the database format. I am about to start work on the implementation today. I've also added a few more command line flags, but mainly as placeholders.

Next week: Database creation and text searches (looking back I see some similarity to last week!)

User visible changes: haddock --hoogle now works.

Monday, June 09, 2008

GSoC Hoogle: Week 2

This week I submitted my PhD thesis, emptied my entire rented house of furniture, spent £96 on petrol, drove (or was driven) 400 miles, travelled a similar distance by train, have been to the north of Scotland and am currently working on a borrowed Mac in London. Needless to say, its been rather busy - but now all the excitement is over and I should be able to focus properly on Hoogle.

In the last week I've been focusing on the database, the store of all the function names and type signatures, so a very critical piece of information. I want to support fast searching, which doesn't slow down as the number of known functions increases - a nasty property of the current version. For text searching, the trie data structure has this nice property, and can deal with searching for substrings. For fuzzy type searching, things are a lot more complex. However, I think I have an algorithm which is fast (few operations), accurate (gives better matches), scalable (independent of the number of functions in the database) and lazy (returns the best results first). The idea is to have a graph of function results, and then navigate this graph to find the best match.

Most of the database work has been theoretical, but I have done some coding. In particular, I have started on the database creation code, and polished the flag argument interaction code some more. Part of the development required the Derive tool, and in doing this work I noticed a few deficiencies. In particular, if you run Windows and run derive over a UNIX line-ending file, the tool will generate a Windows line-ending file. This problem, and a few others, are now fixed.

Next week: Database creation and searching. I want text searches to work by the end of the week.

User visible changes: The --help flag prints out information on the arguments.

PS. I was looking forward to seeing some blog posts from the other Haskell summer of code students on the Haskell Planet. If any Haskell GSoC student does have a blog, please ask for it to be included!

Sunday, June 01, 2008

GSoC Hoogle: Week 1

I started my Google Summer of Code project on Hoogle at the beginning of this week. In my initial application I promised to make my weekly updates via blog, so here is the first weeks report:

I've only done about half a weeks work on Hoogle this week, because I'm handing in my PhD thesis early next week, and because I'm moving house on Wednesday. I spent 14 hours on Saturday moving furniture, and many more hours than that on my thesis! I should be fully devoted to GSoC by the middle of next week.

Despite all the distractions, I did manage to start work on Hoogle. I created a new project for Hoogle at the community.haskell.org site, and an associated darcs repo at http://code.haskell.org/hoogle. I've done a number of things on Hoogle:

Improved the developer documentation in some places

Reorganised the repo, moving away dead files

Work on command line flags, parsing them etc.

Added a framework for running regression tests

Organise the command line/CGI division

I've started work from the front, and am intending to first flesh out an API and command line client, then move on to the web front end. The biggest change from the current implementation of Hoogle will be that there is one shared binary, which will be able to function in a number of modes. These modes will include running as a web server, as a command line version, as an interactive (Hugs/GHCi style) program, documentation location etc. This will allow easier installation, and let everyone host their own web-based Hoogle without much effort.

Next week: I hope to move towards the command line client and central Hoogle database structure. I also hope to chat to the Haddock 2 people, and try and get some integration similar to Haddock 1's --hoogle flag.

User visible changes: Hoogle 4 as it currently stands is unable to run searches, although hoogle --test will run some regression tests.

Thursday, May 22, 2008

Interactive TagSoup parsing

I've written quite a few programs using the tagsoup library, but have never really used the library interactively. Today I was wondering how many packages on hackage use all lower case names, compared to those starting with an initial capital. This sounds like a great opportunity to experiment! The rest of this post is a GHCi transcript, with my comments on what I'm doing prefixed with -- characters.


$ ghci
GHCi, version 6.8.2: http://www.haskell.org/ghc/  :? for help
Loading package base ... linking ... done.
-- load some useful packages
Prelude> :m Text.HTML.TagSoup Text.HTML.Download Data.List Data.Char Data.Maybe
Prelude Data.Maybe Data.Char Data.List Text.HTML.Download Text.HTML.TagSoup>
-- ouch, that prompt is a bit long - we can use :set prompt to shorten it
-- side note: I actually supplied the patch for set prompt :)
   :set prompt "Meep> "
-- lets download the list of packages
Meep> src <- openURL "http://hackage.haskell.org/packages/archive/pkg-list.html"
... src scrolls pass the screen ...
-- parse the file, dropping everything before the packages
Meep> let parsed = dropWhile (~/= "<h3>") $ parseTags src
-- grab the list of packages
Meep> let packages = sort [x | a:TagText x:_ <- tails parsed, a ~== "<a href>"]
-- now we can query the list of packages
Meep> length packages
648
Meep> length $ filter (all isLower) packages
320
Meep> length $ filter ('_' `elem`) packages
0
Meep> length $ filter ('-' `elem`) packages
165
Meep> length $ filter (any isUpper . dropWhile isUpper) packages
100
Meep> length $ filter (isPrefixOf "hs" . map toLower) packages
47
Meep> length $ filter (any isDigit) packages
37
Meep> reverse $ sort $ map (\(x:xs) -> (1 + length xs,x)) $ group $ sort $ conca
t packages
[(484,'e'),(374,'a'),(346,'r'),(336,'s'),(335,'t'),(306,'i'),(272,'l'),(248,'c')
,(247,'n'),(240,'o'),(227,'p'),(209,'h'),(185,'-'),(171,'m'),(159,'d'),(126,'g')
,(112,'b'),(96,'u'),(87,'y'),(78,'k'),(76,'f'),(74,'x'),(58,'S'),(53,'H'),(35,'w
'),(33,'v'),(29,'q'),(29,'L'),(27,'A'),(26,'F'),(24,'D'),(23,'C'),(22,'T'),(16,'
P'),(16,'M'),(16,'I'),(16,'G'),(13,'B'),(12,'W'),(12,'3'),(12,'2'),(10,'O'),(9,'
R'),(9,'1'),(8,'z'),(8,'j'),(8,'E'),(7,'X'),(7,'U'),(7,'N'),(6,'Y'),(6,'V'),(5,'
J'),(4,'Q'),(4,'5'),(4,'4'),(3,'Z'),(3,'8'),(3,'6'),(1,'9')]

We can see that loads of packages use lowercase, lots of packages use upper case, quite a few use CamelCase, quite a few start with "hs", none use "_", but lots use "-". The final query figures out which is the most common letter in hackage packages, and rather unsurprisingly, it roughly follows the frequency of English letters.

TagSoup and GHCi make a potent combination for obtaining and playing with webpages.

Sunday, May 18, 2008

Haskell and Performance

There has been lots of discussion on the Haskell mailing lists about the speed of Haskell. There are many conflicting opinions, and lots of different advice. Some of the information on Haskell's performance is written as a sales pitch, some is based on outdated knowledge. Since I've been working on optimisation for a while, I thought I'd try and give a snapshot of Haskell performance. Most of the following is personal opinion, and others could quite validly disagree. Since GHC is the best performing Haskell compiler, I have used Haskell to mean GHC with the -O2 flag throughout.

High-level Haskell is not as fast as C

If you write Haskell in a standard manner, it is unlikely to perform as fast as C. In most cases, linked-lists are slower than arrays. Laziness is more expensive than strictness. The Haskell code will almost always be shorter, and more concise, since it will abstract over low-level detail. But by writing that low-level detail in the C code, you are likely to produce faster code.

Low-level Haskell is competitive with C

If you use GHC, with unboxed operations, written in a low-level style, you can obtain similar performance to C. The Haskell won't be as nice as it was before, but will still probably express fewer details than the C code. Writing in such a low-level manner requires more knowledge of Haskell, and is probably something that a beginner should not be attempting. However, for a critical inner loop, low-level Haskell is a very attractive option.

Haskell's Code Generator is weak

The back end assembly generator in GHC is a weak link, but improvements are being carried out. After this work has been finished, it is likely that low-level Haskell will be able to produce nearly identical assembly code to C.

Some Haskell libraries are poorly optimised

Some of the central Haskell libraries have functions which are badly optimised. For example, the MTL library is known to be poorly performing. The words and isSpace functions in the base library aren't very good. These issues are being addressed over time, the Binary and ByteString libraries have fixed two holes. A new implementation of inits has been contributed. Over time, more issues will be identified and fixed, improving the speed of all code.

Haskell's multi-threaded performance is amazing

A lot of clever people have done a lot of clever work on making multi-threaded programming in Haskell both simple and fast. While low-level speed matters for general programming, for multi-threaded programming there are lots of much higher-level performance considerations. Haskell supports better abstraction, and can better optimise at this level, outperforming C.

Reading the Core is not easy

A standard advice to people trying to optimise Haskell is to read the Core - the low-level functional language used as an intermediate form in the compiler. While Core provides much useful information about what optimisations were performed, it isn't easy to read, and takes a lot of practice. Some effort has been done to make reading Core easier, but I still wouldn't recommend it for beginners.

Optimisation without profiling is pointless

People often want to make programs run faster. In general, this activity is a waste of time. I recently wrote a program for the HCI group at my university, which takes 10 minutes to run, and requires 4Gb of RAM, on a very expensive machine. I haven't even bothered to profile the program, because I have better things to do. Unless the speed of something actually makes a difference, you should not be spending excessive effort on optimisation.

If you have determined that the program in question is running too slowly, then profile. After profiling, you can usually identify some small part of the program that needs optimisation. Too often there is a focus on speeding up something that is not slow enough to make a difference.

The trend is for higher-level optimisation

As time goes buy, higher-level programs keep getting faster and faster. The ByteString work allows programmers to write high-level programs that are competitive with C. Performance enhancements are being made to the compiler regularly, pointer tagging, constructor specialisation etc. are all helping to improve things. More long term projects such as Supero and NDP are showing some nice results. Optimisation is a difficult problem, but progress is being made, allowing programs to be written in a higher-level.

My goal is that one day Haskell programs will be written in a very declarative, high-level style - and outperform C at the same time. I think this goal is obtainable, albeit some way in the future.

Saturday, May 10, 2008

Bad strictness

Haskell has one primitive construct for enforcing strictness, seq :: a -> b -> b. The idea is that the first argument is evaluated to weak-head normal form (WHNF), then the second argument is evaluated to WHNF and returned. WHNF is reduction until the outermost bit is available - either a function value, or an outer constructor.

You can model the behaviour by introducing an evaluate function, in a lower-level language, and showing how to perform reduction:


evaluate (seq a b) = do
    a' <- evaluate a
    b' <- evaluate b
    return b'

The evaluate function must return an evaluated argument, and it wants to return b which is not already evaluated, so it must make a recursive call. The evaluate function for id, which simply returns its argument, is:


evaluate (id a) = do
    a' <- evaluate a
    return a'

Notice that even though id does "nothing", it still has to evaluate its argument. Of course, evaluate (id x) is the same as evaluate x, so id performs no additional work.

Haskell is lazy, so if an expression has already been evaluated, then the evaluate call will be incredibly cheap, and just return the previous result. So let's consider the result of calling seq with the same argument twice.


evaluate (seq a a) = do
    a' <- evaluate a
    return a'

This time the second evaluation of a is skipped, as a is already evaluated. We can easily see that evaluation of seq a a is exactly equivalent to a. This means that any code which writes a `seq` a is wrong. There is plenty of this code around, and one example (which prompted me to write this) is on slide 15 of this talk.

The other classic seq related mistake is id $! x. The ($!) operator is for strict application, and is defined:


f $! x = x `seq` f x

For the particular instance of id $! x, we obtain x `seq` id x. Of course, all that id x does is evaluate x, so again there is no change from just writing x.

There are valid uses of seq, but any time you see either of the following constructs, you know the programmer got it wrong:

x `seq` x
id $! x

Monday, April 21, 2008

Summer of Code 2008

This year I am going to be participating in the Google Summer of Code as a student, for the haskell.org organisation, on the Haskell API Search as an interface to Hackage project - aka Hoogle 4. I will be mentored by Niklas Broberg, author of various tools including Haskell Source Extensions, which is already used by Hoogle. My project link gives the summary I gave for the project, but below I've posted the interesting bits from my full application. I am going to be posting my progress at least once a week once the project phase starts (about 6 weeks time). I welcome any comments!

What is the goal of the project you propose to do?

There are two main goals:

1) Make Hoogle more useful to the community, along the same path as it is currently used.

2) Make Hoogle suitable to use as the standard interface to Hackage.

Can you give some more detailed design of what precisely you intend to achieve?

# Removal of all bugs

Hoogle 3 has a number of embarrassing bugs, some of which are not easily fixed. The nastiest of these is to do with monads, which are horribly mistreated. Since I now know the underlying issues which have caused a problem with Hoogle 3, things like higher-kinded type classes can be solved in a more principled manner.

# Support for some Haskell type extensions

Hoogle 3 does not deal with multi-parameter type classes. I would like to support a variety of type system extensions, primarily by mapping them on to Haskell 98 equivalent types.

# Faster searching

The current implementation is O(n) in the number of functions in the library, where the constant factor is absolutely massive. I wish to make text searching O(s), where s is the length of the search string, and have an incredibly low constant overhead -- using the a lazy file-based trie.

The type searching also needs a massive speed up. I have some ideas on how to proceed, but it is a difficult problem! I will spend a small amount of time investigating this problem, but may have to use a simpler algorithm, rather than delay the rest of the project.

# Better deployment

Currently there is Hoogle for the base libraries, and a special (very hacked) version that supports Gtk2hs only. I have received several requests for custom Hoogle instances for tools such as XMonad, Ycr2js, wxHaskell etc. The new Hoogle will make deployment of individual versions for specific packages easy.

# Support for multiple packages

I wish to support searching through every package on Hackage at once. This requires a massive speed up in the searching algorithms.

# Generalised text searching

By searching both function names, and also cabal package descriptions, Hoogle can be much more useful in finding packages, as opposed to individual functions.

# Better Design

Hoogle 3 is a web application, with a hacked on command line program. Hoogle 4 will be a central API which can be reused from any IDE tools, and also used to build the web interface and the command line application.

# Generalised interface to all of Cabal

Hopefully all the above goals will result in a tool that is suitable to be an interface to Cabal.

What deliverables do you think are reasonable targets? Can you outline an approximate schedule of milestones?

I would plan to release a beta of Hoogle 4 approximately half way through the project, as a web application. Much of the initial design has been done, so this is primarily hacking time.

I would then hope to complete the final hackage integration for the second half. This stage will require discussion with the cabal people, and will be a combination of design, implementation and server administration/setup.

In what ways will this project benefit the wider Haskell community?

Hoogle is already of use to the community, but has never seen a final release, and has a reasonable number of known bugs. This project would produce a polished version of a tool for which we already know there is huge demand.

Hackage is working well, and gaining new packages every day. As the number of packages increases, the interface to hackage must be updated to handle this volume. Discussions with some of the hackage/cabal team seem to suggest that a search interface is the way forward. By making Hackage easier to use, everyone benefits.

Saturday, April 12, 2008

darcs Feature Request (Part II)

I previously requested a feature for darcs. I always pull from an http repo, and push over SSH. I have to push using --no-set-default and typing the ssh repo in full, which I automate with a .bat file in each repo.

Today I noticed that darcs has _darcs/prefs/repos, which seems to list the repo's that darcs has used. In one of my typical repo files, I have an http entry and an SSH entry. To get darcs to behave the way I want, all I need to do is push using the first non-http repo in that list.

I have implemented my version of the darcs push command inside my paper tool, the code is all online here. Now I can delete all my push.bat scripts, and just type paper push from any darcs repo. As an added bonus, I now don't need to change to the root directory to perform a push.

It would be really nice if someone could incorporate this feature into the main darcs codebase. However, I'm quite happy using my paper tool for now. I certainly don't have time to patch, or even build darcs :-)

Wednesday, April 09, 2008

TagSoup Parsing: Dictionary Extraction

I've just read issue 10 of The Monad.Reader. It's a great issue, including a tutorial on using the new GHCi debugger, and how to write an efficient Haskell interpreter in Haskell. The running example for the GHCi debugger is parsing the computer dictionary and extracting descriptions from keywords, using the TagSoup library. The article starts with an initial version of the extraction code, then fixes some mistakes using the debugger present in GHCi. The code was written to teach debugging, not as a demonstration of TagSoup. This post explains how I would have written the program.

The original program is written in a low-level style. To search for a keyword, the program laboriously traverses through the file looking for the keyword, much like a modern imperative language might. But Haskell programmers can do better. We can separate the task: first parsing the keyword/description pairs into a list; then searching the list. Lazy evaluation will combine these separate operations to obtain something just as efficient as the original. By separating the concerns, we can express each at a higher-level, reducing the search function to a simple lookup. It also gives us more flexibility for the future, allowing us to potentially reuse the parsing functions.

I have fixed a number of other bugs in the code, and my solution is:


module Main where
import Text.HTML.TagSoup
import Maybe

main = do
    putStr "Enter a term to search for: "
    term <- getLine
    html <- readFile "Dictionary.html"
    let dict = parseDict $ parseTags html
    putStrLn $ fromMaybe "No match found." $ lookup term dict


parseDict :: [Tag] -> [(String,String)]
parseDict = map parseItem 
          . sections (~== "<dt>")
          . dropWhile (~/= "<div class=glosslist>")

parseItem xs = (innerText a, unwords $ words $ innerText b)
    where (a,b) = break (~== "<dd>") (takeWhile (~/= "</dd>") xs)

Instead of searching for a single keyword, I parse all keywords using parseDict. The parseDict function first skips over the gunk at the top of the file, then finds each definition, and parses it. The parseItem function spots where tags begin and end, and takes the text from inside. The unwords $ words expression is a neat trick for normalising the spacing within an arbitrary string.

This revised program is shorter than the original, I find it easier to comprehend, and it provides more functionality with fewer bugs. The TagSoup library provides a robust base to work from, allowing concise expression of HTML/XML extraction programs.

Tuesday, April 08, 2008

Optional Parameters in Haskell

I use optional parameters in my TagSoup library, but it seems not to be a commonly known trick, as someone recently asked if the relevant line was a syntax error. So, here is how to pass optional parameters to a Haskell function.

Optional Parameters in Other Languages

Optional parameters are in a number of other languages, and come in a variety of flavours. Ada and Visual Basic both provide named and positional optional parameters. For example, given the definition:


Sub Foo(b as Boolean = True, i as Integer = 0, s as String = "Hello")

We can make the calls:


Call Foo(s = "Goodbye", b = False)
Call Foo(False, 1)

In the first case we give named parameters, in the second we give all the parameters up to a certain position.

In some languages, such as GP⁺, you can say which parameters should take their default values:


Call Foo(_, 42, _)

Optional Parameters in Haskell

Haskell doesn't have built-in optional parameters, but using the record syntax, it is simple to encode named optional parameters.


data Foo = Foo {b :: Bool, i :: Integer, s :: String}
defFoo = Foo True 0 "Hello"

foo :: Foo -> IO ()

Now we can pass arguments by name, for example:


foo defFoo{s = "Goodbye", b = False}

This syntax takes the value defFoo, and replaces the fields s and b with the associated values. Using a type class, we can abstract this slightly:


class Def a where
    def :: a

instance Def Foo where
    def = defFoo

Now we can make all functions taking default arguments use the def argument as a basis, allowing type inference and type classes to choose the correct default type. Even still, optional parameters in Haskell are not quite as neat as in other languages, but the other features of Haskell mean they are required less often.

This technique has been used in TagSoup, particularly for the parseTagOptions function. I've also seen this technique used in cpphs with the runCpphs function.

Tuesday, April 01, 2008

Automated Proof Reading

I've been spending the last few months writing papers, which requires lots and lots of proof reading. Yesterday Tom shared one trick he'd used to help with his thesis. He took his thesis, ran detex over it, then used a text-to-speech program to read it back to him. This trick ensures you pick up all the subtle things like repeated words, and forces you to listen to every word, not skim reading bits. I thought this was a great idea, so implemented something similar in my paper preparation tool (release forthcoming), so typing paper talk reads the paper back to you, converting the LaTeX in a sensible way, to make it as readable as possible.

Here is result of running paper talk on the introduction section of my Catch paper:

The audio is a little clearer before being compressed for upload, but still has a very clunky feel. It is surprisingly useful - I made about 12 minor changes as a result. It has some oddities, for example ML becomes millilitre, and some pause issues with brackets, which the preprocessor sorts out. The entire paper takes 49 minutes to read, but I think I will be doing this will all papers from now on.

Sunday, March 30, 2008

Toddler's play with HTML in Haskell

I just read a blog article entitled Kid's play with HTML in Haskell, where the author extracts some information from an HTML document, using the Haskell XML Toolbox. I have an alternative XML/HTML library, TagSoup, so I decided to implement their problem with my library.

The Problem

Given an HTML file, extract all hyperlinks to mp3 files.

In TagSoup


[mp3 | TagOpen "a" atts <- parseTags txt
     , ("href",mp3) <- atts
     , takeExtension mp3 == ".mp3"]

The code is a list comprehension. The first line says use TagSoup to parse the text, and pick all "a" links. The second line says pick all "href" attributes from the tag you matched. The final line uses the FilePath library to check the extension is mp3.

A Complete Program

The above fragment is all the TagSoup logic, but to match exact the interface to the original code, we can wrap it up as so:


import System.FilePath
import System.Environment
import Text.HTML.TagSoup

main = do
   [src] <- getArgs
   txt <- readFile src
   mapM_ putStrLn [mp3 | TagOpen "a" atts <- parseTags txt
                       , ("href",mp3) <- atts
                       , takeExtension mp3 == ".mp3"]

Summary

If you have a desire to quickly get a bit of information out of some XML/HTML page, TagSoup may be the answer. It isn't intended to be a complete HTML framework, but it does nicely optimise fairly common patterns of use.

Tuesday, March 11, 2008

Poor Ada Error Message

I have been demonstrating on the York University "Algorithms and Data Structures" course for 4 years now. As part of the course, first year students learn Ada. A lot of the error messages are really bad - but over time I've created a mental mapping between the message and the cause. I am now fairly fluent at recognising what mistake a student has made, given the exercise they are attempting and the error message. But yesterday I encountered a brand new misleading error message.

The error message was:


bad.adb:12:22: actual for "N" must be a variable

As always, the most useful thing in the error message is the line number. I read enough of the error message to check whether its a parse error, type error or something else, then head for the line mentioned. (I follow this same tactic in all languages, not just Ada.)


12:     Add_Cell(I, Next(N));

Nothing obviously wrong about this statement, so I read the error message. It seems to want N to be a variable. But I already know that N is a variable, or at the very least a parameter, so this condition seems to be met.

Next step is to head to the definitions of Next and Add_Cell, to see if they can shed some light on the situation.


function Next(N: List) return List is ...
procedure Add_Cell(I: Integer, N: in out List) is ...

The in out in Add_Cell can be read as "pass as a pointer". Aha, maybe the error message is complaining that the second argument to Add_Cell can't be made a pointer, as its a return value from a function. That would explain it, and indeed, that turned out to be the cause of the problem. But back to the error message, what was it trying to tell us?


bad.adb:12:22: actual for "N" must be a variable
12:     Add_Cell(I, Next(N));

When the error message refers to N it isn't talking about the variable N I can see, but the second argument of Add_Cell, which is also called N. If the function being calling was in a separate library, it would have been even harder to understand. A more helpful error message might have been:


bad.adb:12:22: expression passed as the second argument to Add_Cell must be a variable
    Found: Next(N)
    Expected: A variable
    Reason: Second argument of Add_Cell is declared "in out"

In general compiler error messages should be in terms of the line where the error resides, not requiring a large amount of global knowledge. The error can be resolved, but without help from the message. All compilers have bad error messages in some circumstances, but this one seems almost malicious!

Monday, March 10, 2008

Sorting At Speed

Sorting is currently a hot topic within the the Haskell community. Christopher brought it up in a recent thread on the mailing list, and this weekend I ended up spending several hours looking at sort routines.

I was browsing through the Yhc standard libraries, as one does on the weekend, and was drawn to Yhc's sort function. It had some undesirable characteristics for one of the projects I was working on, so I wondered if other Haskell systems used different implementations. I checked GHC, and discovered it their sort was different. In general, when Yhc and GHC have different implementations of a standard library function, the GHC one is better tuned for performance. I decided to replace the Yhc sort function with the GHC one, but before doing so, thought a quick performance test was in order. So I came up with something simple:


test = do
    src <- readFile "Sort.hs"
    print $ ordered $ sort $ sort $ reverse $ sort src

The file "sort.txt" was simply the source code to the program. The code sorts the contents of this file, then reverses it, sorts it and sorts it again. This means that the program performs one sort over semi-random data, one over reverse-ordered data and one over ordered-data. These are some fairly standard cases that should be checked. This test is not a comprehensive benchmark, but a nice quick indicator.

I ran the Yhc sort function against the GHC version, and was shocked to find that the Yhc code was twice as fast. I ran the benchmark under Yhc, GHC and Hugs (using reduction count in Hugs), and in all cases the performance was doubled. I was not expecting this result!

The code for the GHC sort is:


mergesort :: (a -> a -> Ordering) -> [a] -> [a]
mergesort cmp = mergesort' cmp . map (:[])

mergesort' :: (a -> a -> Ordering) -> [[a]] -> [a]
mergesort' cmp [] = []
mergesort' cmp [xs] = xs
mergesort' cmp xss = mergesort' cmp (merge_pairs cmp xss)

merge_pairs :: (a -> a -> Ordering) -> [[a]] -> [[a]]
merge_pairs cmp [] = []
merge_pairs cmp [xs] = [xs]
merge_pairs cmp (xs:ys:xss) = merge cmp xs ys : merge_pairs cmp xss

merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
merge cmp [] ys = ys
merge cmp xs [] = xs
merge cmp (x:xs) (y:ys)
 = case x `cmp` y of
        GT -> y : merge cmp (x:xs)   ys
        _  -> x : merge cmp    xs (y:ys)

The function works by splitting the list into one element lists, resulting in each basic list being ordered. These lists are then merged in pairs until a single list is left. For example, given the input "sort":


"s"  "o"  "r"  "t"
  "os"      "rt"
      "orst"

We first split each character into its own list, then merge adjacent pairs. This code corresponds to the standard merge sort. But instead of making each initial list a single element, we could use sequences of increasing elements, for example using the risers function:


risers :: Ord a => [a] -> [[a]]
risers [] = []
risers [x] = [[x]]
risers (x:y:etc) = if x <= y then (x:s):ss else [x]:(s:ss)
     where (s:ss) = risers (y:etc)

Now, if we apply risers "sort" we get ["s","ort"]. We can now follow the same merge procedure as before:


"s"    "ort"
  "orst"

Instead of doing 3 merges, we have done only 1. Given the input "abcd" the effect would have been even more dramatic. We can refine this scheme further, by detecting both ascending and descending chains of elements in the initial list. This technique is used by Yhc, and is based on code originally written by Lennart Augustsson. Knowing the original source of the code, my shock at the performance benefits offered by the Yhc version have decreased substantially.

The GHC sort function should probably be replaced by the one from the Yhc libraries. This would offer increased performance, especially in the case of ordered or reverse-ordered lists. The asymptotic complexity of the two sorts means that there must exist some value of n such that sort [1..n] runs faster in Yhc than GHC (assuming sufficient stack/heap for both). I wonder whether Int32 is capable of expressing such a value...

A side note: I have been playing with the risers function for several years. I've used it as an example of pattern match checking, both specialised to Int and on a general Ord class. I've used it for supercompiling. It's appeared in blog posts, a TFP paper (Mitchell+Runciman 2007), a Haskell Workshop paper (Xu 2006) etc. I originally had this example suggested from a functional programming exam paper, but only today at lunch did I discover its true origins. The risers function had originally been the first-step in a merge sort!

Tuesday, March 04, 2008

Lazy Evaluation: Strict vs. Speculative

For the last few days I have been thinking about how to write a low-level program optimiser, based on the ideas from Supero. Supero works at the level of a lazy Core expressions, but actual hardware works on a sequence of strict instructions. The possible idea is to translate the lazy expressions to strict sequences, then borrow the ideas from supercompilation once more. In particular I have been looking at the GRIN approach, which defines such a set of instructions.

The GRIN work is very clever, and has many ideas that I would like to reuse. However, the one aspect that gave me slight concern is the complexity. A GRIN program requires the use of several analysis passes, and many many transformation rules. While this approach is perfectly acceptable, one of the goals of the Supero work is to make the optimisation process simpler -- comprising of a few simple but powerful rules.

I will first explain how strictness works, then how my speculative approach works. Readers who already know about unboxing are encouraged to skip to the speculative section.

Strictness

When doing low-level compilation, one of the most important stages is strictness analysis, and the associated unboxing. To take the example of the factorial function in Haskell:


factorial :: Int -> Int
factorial n = if n > 0 then n * factorial (n-1) else 1

Here it is easy to see that the factorial function always evaluates n. We can also use our knowledge of the definition of Int:


data Int = Int# I#

Where I# is an actual machine integer (possibly stored in a register), and Int# is a lazy box surrounding it. Since we know that factorial will always unwrap our n, we can pass the n around without the Int# box. I have made all the conversions from I# to Int explicit using an Int#, but have left all the unboxings implicit. The operators ># etc. are simply unboxed and strict variants of the standard operators.


factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (Int# n# - 1) else 1

Also, since we know factorial is strict in its first argument, we can evaluate the first argument to the recursive call strictly. Applying all these optimisations can now write:


factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (n# -# 1) else 1

We have removed the explicit boxing in the recursive call, and work entirely with unboxed integers. Now factorial is entirely strict. We can even write a wrapper around our strict version, to provide a lazy interface matching the original.


factorial :: Int -> Int
factorial n = factorial# n#

I have used n# to denote the unboxing of n. Now factorial looks like it did before, but operates much faster, on unboxed integers.

Speculative

I would like to not include a strictness analyser in my optimiser, or if it is included, have it be the result of a series of transformations -- without explicit "stop and analyse" then "use the results" stages. As part of my thoughts on this, I was trying to consider how to optimise factorial without invoking the strictness analyser.

The speculative transformation I have defined first generates factorial# - I have left out the details of why it decides to.


factorial :: Int -> Int
factorial n = if n > 0 then n * factorial (n-1) else 1

factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (Int# n# - 1) else 1

This step is entirely safe - we have defined factorial#, but we have not written a wrapper that invokes it, even in the recursive case. The factorial# function is equivalent to factorial if the initial argument was evaluated. We have transformed factorial# using only local knowledge, at the point. We can also transform factorial, replacing any uses of n which are guaranteed to come after n is evaluated, with (Int# n#). This transformation is merely reusing the knowledge we have gained unwrapping n:


factorial n = if n > 0 then Int# n# * factorial (Int# n# - 1) else 1

Now we promote any primitive operations on only unboxed values. Given (-), it is cheaper to evalute the subtraction than to store a lazy thunk to the function.


factorial n = if n > 0 then Int# n# * factorial (Int# (n# -# 1)) else 1

factorial# n# = if n# ># 0 then n# *# factorial (Int# (n# -# 1)) else 1

We can now use our knowledge that if we know an argument to a function is already evaluated, we can call the strict variant (this corresponds closely to constructor specialisation):


factorial n = if n > 0 then n# *# factorial# (n# -# 1) else 1

factorial# n# = if n# ># 0 then n# *# factorial# (n# -# 1) else 1

We can also replace the * in factorial with *# as we know we will have to evaluate the result of a function. Now we have ended up with a fast inner loop, operating only on unboxed integers. We have not required strictness information to make any transformation.

One way of viewing the difference between strictness and this transformation is the flow of information. In strictness, the caller is informed that a particular argument will be evaluated. In speculative, the callee informs the caller that an argument has already been evaluated. These two concepts are not the same, and while they overlap, there are instances where they differ considerably.

Strict vs. Speculative

Consider the following example:


strict :: Int -> Int
strict x = x `seq` lazy x (x-1) (x+1)

lazy :: Int -> Int -> Int -> Int
lazy a b c = if a == 0 then b else c

Here the lazy function is strict in a, but not either of b or c. A strictness analyser would generate a variant of lazy with only the first argument unboxed. In contrast the speculative variant will determine that x-1 and x+1 should be evaluated, and pass unboxed values in all arguments of lazy, even though lazy may not evaluate b or c.

To see this behaviour in GHC, it helps to make lazy recursive:


module Temp where

strict :: Int -> Int
strict x = x `seq` lazy x (x+1) (x-1)

lazy :: Int -> Int -> Int -> Int
lazy a b c = if a == 0 then lazy b b b else c

Now run with the options ghc Temp.hs -c -O2 -ddump-simpl, and you will see the lazy variant has type lazy :: Int# -> Int -> Int -> Int.

These thoughts are still very preliminary, and there are a number of unanswered questions:

What is the overlap between strict and speculative?

Can both variants be combined? (almost certainly yes)

Is speculative really simpler?

Is speculative sufficient?

What are the performance benefits of speculative?