Neil Mitchell's Blog (Haskell etc)

Sunday, February 22, 2009

Hoogle package search

Recently on the Haskell mailing list there has been some discussions of which packages Hoogle searches by default. One person remarked that it was unfortunate that the network package isn't searched by default. There are lots of packages on Hackage, and Hoogle needs to decide how to cope with so much choice. There are a number of questions that I need to answer in Hoogle:

What packages should Hoogle search by default? All of hackage? The base libraries? Only the packages a user has installed? Only packages that make it in to the Haskell Platform?

What groups of packages should Hoogle have available? Each package individually? All packages which compile on Windows? All packages by a certain author? All packages whose minor version number is even?

What UI should Hoogle show? Should there be checkboxes for each os's package? Should their be a checkbox for each compiler/version? Should their be no UI but some documentation?

And these questions present a number of trade offs:

The packages have to be divided under sensible and clear lines - I don't want to (and shouldn't) arbitrate divisions like "good" or "popular".

The more packages you search, the less relevant the results will be.

The fewer packages you search, the more chance that you miss something.

The more UI that is added the more confusing things get.

My development time for Hoogle derives Bounded, Finite and increasingly also derives Small.

Thoughts and suggestions are very welcome. I've set up a wiki page to track peoples thoughts, please make your view and arguments known: http://haskell.org/haskellwiki/Hoogle/Packages.

(As an aside, I recently found that dolphin friendly tuna is actually really harmful to the environment, far more harmful than dolphin unfriendly tuna. Read more here.)

Tuesday, February 03, 2009

Monomorphism and Defaulting

Haskell has some ugly corners - not many, but a few. One that many people consider exceptionally ugly is the monomorphism restriction. In this post I'm going to discuss three related issues - Constant Applicative Forms (CAFs), the monomorphism restriction and defaulting. But before we start, lets take a simple example.

Computing Pi

Haskell already provides the pi function which represents the value of pi, but lets assume it didn't. Taking a quick look at Wikipedia we can see that one way of computing Pi is the Gregory-Leibniz series. We can calculate pi as:

pi = (4/1) + (-4/3) + (4/5) + (-4/7) + (4/9) + (-4/11) ...

So let's write that as a Haskell program:


pie = sum $ take 1000000 $ zipWith (/) (iterate negate 4) [1,3..]

Here the constant 1000000 gives the accuracy of our approach, increasing this value will give a higher precision. As it currently stands, the Haskell library says pi = 3.14159265358979 and our program says pie = 3.14159165358977. Thirteen matching digits should be suffient for most uses of pi :-)

CAFs

The disadvantage of our pie function is that (under Hugs) it takes about 4 seconds to evaluate. If we are performing lots of calculations with pi, calculating pie each time will be a serious problem. CAFs are the solution!

A CAF is a top-level constant, which doesn't take any arguments, and will be computed at most once per program execution. As a slight subtlety, if the constant has class constraints on it (i.e. is Num a => a, instead of a) then it isn't a CAF because the class constraints act like implicit arguments. Our pie function above doesn't take any arguments, so is a CAF.

Defaulting

While pie doesn't have any class constraints, the right-hand side of pie does! Take a look in Hugs:


Main> :t sum $ take 1000000 $ zipWith (/) (iterate negate 4) [1,3..]
:: (Enum a, Fractional a) => a

Main> :t pie
:: Double

The right-hand side works for any Enum and Fractional type, for example Float, but pie is restricted to Double. The reason is the defaulting mechanism in Haskell - if a type can't be nailed down precisely, but is one of a handful of built-in classes, then it will default to a particular type. This feature is handy for working at an interactive environment, but can sometimes be a little unexpected.

Monomorphism restriction

Without defaulting the compiler would infer the type of pie as ::(Enum a, Fractional a) => a. However, such a definition would be rejected by the monomorphism restriction. The monomorphism restriction states that a function with no explicit arguments, but with class constraints, must be given a type annotation. This rejects functions like:


snub = sort . nub

To fix the problem there are two solutions:


snub i_hate_the_evil_mr = (sort . nub) i_hate_the_evil_mr

snub :: Ord a => [a] -> [a]
snub = sort . nub

For a function like pie only the second approach is applicable. The addition of dummy arguments to avoid the monomorphism restriction is sufficiently common that the HLint tool never suggests eta-reduction if the argument is named mr.

Conclusion

So why was the monomorphism restriction first introducted? For a function with no explicit arguments, the programmer might think they had written a CAF, but class constraints may substantially degrade the performance. Defaulting reduces the number of cases where the monomorphism restriction would otherwise bite, but it is still useful to be aware of the ugly corners.

There are proposals afoot to remove the monomorphism restriction and to increase the power of the default mechanism - hopefully both will be included in to Haskell'.

Tuesday, January 27, 2009

Small scripts with Haskell

Normally I give blog posts detailing the fun, interesting or advanced stuff I do with Haskell. But that isn't a real representation of my programming life! Most of the time I am doing small scripts that do little tasks, so I thought I'd describe one of those. This post is written as Literate Haskell, which means you can save the whole contents as a .lhs file and run it in GHCi or Hugs.

The task I had to complete was to take a directory of files, and for each file foo.txt generate the files foo_m1.txt to foo_m3.txt, where each one file is a block of lines from the original delimited by a blank line. i.e. given the file with the lines ["","1","1","","2","","3"], the numbers "1" would go in foo_m1.txt etc.

This blog post isn't how I actually wrote the original script - I didn't use literate Haskell (since I find it ugly), I didn't give explicit import lists (since they are needlessly verbose), I didn't give type signatures (but I should have) and I didn't split the IO and non-IO as well (but again, I should have). It is intended as a guide to the simple things you can easily do with Haskell. Now on to the code...


> import System.FilePath(takeExtension, dropExtension, (<.>), (</>))
> import System.Directory(getDirectoryContents)
> import Data.Char(isSpace)
> import Control.Monad

First, let's import some useful modules. To find more about a particular function just use Hoogle and search for it, but a quick summary:


takeExtension "foo.txt" = ".txt"
dropExtension "foo.txt" = "foo"
"foo" <.> "txt" = "foo.txt"
"bar" </> "foo.txt" = "bar/foo.txt"
getDirectoryContents "C:\Windows" = running "dir C:\Windows" at the command prompt
isSpace ' ' = True

Every Haskell program starts with a main function, which is an IO action. For this program, we are going to keep all the IO in main, and only use other pure functions. With most file processing applications its best to read files from one directory, and write them to another. That way, if anything goes wrong, its usually easy to recover. In this case we read from "data" and write to "res".


> main :: IO ()
> main = do
>     files <- getDirectoryContents "data"
>     forM_ files $ \file -> when (takeExtension file == ".txt") $ do
>         src <- readFile $ "data" </> file
>         forM_ (zip [1..] (splitFile src)) $ \(i,x) ->
>              writeFile ("res" </> dropExtension file ++ "_m" ++ show i <.> "txt") x

Or in some kind of pseudo-code:


main =
    set files to be the list of files in the directory "data"
    for each file in files which has the extension ".txt"
    {
        set src to be the result of reading the file
        for each numbered result of splitFile
        {
            write out the value from splitFile to the location "res/file_m#.txt"
            where # is the 1-based index into the list of results
        }
    }

We can now move on to the pure bits left over. We want a function splitFile that takes a file, and splits it in to three chunks for each of the blocks in the file. When processing text, often there will be stray blank lines, and the term "blank lines" will also apply to lines consisting only of spaces. The code is below:


> splitFile :: String -> [String]
> splitFile xs = map (tabify . unlines) [s1,s2,s3]
>     where
>         xs2 = dropWhile null $ map (dropWhile isSpace) $ lines xs
>         (s1,_:rest) = break null xs2
>         (s2,_:s3) = break null $ dropWhile null rest

And now presented more as a list of steps:

split the text in to lines

for each line drop all the leading spaces from it

drop all the leading blank lines

break on the first empty line, the bits before are chunk 1

drop all leading blank lines for the rest

break on the first empty line in the rest, before is chunk 2, after is chunk 3

for each of the chunks, put the lines back together, then tabify them

The tabify requirement was added after. The person decided that all continuous runs of spaces should be converted to tabs, so the file could better be loaded in to a spread sheet. Easy enough to add, just a simple bit of recursive programming:


> tabify (' ':xs) = '\t' : tabify (dropWhile (== ' ') xs)
> tabify (x:xs) = x : tabify xs
> tabify [] = []

And again in English:

if you encouter a space, drop it and all successive spaces, and write out a tab

otherwise just continue onwards

Haskell is a great language for writing short scripts, and as the libraries improve it just keeps getting better.

Sunday, January 18, 2009

FsCheck changes

Kurt Schelfthout has just released FsCheck 0.4, a tool similar to QuickCheck but for F#. While working at my internship for Credit Suisse I spent a little bit of time modifying FsCheck to include automatic generators (so you don't have to describe how to generate arbitrary values) and failure shrinking (so the counter-examples are smaller). Both these changes have now been incorporated in to the main FsCheck tool. It is really nice to see the work being contributed back, and that big companies are taking the time to get the necessary legal clearance etc.

I find shrinking to be a particularly potent feature. In one real-world task I struggled to debug a test failure for 8 hours, before shrinking was available. Attacking the same example with FsCheck and shrinking made the reason for the test failure immediately obvious.

Thursday, December 11, 2008

mapM, mapM_ and monadic statements

In my last post on F# I mentioned that do mapM f xs; return 1 caused a space leak, and that the programmer should have written mapM_. I also proposed that monadic statements should work more like in F# where non-unit return values can't be ignored. Various people seemed to misunderstand both points, so I thought I'd elaborate.

mapM as a Space Leak

First, I should clarify what I understand as a space leak. A space leak is not a memory leak in the C sense. A space leak is when a computation retains more live memory than necessary for some period of time. One sign of a possible space leak is that lots of memory is retained by garbage collection.

Comparing mapM and mapM_ on the following program:


main = do
    mapM* putChar (replicate 10000 'a')
    return ()

The mapM_ variant has a maximum heap residency of 2Kb, while the mapM variant has 226Kb. Given an input list of length n, the residency of mapM_ is O(1), while mapM is O(n).

The exact reasons for the space leak are quite detailed, and I'm not going to attempt to cover them. My intuition is that the return list is wrapped in the IO monad, and therefore can't be deallocated until the IO action finishes. In summary, unless you are going to use the end result of a monadic map, always use mapM_.

Monadic Statements

In the above example it would be nice if the compiler had complained. You generated a value, but you didn't use it. Fortunately, it is a very easy fix - change the type of monadic bind (>>) :: Monad m => m a -> m b -> b to Monad m => m () -> m b -> m b. Now, if a monadic statement generates a value that isn't (), you get a type error. The above examples with mapM would be rejected by the type checker.

But what if we really wanted to call mapM? There are two options. The first is to bind the result, for example do _ <- mapM f xs; return 1. The second option, which F# favours, is do ignore $ mapM f xs ; return 1, with the auxiliary ignore :: Monad m => m a -> m (). I prefer the second option, as it clearly states that you want to ignore the result of a computation. You could even write a rule ignore . mapM f == mapM_ f.

Sunday, December 07, 2008

F# from a Haskell perspective

I've recently started a full-time job at Standard Chartered. Before that I was doing an internship with Credit Suisse, where I spent a reasonable amount of time doing F# programming. Before I started F# I had 6 years of Haskell experience, plenty of C# experience, but little exposure to ML. I've now had 3 months to experiment with F#, using an old version (the one before the latest Community Technology Preview) and here are my impressions.

F# is a functional language from Microsoft, previously a Microsoft Research language, which is moving towards a fully supported language. F# is based on ML, and some (perhaps many) ML programs will compile with F#. At the same time, F# has complete access to the .NET framework and can interoperate with languages such as C#. F# is a hybrid language - at one extreme you can write purely functional ML, and at the other extreme you can write imperative C#, just using a different syntax. F# seems to be designed as a practical language - it isn't elegant or small, but does interoperate very nicely with every .NET feature.

The Language

The F# language is based on ML, with the addition of indentation based layout, and many of the weaknesses in F# come from ML. F# is certainly more verbose than Haskell: in some places you need an extra keyword (often a let); pattern matching is not nearly as complete; the indentation isn't as natural as Haskell. However there are some nice syntactic features in F# that are not in Haskell, including generalised list/array/sequence comprehensions and active patterns.

The type checker in F# is powerful, but unpredictable. I often get surprised by where type annotations need to go, particularly when working with .NET object types. The tuple type is treated specially in many cases, and this also leads to surprise - inserting or removing a pair of brackets can effect the type checker. Much of this complexity is necessary to manage the interaction with .NET, but it does complicate the language. Unfortunately, even with the advanced type features in F#, there are no type classes. The lack of type classes precludes the standard implementation of things such as Uniplate and QuickCheck. However, F# does have some nice reflection capabilities, and often entirely generic implementations can be given using reflection. There is certainly an interesting design trade-off between reflection based operations and type classes, something I have looked at in the past and hope to explore again in future.

F# is an impure language, which offers some practical benefits from Haskell, but also encourages a less functional style. In Haskell I sometimes work within a localised state monad - F# makes this much more natural. The impurity also allows simple interaction with .NET. Having programmed with an impure language I did find myself reaching for localised state much more often - and was occasionally tempted into using global state. In most cases, this state became problematic later. Before using F# I thought purity was a good thing, now I'm convinced that purity is a good thing but that impurity is often very useful!

Haskell could learn some things from F#. Every statement in F# must either be bound to a value or evaluate to (). In Haskell it is possible to write do mapM f xs; return 1. Any experienced Haskell programmer should spot that the mapM is a space leak (it should be mapM_), but the type system doesn't enforce it. In F# the type system does. The change in Haskell is simple, and in my opinion, desirable.

The Platform

F# is a fully fledged member of the .NET platform. You can write a class in VB.NET, write a derived class in F#, and then derive from that class in C#. All the languages can produce and consume the same libraries. This integration with .NET allows companies that already use the Microsoft tools to easily migrate - even on a function by function basis. However, the combination of both an imperative framework and a functional language at some times leads to confusing choices. All of the standard .NET libraries work with arrays, but for a functional program the list is a more natural type. F# provides both, and it was never clear which I should use where, leading to lots of conversions. The .NET libraries are very powerful, but often are overly imperative. For example, the XSD libraries (Xml Schema Description) are very imperative - you have to create objects, mutate properties, then make calls. However, in Haskell, I probably wouldn't have had any XSD support, certainly nothing as well-supported as in .NET.

The Tool Chain

The most impressive aspect of F# is the integration with the Visual Studio environment. F# contains a debugger, profiler, auto-completion, identifier lookup and many other tools. While other functional languages have some of these tools, the Visual Studio environment tends to have very refined and polished implementations. The integration with F# is sometimes a little fragile, or at least was in the version I was using, but the tools are already very powerful and are likely to continue to improve.

Overall

The F# language isn't the most beautiful language ever, but it's not bad. The integration with .NET is incredible, and while this requires compromises in the language, the benefits are considerable. I still prefer Haskell as a language, but for many users the tool chain is a more important consideration, and here F# excels.

This post was brought to you by Ctrl and v, as the computer I am currently using doesn't have a # key!

Update: I'd recommend reading Vesa Karvonen's comment below - he has additional perspectives on F# from a more ML perspective.

Saturday, December 06, 2008

Enabling Reply To All in Outlook

Update: See http://neilmitchell.blogspot.com/2011/12/enabling-reply-to-all-in-outlook.html for an updated version of this functionality.

Some companies lock down the use of Outlook by disabling the Reply To All button. This makes it harder to manage email, and requires manually copying email addresses to get the same effect. But using a bit of Office VBA, it is possible to make a functioning Reply To All button. The following solution has been tested in Outlook 2003, but should work for older versions as well.

First, enable macros in Outlook. Go to Tools, Macro, Security and select Medium or Low security.

Second, add a Reply To All action. Go to Tools, Macro, Visual Basic Editor and put the following code in the text editor.


Option Explicit

Public Sub ReallyReplyAll()
Dim o As MailItem
Set o = Application.ActiveExplorer.Selection.Item(1)
o.ReplyAll.Display
End Sub

Finally, add a toolbar button to invoke the action. Go to Tools, Customise, Commands, Macros, and drag and drop the command Project1.ThisOutlookSession.ReallyReplyAll on to the toolbar. You can put this command exactly where you used to have Reply To All, and give it the same icon/name.

To test, select an email and click on the button you just added, it should do exactly what Reply To All would have done. There are some minor limitations to this method:

The button will not disable itself when it isn't applicable, i.e. when there are no emails selected. You will still be able to click on the button, but it won't do anything.
If you select a medium level of macro security, you will have to go through a security confirmation the first time you click Reply To All in an Outlook session.

If possible, try to educate the person in charge that Reply To All is perfectly good email etiquette, and that people should be trusted to use it responsibly. However, if that fails, the above method is a useful fallback.

Monday, September 29, 2008

General Updates

It's been a little while since I last posted. I've recently got back from ICFP 2008, and quite a few people asked me what I was doing now. I've also got a few comments on a a few other things. The following as a section of disjointed paragraphs on a variety of topics, both academic and personal.

Catch Talk

I gave a talk at the Haskell Symposium, about Catch. A video of the talk is now online.

Generics Talk

Alexey gave a talk about generic programming libraries at the Haskell Symposium. I was particularly interested in this talk as it is somewhat like a competition between libraries, where Uniplate is one of the competitors. One thing I noticed is that the Uniplate version of the SYB example in the talk can be written as one single lexeme, namely uniplateBi. The talk was much more about generics libraries, while Uniplate is probably more accurately described as a traversal library, so issues such as conciseness of code were left out. One thing I did disagree with from the talk was the assertion that Uniplate requires Template Haskell and Data/Typeable deriving. In reality Uniplate requires neither, but if they are present, then you have the option of using them to write even less code.

From a combination of the paper and the talk I think its fair to conclude that if Uniplate does what you want, its a pretty good choice. This fits well with the Uniplate philosophy of giving up a small amount of power, to allow a massive simplification, while still being powerful enough for most tasks.

PhD/Work

I've had my PhD viva (passed with minor corrections), and have nearly finished making the minor corrections. I'll update my website with a revised copy of the thesis shortly. I'm currently working at Credit Suisse on a three month internship. I'm not working on Haskell stuff, but instead am doing F# programming. To get a feel for some of the things that are done by Credit Suisse I recommend looking at Ganesh's ICFP talk/paper and Howard's CUFP talk. Disclaimer: Nothing I say on this blog, or anywhere public, has anything to do with Credit Suisse, but are my personal thoughts.

Personal Life

I've just moved to Cambridge, and got engaged to my girlfriend (now fiancee), Emily King. I'll be commuting to Credit Suisse for the next two months.

My Libraries/Tools

Now I'm working full-time (long hours plus a long commute), its hard for me to put the same amount of time into updating and maintaining my Haskell libraries and tools. I will still be accepting patches and answering questions, but probably not fixing too many bugs at any great speed. I'm still maintaining my bug tracker, so feel free to add bugs, fix bugs, or comment on bugs. If anyone has any particular interest in a tool, I'd consider taking on a co-maintainer to reduce some of the maintenance burden.

Well Typed

There is now a Haskell consultancy, Well Typed, comprising of Duncan Coutts and Ian Lynagh. These are two very good Haskell hackers, who are now selling their knowledge and experience. Between them, they've had substantial experience with GHC, Cabal, Hackage, ByteString, TemplateHaskell and numerous Haskell libraries. They've also taught lots of students Haskell, and helped lots of beginners on IRC and mailing lists. If I want help with Haskell, or with the general infrastructure and tools, they are usually the first people I approach. I strongly recommend that anyone needing Haskell help in a commercial environment get in contact with them - they can help you get the most out of Haskell. Disclaimer: I haven't been asked to write this section, and haven't checked with Ian/Duncan first, but I do wish them luck!

Thursday, August 28, 2008

Running your own Hoogle on a Web Server

As promised, here is a guide on deploying Hoogle on a web server. Before doing so, you need to generate the necessary Hoogle databases, as described yesterday, and place them in the datadir configured with Cabal. Then:

Move the hoogle binary to a location where it can act as a CGI binary, perhaps changing its name to index.cgi, if necessary. Configure the CGI program to run, possibly changing the program to be executable or adding settings somewhere.

Copy the files from src/res in the darcs repo into a res directory located beside the binary.

Create a file log.txt and give it global write permissions.

Now you should have Hoogle running on a web server! Some of the features, such as OpenSearch integration, won't work - but Hoogle should be usable. If anyone does get Hoogle running on a web server I'd love to hear, any feedback appreciated. In particular, if there are any tweaks required please let me know.

Wednesday, August 27, 2008

Hoogle Database Generation

Brief Annoucement: A new release of the Hoogle command line is out, including bug fixes and additional features. Upgrading is recommended.

Two interesting features of Hoogle 4 are working with mulitple function databases (from multiple packages), and running your own web server. Both these features aren't fully developed yet, and may change in their use, but can be used with care. This post covers how to generate your own databases, and how the web version databases are generated. Tomorrow I'm going to post on how to run your own Hoogle web server, but you'll need to generate your databases first! I'm going to walk through all the steps to create a database from the filepath library, as an example

Hoogle Databases

A Hoogle database is a set of searchable things, including text and type searching, and has a ".hoo" extension. A database may include the definitions from one package, or from multiple packages. Typically the Hoogle databases installed would include one database for each package (i.e. base.hoo, filepath.hoo), a default database (default.hoo) comprising of all the standard search items, and any number of custom databases (all.hoo) which comprise of different combinations of the other databases.

When using Hoogle, adding +name will include the given database in the search list, and -name will exclude the given package from the search. By default, Hoogle will use default.hoo, but if any +name commands are given then those databases will be used instead.

Hoogle looks for databases in the current directory, in the data directory specified by Cabal, and in any --include directories passed at the command line.

Step 1: Creating a Textbase

A Textbase is a textual representation of a function database. To generate a textbase you need to install the darcs version of Haddock, then use runhaskell Setup haddock --hoogle on your package. For filepath, this will create the file dist/doc/html/filepath/filepath.txt, which is a textbase.

Step 2: Converting a Textbase to a Database

To convert a textbase to a database use the command hoogle --convert=filepath.txt in the appropriate folder. If a package depends on any other packages, then adding +package will allow Hoogle to use the dependencies to generate a more accurate database. In the case of filepath, which depends on base, we use hoogle --convert=filepath.txt +base. This command requires base.hoo to be present.

Adding the dependencies is not strictly necessary, but will allow Hoogle to generate a more accurate database. For example, the base package defines type String = [Char], without the +base flag this type synonym would not be known to Hoogle.

We now have filepath.hoo, which can be used as a search database.

Step 3: Combining Databases

To generate a database comprising of both filepath and base, type hoogle --output=default.hoo --combine=filepath.hoo --combine=base.hoo. By combining databases you allow easy access to common groups of packages, and searching all these packages at once becomes faster than listing each database separately.

Web Version Databases

The web version uses the Hackage tarballs to generate documentation for most of its databases, but also has three custom databases:

base - the base package is just too weird, and isn't even on hackage. A darcs version and some tweaking is required.

keyword - the keyword database is a list of the keywords in Haskell, and is taken from the web page on the wiki.

hackage - the hackage database is a list of all the packages on Hackage, indexed only by the package name.

All the code for generating the web version databases is found in data/generate in the Hoogle darcs repo at http://code.haskell.org/hoogle.

Future Improvements

There are two database related tasks that still need to be done: Cabal integration and indexing all of Hackage.

Bug 80: In the future I would like Hoogle databases to be generated by Cabal automatically on installing a package. Unfortunately, I don't have the time to implement such a feature currently, and even if I did implement it, I'm unlikely to ever use it. If anyone wants to work on this, please get in contact. This is mainly a project working with Cabal.

Bug 79: The other work is to index all the packages on Hackage. The problem here is generating the textbases, once they have been created the rest is fairly simple. However, to run Haddock 2 over a package requires that the package builds, and that all the dependencies are present. Unfortunatley my machine is not powerful enough to cope with the number of packages on Hackage. Hopefully at some point the machinery that builds Haddock documentation for Hackage will also generate textbases, however in the mean time if someone wants to take on the task of generating textbases for Hackage, please get in contact.

Bug Tracker

I'm not working on Hoogle full-time anymore, so am using my bug tracker to keep track of outstanding issues. In order to interact more effectively with my bug tracker, you might want to read this guide. It describes how to vote for bugs etc.

Wednesday, August 20, 2008

Hoogle New Features

I've now finished my Hoogle Summer of Code work, though I still intend to continue working on Hoogle when I get the chance. Before the coding period expired, I was able to add a number of new features to Hoogle. These features are all available at Hoogle, under http://haskell.org/hoogle/.

More Compact Text Searching

The old text search feature was very fast, using an on disk trie to navigate around the possible matches. The downside to this trie was the space it consumed, about half the database was devoted to it. Fortunately, I came up with an alternative way to get fast text searching (albeit slightly slower), in a lot more compact form.

Much smaller database files also mean much faster database generation, as the time spent in the IO routines is the main bottleneck.

Faster IO routines

I rewrote the underlying binary layer in Hoogle, to make it faster. It's not as fast as I would like, and I think that moving to memory-mapped files is probably a good idea. With these improvements, along with the compact text searching, I am able to generate databases in about 2 seconds (compared to about 20 seconds before).

Database Restricted Searches

Hoogle has been able to run database restricted searches for some time, but now the databases contain enough information to make it practical. By adding +package or -package to the search you can include or exclude certain packages. For example, to find out which map functions are in the containers package try map +containers. To find out which map functions are not in the containers or bytestring packages try map -containers -bytestring. I have also split out the GHC.* modules from base, so if you want to find some unboxed types in GHC's libraries try # +ghc. Note that not all the documentation links work from the GHC modules, I am still trying to fix this.

By default Hoogle searches the following packages: array, base, bytestring, cabal, containers, directory, filepath, haskell-src, hunit, keyword, mtl, parallel, parsec, pretty, process, quickcheck, random, stm, template-haskell, time, xhtml

The "ghc" package is also available if specified with +ghc and includes the GHC.* modules of base only.

Hoogle 3

I have now replaced the default Hoogle with Hoogle 4, but have copied Hoogle 3 to http://haskell.org/hoogle/3. Unfortunately, it doesn't yet work, as I need some admin help. But it will in the next few days, I hope. The only reason I can think of for using Hoogle 3 is Gtk2hs library searching, which I do want to add to Hoogle 4 when possible.

Give Me Feedback

There are quite a lot of enhancements to Hoogle that I still want to make. I have tried to list all these improvements in my bug tracker. If you find a bug, or want some feature, open an issue. If you have a particular interest in a bug, you can star it, to be informed on its progress and to indicate to me that you care.

I'm particularly interested in two pieces of feedback:

I don't use Hoogle 4 because ...

Do you use any type/name search engine? Do you want to still use Hoogle 3? Do you use Hayoo? If you use something else, what feature draws you to it? What do you dislike about Hoogle 4?

I use Hoogle 4, but my life would be nicer if ...

There are many things which effect Hoogle 4 users that I'm not aware of. If you open a bug saying what annoys you (or leave a comment and I'll do it for you) then I can keep track of this information. Even if you don't necessarily see any way to fix the problems, I'd still like to know them.

Thanks for everyone who has given feedback on Hoogle so far, it has been very useful.

Friday, August 15, 2008

GSoC Hoogle: Week 12

This week I've been trying to get Hoogle 4 to the point where it can replace Hoogle 3. This is the final official week of Google Summer of Code, but I'm planning to continue hacking Hoogle next week, and then as time allows after that.

The priority this week was getting the documentation links working. The problem was not with Hoogle - displaying the links is trivial - but ensuring that Cabal + Haddock + Hoogle + random build scripts combine to generate the correct databases. This work involved lots of little changes in lots of places, but is now working properly. Included in this work is dependency tracking of packages (so that all packages using base know that String = [Char] etc), and merging multiple databases to create a single one.

After the Hoogle database was generated correctly, I started looking at using some of the additional information present. I have now added Haddock documentation inline in the search results. If the documentation is too long to fit comfortably, Hoogle uses AJAX wizzy-ness (or more accurately, DHTML) to allow the user to expand and show all the documentation. I suspect that this will eliminate many cases of the user actually following to the Haddock webpages. This feature is fairly new, and I have pushed it out because its useful - there are still many small improvements that need to be made.

This week I also spent some time attempting to generate documentation for all the Hackage libraries. I had some success, but the computer I am currently using is years old and lacks the necessary processing power. I will tackle this at some point in the future, once I have purchased a new machine (which should be quite soon).

With all these changes, I find Hoogle 4 to be significantly more usable than Hoogle 3. Please give it a try, and give feedback. At this point I'm particularly interested in any issues that would cause you to use Hoogle 3 instead of Hoogle 4.

Hoogle 3: http://haskell.org/hoogle

Hoogle 4: http://haskell.org/hoogle/beta

If there are no major issues, I will be replacing Hoogle 4 as the standard Hoogle sometime next week.

Next week: I will be no longer doing Google Summer of Code :-) I plan to refine some of the existing bits of Hoogle, and ensure that anything I haven't done is in a bug tracker for later.

User visible changes: The web search engine now gives Haddock links and displays Haddock documentation inline.

Monday, August 11, 2008

GSoC Hoogle: Week 11

This week I've been releasing lots. Hoogle 4 is finally starting to come together, and should be a worthy replacement for Hoogle 3 very shortly. Rather than go into detail about the past week, I'm just going to give some of the bullet points:

I have released 4 versions of the command line version of Hoogle, available on Hackage. Many bugs have been spotted by some very useful testers, and improvements have been made.

I have released a web version of Hoogle 4, and encourage feedback.

I have started to update the wiki Manual, which now contains some details of Hoogle's query syntax.

I gave a talk at AngloHaskell 2008, which is available online, as slides and an audio stream. All of the other talks were excellent and are well worth listening to.

I have started to build Hoogle documentation for all of Hackage. The machine I'm doing this on is very slow, so its not a quick process!

Next week: I'm hoping to work on generating better Hoogle databases, including a Hoogle database for the whole of Hackage. I also have a number of bugs to fix.

User visible changes: Users can download and use Hoogle, and the web interface is online.

Tuesday, August 05, 2008

Hoogle 4.0 web client preview

Since releasing a command line version of Hoogle 4 yesterday, I've had some useful feedback from a number of people. As a result, I have added a few bugs to the bug tracker, and fixed a few mistakes in the searching and ranking. The Hoogle on Hackage is currently 4.0.0.3 and is a recommended upgrade to all early testers.

I've now written a web interface to Hoogle 4, which has been uploaded to http://haskell.org/hoogle/beta/. This web interface is primarily so people can test searching/ranking without installing anything. There are a number of limitations:

The links to documentation do not work - this is the most severe problem, and probably stops people permanently changing to the new version.

The Haddock documentation is not present.

Some database entries are duplicates.

The Lambdabot says feature is missing.

The Suggestion feature is incomplete.

The AJAX style client features are not present.

The first three issues are fixed in Hoogle, but need various support through Haddock and Cabal to work. Other than these limitations, I am very interested in hearing what people think. As before, particularly regressions from Hoogle 3 or poor results/ranking.

Monday, August 04, 2008

Hoogle 4.0 release (beta, command line)

I am pleased to announce Hoogle 4.0, available on Hackage. A couple of things to note:

This is a release of the command-line version only. It will have identical searching abilities to the web-based version, which I'm about to write.

It currently only searches the same packages as Hoogle 3 (the final release will search more).

It currently doesn't support the --info flag as previously described (problems with Haddock, not with Hoogle).

Walkthrough: Installation

If you have cabal-install available, it should be as simple as:


$ cabal update && cabal install hoogle

Otherwise, follow the standard Cabal/Hackage guidelines. Hoogle depends on about 4 packages on Hackage which are not available with a standard GHC install, so these will need to be built.

Walkthrough: A few searches

Here are some example searches. I have used --count=5 to limit the number of results displayed. If you are using a terminal with ANSI escape codes I recommend also passing --color to enable colored output.


$ hoogle map --count=5
Prelude map :: (a -> b) -> [a] -> [b]
Data.ByteString map :: (Word8 -> Word8) -> ByteString -> ByteString
Data.IntMap map :: (a -> b) -> IntMap a -> IntMap b
Data.IntSet map :: (Int -> Int) -> IntSet -> IntSet
Data.List map :: (a -> b) -> [a] -> [b]

$ hoogle "(a -> b) -> [a] -> [b]" --count=5
Prelude map :: (a -> b) -> [a] -> [b]
Data.List map :: (a -> b) -> [a] -> [b]
Control.Parallel.Strategies parMap :: Strategy b -> (a -> b) -> [a] -> [b]
Prelude fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative <$> :: Functor f => (a -> b) -> f a -> f b

$ hoogle Data.Map.map --count=5
Data.Map map :: (a -> b) -> Map k a -> Map k b
Data.Map data Map k a
module Data.Map
Data.Map mapAccum :: (a -> b -> (a, c)) -> a -> Map k b -> (a, Map k c)
Data.Map mapAccumWithKey :: (a -> k -> b -> (a, c)) -> a -> Map k b -> (a, Map k c)

$ hoogle "Functor f => (a -> b) -> f a -> f b" --count=5
Prelude fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative <$> :: Functor f => (a -> b) -> f a -> f b
Control.Monad fmap :: Functor f => (a -> b) -> f a -> f b
Control.Monad.Instances fmap :: Functor f => (a -> b) -> f a -> f b
Data.Traversable fmapDefault :: Traversable t => (a -> b) -> t a -> t b

How you can help

I've released a command line version of the search to solicit feedback. I'm interested in all comments, but especially ones of the form:

I prefer the command line version of Hoogle 3 because ...

When I search for ... I would expect result ... to appear, or to appear above result ...

I was hoping for the feature ...

It takes too long when I ...

I'm going to be accumulating Hoogle 4 bugs in my bug tracker, or by email (http://www-users.cs.york.ac.uk/~ndm/contact/) - whichever you find more convenient.

Now I'm going to start work on the Web search :-)

Sunday, August 03, 2008

GSoC Hoogle: Week 10

This week I've been in Bristol, and am just about to head off to the Harbour Festival. Next week I'm heading off to AngloHaskell 2008, and will be talking about Hoogle type searching on the Saturday.

This week has been type search, yet again. There were issues with algorithmic complexity, combinatorial explosions and other fun stuff. However, its now finished. The type search is now fast enough (you can run Hoogle in Hugs against the core libraries) and gives good results. Rather than describe type searching, its easier to give an example. Searching for (a -> b) -> [a] -> [b] in Hoogle 3 gives:


Prelude.map :: (a -> b) -> [a] -> [b]
Data.List.map :: (a -> b) -> [a] -> [b]
Control.Parallel.S... parMap :: Strategy b -> (a -> b) -> [a] -> [b]
Prelude.scanr :: (a -> b -> b) -> b -> [a] -> [b]
Data.List.scanr :: (a -> b -> b) -> b -> [a] -> [b]
Prelude.scanl :: (a -> b -> a) -> a -> [b] -> [a]
Data.List.scanl :: (a -> b -> a) -> a -> [b] -> [a]
Prelude.concatMap :: (a -> [b]) -> [a] -> [b]

But in Hoogle 4 gives:


Prelude map :: (a -> b) -> [a] -> [b]
Data.List map :: (a -> b) -> [a] -> [b]
Prelude fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative <$> :: Functor f => (a -> b) -> f a -> f b
Control.Monad fmap :: Functor f => (a -> b) -> f a -> f b
Control.Monad.Instances fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative liftA :: Applicative f => (a -> b) -> f a -> f b
Data.Traversable fmapDefault :: Traversable t => (a -> b) -> t a -> t b
Control.Monad liftM :: Monad m => (a1 -> r) -> m a1 -> m r
Control.Parallel.Strategies parMap :: Strategy b -> (a -> b) -> [a] -> [b]

I think the new results are better. For more details, come to the AngloHaskell talk.

Next Week: I want to release a public beta of Hoogle 4 in command line form. I want to start on the web search engine and tweak the ranking algorithm. I'll also be writing up type search in the form of a presentation.

User Visible Changes: Type search works well and fast.

Thursday, July 24, 2008

GSoC Hoogle: Week 9

I'm off camping for the next weekend in a couple of hours, so this is my early weekly summary. From next week, for a week and a half, I'll actually have an SSH connection so expect to see 200+ patches flow into the Hoogle repo in a few days.

This week I've been rewriting the type search. I spent 3 days writing code, type checking it, but not actually having enough written to run it. Late last night I finished the code, and this morning I debugged it. Amazingly (although actually quite commonly for Haskell) it worked with only minor tweaks. I now have a type search which should scale to large databases and provide fast and accurate searches.

All the basic tests work, and I can generate a Hoogle database for the array library. I still can't generate a Hoogle database for the base library, due to a stack overflow, but I think the cause of the stack overflow has changed and should be easy to debug.

Next Week: A public beta of the command line version is now overdue, and hopefully will happen next week. I aim to finish the actual search side of Hoogle, and move on to the web interface.

User Visible Changes: Type search works again, mostly.

Sunday, July 20, 2008

GSoC Hoogle: Week 8

This week I've been travelling quite a bit, and rather busy with other things. Hopefully next week I'll be able to focus more time on Hoogle!

This week I fleshed out the final part of type search, including support for instances and alpha renaming of variables. After having implemented all the bits in the type search, I tried to convert the base libraries - and it failed, taking up too much time/memory to feasibly finish.

The type search is based around the idea of having nodes in a graph representing types, and then moving between these nodes, at a cost. In order to avoid a blow-up in the number of nodes in the graph, types are alpha-normalised and then alpha-renaming is performed afterwards. Instead of having 3 type nodes for (a,b), (c,d) and (a,a) there is just one named (a,b) and a 3 sets of alpha-renamings. All is good.

However, once you introduce instance restrictions, the types blow up. For example, from the type node a, you can move to Eq a => a, Ord a => a, Show a => a etc. The large (but feasible) number of type nodes, combined with even a small number of class names, gives a huge number of nodes. In fact, for every type variable in a node there are 2^n possible instance contexts it could take. All is bad.

Fortunately there is a solution - move instance checking outside the type graph. This makes the number of nodes feasible, and should work fairly well. It also has a few other benefits, including slightly better scoring and a simpler implementation in a few places. I also came up with a strategy for moving the cost associated with alpha-renaming into the graph search, which further simplifies things.

Of course, all this work takes time, so overall progress is slower than I would have liked. However, the results so far are promising, and the problems of scale seem to have been successfully addressed. The problem of fast and accurate type searching is hard, but hopefully Hoogle 4 will have a scalable solution that should be useful.

Next week: I want to finish the implementation of type searching, and check it works on the full base libraries. A release would be good, although may take place early in the following week.

User visible changes: Creating a database for the base library will now fail with a stack overflow. Hopefully next weeks changes will fix this!

Friday, July 11, 2008

GSoC Hoogle: Week 7

This week I've continued to improve the type searching, and generated Hoogle databases for the core libraries. I'm away from a computer all weekend until Tuesday evening, which has happened every ICFP contest for the last 3 years.

I've substantially refactored the type searching, basing it on a proper abstract Graph data type. Now the mechanisms for dealing with type search and graph traversal are separate it is much easier to express clearly what type search is doing. I've also fleshed out the type searching code so that it can accurately perform searches with all the necessary features. There are still a number of tasks to do before the type searching code is finished, but each is a fairly discrete unit of work with well-understood problems.

The other challenge for the week has been generating Hoogle databases for the core libraries - the base library and all the other libraries GHC ships with a release. With these libraries in place, it is feasible to use Hoogle to perform useful queries. The libraries are generated using a combination of Cabal, Haddock and Hoogle. I've made changes in both the Haddock and Hoogle layers so that the full base libraries can now be processed.

In order to deal with the full base libraries there are numerous GHC extensions that must be supported. In particular, Hoogle now supports multi-parameter type classes, higher-ranked types, type operators, unboxed types, unboxed tuples and NDP style arrays. All of these features are translated down into Haskell 98 types, but most closely approximate their behaviour in GHC, and can be used in searches.

Throughout the week I've been profiling the database creation code in Hoogle. The databases for the core libraries come to about 4.5Mb, and are highly optimised for performing searches - often at the cost of making them harder to create. I've halved the time to create databases during the week, using profiling to direct improvements. The time required to process the Core libraries takes 60 seconds, which is a certainly an acceptable timeframe, but could always be faster. Currently the biggest culprit in the profile is the hPutByte function:


hPutByte :: Handle -> Int -> IO ()
hPutByte hndl i = hPutChar hndl $ chr i

Since the underlying databases are written using lots of hTell and hSeek commands, it is not possible to use something like the Data.Binary library. However, if anyone has any suggestions on how to improve performance they would be gratefully received.

Next week: I want to finish off the remaining type search features, and then package up a command line release for Hackage. Hopefully Hoogle 4 will be ready for initial use by early testers.

User visible changes: Type search is more robust, but still not fully featured. Database creation if faster and robust. You can search the base libraries.

Sunday, July 06, 2008

GSoC Hoogle: Week 6

This week I've been tackling type searching. I have just (in the last few minutes) got my first type search to work. At the moment type search is very limited, but all the ideas and scafolding are in place, so should now proceed relatively quickly.

In all previous versions on Hoogle, type searching was O(n), where n is the number of functions in the database. Hoogle compared the type search to each possible answer, computed a closeness score, then at the end wrote out the closest matches. This meant that before the first answer could be given, all functions had to be checked, i.e. the time for the first answer was O(n). As the Hoogle database is about to get massively bigger, this approach is insufficient.

The new version of Hoogle is much cleverer. It works by exploring a graph, following similar ideas to Dijkstra's algorithm, to reach more suitable results first. Typically, the best answers will be given without any search of the graph, and then as the graph is searched more results will appear with lower closeness. With the new scheme the complexity is O(m), where m is the number of results you want. I hope at some point after the SoC is finished to describe the algorithm properly, so others can understand it, and hopefully improve upon it.

Next week: Finishing off type searching, so it supports all the features planned. Build system work, and potentially a cabal pre-release.

User visible changes: Type search works to some degree, but not perfectly. Database debugging options (conversion and dumping to a text file) have been added.