Friday, July 11, 2008

GSoC Hoogle: Week 7

This week I've continued to improve the type searching, and generated Hoogle databases for the core libraries. I'm away from a computer all weekend until Tuesday evening, which has happened every ICFP contest for the last 3 years.

I've substantially refactored the type searching, basing it on a proper abstract Graph data type. Now the mechanisms for dealing with type search and graph traversal are separate it is much easier to express clearly what type search is doing. I've also fleshed out the type searching code so that it can accurately perform searches with all the necessary features. There are still a number of tasks to do before the type searching code is finished, but each is a fairly discrete unit of work with well-understood problems.

The other challenge for the week has been generating Hoogle databases for the core libraries - the base library and all the other libraries GHC ships with a release. With these libraries in place, it is feasible to use Hoogle to perform useful queries. The libraries are generated using a combination of Cabal, Haddock and Hoogle. I've made changes in both the Haddock and Hoogle layers so that the full base libraries can now be processed.

In order to deal with the full base libraries there are numerous GHC extensions that must be supported. In particular, Hoogle now supports multi-parameter type classes, higher-ranked types, type operators, unboxed types, unboxed tuples and NDP style arrays. All of these features are translated down into Haskell 98 types, but most closely approximate their behaviour in GHC, and can be used in searches.

Throughout the week I've been profiling the database creation code in Hoogle. The databases for the core libraries come to about 4.5Mb, and are highly optimised for performing searches - often at the cost of making them harder to create. I've halved the time to create databases during the week, using profiling to direct improvements. The time required to process the Core libraries takes 60 seconds, which is a certainly an acceptable timeframe, but could always be faster. Currently the biggest culprit in the profile is the hPutByte function:

hPutByte :: Handle -> Int -> IO ()
hPutByte hndl i = hPutChar hndl $ chr i

Since the underlying databases are written using lots of hTell and hSeek commands, it is not possible to use something like the Data.Binary library. However, if anyone has any suggestions on how to improve performance they would be gratefully received.

Next week: I want to finish off the remaining type search features, and then package up a command line release for Hackage. Hopefully Hoogle 4 will be ready for initial use by early testers.

User visible changes: Type search is more robust, but still not fully featured. Database creation if faster and robust. You can search the base libraries.


meteficha said...

mmap, maybe?

Neil Mitchell said...

meteficha, sounds like it could work - will give it a go in the next few weeks and see what the speed is. Thanks!