This week I was going to tackle type searching, but then realised I'm going to spend 6 hours on Friday on a train (hence the weekly update on Thursday), so can spend that time productively working on paper tackling type search. So instead of type search, I worked on a few other pieces, some of which make type search easier:
Haddock Database Generation More patches to get better output from Haddock. The code now handles class methods properly, and deals with some FFI bits.
Lazy Name Searching Searching for a name is now fairly lazy. When searching for a name, Hoogle can return the prefix of the results without doing too much computation to calculate all the results. This work is useful in its own right, but very necessary for the type searching, and can be reused.
Hoogle --info The biggest feature added this week is the --info flag. When this flag is given, Hoogle picks the first result and gives more details, including any Haddock documentation associated with the function. For example:
$ hoogle +tagsoup openurl --info
Text.HTML.Download openURL :: String -> IO String
This function opens a URL on the internet. Any http:// prefix is ignored.
> openURL "www.haskell.org/haskellwiki/Haskell"
Known Limitations:
* Only HTTP on port 80
* Outputs the HTTP Headers as well
* Does not work with all servers
It is hoped that a more reliable version of this function will be placed in a new HTTP library at some point!
Next week: Type searching! See last week for a description of what I hope to achieve.
User visible changes: The --info flag now exists.
Thursday, June 26, 2008
Sunday, June 22, 2008
GSoC Hoogle: Week 4
This week I've stayed in one place, and had lots of opportunity to get on with Hoogle. I've done a number of different things this week:
More on Haddock databases I fixed a number of issues with the Haddock generated Hoogle information. These patches have been submitted back to Haddock.
Binary Defer library I merged the binary defer library into the Hoogle sources, and modified it substantially. Some of the modifications were thanks to suggestions from the Haskell community, particularly David Roundy. The library is now more robust, and is being used as a solid foundation to build the rest of Hoogle on top of.
Text Searching You can now search for words, even multiple words, and the search will be performed. The text searching uses efficient data structures, scales excellently, and returns better results first.
Suggestions These improvements were detailed earlier in the week.
Next week: Type searching. I have various ideas on how to go about this, but it is the most tricky part of the whole project. I hope to come up with the perfect solution by the end of the week, but if not, will come up with something good enough for Hoogle 4 then revise it after the Summer is over (it could easily suck in a whole Summer of time if I am not careful!). Much of the low-level infrastructure is already present, so it is just the search algorithm.
User visible changes: Text searching works. A session with Hoogle as it currently stands:
.
More on Haddock databases I fixed a number of issues with the Haddock generated Hoogle information. These patches have been submitted back to Haddock.
Binary Defer library I merged the binary defer library into the Hoogle sources, and modified it substantially. Some of the modifications were thanks to suggestions from the Haskell community, particularly David Roundy. The library is now more robust, and is being used as a solid foundation to build the rest of Hoogle on top of.
Text Searching You can now search for words, even multiple words, and the search will be performed. The text searching uses efficient data structures, scales excellently, and returns better results first.
Suggestions These improvements were detailed earlier in the week.
Next week: Type searching. I have various ideas on how to go about this, but it is the most tricky part of the whole project. I hope to come up with the perfect solution by the end of the week, but if not, will come up with something good enough for Hoogle 4 then revise it after the Summer is over (it could easily suck in a whole Summer of time if I am not careful!). Much of the low-level infrastructure is already present, so it is just the search algorithm.
User visible changes: Text searching works. A session with Hoogle as it currently stands:
> cabal haddock --hoogle
-- generates tagsoup.txt
> hoogle --convert=tagsoup.txt
Generating Hoogle database
Written tagsoup.hoo
> hoogle +tagsoup is open --color
Text.HTML.TagSoup.Type isTagOpen :: Tag -> Bool
Text.HTML.TagSoup.Type isTagOpenName :: String -> Tag -> Bool
.
Wednesday, June 18, 2008
Hoogle 4 New Features
I'm still developing Hoogle 4, and there are many things that don't work (such as searching for types and the web version). However, it's starting to come together, and I'm beginning to implement new features that aren't in Hoogle 3. Today I've implemented two useful features.
Multi Word Text Search
In Hoogle 3, if you entered "is just" it would be treated as a type search, exactly the same as "m a". Now, it will search for "is" and search for "just" and intersect the results. This seems to be something that people often try, so hopefully will make Hoogle more intuitive.
Intelligent Suggestions
Hoogle 3 tries to give suggestions, for example if I search for "a -> maybe a" it will helpfully suggest "a -> Maybe a". Unfortunately it's not that clever. If your search term contains a type variable (starting with a lower-case letter), which is more than one letter, it will suggest you try the capitalised version. For example, "(fst,snd) -> snd" will suggest "(Fst,Snd) -> Snd", which isn't very helpful.
The new mechanism uses knowledge about the types, arities and constructors present in the Hoogle database. Some examples:
.
Multi Word Text Search
In Hoogle 3, if you entered "is just" it would be treated as a type search, exactly the same as "m a". Now, it will search for "is" and search for "just" and intersect the results. This seems to be something that people often try, so hopefully will make Hoogle more intuitive.
Intelligent Suggestions
Hoogle 3 tries to give suggestions, for example if I search for "a -> maybe a" it will helpfully suggest "a -> Maybe a". Unfortunately it's not that clever. If your search term contains a type variable (starting with a lower-case letter), which is more than one letter, it will suggest you try the capitalised version. For example, "(fst,snd) -> snd" will suggest "(Fst,Snd) -> Snd", which isn't very helpful.
The new mechanism uses knowledge about the types, arities and constructors present in the Hoogle database. Some examples:
"Just a -> a" ===> "Maybe a -> a"
"a -> Maybe" ===> "a -> Maybe b"
"a -> MayBe a" ===> "a -> Maybe a"
"a -> maybe a" ===> "a -> Maybe a"
.
Sunday, June 15, 2008
darcs over FTP
I'm currently unable to access SSH, and suspect this situation will persist for most of the Summer. Most of my darcs repo's are behind SSH, so this presents a problem. I've been looking for a way to work with darcs over FTP, and have managed to get it going on Windows. The following are instructions for (1) me when I forget them and (2) any Windows users who want to follow the same path. If you are a Linux user, then similar information is available from this blog post.
Step 1: Install Sitecopy
Go to http://dennisbareis.com/freew32.htm and download and install SITECPY.
Add "C:\Program Files\SITECOPY" to your path.
Add "C:\Home" to a %HOME% variable.
Open up a command line and type:
Step 2: Prepare the FTP site
Go to the FTP site, and create a directory. In my particular example, I created darcs/hoogle so I could mirror the Hoogle repo.
Step 3: Configure Sitecopy
Edit the file "c:\home\.sitecopyrc" to contain:
Obviously, substituting in your relevant details.
Step 4: Initialise Sitecopy
Type:
darcs push
Now to do a darcs push, you can type:
The first copy will take a long time, but subsequent copies should be a lot faster.
darcs pull
After all this, you can either pull using FTP, or if your FTP is also a web site, you can pull over http. For example:
Step 1: Install Sitecopy
Go to http://dennisbareis.com/freew32.htm and download and install SITECPY.
Add "C:\Program Files\SITECOPY" to your path.
Add "C:\Home" to a %HOME% variable.
Open up a command line and type:
c:\> mkdir home
c:\> cd home
c:\home> mkdir .sitecopy
c:\home> echo . > .sitecopyrc
Step 2: Prepare the FTP site
Go to the FTP site, and create a directory. In my particular example, I created darcs/hoogle so I could mirror the Hoogle repo.
Step 3: Configure Sitecopy
Edit the file "c:\home\.sitecopyrc" to contain:
site hoogle
server ftp.york.ac.uk
username ndm500
local C:\Neil\hoogle
remote web/darcs/hoogle
port 21
Obviously, substituting in your relevant details.
Step 4: Initialise Sitecopy
Type:
sitecopy --init hoogle
darcs push
Now to do a darcs push, you can type:
sitecopy --update hoogle
The first copy will take a long time, but subsequent copies should be a lot faster.
darcs pull
After all this, you can either pull using FTP, or if your FTP is also a web site, you can pull over http. For example:
darcs get http://www-users.york.ac.uk/~ndm500/darcs/hoogle/
Labels:
darcs
GSoC Hoogle: Week 3
This week I've travelled a further 600 miles by train, but am now starting to get settled for the Summer, and down to work on Hoogle.
My main focus this week has been getting Haddock to generate Hoogle databases. For Haddock 0.8 I added in a --hoogle flag to generate Hoogle databases, and a similar --hoogle flag to Cabal. Unfortunately, for Haddock 2.0, the feature was removed as most of the code got rewritten. Now I've added the feature back, making extensive use of the GHC API to reduce the amount of custom pretty-printing required, and to support more Haskell features. The code has been added to the development Haddock branch, and will be present in the next release.
Most of the challenge was working with the GHC API. It's certainly a powerful body of code, but suffers from being inconsistent in various places and poorly documented. I mainly worked with the code using :i to view the API. I got bitten by various problems such as the Outputable module exporting useful functions such as mkUserStyle :: QueryQualifies -> Depth -> PprStyle, but not exporting any functions that can create a Depth value, and therefore not actually being usable. If Hoogle and Haddock could be used over the GHC API, it would substantially improve the development experience!
I've also worked more on defining the database format. I am about to start work on the implementation today. I've also added a few more command line flags, but mainly as placeholders.
Next week: Database creation and text searches (looking back I see some similarity to last week!)
User visible changes: haddock --hoogle now works.
My main focus this week has been getting Haddock to generate Hoogle databases. For Haddock 0.8 I added in a --hoogle flag to generate Hoogle databases, and a similar --hoogle flag to Cabal. Unfortunately, for Haddock 2.0, the feature was removed as most of the code got rewritten. Now I've added the feature back, making extensive use of the GHC API to reduce the amount of custom pretty-printing required, and to support more Haskell features. The code has been added to the development Haddock branch, and will be present in the next release.
Most of the challenge was working with the GHC API. It's certainly a powerful body of code, but suffers from being inconsistent in various places and poorly documented. I mainly worked with the code using :i to view the API. I got bitten by various problems such as the Outputable module exporting useful functions such as mkUserStyle :: QueryQualifies -> Depth -> PprStyle, but not exporting any functions that can create a Depth value, and therefore not actually being usable. If Hoogle and Haddock could be used over the GHC API, it would substantially improve the development experience!
I've also worked more on defining the database format. I am about to start work on the implementation today. I've also added a few more command line flags, but mainly as placeholders.
Next week: Database creation and text searches (looking back I see some similarity to last week!)
User visible changes: haddock --hoogle now works.
Monday, June 09, 2008
GSoC Hoogle: Week 2
This week I submitted my PhD thesis, emptied my entire rented house of furniture, spent £96 on petrol, drove (or was driven) 400 miles, travelled a similar distance by train, have been to the north of Scotland and am currently working on a borrowed Mac in London. Needless to say, its been rather busy - but now all the excitement is over and I should be able to focus properly on Hoogle.
In the last week I've been focusing on the database, the store of all the function names and type signatures, so a very critical piece of information. I want to support fast searching, which doesn't slow down as the number of known functions increases - a nasty property of the current version. For text searching, the trie data structure has this nice property, and can deal with searching for substrings. For fuzzy type searching, things are a lot more complex. However, I think I have an algorithm which is fast (few operations), accurate (gives better matches), scalable (independent of the number of functions in the database) and lazy (returns the best results first). The idea is to have a graph of function results, and then navigate this graph to find the best match.
Most of the database work has been theoretical, but I have done some coding. In particular, I have started on the database creation code, and polished the flag argument interaction code some more. Part of the development required the Derive tool, and in doing this work I noticed a few deficiencies. In particular, if you run Windows and run derive over a UNIX line-ending file, the tool will generate a Windows line-ending file. This problem, and a few others, are now fixed.
Next week: Database creation and searching. I want text searches to work by the end of the week.
User visible changes: The --help flag prints out information on the arguments.
PS. I was looking forward to seeing some blog posts from the other Haskell summer of code students on the Haskell Planet. If any Haskell GSoC student does have a blog, please ask for it to be included!
In the last week I've been focusing on the database, the store of all the function names and type signatures, so a very critical piece of information. I want to support fast searching, which doesn't slow down as the number of known functions increases - a nasty property of the current version. For text searching, the trie data structure has this nice property, and can deal with searching for substrings. For fuzzy type searching, things are a lot more complex. However, I think I have an algorithm which is fast (few operations), accurate (gives better matches), scalable (independent of the number of functions in the database) and lazy (returns the best results first). The idea is to have a graph of function results, and then navigate this graph to find the best match.
Most of the database work has been theoretical, but I have done some coding. In particular, I have started on the database creation code, and polished the flag argument interaction code some more. Part of the development required the Derive tool, and in doing this work I noticed a few deficiencies. In particular, if you run Windows and run derive over a UNIX line-ending file, the tool will generate a Windows line-ending file. This problem, and a few others, are now fixed.
Next week: Database creation and searching. I want text searches to work by the end of the week.
User visible changes: The --help flag prints out information on the arguments.
PS. I was looking forward to seeing some blog posts from the other Haskell summer of code students on the Haskell Planet. If any Haskell GSoC student does have a blog, please ask for it to be included!
Sunday, June 01, 2008
GSoC Hoogle: Week 1
I started my Google Summer of Code project on Hoogle at the beginning of this week. In my initial application I promised to make my weekly updates via blog, so here is the first weeks report:
I've only done about half a weeks work on Hoogle this week, because I'm handing in my PhD thesis early next week, and because I'm moving house on Wednesday. I spent 14 hours on Saturday moving furniture, and many more hours than that on my thesis! I should be fully devoted to GSoC by the middle of next week.
Despite all the distractions, I did manage to start work on Hoogle. I created a new project for Hoogle at the community.haskell.org site, and an associated darcs repo at http://code.haskell.org/hoogle. I've done a number of things on Hoogle:
I've started work from the front, and am intending to first flesh out an API and command line client, then move on to the web front end. The biggest change from the current implementation of Hoogle will be that there is one shared binary, which will be able to function in a number of modes. These modes will include running as a web server, as a command line version, as an interactive (Hugs/GHCi style) program, documentation location etc. This will allow easier installation, and let everyone host their own web-based Hoogle without much effort.
Next week: I hope to move towards the command line client and central Hoogle database structure. I also hope to chat to the Haddock 2 people, and try and get some integration similar to Haddock 1's --hoogle flag.
User visible changes: Hoogle 4 as it currently stands is unable to run searches, although hoogle --test will run some regression tests.
I've only done about half a weeks work on Hoogle this week, because I'm handing in my PhD thesis early next week, and because I'm moving house on Wednesday. I spent 14 hours on Saturday moving furniture, and many more hours than that on my thesis! I should be fully devoted to GSoC by the middle of next week.
Despite all the distractions, I did manage to start work on Hoogle. I created a new project for Hoogle at the community.haskell.org site, and an associated darcs repo at http://code.haskell.org/hoogle. I've done a number of things on Hoogle:
- Improved the developer documentation in some places
- Reorganised the repo, moving away dead files
- Work on command line flags, parsing them etc.
- Added a framework for running regression tests
- Organise the command line/CGI division
I've started work from the front, and am intending to first flesh out an API and command line client, then move on to the web front end. The biggest change from the current implementation of Hoogle will be that there is one shared binary, which will be able to function in a number of modes. These modes will include running as a web server, as a command line version, as an interactive (Hugs/GHCi style) program, documentation location etc. This will allow easier installation, and let everyone host their own web-based Hoogle without much effort.
Next week: I hope to move towards the command line client and central Hoogle database structure. I also hope to chat to the Haddock 2 people, and try and get some integration similar to Haddock 1's --hoogle flag.
User visible changes: Hoogle 4 as it currently stands is unable to run searches, although hoogle --test will run some regression tests.
Thursday, May 22, 2008
Interactive TagSoup parsing
I've written quite a few programs using the tagsoup library, but have never really used the library interactively. Today I was wondering how many packages on hackage use all lower case names, compared to those starting with an initial capital. This sounds like a great opportunity to experiment! The rest of this post is a GHCi transcript, with my comments on what I'm doing prefixed with -- characters.
We can see that loads of packages use lowercase, lots of packages use upper case, quite a few use CamelCase, quite a few start with "hs", none use "_", but lots use "-". The final query figures out which is the most common letter in hackage packages, and rather unsurprisingly, it roughly follows the frequency of English letters.
TagSoup and GHCi make a potent combination for obtaining and playing with webpages.
$ ghci
GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help
Loading package base ... linking ... done.
-- load some useful packages
Prelude> :m Text.HTML.TagSoup Text.HTML.Download Data.List Data.Char Data.Maybe
Prelude Data.Maybe Data.Char Data.List Text.HTML.Download Text.HTML.TagSoup>
-- ouch, that prompt is a bit long - we can use :set prompt to shorten it
-- side note: I actually supplied the patch for set prompt :)
:set prompt "Meep> "
-- lets download the list of packages
Meep> src <- openURL "http://hackage.haskell.org/packages/archive/pkg-list.html"
... src scrolls pass the screen ...
-- parse the file, dropping everything before the packages
Meep> let parsed = dropWhile (~/= "<h3>") $ parseTags src
-- grab the list of packages
Meep> let packages = sort [x | a:TagText x:_ <- tails parsed, a ~== "<a href>"]
-- now we can query the list of packages
Meep> length packages
648
Meep> length $ filter (all isLower) packages
320
Meep> length $ filter ('_' `elem`) packages
0
Meep> length $ filter ('-' `elem`) packages
165
Meep> length $ filter (any isUpper . dropWhile isUpper) packages
100
Meep> length $ filter (isPrefixOf "hs" . map toLower) packages
47
Meep> length $ filter (any isDigit) packages
37
Meep> reverse $ sort $ map (\(x:xs) -> (1 + length xs,x)) $ group $ sort $ conca
t packages
[(484,'e'),(374,'a'),(346,'r'),(336,'s'),(335,'t'),(306,'i'),(272,'l'),(248,'c')
,(247,'n'),(240,'o'),(227,'p'),(209,'h'),(185,'-'),(171,'m'),(159,'d'),(126,'g')
,(112,'b'),(96,'u'),(87,'y'),(78,'k'),(76,'f'),(74,'x'),(58,'S'),(53,'H'),(35,'w
'),(33,'v'),(29,'q'),(29,'L'),(27,'A'),(26,'F'),(24,'D'),(23,'C'),(22,'T'),(16,'
P'),(16,'M'),(16,'I'),(16,'G'),(13,'B'),(12,'W'),(12,'3'),(12,'2'),(10,'O'),(9,'
R'),(9,'1'),(8,'z'),(8,'j'),(8,'E'),(7,'X'),(7,'U'),(7,'N'),(6,'Y'),(6,'V'),(5,'
J'),(4,'Q'),(4,'5'),(4,'4'),(3,'Z'),(3,'8'),(3,'6'),(1,'9')]
We can see that loads of packages use lowercase, lots of packages use upper case, quite a few use CamelCase, quite a few start with "hs", none use "_", but lots use "-". The final query figures out which is the most common letter in hackage packages, and rather unsurprisingly, it roughly follows the frequency of English letters.
TagSoup and GHCi make a potent combination for obtaining and playing with webpages.
Sunday, May 18, 2008
Haskell and Performance
There has been lots of discussion on the Haskell mailing lists about the speed of Haskell. There are many conflicting opinions, and lots of different advice. Some of the information on Haskell's performance is written as a sales pitch, some is based on outdated knowledge. Since I've been working on optimisation for a while, I thought I'd try and give a snapshot of Haskell performance. Most of the following is personal opinion, and others could quite validly disagree. Since GHC is the best performing Haskell compiler, I have used Haskell to mean GHC with the -O2 flag throughout.
High-level Haskell is not as fast as C
If you write Haskell in a standard manner, it is unlikely to perform as fast as C. In most cases, linked-lists are slower than arrays. Laziness is more expensive than strictness. The Haskell code will almost always be shorter, and more concise, since it will abstract over low-level detail. But by writing that low-level detail in the C code, you are likely to produce faster code.
Low-level Haskell is competitive with C
If you use GHC, with unboxed operations, written in a low-level style, you can obtain similar performance to C. The Haskell won't be as nice as it was before, but will still probably express fewer details than the C code. Writing in such a low-level manner requires more knowledge of Haskell, and is probably something that a beginner should not be attempting. However, for a critical inner loop, low-level Haskell is a very attractive option.
Haskell's Code Generator is weak
The back end assembly generator in GHC is a weak link, but improvements are being carried out. After this work has been finished, it is likely that low-level Haskell will be able to produce nearly identical assembly code to C.
Some Haskell libraries are poorly optimised
Some of the central Haskell libraries have functions which are badly optimised. For example, the MTL library is known to be poorly performing. The words and isSpace functions in the base library aren't very good. These issues are being addressed over time, the Binary and ByteString libraries have fixed two holes. A new implementation of inits has been contributed. Over time, more issues will be identified and fixed, improving the speed of all code.
Haskell's multi-threaded performance is amazing
A lot of clever people have done a lot of clever work on making multi-threaded programming in Haskell both simple and fast. While low-level speed matters for general programming, for multi-threaded programming there are lots of much higher-level performance considerations. Haskell supports better abstraction, and can better optimise at this level, outperforming C.
Reading the Core is not easy
A standard advice to people trying to optimise Haskell is to read the Core - the low-level functional language used as an intermediate form in the compiler. While Core provides much useful information about what optimisations were performed, it isn't easy to read, and takes a lot of practice. Some effort has been done to make reading Core easier, but I still wouldn't recommend it for beginners.
Optimisation without profiling is pointless
People often want to make programs run faster. In general, this activity is a waste of time. I recently wrote a program for the HCI group at my university, which takes 10 minutes to run, and requires 4Gb of RAM, on a very expensive machine. I haven't even bothered to profile the program, because I have better things to do. Unless the speed of something actually makes a difference, you should not be spending excessive effort on optimisation.
If you have determined that the program in question is running too slowly, then profile. After profiling, you can usually identify some small part of the program that needs optimisation. Too often there is a focus on speeding up something that is not slow enough to make a difference.
The trend is for higher-level optimisation
As time goes buy, higher-level programs keep getting faster and faster. The ByteString work allows programmers to write high-level programs that are competitive with C. Performance enhancements are being made to the compiler regularly, pointer tagging, constructor specialisation etc. are all helping to improve things. More long term projects such as Supero and NDP are showing some nice results. Optimisation is a difficult problem, but progress is being made, allowing programs to be written in a higher-level.
My goal is that one day Haskell programs will be written in a very declarative, high-level style - and outperform C at the same time. I think this goal is obtainable, albeit some way in the future.
High-level Haskell is not as fast as C
If you write Haskell in a standard manner, it is unlikely to perform as fast as C. In most cases, linked-lists are slower than arrays. Laziness is more expensive than strictness. The Haskell code will almost always be shorter, and more concise, since it will abstract over low-level detail. But by writing that low-level detail in the C code, you are likely to produce faster code.
Low-level Haskell is competitive with C
If you use GHC, with unboxed operations, written in a low-level style, you can obtain similar performance to C. The Haskell won't be as nice as it was before, but will still probably express fewer details than the C code. Writing in such a low-level manner requires more knowledge of Haskell, and is probably something that a beginner should not be attempting. However, for a critical inner loop, low-level Haskell is a very attractive option.
Haskell's Code Generator is weak
The back end assembly generator in GHC is a weak link, but improvements are being carried out. After this work has been finished, it is likely that low-level Haskell will be able to produce nearly identical assembly code to C.
Some Haskell libraries are poorly optimised
Some of the central Haskell libraries have functions which are badly optimised. For example, the MTL library is known to be poorly performing. The words and isSpace functions in the base library aren't very good. These issues are being addressed over time, the Binary and ByteString libraries have fixed two holes. A new implementation of inits has been contributed. Over time, more issues will be identified and fixed, improving the speed of all code.
Haskell's multi-threaded performance is amazing
A lot of clever people have done a lot of clever work on making multi-threaded programming in Haskell both simple and fast. While low-level speed matters for general programming, for multi-threaded programming there are lots of much higher-level performance considerations. Haskell supports better abstraction, and can better optimise at this level, outperforming C.
Reading the Core is not easy
A standard advice to people trying to optimise Haskell is to read the Core - the low-level functional language used as an intermediate form in the compiler. While Core provides much useful information about what optimisations were performed, it isn't easy to read, and takes a lot of practice. Some effort has been done to make reading Core easier, but I still wouldn't recommend it for beginners.
Optimisation without profiling is pointless
People often want to make programs run faster. In general, this activity is a waste of time. I recently wrote a program for the HCI group at my university, which takes 10 minutes to run, and requires 4Gb of RAM, on a very expensive machine. I haven't even bothered to profile the program, because I have better things to do. Unless the speed of something actually makes a difference, you should not be spending excessive effort on optimisation.
If you have determined that the program in question is running too slowly, then profile. After profiling, you can usually identify some small part of the program that needs optimisation. Too often there is a focus on speeding up something that is not slow enough to make a difference.
The trend is for higher-level optimisation
As time goes buy, higher-level programs keep getting faster and faster. The ByteString work allows programmers to write high-level programs that are competitive with C. Performance enhancements are being made to the compiler regularly, pointer tagging, constructor specialisation etc. are all helping to improve things. More long term projects such as Supero and NDP are showing some nice results. Optimisation is a difficult problem, but progress is being made, allowing programs to be written in a higher-level.
My goal is that one day Haskell programs will be written in a very declarative, high-level style - and outperform C at the same time. I think this goal is obtainable, albeit some way in the future.
Saturday, May 10, 2008
Bad strictness
Haskell has one primitive construct for enforcing strictness, seq :: a -> b -> b. The idea is that the first argument is evaluated to weak-head normal form (WHNF), then the second argument is evaluated to WHNF and returned. WHNF is reduction until the outermost bit is available - either a function value, or an outer constructor.
You can model the behaviour by introducing an evaluate function, in a lower-level language, and showing how to perform reduction:
The evaluate function must return an evaluated argument, and it wants to return b which is not already evaluated, so it must make a recursive call. The evaluate function for id, which simply returns its argument, is:
Notice that even though id does "nothing", it still has to evaluate its argument. Of course, evaluate (id x) is the same as evaluate x, so id performs no additional work.
Haskell is lazy, so if an expression has already been evaluated, then the evaluate call will be incredibly cheap, and just return the previous result. So let's consider the result of calling seq with the same argument twice.
This time the second evaluation of a is skipped, as a is already evaluated. We can easily see that evaluation of seq a a is exactly equivalent to a. This means that any code which writes a `seq` a is wrong. There is plenty of this code around, and one example (which prompted me to write this) is on slide 15 of this talk.
The other classic seq related mistake is id $! x. The ($!) operator is for strict application, and is defined:
For the particular instance of id $! x, we obtain x `seq` id x. Of course, all that id x does is evaluate x, so again there is no change from just writing x.
There are valid uses of seq, but any time you see either of the following constructs, you know the programmer got it wrong:
You can model the behaviour by introducing an evaluate function, in a lower-level language, and showing how to perform reduction:
evaluate (seq a b) = do
a' <- evaluate a
b' <- evaluate b
return b'
The evaluate function must return an evaluated argument, and it wants to return b which is not already evaluated, so it must make a recursive call. The evaluate function for id, which simply returns its argument, is:
evaluate (id a) = do
a' <- evaluate a
return a'
Notice that even though id does "nothing", it still has to evaluate its argument. Of course, evaluate (id x) is the same as evaluate x, so id performs no additional work.
Haskell is lazy, so if an expression has already been evaluated, then the evaluate call will be incredibly cheap, and just return the previous result. So let's consider the result of calling seq with the same argument twice.
evaluate (seq a a) = do
a' <- evaluate a
return a'
This time the second evaluation of a is skipped, as a is already evaluated. We can easily see that evaluation of seq a a is exactly equivalent to a. This means that any code which writes a `seq` a is wrong. There is plenty of this code around, and one example (which prompted me to write this) is on slide 15 of this talk.
The other classic seq related mistake is id $! x. The ($!) operator is for strict application, and is defined:
f $! x = x `seq` f x
For the particular instance of id $! x, we obtain x `seq` id x. Of course, all that id x does is evaluate x, so again there is no change from just writing x.
There are valid uses of seq, but any time you see either of the following constructs, you know the programmer got it wrong:
x `seq` x
id $! x
id $! x
Monday, April 21, 2008
Summer of Code 2008
This year I am going to be participating in the Google Summer of Code as a student, for the haskell.org organisation, on the Haskell API Search as an interface to Hackage project - aka Hoogle 4. I will be mentored by Niklas Broberg, author of various tools including Haskell Source Extensions, which is already used by Hoogle. My project link gives the summary I gave for the project, but below I've posted the interesting bits from my full application. I am going to be posting my progress at least once a week once the project phase starts (about 6 weeks time). I welcome any comments!
What is the goal of the project you propose to do?
There are two main goals:
1) Make Hoogle more useful to the community, along the same path as it is currently used.
2) Make Hoogle suitable to use as the standard interface to Hackage.
Can you give some more detailed design of what precisely you intend to achieve?
# Removal of all bugs
Hoogle 3 has a number of embarrassing bugs, some of which are not easily fixed. The nastiest of these is to do with monads, which are horribly mistreated. Since I now know the underlying issues which have caused a problem with Hoogle 3, things like higher-kinded type classes can be solved in a more principled manner.
# Support for some Haskell type extensions
Hoogle 3 does not deal with multi-parameter type classes. I would like to support a variety of type system extensions, primarily by mapping them on to Haskell 98 equivalent types.
# Faster searching
The current implementation is O(n) in the number of functions in the library, where the constant factor is absolutely massive. I wish to make text searching O(s), where s is the length of the search string, and have an incredibly low constant overhead -- using the a lazy file-based trie.
The type searching also needs a massive speed up. I have some ideas on how to proceed, but it is a difficult problem! I will spend a small amount of time investigating this problem, but may have to use a simpler algorithm, rather than delay the rest of the project.
# Better deployment
Currently there is Hoogle for the base libraries, and a special (very hacked) version that supports Gtk2hs only. I have received several requests for custom Hoogle instances for tools such as XMonad, Ycr2js, wxHaskell etc. The new Hoogle will make deployment of individual versions for specific packages easy.
# Support for multiple packages
I wish to support searching through every package on Hackage at once. This requires a massive speed up in the searching algorithms.
# Generalised text searching
By searching both function names, and also cabal package descriptions, Hoogle can be much more useful in finding packages, as opposed to individual functions.
# Better Design
Hoogle 3 is a web application, with a hacked on command line program. Hoogle 4 will be a central API which can be reused from any IDE tools, and also used to build the web interface and the command line application.
# Generalised interface to all of Cabal
Hopefully all the above goals will result in a tool that is suitable to be an interface to Cabal.
What deliverables do you think are reasonable targets? Can you outline an approximate schedule of milestones?
I would plan to release a beta of Hoogle 4 approximately half way through the project, as a web application. Much of the initial design has been done, so this is primarily hacking time.
I would then hope to complete the final hackage integration for the second half. This stage will require discussion with the cabal people, and will be a combination of design, implementation and server administration/setup.
In what ways will this project benefit the wider Haskell community?
Hoogle is already of use to the community, but has never seen a final release, and has a reasonable number of known bugs. This project would produce a polished version of a tool for which we already know there is huge demand.
Hackage is working well, and gaining new packages every day. As the number of packages increases, the interface to hackage must be updated to handle this volume. Discussions with some of the hackage/cabal team seem to suggest that a search interface is the way forward. By making Hackage easier to use, everyone benefits.
What is the goal of the project you propose to do?
There are two main goals:
1) Make Hoogle more useful to the community, along the same path as it is currently used.
2) Make Hoogle suitable to use as the standard interface to Hackage.
Can you give some more detailed design of what precisely you intend to achieve?
# Removal of all bugs
Hoogle 3 has a number of embarrassing bugs, some of which are not easily fixed. The nastiest of these is to do with monads, which are horribly mistreated. Since I now know the underlying issues which have caused a problem with Hoogle 3, things like higher-kinded type classes can be solved in a more principled manner.
# Support for some Haskell type extensions
Hoogle 3 does not deal with multi-parameter type classes. I would like to support a variety of type system extensions, primarily by mapping them on to Haskell 98 equivalent types.
# Faster searching
The current implementation is O(n) in the number of functions in the library, where the constant factor is absolutely massive. I wish to make text searching O(s), where s is the length of the search string, and have an incredibly low constant overhead -- using the a lazy file-based trie.
The type searching also needs a massive speed up. I have some ideas on how to proceed, but it is a difficult problem! I will spend a small amount of time investigating this problem, but may have to use a simpler algorithm, rather than delay the rest of the project.
# Better deployment
Currently there is Hoogle for the base libraries, and a special (very hacked) version that supports Gtk2hs only. I have received several requests for custom Hoogle instances for tools such as XMonad, Ycr2js, wxHaskell etc. The new Hoogle will make deployment of individual versions for specific packages easy.
# Support for multiple packages
I wish to support searching through every package on Hackage at once. This requires a massive speed up in the searching algorithms.
# Generalised text searching
By searching both function names, and also cabal package descriptions, Hoogle can be much more useful in finding packages, as opposed to individual functions.
# Better Design
Hoogle 3 is a web application, with a hacked on command line program. Hoogle 4 will be a central API which can be reused from any IDE tools, and also used to build the web interface and the command line application.
# Generalised interface to all of Cabal
Hopefully all the above goals will result in a tool that is suitable to be an interface to Cabal.
What deliverables do you think are reasonable targets? Can you outline an approximate schedule of milestones?
I would plan to release a beta of Hoogle 4 approximately half way through the project, as a web application. Much of the initial design has been done, so this is primarily hacking time.
I would then hope to complete the final hackage integration for the second half. This stage will require discussion with the cabal people, and will be a combination of design, implementation and server administration/setup.
In what ways will this project benefit the wider Haskell community?
Hoogle is already of use to the community, but has never seen a final release, and has a reasonable number of known bugs. This project would produce a polished version of a tool for which we already know there is huge demand.
Hackage is working well, and gaining new packages every day. As the number of packages increases, the interface to hackage must be updated to handle this volume. Discussions with some of the hackage/cabal team seem to suggest that a search interface is the way forward. By making Hackage easier to use, everyone benefits.
Saturday, April 12, 2008
darcs Feature Request (Part II)
I previously requested a feature for darcs. I always pull from an http repo, and push over SSH. I have to push using --no-set-default and typing the ssh repo in full, which I automate with a .bat file in each repo.
Today I noticed that darcs has _darcs/prefs/repos, which seems to list the repo's that darcs has used. In one of my typical repo files, I have an http entry and an SSH entry. To get darcs to behave the way I want, all I need to do is push using the first non-http repo in that list.
I have implemented my version of the darcs push command inside my paper tool, the code is all online here. Now I can delete all my push.bat scripts, and just type paper push from any darcs repo. As an added bonus, I now don't need to change to the root directory to perform a push.
It would be really nice if someone could incorporate this feature into the main darcs codebase. However, I'm quite happy using my paper tool for now. I certainly don't have time to patch, or even build darcs :-)
Today I noticed that darcs has _darcs/prefs/repos, which seems to list the repo's that darcs has used. In one of my typical repo files, I have an http entry and an SSH entry. To get darcs to behave the way I want, all I need to do is push using the first non-http repo in that list.
I have implemented my version of the darcs push command inside my paper tool, the code is all online here. Now I can delete all my push.bat scripts, and just type paper push from any darcs repo. As an added bonus, I now don't need to change to the root directory to perform a push.
It would be really nice if someone could incorporate this feature into the main darcs codebase. However, I'm quite happy using my paper tool for now. I certainly don't have time to patch, or even build darcs :-)
Wednesday, April 09, 2008
TagSoup Parsing: Dictionary Extraction
I've just read issue 10 of The Monad.Reader. It's a great issue, including a tutorial on using the new GHCi debugger, and how to write an efficient Haskell interpreter in Haskell. The running example for the GHCi debugger is parsing the computer dictionary and extracting descriptions from keywords, using the TagSoup library. The article starts with an initial version of the extraction code, then fixes some mistakes using the debugger present in GHCi. The code was written to teach debugging, not as a demonstration of TagSoup. This post explains how I would have written the program.
The original program is written in a low-level style. To search for a keyword, the program laboriously traverses through the file looking for the keyword, much like a modern imperative language might. But Haskell programmers can do better. We can separate the task: first parsing the keyword/description pairs into a list; then searching the list. Lazy evaluation will combine these separate operations to obtain something just as efficient as the original. By separating the concerns, we can express each at a higher-level, reducing the search function to a simple lookup. It also gives us more flexibility for the future, allowing us to potentially reuse the parsing functions.
I have fixed a number of other bugs in the code, and my solution is:
Instead of searching for a single keyword, I parse all keywords using parseDict. The parseDict function first skips over the gunk at the top of the file, then finds each definition, and parses it. The parseItem function spots where tags begin and end, and takes the text from inside. The unwords $ words expression is a neat trick for normalising the spacing within an arbitrary string.
This revised program is shorter than the original, I find it easier to comprehend, and it provides more functionality with fewer bugs. The TagSoup library provides a robust base to work from, allowing concise expression of HTML/XML extraction programs.
The original program is written in a low-level style. To search for a keyword, the program laboriously traverses through the file looking for the keyword, much like a modern imperative language might. But Haskell programmers can do better. We can separate the task: first parsing the keyword/description pairs into a list; then searching the list. Lazy evaluation will combine these separate operations to obtain something just as efficient as the original. By separating the concerns, we can express each at a higher-level, reducing the search function to a simple lookup. It also gives us more flexibility for the future, allowing us to potentially reuse the parsing functions.
I have fixed a number of other bugs in the code, and my solution is:
module Main where
import Text.HTML.TagSoup
import Maybe
main = do
putStr "Enter a term to search for: "
term <- getLine
html <- readFile "Dictionary.html"
let dict = parseDict $ parseTags html
putStrLn $ fromMaybe "No match found." $ lookup term dict
parseDict :: [Tag] -> [(String,String)]
parseDict = map parseItem
. sections (~== "<dt>")
. dropWhile (~/= "<div class=glosslist>")
parseItem xs = (innerText a, unwords $ words $ innerText b)
where (a,b) = break (~== "<dd>") (takeWhile (~/= "</dd>") xs)
Instead of searching for a single keyword, I parse all keywords using parseDict. The parseDict function first skips over the gunk at the top of the file, then finds each definition, and parses it. The parseItem function spots where tags begin and end, and takes the text from inside. The unwords $ words expression is a neat trick for normalising the spacing within an arbitrary string.
This revised program is shorter than the original, I find it easier to comprehend, and it provides more functionality with fewer bugs. The TagSoup library provides a robust base to work from, allowing concise expression of HTML/XML extraction programs.
Tuesday, April 08, 2008
Optional Parameters in Haskell
I use optional parameters in my TagSoup library, but it seems not to be a commonly known trick, as someone recently asked if the relevant line was a syntax error. So, here is how to pass optional parameters to a Haskell function.
Optional Parameters in Other Languages
Optional parameters are in a number of other languages, and come in a variety of flavours. Ada and Visual Basic both provide named and positional optional parameters. For example, given the definition:
We can make the calls:
In the first case we give named parameters, in the second we give all the parameters up to a certain position.
In some languages, such as GP+, you can say which parameters should take their default values:
Optional Parameters in Haskell
Haskell doesn't have built-in optional parameters, but using the record syntax, it is simple to encode named optional parameters.
Now we can pass arguments by name, for example:
This syntax takes the value defFoo, and replaces the fields s and b with the associated values. Using a type class, we can abstract this slightly:
Now we can make all functions taking default arguments use the def argument as a basis, allowing type inference and type classes to choose the correct default type. Even still, optional parameters in Haskell are not quite as neat as in other languages, but the other features of Haskell mean they are required less often.
This technique has been used in TagSoup, particularly for the parseTagOptions function. I've also seen this technique used in cpphs with the runCpphs function.
Optional Parameters in Other Languages
Optional parameters are in a number of other languages, and come in a variety of flavours. Ada and Visual Basic both provide named and positional optional parameters. For example, given the definition:
Sub Foo(b as Boolean = True, i as Integer = 0, s as String = "Hello")
We can make the calls:
Call Foo(s = "Goodbye", b = False)
Call Foo(False, 1)
In the first case we give named parameters, in the second we give all the parameters up to a certain position.
In some languages, such as GP+, you can say which parameters should take their default values:
Call Foo(_, 42, _)
Optional Parameters in Haskell
Haskell doesn't have built-in optional parameters, but using the record syntax, it is simple to encode named optional parameters.
data Foo = Foo {b :: Bool, i :: Integer, s :: String}
defFoo = Foo True 0 "Hello"
foo :: Foo -> IO ()
Now we can pass arguments by name, for example:
foo defFoo{s = "Goodbye", b = False}
This syntax takes the value defFoo, and replaces the fields s and b with the associated values. Using a type class, we can abstract this slightly:
class Def a where
def :: a
instance Def Foo where
def = defFoo
Now we can make all functions taking default arguments use the def argument as a basis, allowing type inference and type classes to choose the correct default type. Even still, optional parameters in Haskell are not quite as neat as in other languages, but the other features of Haskell mean they are required less often.
This technique has been used in TagSoup, particularly for the parseTagOptions function. I've also seen this technique used in cpphs with the runCpphs function.
Tuesday, April 01, 2008
Automated Proof Reading
I've been spending the last few months writing papers, which requires lots and lots of proof reading. Yesterday Tom shared one trick he'd used to help with his thesis. He took his thesis, ran detex over it, then used a text-to-speech program to read it back to him. This trick ensures you pick up all the subtle things like repeated words, and forces you to listen to every word, not skim reading bits. I thought this was a great idea, so implemented something similar in my paper preparation tool (release forthcoming), so typing paper talk reads the paper back to you, converting the LaTeX in a sensible way, to make it as readable as possible.
Here is result of running paper talk on the introduction section of my Catch paper:
The audio is a little clearer before being compressed for upload, but still has a very clunky feel. It is surprisingly useful - I made about 12 minor changes as a result. It has some oddities, for example ML becomes millilitre, and some pause issues with brackets, which the preprocessor sorts out. The entire paper takes 49 minutes to read, but I think I will be doing this will all papers from now on.
Here is result of running paper talk on the introduction section of my Catch paper:
The audio is a little clearer before being compressed for upload, but still has a very clunky feel. It is surprisingly useful - I made about 12 minor changes as a result. It has some oddities, for example ML becomes millilitre, and some pause issues with brackets, which the preprocessor sorts out. The entire paper takes 49 minutes to read, but I think I will be doing this will all papers from now on.
Labels:
catch
Sunday, March 30, 2008
Toddler's play with HTML in Haskell
I just read a blog article entitled Kid's play with HTML in Haskell, where the author extracts some information from an HTML document, using the Haskell XML Toolbox. I have an alternative XML/HTML library, TagSoup, so I decided to implement their problem with my library.
The Problem
Given an HTML file, extract all hyperlinks to mp3 files.
In TagSoup
The code is a list comprehension. The first line says use TagSoup to parse the text, and pick all "a" links. The second line says pick all "href" attributes from the tag you matched. The final line uses the FilePath library to check the extension is mp3.
A Complete Program
The above fragment is all the TagSoup logic, but to match exact the interface to the original code, we can wrap it up as so:
Summary
If you have a desire to quickly get a bit of information out of some XML/HTML page, TagSoup may be the answer. It isn't intended to be a complete HTML framework, but it does nicely optimise fairly common patterns of use.
The Problem
Given an HTML file, extract all hyperlinks to mp3 files.
In TagSoup
[mp3 | TagOpen "a" atts <- parseTags txt
, ("href",mp3) <- atts
, takeExtension mp3 == ".mp3"]
The code is a list comprehension. The first line says use TagSoup to parse the text, and pick all "a" links. The second line says pick all "href" attributes from the tag you matched. The final line uses the FilePath library to check the extension is mp3.
A Complete Program
The above fragment is all the TagSoup logic, but to match exact the interface to the original code, we can wrap it up as so:
import System.FilePath
import System.Environment
import Text.HTML.TagSoup
main = do
[src] <- getArgs
txt <- readFile src
mapM_ putStrLn [mp3 | TagOpen "a" atts <- parseTags txt
, ("href",mp3) <- atts
, takeExtension mp3 == ".mp3"]
Summary
If you have a desire to quickly get a bit of information out of some XML/HTML page, TagSoup may be the answer. It isn't intended to be a complete HTML framework, but it does nicely optimise fairly common patterns of use.
Tuesday, March 11, 2008
Poor Ada Error Message
I have been demonstrating on the York University "Algorithms and Data Structures" course for 4 years now. As part of the course, first year students learn Ada. A lot of the error messages are really bad - but over time I've created a mental mapping between the message and the cause. I am now fairly fluent at recognising what mistake a student has made, given the exercise they are attempting and the error message. But yesterday I encountered a brand new misleading error message.
The error message was:
As always, the most useful thing in the error message is the line number. I read enough of the error message to check whether its a parse error, type error or something else, then head for the line mentioned. (I follow this same tactic in all languages, not just Ada.)
Nothing obviously wrong about this statement, so I read the error message. It seems to want N to be a variable. But I already know that N is a variable, or at the very least a parameter, so this condition seems to be met.
Next step is to head to the definitions of Next and Add_Cell, to see if they can shed some light on the situation.
The in out in Add_Cell can be read as "pass as a pointer". Aha, maybe the error message is complaining that the second argument to Add_Cell can't be made a pointer, as its a return value from a function. That would explain it, and indeed, that turned out to be the cause of the problem. But back to the error message, what was it trying to tell us?
When the error message refers to N it isn't talking about the variable N I can see, but the second argument of Add_Cell, which is also called N. If the function being calling was in a separate library, it would have been even harder to understand. A more helpful error message might have been:
In general compiler error messages should be in terms of the line where the error resides, not requiring a large amount of global knowledge. The error can be resolved, but without help from the message. All compilers have bad error messages in some circumstances, but this one seems almost malicious!
The error message was:
bad.adb:12:22: actual for "N" must be a variable
As always, the most useful thing in the error message is the line number. I read enough of the error message to check whether its a parse error, type error or something else, then head for the line mentioned. (I follow this same tactic in all languages, not just Ada.)
12: Add_Cell(I, Next(N));
Nothing obviously wrong about this statement, so I read the error message. It seems to want N to be a variable. But I already know that N is a variable, or at the very least a parameter, so this condition seems to be met.
Next step is to head to the definitions of Next and Add_Cell, to see if they can shed some light on the situation.
function Next(N: List) return List is ...
procedure Add_Cell(I: Integer, N: in out List) is ...
The in out in Add_Cell can be read as "pass as a pointer". Aha, maybe the error message is complaining that the second argument to Add_Cell can't be made a pointer, as its a return value from a function. That would explain it, and indeed, that turned out to be the cause of the problem. But back to the error message, what was it trying to tell us?
bad.adb:12:22: actual for "N" must be a variable
12: Add_Cell(I, Next(N));
When the error message refers to N it isn't talking about the variable N I can see, but the second argument of Add_Cell, which is also called N. If the function being calling was in a separate library, it would have been even harder to understand. A more helpful error message might have been:
bad.adb:12:22: expression passed as the second argument to Add_Cell must be a variable
Found: Next(N)
Expected: A variable
Reason: Second argument of Add_Cell is declared "in out"
In general compiler error messages should be in terms of the line where the error resides, not requiring a large amount of global knowledge. The error can be resolved, but without help from the message. All compilers have bad error messages in some circumstances, but this one seems almost malicious!
Labels:
ada
Monday, March 10, 2008
Sorting At Speed
Sorting is currently a hot topic within the the Haskell community. Christopher brought it up in a recent thread on the mailing list, and this weekend I ended up spending several hours looking at sort routines.
I was browsing through the Yhc standard libraries, as one does on the weekend, and was drawn to Yhc's sort function. It had some undesirable characteristics for one of the projects I was working on, so I wondered if other Haskell systems used different implementations. I checked GHC, and discovered it their sort was different. In general, when Yhc and GHC have different implementations of a standard library function, the GHC one is better tuned for performance. I decided to replace the Yhc sort function with the GHC one, but before doing so, thought a quick performance test was in order. So I came up with something simple:
The file "sort.txt" was simply the source code to the program. The code sorts the contents of this file, then reverses it, sorts it and sorts it again. This means that the program performs one sort over semi-random data, one over reverse-ordered data and one over ordered-data. These are some fairly standard cases that should be checked. This test is not a comprehensive benchmark, but a nice quick indicator.
I ran the Yhc sort function against the GHC version, and was shocked to find that the Yhc code was twice as fast. I ran the benchmark under Yhc, GHC and Hugs (using reduction count in Hugs), and in all cases the performance was doubled. I was not expecting this result!
The code for the GHC sort is:
The function works by splitting the list into one element lists, resulting in each basic list being ordered. These lists are then merged in pairs until a single list is left. For example, given the input "sort":
We first split each character into its own list, then merge adjacent pairs. This code corresponds to the standard merge sort. But instead of making each initial list a single element, we could use sequences of increasing elements, for example using the risers function:
Now, if we apply risers "sort" we get ["s","ort"]. We can now follow the same merge procedure as before:
Instead of doing 3 merges, we have done only 1. Given the input "abcd" the effect would have been even more dramatic. We can refine this scheme further, by detecting both ascending and descending chains of elements in the initial list. This technique is used by Yhc, and is based on code originally written by Lennart Augustsson. Knowing the original source of the code, my shock at the performance benefits offered by the Yhc version have decreased substantially.
The GHC sort function should probably be replaced by the one from the Yhc libraries. This would offer increased performance, especially in the case of ordered or reverse-ordered lists. The asymptotic complexity of the two sorts means that there must exist some value of n such that sort [1..n] runs faster in Yhc than GHC (assuming sufficient stack/heap for both). I wonder whether Int32 is capable of expressing such a value...
A side note: I have been playing with the risers function for several years. I've used it as an example of pattern match checking, both specialised to Int and on a general Ord class. I've used it for supercompiling. It's appeared in blog posts, a TFP paper (Mitchell+Runciman 2007), a Haskell Workshop paper (Xu 2006) etc. I originally had this example suggested from a functional programming exam paper, but only today at lunch did I discover its true origins. The risers function had originally been the first-step in a merge sort!
I was browsing through the Yhc standard libraries, as one does on the weekend, and was drawn to Yhc's sort function. It had some undesirable characteristics for one of the projects I was working on, so I wondered if other Haskell systems used different implementations. I checked GHC, and discovered it their sort was different. In general, when Yhc and GHC have different implementations of a standard library function, the GHC one is better tuned for performance. I decided to replace the Yhc sort function with the GHC one, but before doing so, thought a quick performance test was in order. So I came up with something simple:
test = do
src <- readFile "Sort.hs"
print $ ordered $ sort $ sort $ reverse $ sort src
The file "sort.txt" was simply the source code to the program. The code sorts the contents of this file, then reverses it, sorts it and sorts it again. This means that the program performs one sort over semi-random data, one over reverse-ordered data and one over ordered-data. These are some fairly standard cases that should be checked. This test is not a comprehensive benchmark, but a nice quick indicator.
I ran the Yhc sort function against the GHC version, and was shocked to find that the Yhc code was twice as fast. I ran the benchmark under Yhc, GHC and Hugs (using reduction count in Hugs), and in all cases the performance was doubled. I was not expecting this result!
The code for the GHC sort is:
mergesort :: (a -> a -> Ordering) -> [a] -> [a]
mergesort cmp = mergesort' cmp . map (:[])
mergesort' :: (a -> a -> Ordering) -> [[a]] -> [a]
mergesort' cmp [] = []
mergesort' cmp [xs] = xs
mergesort' cmp xss = mergesort' cmp (merge_pairs cmp xss)
merge_pairs :: (a -> a -> Ordering) -> [[a]] -> [[a]]
merge_pairs cmp [] = []
merge_pairs cmp [xs] = [xs]
merge_pairs cmp (xs:ys:xss) = merge cmp xs ys : merge_pairs cmp xss
merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
merge cmp [] ys = ys
merge cmp xs [] = xs
merge cmp (x:xs) (y:ys)
= case x `cmp` y of
GT -> y : merge cmp (x:xs) ys
_ -> x : merge cmp xs (y:ys)
The function works by splitting the list into one element lists, resulting in each basic list being ordered. These lists are then merged in pairs until a single list is left. For example, given the input "sort":
"s" "o" "r" "t"
"os" "rt"
"orst"
We first split each character into its own list, then merge adjacent pairs. This code corresponds to the standard merge sort. But instead of making each initial list a single element, we could use sequences of increasing elements, for example using the risers function:
risers :: Ord a => [a] -> [[a]]
risers [] = []
risers [x] = [[x]]
risers (x:y:etc) = if x <= y then (x:s):ss else [x]:(s:ss)
where (s:ss) = risers (y:etc)
Now, if we apply risers "sort" we get ["s","ort"]. We can now follow the same merge procedure as before:
"s" "ort"
"orst"
Instead of doing 3 merges, we have done only 1. Given the input "abcd" the effect would have been even more dramatic. We can refine this scheme further, by detecting both ascending and descending chains of elements in the initial list. This technique is used by Yhc, and is based on code originally written by Lennart Augustsson. Knowing the original source of the code, my shock at the performance benefits offered by the Yhc version have decreased substantially.
The GHC sort function should probably be replaced by the one from the Yhc libraries. This would offer increased performance, especially in the case of ordered or reverse-ordered lists. The asymptotic complexity of the two sorts means that there must exist some value of n such that sort [1..n] runs faster in Yhc than GHC (assuming sufficient stack/heap for both). I wonder whether Int32 is capable of expressing such a value...
A side note: I have been playing with the risers function for several years. I've used it as an example of pattern match checking, both specialised to Int and on a general Ord class. I've used it for supercompiling. It's appeared in blog posts, a TFP paper (Mitchell+Runciman 2007), a Haskell Workshop paper (Xu 2006) etc. I originally had this example suggested from a functional programming exam paper, but only today at lunch did I discover its true origins. The risers function had originally been the first-step in a merge sort!
Labels:
yhc
Tuesday, March 04, 2008
Lazy Evaluation: Strict vs. Speculative
For the last few days I have been thinking about how to write a low-level program optimiser, based on the ideas from Supero. Supero works at the level of a lazy Core expressions, but actual hardware works on a sequence of strict instructions. The possible idea is to translate the lazy expressions to strict sequences, then borrow the ideas from supercompilation once more. In particular I have been looking at the GRIN approach, which defines such a set of instructions.
The GRIN work is very clever, and has many ideas that I would like to reuse. However, the one aspect that gave me slight concern is the complexity. A GRIN program requires the use of several analysis passes, and many many transformation rules. While this approach is perfectly acceptable, one of the goals of the Supero work is to make the optimisation process simpler -- comprising of a few simple but powerful rules.
I will first explain how strictness works, then how my speculative approach works. Readers who already know about unboxing are encouraged to skip to the speculative section.
Strictness
When doing low-level compilation, one of the most important stages is strictness analysis, and the associated unboxing. To take the example of the factorial function in Haskell:
Here it is easy to see that the factorial function always evaluates n. We can also use our knowledge of the definition of Int:
Where I# is an actual machine integer (possibly stored in a register), and Int# is a lazy box surrounding it. Since we know that factorial will always unwrap our n, we can pass the n around without the Int# box. I have made all the conversions from I# to Int explicit using an Int#, but have left all the unboxings implicit. The operators ># etc. are simply unboxed and strict variants of the standard operators.
Also, since we know factorial is strict in its first argument, we can evaluate the first argument to the recursive call strictly. Applying all these optimisations can now write:
We have removed the explicit boxing in the recursive call, and work entirely with unboxed integers. Now factorial is entirely strict. We can even write a wrapper around our strict version, to provide a lazy interface matching the original.
I have used n# to denote the unboxing of n. Now factorial looks like it did before, but operates much faster, on unboxed integers.
Speculative
I would like to not include a strictness analyser in my optimiser, or if it is included, have it be the result of a series of transformations -- without explicit "stop and analyse" then "use the results" stages. As part of my thoughts on this, I was trying to consider how to optimise factorial without invoking the strictness analyser.
The speculative transformation I have defined first generates factorial# - I have left out the details of why it decides to.
This step is entirely safe - we have defined factorial#, but we have not written a wrapper that invokes it, even in the recursive case. The factorial# function is equivalent to factorial if the initial argument was evaluated. We have transformed factorial# using only local knowledge, at the point. We can also transform factorial, replacing any uses of n which are guaranteed to come after n is evaluated, with (Int# n#). This transformation is merely reusing the knowledge we have gained unwrapping n:
Now we promote any primitive operations on only unboxed values. Given (-), it is cheaper to evalute the subtraction than to store a lazy thunk to the function.
We can now use our knowledge that if we know an argument to a function is already evaluated, we can call the strict variant (this corresponds closely to constructor specialisation):
We can also replace the * in factorial with *# as we know we will have to evaluate the result of a function. Now we have ended up with a fast inner loop, operating only on unboxed integers. We have not required strictness information to make any transformation.
One way of viewing the difference between strictness and this transformation is the flow of information. In strictness, the caller is informed that a particular argument will be evaluated. In speculative, the callee informs the caller that an argument has already been evaluated. These two concepts are not the same, and while they overlap, there are instances where they differ considerably.
Strict vs. Speculative
Consider the following example:
Here the lazy function is strict in a, but not either of b or c. A strictness analyser would generate a variant of lazy with only the first argument unboxed. In contrast the speculative variant will determine that x-1 and x+1 should be evaluated, and pass unboxed values in all arguments of lazy, even though lazy may not evaluate b or c.
To see this behaviour in GHC, it helps to make lazy recursive:
Now run with the options ghc Temp.hs -c -O2 -ddump-simpl, and you will see the lazy variant has type lazy :: Int# -> Int -> Int -> Int.
These thoughts are still very preliminary, and there are a number of unanswered questions:
The GRIN work is very clever, and has many ideas that I would like to reuse. However, the one aspect that gave me slight concern is the complexity. A GRIN program requires the use of several analysis passes, and many many transformation rules. While this approach is perfectly acceptable, one of the goals of the Supero work is to make the optimisation process simpler -- comprising of a few simple but powerful rules.
I will first explain how strictness works, then how my speculative approach works. Readers who already know about unboxing are encouraged to skip to the speculative section.
Strictness
When doing low-level compilation, one of the most important stages is strictness analysis, and the associated unboxing. To take the example of the factorial function in Haskell:
factorial :: Int -> Int
factorial n = if n > 0 then n * factorial (n-1) else 1
Here it is easy to see that the factorial function always evaluates n. We can also use our knowledge of the definition of Int:
data Int = Int# I#
Where I# is an actual machine integer (possibly stored in a register), and Int# is a lazy box surrounding it. Since we know that factorial will always unwrap our n, we can pass the n around without the Int# box. I have made all the conversions from I# to Int explicit using an Int#, but have left all the unboxings implicit. The operators ># etc. are simply unboxed and strict variants of the standard operators.
factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (Int# n# - 1) else 1
Also, since we know factorial is strict in its first argument, we can evaluate the first argument to the recursive call strictly. Applying all these optimisations can now write:
factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (n# -# 1) else 1
We have removed the explicit boxing in the recursive call, and work entirely with unboxed integers. Now factorial is entirely strict. We can even write a wrapper around our strict version, to provide a lazy interface matching the original.
factorial :: Int -> Int
factorial n = factorial# n#
I have used n# to denote the unboxing of n. Now factorial looks like it did before, but operates much faster, on unboxed integers.
Speculative
I would like to not include a strictness analyser in my optimiser, or if it is included, have it be the result of a series of transformations -- without explicit "stop and analyse" then "use the results" stages. As part of my thoughts on this, I was trying to consider how to optimise factorial without invoking the strictness analyser.
The speculative transformation I have defined first generates factorial# - I have left out the details of why it decides to.
factorial :: Int -> Int
factorial n = if n > 0 then n * factorial (n-1) else 1
factorial# :: Int# -> Int
factorial# n# = if n# ># 0 then n# *# factorial (Int# n# - 1) else 1
This step is entirely safe - we have defined factorial#, but we have not written a wrapper that invokes it, even in the recursive case. The factorial# function is equivalent to factorial if the initial argument was evaluated. We have transformed factorial# using only local knowledge, at the point. We can also transform factorial, replacing any uses of n which are guaranteed to come after n is evaluated, with (Int# n#). This transformation is merely reusing the knowledge we have gained unwrapping n:
factorial n = if n > 0 then Int# n# * factorial (Int# n# - 1) else 1
Now we promote any primitive operations on only unboxed values. Given (-), it is cheaper to evalute the subtraction than to store a lazy thunk to the function.
factorial n = if n > 0 then Int# n# * factorial (Int# (n# -# 1)) else 1
factorial# n# = if n# ># 0 then n# *# factorial (Int# (n# -# 1)) else 1
We can now use our knowledge that if we know an argument to a function is already evaluated, we can call the strict variant (this corresponds closely to constructor specialisation):
factorial n = if n > 0 then n# *# factorial# (n# -# 1) else 1
factorial# n# = if n# ># 0 then n# *# factorial# (n# -# 1) else 1
We can also replace the * in factorial with *# as we know we will have to evaluate the result of a function. Now we have ended up with a fast inner loop, operating only on unboxed integers. We have not required strictness information to make any transformation.
One way of viewing the difference between strictness and this transformation is the flow of information. In strictness, the caller is informed that a particular argument will be evaluated. In speculative, the callee informs the caller that an argument has already been evaluated. These two concepts are not the same, and while they overlap, there are instances where they differ considerably.
Strict vs. Speculative
Consider the following example:
strict :: Int -> Int
strict x = x `seq` lazy x (x-1) (x+1)
lazy :: Int -> Int -> Int -> Int
lazy a b c = if a == 0 then b else c
Here the lazy function is strict in a, but not either of b or c. A strictness analyser would generate a variant of lazy with only the first argument unboxed. In contrast the speculative variant will determine that x-1 and x+1 should be evaluated, and pass unboxed values in all arguments of lazy, even though lazy may not evaluate b or c.
To see this behaviour in GHC, it helps to make lazy recursive:
module Temp where
strict :: Int -> Int
strict x = x `seq` lazy x (x+1) (x-1)
lazy :: Int -> Int -> Int -> Int
lazy a b c = if a == 0 then lazy b b b else c
Now run with the options ghc Temp.hs -c -O2 -ddump-simpl, and you will see the lazy variant has type lazy :: Int# -> Int -> Int -> Int.
These thoughts are still very preliminary, and there are a number of unanswered questions:
- What is the overlap between strict and speculative?
- Can both variants be combined? (almost certainly yes)
- Is speculative really simpler?
- Is speculative sufficient?
- What are the performance benefits of speculative?
Friday, February 29, 2008
Hoogle 3 Security Bug
I recently released Hoogle 3.1, in response to a security bug spotted by Tillmann Rendel. Checking back through my records, I found that Botje on #haskell had previously spotted the same issue, but at the time I hadn't noticed it was a security bug. The security implications of the bug are very low, and it could not be used to cause any real harm.
The Bug
The bug is that in the Hoogle web interface, user supplied data may end up being shown to the user without escaping. For example, searching for the number 1 results in an error message, which says "Parse Error: Unexpected character '1'". Unfortunately, in Hoogle 3, if that search string had been "1<u>2<u>", then the result would have been "Parse Error: Unexpected character '12'" - i.e. the number 2 would be underlined. If you try this same example in Hoogle 3.1, you do not get any formatting, and see the entered tags.
The bug could be provoked in several places:
To perturb the input box would require entering a quote character ("), and to perturb the other instances would require an opening angle bracket (<).
The Severity
I am fairly sure the severity of this bug is "incredibly low". As a result of entering a malicious query, the attacker could cause the page displayed to contain whatever was desired. However, the Hoogle online interface has no privileges beyond that of a normal web page, and so can't actually do anything evil. The bug does not permit any supplied code to be executed on the server.
There is only one malicious use I could think of: browser redirects. Sometimes evil companies will send out spam mail, with links such as "click here to go to example.com and order Viagra". One anti-spam measure is to reject all emails linking to a particular domain name. By crafting a URL, it was possible for a link to Hoogle to redirect to another domain, thus appearing that the initial link was to a trusted website. The spam recipient still goes to the original page, but it may defeat their spam filters.
Checking the server logs for Hoogle shows that no one ever actually exploited the flaw to perform a redirect, or even to insert a <script> tag - the first step to any such exploit.
The Fix
I had to make two fixes to the code. I use Haskell Source Extensions to generate most of the HTML shown in Hoogle. As part of that, I have a ToXML class that automatically converts values to an XML representation, which is then rendered. The ToXML instance for String did not escape special HTML characters, now it does. I wrote the ToXML instances, instead of relying on those supplied in the associated HSP, and thus introduced the bug.
The only other code that generates HTML uses a formatted string type, which can represent hyperlinks and various formatting, and can be rendered as either console escape characters or as HTML. Since this part of the code was written before moving to Haskell Source Extensions, it generates raw strings. This generating code was also patched to escape certain characters.
As a result of using libraries and abstractions, it wasn't necessary to fix each of the security flaws one by one, but to fix the interface to the library. In doing so, I have much more confidence that all the security flaws have been tackled once and for all, and that they will not reoccur.
Is Haskell Insecure?
Enhanced security is one of the many advantages that Haskell offers. It is not possible to overrun a buffer and conduct stack smashing attacks on a Haskell program. Passing query strings will not overwrite global variables, and escaping cannot cause user code to be executed on the server. However, when Haskell code generates HTML, it is not immune from code injection attacks on the client side.
In the beginning Hoogle did not use any HTML generation libraries. As I have slowly moved towards Haskell Source Extensions, I have benefited from better guarantees about well-formed HTML. By creating appropriate abstractions, and dealing with concerns like escaping at the right level, and enforcing these decisions with appropriate types, the number of places to introduce a security bug is lowered. Hopefully Hoogle will not fall victim to such a security problem in future.
The Bug
The bug is that in the Hoogle web interface, user supplied data may end up being shown to the user without escaping. For example, searching for the number 1 results in an error message, which says "Parse Error: Unexpected character '1'". Unfortunately, in Hoogle 3, if that search string had been "1<u>2<u>", then the result would have been "Parse Error: Unexpected character '12'" - i.e. the number 2 would be underlined. If you try this same example in Hoogle 3.1, you do not get any formatting, and see the entered tags.
The bug could be provoked in several places:
- The error message, on a parse error.
- The input box, after a search had been entered.
- As the string listed as what the user searched for.
To perturb the input box would require entering a quote character ("), and to perturb the other instances would require an opening angle bracket (<).
The Severity
I am fairly sure the severity of this bug is "incredibly low". As a result of entering a malicious query, the attacker could cause the page displayed to contain whatever was desired. However, the Hoogle online interface has no privileges beyond that of a normal web page, and so can't actually do anything evil. The bug does not permit any supplied code to be executed on the server.
There is only one malicious use I could think of: browser redirects. Sometimes evil companies will send out spam mail, with links such as "click here to go to example.com and order Viagra". One anti-spam measure is to reject all emails linking to a particular domain name. By crafting a URL, it was possible for a link to Hoogle to redirect to another domain, thus appearing that the initial link was to a trusted website. The spam recipient still goes to the original page, but it may defeat their spam filters.
Checking the server logs for Hoogle shows that no one ever actually exploited the flaw to perform a redirect, or even to insert a <script> tag - the first step to any such exploit.
The Fix
I had to make two fixes to the code. I use Haskell Source Extensions to generate most of the HTML shown in Hoogle. As part of that, I have a ToXML class that automatically converts values to an XML representation, which is then rendered. The ToXML instance for String did not escape special HTML characters, now it does. I wrote the ToXML instances, instead of relying on those supplied in the associated HSP, and thus introduced the bug.
The only other code that generates HTML uses a formatted string type, which can represent hyperlinks and various formatting, and can be rendered as either console escape characters or as HTML. Since this part of the code was written before moving to Haskell Source Extensions, it generates raw strings. This generating code was also patched to escape certain characters.
As a result of using libraries and abstractions, it wasn't necessary to fix each of the security flaws one by one, but to fix the interface to the library. In doing so, I have much more confidence that all the security flaws have been tackled once and for all, and that they will not reoccur.
Is Haskell Insecure?
Enhanced security is one of the many advantages that Haskell offers. It is not possible to overrun a buffer and conduct stack smashing attacks on a Haskell program. Passing query strings will not overwrite global variables, and escaping cannot cause user code to be executed on the server. However, when Haskell code generates HTML, it is not immune from code injection attacks on the client side.
In the beginning Hoogle did not use any HTML generation libraries. As I have slowly moved towards Haskell Source Extensions, I have benefited from better guarantees about well-formed HTML. By creating appropriate abstractions, and dealing with concerns like escaping at the right level, and enforcing these decisions with appropriate types, the number of places to introduce a security bug is lowered. Hopefully Hoogle will not fall victim to such a security problem in future.
Labels:
hoogle
Subscribe to:
Posts (Atom)