This week I've been trying to get Hoogle 4 to the point where it can replace Hoogle 3. This is the final official week of Google Summer of Code, but I'm planning to continue hacking Hoogle next week, and then as time allows after that.
The priority this week was getting the documentation links working. The problem was not with Hoogle - displaying the links is trivial - but ensuring that Cabal + Haddock + Hoogle + random build scripts combine to generate the correct databases. This work involved lots of little changes in lots of places, but is now working properly. Included in this work is dependency tracking of packages (so that all packages using base know that String = [Char] etc), and merging multiple databases to create a single one.
After the Hoogle database was generated correctly, I started looking at using some of the additional information present. I have now added Haddock documentation inline in the search results. If the documentation is too long to fit comfortably, Hoogle uses AJAX wizzy-ness (or more accurately, DHTML) to allow the user to expand and show all the documentation. I suspect that this will eliminate many cases of the user actually following to the Haddock webpages. This feature is fairly new, and I have pushed it out because its useful - there are still many small improvements that need to be made.
This week I also spent some time attempting to generate documentation for all the Hackage libraries. I had some success, but the computer I am currently using is years old and lacks the necessary processing power. I will tackle this at some point in the future, once I have purchased a new machine (which should be quite soon).
With all these changes, I find Hoogle 4 to be significantly more usable than Hoogle 3. Please give it a try, and give feedback. At this point I'm particularly interested in any issues that would cause you to use Hoogle 3 instead of Hoogle 4.
Hoogle 3: http://haskell.org/hoogle
Hoogle 4: http://haskell.org/hoogle/beta
If there are no major issues, I will be replacing Hoogle 4 as the standard Hoogle sometime next week.
Next week: I will be no longer doing Google Summer of Code :-) I plan to refine some of the existing bits of Hoogle, and ensure that anything I haven't done is in a bug tracker for later.
User visible changes: The web search engine now gives Haddock links and displays Haddock documentation inline.
Showing posts with label soc. Show all posts
Showing posts with label soc. Show all posts
Friday, August 15, 2008
Monday, August 11, 2008
GSoC Hoogle: Week 11
This week I've been releasing lots. Hoogle 4 is finally starting to come together, and should be a worthy replacement for Hoogle 3 very shortly. Rather than go into detail about the past week, I'm just going to give some of the bullet points:
Next week: I'm hoping to work on generating better Hoogle databases, including a Hoogle database for the whole of Hackage. I also have a number of bugs to fix.
User visible changes: Users can download and use Hoogle, and the web interface is online.
- I have released 4 versions of the command line version of Hoogle, available on Hackage. Many bugs have been spotted by some very useful testers, and improvements have been made.
- I have released a web version of Hoogle 4, and encourage feedback.
- I have started to update the wiki Manual, which now contains some details of Hoogle's query syntax.
- I gave a talk at AngloHaskell 2008, which is available online, as slides and an audio stream. All of the other talks were excellent and are well worth listening to.
- I have started to build Hoogle documentation for all of Hackage. The machine I'm doing this on is very slow, so its not a quick process!
Next week: I'm hoping to work on generating better Hoogle databases, including a Hoogle database for the whole of Hackage. I also have a number of bugs to fix.
User visible changes: Users can download and use Hoogle, and the web interface is online.
Sunday, August 03, 2008
GSoC Hoogle: Week 10
This week I've been in Bristol, and am just about to head off to the Harbour Festival. Next week I'm heading off to AngloHaskell 2008, and will be talking about Hoogle type searching on the Saturday.
This week has been type search, yet again. There were issues with algorithmic complexity, combinatorial explosions and other fun stuff. However, its now finished. The type search is now fast enough (you can run Hoogle in Hugs against the core libraries) and gives good results. Rather than describe type searching, its easier to give an example. Searching for (a -> b) -> [a] -> [b] in Hoogle 3 gives:
But in Hoogle 4 gives:
I think the new results are better. For more details, come to the AngloHaskell talk.
Next Week: I want to release a public beta of Hoogle 4 in command line form. I want to start on the web search engine and tweak the ranking algorithm. I'll also be writing up type search in the form of a presentation.
User Visible Changes: Type search works well and fast.
This week has been type search, yet again. There were issues with algorithmic complexity, combinatorial explosions and other fun stuff. However, its now finished. The type search is now fast enough (you can run Hoogle in Hugs against the core libraries) and gives good results. Rather than describe type searching, its easier to give an example. Searching for (a -> b) -> [a] -> [b] in Hoogle 3 gives:
Prelude.map :: (a -> b) -> [a] -> [b]
Data.List.map :: (a -> b) -> [a] -> [b]
Control.Parallel.S... parMap :: Strategy b -> (a -> b) -> [a] -> [b]
Prelude.scanr :: (a -> b -> b) -> b -> [a] -> [b]
Data.List.scanr :: (a -> b -> b) -> b -> [a] -> [b]
Prelude.scanl :: (a -> b -> a) -> a -> [b] -> [a]
Data.List.scanl :: (a -> b -> a) -> a -> [b] -> [a]
Prelude.concatMap :: (a -> [b]) -> [a] -> [b]
But in Hoogle 4 gives:
Prelude map :: (a -> b) -> [a] -> [b]
Data.List map :: (a -> b) -> [a] -> [b]
Prelude fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative <$> :: Functor f => (a -> b) -> f a -> f b
Control.Monad fmap :: Functor f => (a -> b) -> f a -> f b
Control.Monad.Instances fmap :: Functor f => (a -> b) -> f a -> f b
Control.Applicative liftA :: Applicative f => (a -> b) -> f a -> f b
Data.Traversable fmapDefault :: Traversable t => (a -> b) -> t a -> t b
Control.Monad liftM :: Monad m => (a1 -> r) -> m a1 -> m r
Control.Parallel.Strategies parMap :: Strategy b -> (a -> b) -> [a] -> [b]
I think the new results are better. For more details, come to the AngloHaskell talk.
Next Week: I want to release a public beta of Hoogle 4 in command line form. I want to start on the web search engine and tweak the ranking algorithm. I'll also be writing up type search in the form of a presentation.
User Visible Changes: Type search works well and fast.
Thursday, July 24, 2008
GSoC Hoogle: Week 9
I'm off camping for the next weekend in a couple of hours, so this is my early weekly summary. From next week, for a week and a half, I'll actually have an SSH connection so expect to see 200+ patches flow into the Hoogle repo in a few days.
This week I've been rewriting the type search. I spent 3 days writing code, type checking it, but not actually having enough written to run it. Late last night I finished the code, and this morning I debugged it. Amazingly (although actually quite commonly for Haskell) it worked with only minor tweaks. I now have a type search which should scale to large databases and provide fast and accurate searches.
All the basic tests work, and I can generate a Hoogle database for the array library. I still can't generate a Hoogle database for the base library, due to a stack overflow, but I think the cause of the stack overflow has changed and should be easy to debug.
Next Week: A public beta of the command line version is now overdue, and hopefully will happen next week. I aim to finish the actual search side of Hoogle, and move on to the web interface.
User Visible Changes: Type search works again, mostly.
This week I've been rewriting the type search. I spent 3 days writing code, type checking it, but not actually having enough written to run it. Late last night I finished the code, and this morning I debugged it. Amazingly (although actually quite commonly for Haskell) it worked with only minor tweaks. I now have a type search which should scale to large databases and provide fast and accurate searches.
All the basic tests work, and I can generate a Hoogle database for the array library. I still can't generate a Hoogle database for the base library, due to a stack overflow, but I think the cause of the stack overflow has changed and should be easy to debug.
Next Week: A public beta of the command line version is now overdue, and hopefully will happen next week. I aim to finish the actual search side of Hoogle, and move on to the web interface.
User Visible Changes: Type search works again, mostly.
Sunday, July 20, 2008
GSoC Hoogle: Week 8
This week I've been travelling quite a bit, and rather busy with other things. Hopefully next week I'll be able to focus more time on Hoogle!
This week I fleshed out the final part of type search, including support for instances and alpha renaming of variables. After having implemented all the bits in the type search, I tried to convert the base libraries - and it failed, taking up too much time/memory to feasibly finish.
The type search is based around the idea of having nodes in a graph representing types, and then moving between these nodes, at a cost. In order to avoid a blow-up in the number of nodes in the graph, types are alpha-normalised and then alpha-renaming is performed afterwards. Instead of having 3 type nodes for (a,b), (c,d) and (a,a) there is just one named (a,b) and a 3 sets of alpha-renamings. All is good.
However, once you introduce instance restrictions, the types blow up. For example, from the type node a, you can move to Eq a => a, Ord a => a, Show a => a etc. The large (but feasible) number of type nodes, combined with even a small number of class names, gives a huge number of nodes. In fact, for every type variable in a node there are 2^n possible instance contexts it could take. All is bad.
Fortunately there is a solution - move instance checking outside the type graph. This makes the number of nodes feasible, and should work fairly well. It also has a few other benefits, including slightly better scoring and a simpler implementation in a few places. I also came up with a strategy for moving the cost associated with alpha-renaming into the graph search, which further simplifies things.
Of course, all this work takes time, so overall progress is slower than I would have liked. However, the results so far are promising, and the problems of scale seem to have been successfully addressed. The problem of fast and accurate type searching is hard, but hopefully Hoogle 4 will have a scalable solution that should be useful.
Next week: I want to finish the implementation of type searching, and check it works on the full base libraries. A release would be good, although may take place early in the following week.
User visible changes: Creating a database for the base library will now fail with a stack overflow. Hopefully next weeks changes will fix this!
This week I fleshed out the final part of type search, including support for instances and alpha renaming of variables. After having implemented all the bits in the type search, I tried to convert the base libraries - and it failed, taking up too much time/memory to feasibly finish.
The type search is based around the idea of having nodes in a graph representing types, and then moving between these nodes, at a cost. In order to avoid a blow-up in the number of nodes in the graph, types are alpha-normalised and then alpha-renaming is performed afterwards. Instead of having 3 type nodes for (a,b), (c,d) and (a,a) there is just one named (a,b) and a 3 sets of alpha-renamings. All is good.
However, once you introduce instance restrictions, the types blow up. For example, from the type node a, you can move to Eq a => a, Ord a => a, Show a => a etc. The large (but feasible) number of type nodes, combined with even a small number of class names, gives a huge number of nodes. In fact, for every type variable in a node there are 2^n possible instance contexts it could take. All is bad.
Fortunately there is a solution - move instance checking outside the type graph. This makes the number of nodes feasible, and should work fairly well. It also has a few other benefits, including slightly better scoring and a simpler implementation in a few places. I also came up with a strategy for moving the cost associated with alpha-renaming into the graph search, which further simplifies things.
Of course, all this work takes time, so overall progress is slower than I would have liked. However, the results so far are promising, and the problems of scale seem to have been successfully addressed. The problem of fast and accurate type searching is hard, but hopefully Hoogle 4 will have a scalable solution that should be useful.
Next week: I want to finish the implementation of type searching, and check it works on the full base libraries. A release would be good, although may take place early in the following week.
User visible changes: Creating a database for the base library will now fail with a stack overflow. Hopefully next weeks changes will fix this!
Friday, July 11, 2008
GSoC Hoogle: Week 7
This week I've continued to improve the type searching, and generated Hoogle databases for the core libraries. I'm away from a computer all weekend until Tuesday evening, which has happened every ICFP contest for the last 3 years.
I've substantially refactored the type searching, basing it on a proper abstract Graph data type. Now the mechanisms for dealing with type search and graph traversal are separate it is much easier to express clearly what type search is doing. I've also fleshed out the type searching code so that it can accurately perform searches with all the necessary features. There are still a number of tasks to do before the type searching code is finished, but each is a fairly discrete unit of work with well-understood problems.
The other challenge for the week has been generating Hoogle databases for the core libraries - the base library and all the other libraries GHC ships with a release. With these libraries in place, it is feasible to use Hoogle to perform useful queries. The libraries are generated using a combination of Cabal, Haddock and Hoogle. I've made changes in both the Haddock and Hoogle layers so that the full base libraries can now be processed.
In order to deal with the full base libraries there are numerous GHC extensions that must be supported. In particular, Hoogle now supports multi-parameter type classes, higher-ranked types, type operators, unboxed types, unboxed tuples and NDP style arrays. All of these features are translated down into Haskell 98 types, but most closely approximate their behaviour in GHC, and can be used in searches.
Throughout the week I've been profiling the database creation code in Hoogle. The databases for the core libraries come to about 4.5Mb, and are highly optimised for performing searches - often at the cost of making them harder to create. I've halved the time to create databases during the week, using profiling to direct improvements. The time required to process the Core libraries takes 60 seconds, which is a certainly an acceptable timeframe, but could always be faster. Currently the biggest culprit in the profile is the hPutByte function:
Since the underlying databases are written using lots of hTell and hSeek commands, it is not possible to use something like the Data.Binary library. However, if anyone has any suggestions on how to improve performance they would be gratefully received.
Next week: I want to finish off the remaining type search features, and then package up a command line release for Hackage. Hopefully Hoogle 4 will be ready for initial use by early testers.
User visible changes: Type search is more robust, but still not fully featured. Database creation if faster and robust. You can search the base libraries.
I've substantially refactored the type searching, basing it on a proper abstract Graph data type. Now the mechanisms for dealing with type search and graph traversal are separate it is much easier to express clearly what type search is doing. I've also fleshed out the type searching code so that it can accurately perform searches with all the necessary features. There are still a number of tasks to do before the type searching code is finished, but each is a fairly discrete unit of work with well-understood problems.
The other challenge for the week has been generating Hoogle databases for the core libraries - the base library and all the other libraries GHC ships with a release. With these libraries in place, it is feasible to use Hoogle to perform useful queries. The libraries are generated using a combination of Cabal, Haddock and Hoogle. I've made changes in both the Haddock and Hoogle layers so that the full base libraries can now be processed.
In order to deal with the full base libraries there are numerous GHC extensions that must be supported. In particular, Hoogle now supports multi-parameter type classes, higher-ranked types, type operators, unboxed types, unboxed tuples and NDP style arrays. All of these features are translated down into Haskell 98 types, but most closely approximate their behaviour in GHC, and can be used in searches.
Throughout the week I've been profiling the database creation code in Hoogle. The databases for the core libraries come to about 4.5Mb, and are highly optimised for performing searches - often at the cost of making them harder to create. I've halved the time to create databases during the week, using profiling to direct improvements. The time required to process the Core libraries takes 60 seconds, which is a certainly an acceptable timeframe, but could always be faster. Currently the biggest culprit in the profile is the hPutByte function:
hPutByte :: Handle -> Int -> IO ()
hPutByte hndl i = hPutChar hndl $ chr i
Since the underlying databases are written using lots of hTell and hSeek commands, it is not possible to use something like the Data.Binary library. However, if anyone has any suggestions on how to improve performance they would be gratefully received.
Next week: I want to finish off the remaining type search features, and then package up a command line release for Hackage. Hopefully Hoogle 4 will be ready for initial use by early testers.
User visible changes: Type search is more robust, but still not fully featured. Database creation if faster and robust. You can search the base libraries.
Sunday, July 06, 2008
GSoC Hoogle: Week 6
This week I've been tackling type searching. I have just (in the last few minutes) got my first type search to work. At the moment type search is very limited, but all the ideas and scafolding are in place, so should now proceed relatively quickly.
In all previous versions on Hoogle, type searching was O(n), where n is the number of functions in the database. Hoogle compared the type search to each possible answer, computed a closeness score, then at the end wrote out the closest matches. This meant that before the first answer could be given, all functions had to be checked, i.e. the time for the first answer was O(n). As the Hoogle database is about to get massively bigger, this approach is insufficient.
The new version of Hoogle is much cleverer. It works by exploring a graph, following similar ideas to Dijkstra's algorithm, to reach more suitable results first. Typically, the best answers will be given without any search of the graph, and then as the graph is searched more results will appear with lower closeness. With the new scheme the complexity is O(m), where m is the number of results you want. I hope at some point after the SoC is finished to describe the algorithm properly, so others can understand it, and hopefully improve upon it.
Next week: Finishing off type searching, so it supports all the features planned. Build system work, and potentially a cabal pre-release.
User visible changes: Type search works to some degree, but not perfectly. Database debugging options (conversion and dumping to a text file) have been added.
In all previous versions on Hoogle, type searching was O(n), where n is the number of functions in the database. Hoogle compared the type search to each possible answer, computed a closeness score, then at the end wrote out the closest matches. This meant that before the first answer could be given, all functions had to be checked, i.e. the time for the first answer was O(n). As the Hoogle database is about to get massively bigger, this approach is insufficient.
The new version of Hoogle is much cleverer. It works by exploring a graph, following similar ideas to Dijkstra's algorithm, to reach more suitable results first. Typically, the best answers will be given without any search of the graph, and then as the graph is searched more results will appear with lower closeness. With the new scheme the complexity is O(m), where m is the number of results you want. I hope at some point after the SoC is finished to describe the algorithm properly, so others can understand it, and hopefully improve upon it.
Next week: Finishing off type searching, so it supports all the features planned. Build system work, and potentially a cabal pre-release.
User visible changes: Type search works to some degree, but not perfectly. Database debugging options (conversion and dumping to a text file) have been added.
Thursday, June 26, 2008
GSoC Hoogle: Week 5
This week I was going to tackle type searching, but then realised I'm going to spend 6 hours on Friday on a train (hence the weekly update on Thursday), so can spend that time productively working on paper tackling type search. So instead of type search, I worked on a few other pieces, some of which make type search easier:
Haddock Database Generation More patches to get better output from Haddock. The code now handles class methods properly, and deals with some FFI bits.
Lazy Name Searching Searching for a name is now fairly lazy. When searching for a name, Hoogle can return the prefix of the results without doing too much computation to calculate all the results. This work is useful in its own right, but very necessary for the type searching, and can be reused.
Hoogle --info The biggest feature added this week is the --info flag. When this flag is given, Hoogle picks the first result and gives more details, including any Haddock documentation associated with the function. For example:
$ hoogle +tagsoup openurl --info
Text.HTML.Download openURL :: String -> IO String
This function opens a URL on the internet. Any http:// prefix is ignored.
> openURL "www.haskell.org/haskellwiki/Haskell"
Known Limitations:
* Only HTTP on port 80
* Outputs the HTTP Headers as well
* Does not work with all servers
It is hoped that a more reliable version of this function will be placed in a new HTTP library at some point!
Next week: Type searching! See last week for a description of what I hope to achieve.
User visible changes: The --info flag now exists.
Haddock Database Generation More patches to get better output from Haddock. The code now handles class methods properly, and deals with some FFI bits.
Lazy Name Searching Searching for a name is now fairly lazy. When searching for a name, Hoogle can return the prefix of the results without doing too much computation to calculate all the results. This work is useful in its own right, but very necessary for the type searching, and can be reused.
Hoogle --info The biggest feature added this week is the --info flag. When this flag is given, Hoogle picks the first result and gives more details, including any Haddock documentation associated with the function. For example:
$ hoogle +tagsoup openurl --info
Text.HTML.Download openURL :: String -> IO String
This function opens a URL on the internet. Any http:// prefix is ignored.
> openURL "www.haskell.org/haskellwiki/Haskell"
Known Limitations:
* Only HTTP on port 80
* Outputs the HTTP Headers as well
* Does not work with all servers
It is hoped that a more reliable version of this function will be placed in a new HTTP library at some point!
Next week: Type searching! See last week for a description of what I hope to achieve.
User visible changes: The --info flag now exists.
Sunday, June 22, 2008
GSoC Hoogle: Week 4
This week I've stayed in one place, and had lots of opportunity to get on with Hoogle. I've done a number of different things this week:
More on Haddock databases I fixed a number of issues with the Haddock generated Hoogle information. These patches have been submitted back to Haddock.
Binary Defer library I merged the binary defer library into the Hoogle sources, and modified it substantially. Some of the modifications were thanks to suggestions from the Haskell community, particularly David Roundy. The library is now more robust, and is being used as a solid foundation to build the rest of Hoogle on top of.
Text Searching You can now search for words, even multiple words, and the search will be performed. The text searching uses efficient data structures, scales excellently, and returns better results first.
Suggestions These improvements were detailed earlier in the week.
Next week: Type searching. I have various ideas on how to go about this, but it is the most tricky part of the whole project. I hope to come up with the perfect solution by the end of the week, but if not, will come up with something good enough for Hoogle 4 then revise it after the Summer is over (it could easily suck in a whole Summer of time if I am not careful!). Much of the low-level infrastructure is already present, so it is just the search algorithm.
User visible changes: Text searching works. A session with Hoogle as it currently stands:
.
More on Haddock databases I fixed a number of issues with the Haddock generated Hoogle information. These patches have been submitted back to Haddock.
Binary Defer library I merged the binary defer library into the Hoogle sources, and modified it substantially. Some of the modifications were thanks to suggestions from the Haskell community, particularly David Roundy. The library is now more robust, and is being used as a solid foundation to build the rest of Hoogle on top of.
Text Searching You can now search for words, even multiple words, and the search will be performed. The text searching uses efficient data structures, scales excellently, and returns better results first.
Suggestions These improvements were detailed earlier in the week.
Next week: Type searching. I have various ideas on how to go about this, but it is the most tricky part of the whole project. I hope to come up with the perfect solution by the end of the week, but if not, will come up with something good enough for Hoogle 4 then revise it after the Summer is over (it could easily suck in a whole Summer of time if I am not careful!). Much of the low-level infrastructure is already present, so it is just the search algorithm.
User visible changes: Text searching works. A session with Hoogle as it currently stands:
> cabal haddock --hoogle
-- generates tagsoup.txt
> hoogle --convert=tagsoup.txt
Generating Hoogle database
Written tagsoup.hoo
> hoogle +tagsoup is open --color
Text.HTML.TagSoup.Type isTagOpen :: Tag -> Bool
Text.HTML.TagSoup.Type isTagOpenName :: String -> Tag -> Bool
.
Sunday, June 15, 2008
GSoC Hoogle: Week 3
This week I've travelled a further 600 miles by train, but am now starting to get settled for the Summer, and down to work on Hoogle.
My main focus this week has been getting Haddock to generate Hoogle databases. For Haddock 0.8 I added in a --hoogle flag to generate Hoogle databases, and a similar --hoogle flag to Cabal. Unfortunately, for Haddock 2.0, the feature was removed as most of the code got rewritten. Now I've added the feature back, making extensive use of the GHC API to reduce the amount of custom pretty-printing required, and to support more Haskell features. The code has been added to the development Haddock branch, and will be present in the next release.
Most of the challenge was working with the GHC API. It's certainly a powerful body of code, but suffers from being inconsistent in various places and poorly documented. I mainly worked with the code using :i to view the API. I got bitten by various problems such as the Outputable module exporting useful functions such as mkUserStyle :: QueryQualifies -> Depth -> PprStyle, but not exporting any functions that can create a Depth value, and therefore not actually being usable. If Hoogle and Haddock could be used over the GHC API, it would substantially improve the development experience!
I've also worked more on defining the database format. I am about to start work on the implementation today. I've also added a few more command line flags, but mainly as placeholders.
Next week: Database creation and text searches (looking back I see some similarity to last week!)
User visible changes: haddock --hoogle now works.
My main focus this week has been getting Haddock to generate Hoogle databases. For Haddock 0.8 I added in a --hoogle flag to generate Hoogle databases, and a similar --hoogle flag to Cabal. Unfortunately, for Haddock 2.0, the feature was removed as most of the code got rewritten. Now I've added the feature back, making extensive use of the GHC API to reduce the amount of custom pretty-printing required, and to support more Haskell features. The code has been added to the development Haddock branch, and will be present in the next release.
Most of the challenge was working with the GHC API. It's certainly a powerful body of code, but suffers from being inconsistent in various places and poorly documented. I mainly worked with the code using :i to view the API. I got bitten by various problems such as the Outputable module exporting useful functions such as mkUserStyle :: QueryQualifies -> Depth -> PprStyle, but not exporting any functions that can create a Depth value, and therefore not actually being usable. If Hoogle and Haddock could be used over the GHC API, it would substantially improve the development experience!
I've also worked more on defining the database format. I am about to start work on the implementation today. I've also added a few more command line flags, but mainly as placeholders.
Next week: Database creation and text searches (looking back I see some similarity to last week!)
User visible changes: haddock --hoogle now works.
Monday, June 09, 2008
GSoC Hoogle: Week 2
This week I submitted my PhD thesis, emptied my entire rented house of furniture, spent £96 on petrol, drove (or was driven) 400 miles, travelled a similar distance by train, have been to the north of Scotland and am currently working on a borrowed Mac in London. Needless to say, its been rather busy - but now all the excitement is over and I should be able to focus properly on Hoogle.
In the last week I've been focusing on the database, the store of all the function names and type signatures, so a very critical piece of information. I want to support fast searching, which doesn't slow down as the number of known functions increases - a nasty property of the current version. For text searching, the trie data structure has this nice property, and can deal with searching for substrings. For fuzzy type searching, things are a lot more complex. However, I think I have an algorithm which is fast (few operations), accurate (gives better matches), scalable (independent of the number of functions in the database) and lazy (returns the best results first). The idea is to have a graph of function results, and then navigate this graph to find the best match.
Most of the database work has been theoretical, but I have done some coding. In particular, I have started on the database creation code, and polished the flag argument interaction code some more. Part of the development required the Derive tool, and in doing this work I noticed a few deficiencies. In particular, if you run Windows and run derive over a UNIX line-ending file, the tool will generate a Windows line-ending file. This problem, and a few others, are now fixed.
Next week: Database creation and searching. I want text searches to work by the end of the week.
User visible changes: The --help flag prints out information on the arguments.
PS. I was looking forward to seeing some blog posts from the other Haskell summer of code students on the Haskell Planet. If any Haskell GSoC student does have a blog, please ask for it to be included!
In the last week I've been focusing on the database, the store of all the function names and type signatures, so a very critical piece of information. I want to support fast searching, which doesn't slow down as the number of known functions increases - a nasty property of the current version. For text searching, the trie data structure has this nice property, and can deal with searching for substrings. For fuzzy type searching, things are a lot more complex. However, I think I have an algorithm which is fast (few operations), accurate (gives better matches), scalable (independent of the number of functions in the database) and lazy (returns the best results first). The idea is to have a graph of function results, and then navigate this graph to find the best match.
Most of the database work has been theoretical, but I have done some coding. In particular, I have started on the database creation code, and polished the flag argument interaction code some more. Part of the development required the Derive tool, and in doing this work I noticed a few deficiencies. In particular, if you run Windows and run derive over a UNIX line-ending file, the tool will generate a Windows line-ending file. This problem, and a few others, are now fixed.
Next week: Database creation and searching. I want text searches to work by the end of the week.
User visible changes: The --help flag prints out information on the arguments.
PS. I was looking forward to seeing some blog posts from the other Haskell summer of code students on the Haskell Planet. If any Haskell GSoC student does have a blog, please ask for it to be included!
Sunday, June 01, 2008
GSoC Hoogle: Week 1
I started my Google Summer of Code project on Hoogle at the beginning of this week. In my initial application I promised to make my weekly updates via blog, so here is the first weeks report:
I've only done about half a weeks work on Hoogle this week, because I'm handing in my PhD thesis early next week, and because I'm moving house on Wednesday. I spent 14 hours on Saturday moving furniture, and many more hours than that on my thesis! I should be fully devoted to GSoC by the middle of next week.
Despite all the distractions, I did manage to start work on Hoogle. I created a new project for Hoogle at the community.haskell.org site, and an associated darcs repo at http://code.haskell.org/hoogle. I've done a number of things on Hoogle:
I've started work from the front, and am intending to first flesh out an API and command line client, then move on to the web front end. The biggest change from the current implementation of Hoogle will be that there is one shared binary, which will be able to function in a number of modes. These modes will include running as a web server, as a command line version, as an interactive (Hugs/GHCi style) program, documentation location etc. This will allow easier installation, and let everyone host their own web-based Hoogle without much effort.
Next week: I hope to move towards the command line client and central Hoogle database structure. I also hope to chat to the Haddock 2 people, and try and get some integration similar to Haddock 1's --hoogle flag.
User visible changes: Hoogle 4 as it currently stands is unable to run searches, although hoogle --test will run some regression tests.
I've only done about half a weeks work on Hoogle this week, because I'm handing in my PhD thesis early next week, and because I'm moving house on Wednesday. I spent 14 hours on Saturday moving furniture, and many more hours than that on my thesis! I should be fully devoted to GSoC by the middle of next week.
Despite all the distractions, I did manage to start work on Hoogle. I created a new project for Hoogle at the community.haskell.org site, and an associated darcs repo at http://code.haskell.org/hoogle. I've done a number of things on Hoogle:
- Improved the developer documentation in some places
- Reorganised the repo, moving away dead files
- Work on command line flags, parsing them etc.
- Added a framework for running regression tests
- Organise the command line/CGI division
I've started work from the front, and am intending to first flesh out an API and command line client, then move on to the web front end. The biggest change from the current implementation of Hoogle will be that there is one shared binary, which will be able to function in a number of modes. These modes will include running as a web server, as a command line version, as an interactive (Hugs/GHCi style) program, documentation location etc. This will allow easier installation, and let everyone host their own web-based Hoogle without much effort.
Next week: I hope to move towards the command line client and central Hoogle database structure. I also hope to chat to the Haddock 2 people, and try and get some integration similar to Haddock 1's --hoogle flag.
User visible changes: Hoogle 4 as it currently stands is unable to run searches, although hoogle --test will run some regression tests.
Monday, April 21, 2008
Summer of Code 2008
This year I am going to be participating in the Google Summer of Code as a student, for the haskell.org organisation, on the Haskell API Search as an interface to Hackage project - aka Hoogle 4. I will be mentored by Niklas Broberg, author of various tools including Haskell Source Extensions, which is already used by Hoogle. My project link gives the summary I gave for the project, but below I've posted the interesting bits from my full application. I am going to be posting my progress at least once a week once the project phase starts (about 6 weeks time). I welcome any comments!
What is the goal of the project you propose to do?
There are two main goals:
1) Make Hoogle more useful to the community, along the same path as it is currently used.
2) Make Hoogle suitable to use as the standard interface to Hackage.
Can you give some more detailed design of what precisely you intend to achieve?
# Removal of all bugs
Hoogle 3 has a number of embarrassing bugs, some of which are not easily fixed. The nastiest of these is to do with monads, which are horribly mistreated. Since I now know the underlying issues which have caused a problem with Hoogle 3, things like higher-kinded type classes can be solved in a more principled manner.
# Support for some Haskell type extensions
Hoogle 3 does not deal with multi-parameter type classes. I would like to support a variety of type system extensions, primarily by mapping them on to Haskell 98 equivalent types.
# Faster searching
The current implementation is O(n) in the number of functions in the library, where the constant factor is absolutely massive. I wish to make text searching O(s), where s is the length of the search string, and have an incredibly low constant overhead -- using the a lazy file-based trie.
The type searching also needs a massive speed up. I have some ideas on how to proceed, but it is a difficult problem! I will spend a small amount of time investigating this problem, but may have to use a simpler algorithm, rather than delay the rest of the project.
# Better deployment
Currently there is Hoogle for the base libraries, and a special (very hacked) version that supports Gtk2hs only. I have received several requests for custom Hoogle instances for tools such as XMonad, Ycr2js, wxHaskell etc. The new Hoogle will make deployment of individual versions for specific packages easy.
# Support for multiple packages
I wish to support searching through every package on Hackage at once. This requires a massive speed up in the searching algorithms.
# Generalised text searching
By searching both function names, and also cabal package descriptions, Hoogle can be much more useful in finding packages, as opposed to individual functions.
# Better Design
Hoogle 3 is a web application, with a hacked on command line program. Hoogle 4 will be a central API which can be reused from any IDE tools, and also used to build the web interface and the command line application.
# Generalised interface to all of Cabal
Hopefully all the above goals will result in a tool that is suitable to be an interface to Cabal.
What deliverables do you think are reasonable targets? Can you outline an approximate schedule of milestones?
I would plan to release a beta of Hoogle 4 approximately half way through the project, as a web application. Much of the initial design has been done, so this is primarily hacking time.
I would then hope to complete the final hackage integration for the second half. This stage will require discussion with the cabal people, and will be a combination of design, implementation and server administration/setup.
In what ways will this project benefit the wider Haskell community?
Hoogle is already of use to the community, but has never seen a final release, and has a reasonable number of known bugs. This project would produce a polished version of a tool for which we already know there is huge demand.
Hackage is working well, and gaining new packages every day. As the number of packages increases, the interface to hackage must be updated to handle this volume. Discussions with some of the hackage/cabal team seem to suggest that a search interface is the way forward. By making Hackage easier to use, everyone benefits.
What is the goal of the project you propose to do?
There are two main goals:
1) Make Hoogle more useful to the community, along the same path as it is currently used.
2) Make Hoogle suitable to use as the standard interface to Hackage.
Can you give some more detailed design of what precisely you intend to achieve?
# Removal of all bugs
Hoogle 3 has a number of embarrassing bugs, some of which are not easily fixed. The nastiest of these is to do with monads, which are horribly mistreated. Since I now know the underlying issues which have caused a problem with Hoogle 3, things like higher-kinded type classes can be solved in a more principled manner.
# Support for some Haskell type extensions
Hoogle 3 does not deal with multi-parameter type classes. I would like to support a variety of type system extensions, primarily by mapping them on to Haskell 98 equivalent types.
# Faster searching
The current implementation is O(n) in the number of functions in the library, where the constant factor is absolutely massive. I wish to make text searching O(s), where s is the length of the search string, and have an incredibly low constant overhead -- using the a lazy file-based trie.
The type searching also needs a massive speed up. I have some ideas on how to proceed, but it is a difficult problem! I will spend a small amount of time investigating this problem, but may have to use a simpler algorithm, rather than delay the rest of the project.
# Better deployment
Currently there is Hoogle for the base libraries, and a special (very hacked) version that supports Gtk2hs only. I have received several requests for custom Hoogle instances for tools such as XMonad, Ycr2js, wxHaskell etc. The new Hoogle will make deployment of individual versions for specific packages easy.
# Support for multiple packages
I wish to support searching through every package on Hackage at once. This requires a massive speed up in the searching algorithms.
# Generalised text searching
By searching both function names, and also cabal package descriptions, Hoogle can be much more useful in finding packages, as opposed to individual functions.
# Better Design
Hoogle 3 is a web application, with a hacked on command line program. Hoogle 4 will be a central API which can be reused from any IDE tools, and also used to build the web interface and the command line application.
# Generalised interface to all of Cabal
Hopefully all the above goals will result in a tool that is suitable to be an interface to Cabal.
What deliverables do you think are reasonable targets? Can you outline an approximate schedule of milestones?
I would plan to release a beta of Hoogle 4 approximately half way through the project, as a web application. Much of the initial design has been done, so this is primarily hacking time.
I would then hope to complete the final hackage integration for the second half. This stage will require discussion with the cabal people, and will be a combination of design, implementation and server administration/setup.
In what ways will this project benefit the wider Haskell community?
Hoogle is already of use to the community, but has never seen a final release, and has a reasonable number of known bugs. This project would produce a polished version of a tool for which we already know there is huge demand.
Hackage is working well, and gaining new packages every day. As the number of packages increases, the interface to hackage must be updated to handle this volume. Discussions with some of the hackage/cabal team seem to suggest that a search interface is the way forward. By making Hackage easier to use, everyone benefits.
Thursday, April 12, 2007
SOC: GuiHaskell
The Google Summer of Code allocations are now out, and GuiHaskell got one of the slots from Haskell.org. There were quite a few students applying for GuiHaskell - I thought the students who applied were very good, and any of them would have done a wonderful job. In the end Asumu Takikawa was chosen, and I wish him well with the project.
The link with his application is available here.
The basic idea as I see the project is to start from the existing GuiHaskell base and evolving so that by the end of the summer it is a very solid alternative to WinHugs, but with several key advantages:
The initial aim is not to make it an IDE, but to do everything an IDE might do apart from having an integrated editor. Depending on the progress, this may be something that is tackled this summer, or perhaps not.
The link with his application is available here.
The basic idea as I see the project is to start from the existing GuiHaskell base and evolving so that by the end of the summer it is a very solid alternative to WinHugs, but with several key advantages:
- Written in Haskell using Gtk2Hs, not C with the Win32 API
- Cross platform: Windows, Linux and Mac at least
- Cross compiler: Hugs, GHC and GHCi to begin with
- Integration with Haddock/Cabal/Hoogle etc.
The initial aim is not to make it an IDE, but to do everything an IDE might do apart from having an integrated editor. Depending on the progress, this may be something that is tackled this summer, or perhaps not.
Labels:
soc
Subscribe to:
Posts (Atom)
