Thursday, May 22, 2008

Interactive TagSoup parsing

I've written quite a few programs using the tagsoup library, but have never really used the library interactively. Today I was wondering how many packages on hackage use all lower case names, compared to those starting with an initial capital. This sounds like a great opportunity to experiment! The rest of this post is a GHCi transcript, with my comments on what I'm doing prefixed with -- characters.

$ ghci
GHCi, version 6.8.2: :? for help
Loading package base ... linking ... done.
-- load some useful packages
Prelude> :m Text.HTML.TagSoup Text.HTML.Download Data.List Data.Char Data.Maybe
Prelude Data.Maybe Data.Char Data.List Text.HTML.Download Text.HTML.TagSoup>
-- ouch, that prompt is a bit long - we can use :set prompt to shorten it
-- side note: I actually supplied the patch for set prompt :)

:set prompt "Meep> "
-- lets download the list of packages
Meep> src <- openURL ""
... src scrolls pass the screen ...
-- parse the file, dropping everything before the packages
Meep> let parsed = dropWhile (~/= "<h3>") $ parseTags src
-- grab the list of packages
Meep> let packages = sort [x | a:TagText x:_ <- tails parsed, a ~== "<a href>"]
-- now we can query the list of packages
Meep> length packages
Meep> length $ filter (all isLower) packages
Meep> length $ filter ('_' `elem`) packages
Meep> length $ filter ('-' `elem`) packages
Meep> length $ filter (any isUpper . dropWhile isUpper) packages
Meep> length $ filter (isPrefixOf "hs" . map toLower) packages
Meep> length $ filter (any isDigit) packages
Meep> reverse $ sort $ map (\(x:xs) -> (1 + length xs,x)) $ group $ sort $ conca
t packages


We can see that loads of packages use lowercase, lots of packages use upper case, quite a few use CamelCase, quite a few start with "hs", none use "_", but lots use "-". The final query figures out which is the most common letter in hackage packages, and rather unsurprisingly, it roughly follows the frequency of English letters.

TagSoup and GHCi make a potent combination for obtaining and playing with webpages.


Anonymous said...

Hey thats really cool you have peaked my interest as to whats possible :).

Daniel Yokomizo said...

Time to update my (half baked) Haskell spider script to start using TagSoup instead of the weird combination of regexes that I was using...

Anonymous said...

Hi Neil. Great post. But could you pleased so kind to explaing the difference between a:_ and [a,_] pattern models ? I mean, why a:_ <- x and [a,_] <- is not the same models.

Neil Mitchell said...

Anon: [a,_] <- x is equal to (a:_:[]) <- x. One requires x to be a list at least length 1, the other requires the list to be exactly length 2.