Thursday, May 22, 2008

Interactive TagSoup parsing

I've written quite a few programs using the tagsoup library, but have never really used the library interactively. Today I was wondering how many packages on hackage use all lower case names, compared to those starting with an initial capital. This sounds like a great opportunity to experiment! The rest of this post is a GHCi transcript, with my comments on what I'm doing prefixed with -- characters.


$ ghci
GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help
Loading package base ... linking ... done.
-- load some useful packages
Prelude> :m Text.HTML.TagSoup Text.HTML.Download Data.List Data.Char Data.Maybe
Prelude Data.Maybe Data.Char Data.List Text.HTML.Download Text.HTML.TagSoup>
-- ouch, that prompt is a bit long - we can use :set prompt to shorten it
-- side note: I actually supplied the patch for set prompt :)

:set prompt "Meep> "
-- lets download the list of packages
Meep> src <- openURL "http://hackage.haskell.org/packages/archive/pkg-list.html"
... src scrolls pass the screen ...
-- parse the file, dropping everything before the packages
Meep> let parsed = dropWhile (~/= "<h3>") $ parseTags src
-- grab the list of packages
Meep> let packages = sort [x | a:TagText x:_ <- tails parsed, a ~== "<a href>"]
-- now we can query the list of packages
Meep> length packages
648
Meep> length $ filter (all isLower) packages
320
Meep> length $ filter ('_' `elem`) packages
0
Meep> length $ filter ('-' `elem`) packages
165
Meep> length $ filter (any isUpper . dropWhile isUpper) packages
100
Meep> length $ filter (isPrefixOf "hs" . map toLower) packages
47
Meep> length $ filter (any isDigit) packages
37
Meep> reverse $ sort $ map (\(x:xs) -> (1 + length xs,x)) $ group $ sort $ conca
t packages

[(484,'e'),(374,'a'),(346,'r'),(336,'s'),(335,'t'),(306,'i'),(272,'l'),(248,'c')
,(247,'n'),(240,'o'),(227,'p'),(209,'h'),(185,'-'),(171,'m'),(159,'d'),(126,'g')
,(112,'b'),(96,'u'),(87,'y'),(78,'k'),(76,'f'),(74,'x'),(58,'S'),(53,'H'),(35,'w
'),(33,'v'),(29,'q'),(29,'L'),(27,'A'),(26,'F'),(24,'D'),(23,'C'),(22,'T'),(16,'
P'),(16,'M'),(16,'I'),(16,'G'),(13,'B'),(12,'W'),(12,'3'),(12,'2'),(10,'O'),(9,'
R'),(9,'1'),(8,'z'),(8,'j'),(8,'E'),(7,'X'),(7,'U'),(7,'N'),(6,'Y'),(6,'V'),(5,'
J'),(4,'Q'),(4,'5'),(4,'4'),(3,'Z'),(3,'8'),(3,'6'),(1,'9')]


We can see that loads of packages use lowercase, lots of packages use upper case, quite a few use CamelCase, quite a few start with "hs", none use "_", but lots use "-". The final query figures out which is the most common letter in hackage packages, and rather unsurprisingly, it roughly follows the frequency of English letters.

TagSoup and GHCi make a potent combination for obtaining and playing with webpages.

4 comments:

Anonymous said...

Hey thats really cool you have peaked my interest as to whats possible :).

Daniel Yokomizo said...

Time to update my (half baked) Haskell spider script to start using TagSoup instead of the weird combination of regexes that I was using...

Anonymous said...

Hi Neil. Great post. But could you pleased so kind to explaing the difference between a:_ and [a,_] pattern models ? I mean, why a:_ <- x and [a,_] <- is not the same models.

Neil Mitchell said...

Anon: [a,_] <- x is equal to (a:_:[]) <- x. One requires x to be a list at least length 1, the other requires the list to be exactly length 2.