Wednesday, April 09, 2008

TagSoup Parsing: Dictionary Extraction

I've just read issue 10 of The Monad.Reader. It's a great issue, including a tutorial on using the new GHCi debugger, and how to write an efficient Haskell interpreter in Haskell. The running example for the GHCi debugger is parsing the computer dictionary and extracting descriptions from keywords, using the TagSoup library. The article starts with an initial version of the extraction code, then fixes some mistakes using the debugger present in GHCi. The code was written to teach debugging, not as a demonstration of TagSoup. This post explains how I would have written the program.

The original program is written in a low-level style. To search for a keyword, the program laboriously traverses through the file looking for the keyword, much like a modern imperative language might. But Haskell programmers can do better. We can separate the task: first parsing the keyword/description pairs into a list; then searching the list. Lazy evaluation will combine these separate operations to obtain something just as efficient as the original. By separating the concerns, we can express each at a higher-level, reducing the search function to a simple lookup. It also gives us more flexibility for the future, allowing us to potentially reuse the parsing functions.

I have fixed a number of other bugs in the code, and my solution is:

module Main where
import Text.HTML.TagSoup
import Maybe

main = do
putStr "Enter a term to search for: "
term <- getLine
html <- readFile "Dictionary.html"
let dict = parseDict $ parseTags html
putStrLn $ fromMaybe "No match found." $ lookup term dict

parseDict :: [Tag] -> [(String,String)]
parseDict = map parseItem
. sections (~== "<dt>")
. dropWhile (~/= "<div class=glosslist>")

parseItem xs = (innerText a, unwords $ words $ innerText b)
where (a,b) = break (~== "<dd>") (takeWhile (~/= "</dd>") xs)

Instead of searching for a single keyword, I parse all keywords using parseDict. The parseDict function first skips over the gunk at the top of the file, then finds each definition, and parses it. The parseItem function spots where tags begin and end, and takes the text from inside. The unwords $ words expression is a neat trick for normalising the spacing within an arbitrary string.

This revised program is shorter than the original, I find it easier to comprehend, and it provides more functionality with fewer bugs. The TagSoup library provides a robust base to work from, allowing concise expression of HTML/XML extraction programs.

No comments: