The original program is written in a low-level style. To search for a keyword, the program laboriously traverses through the file looking for the keyword, much like a modern imperative language might. But Haskell programmers can do better. We can separate the task: first parsing the keyword/description pairs into a list; then searching the list. Lazy evaluation will combine these separate operations to obtain something just as efficient as the original. By separating the concerns, we can express each at a higher-level, reducing the search function to a simple lookup. It also gives us more flexibility for the future, allowing us to potentially reuse the parsing functions.
I have fixed a number of other bugs in the code, and my solution is:
module Main where
import Text.HTML.TagSoup
import Maybe
main = do
putStr "Enter a term to search for: "
term <- getLine
html <- readFile "Dictionary.html"
let dict = parseDict $ parseTags html
putStrLn $ fromMaybe "No match found." $ lookup term dict
parseDict :: [Tag] -> [(String,String)]
parseDict = map parseItem
. sections (~== "<dt>")
. dropWhile (~/= "<div class=glosslist>")
parseItem xs = (innerText a, unwords $ words $ innerText b)
where (a,b) = break (~== "<dd>") (takeWhile (~/= "</dd>") xs)
Instead of searching for a single keyword, I parse all keywords using parseDict. The parseDict function first skips over the gunk at the top of the file, then finds each definition, and parses it. The parseItem function spots where tags begin and end, and takes the text from inside. The unwords $ words expression is a neat trick for normalising the spacing within an arbitrary string.
This revised program is shorter than the original, I find it easier to comprehend, and it provides more functionality with fewer bugs. The TagSoup library provides a robust base to work from, allowing concise expression of HTML/XML extraction programs.
No comments:
Post a Comment