Thursday, September 12, 2013

Summary: I wrote a simple program to detect repeated words in a file. This post gives a walk-through of how I developed it.

My wife is currently writing her PhD thesis, and in a recent draft her supervisor spotted a few instances where she had repeated repeated a word by accident (as I just did with repeated). While her LaTeX editor of choice highlights spelling mistakes, it does not spot repeated words. Therefore, I offered to write a quick script to spot repeated words for her. I used Haskell because it is a great choice for quick one-off scripts.

The problem can roughly be described as "for each file, find words that repeat". My first approach with any Haskell program is to decompose the problem, and this problem naturally breaks itself into "(for each file) (find words) (that repeat)". I need a function to iterate through a list of files doing IO and calling the other functions (main), a function to split an input file into words (worder) and a function to spot repeats (dupes).

Iteration 1

To get started, I wrote simple versions of each function:

```main :: IO ()
main = print . dupes . worder =<< readFile "thesis.tex"

worder :: String -> [String]
worder = words

dupes :: [String] -> [String]
dupes = map head . filter ((> 1) . length) . group
```

The main function is restricted to a single static file (namely thesis.tex), which it reads in, splits it into words, finds the duplicates, and prints that information. The function to split into words just uses the standard words function which splits on whitespace boundaries. The function dupes is the most interesting - it uses group to create lists of equal adjacent words, filter to find any group of more than 1 adjacent word, then map head to take only the first element from each group to report as the word at fault.

Iteration 2

The first iteration only prints out the words that have been duplicated. It would be much more civilised to also print the line number of the duplicated word, so my wife can quickly find the problem. The solution is to refine the type passed between worder and dupe to include the line number alongside each word. Instead of passing [String] we pass [(Int,String)].

```worder :: String -> [(Int, String)]
worder whole = [(i, word) | (i, line) <- zip [1..] \$ lines whole, word <- words line]

dupes :: [(Int,String)] -> [(Int,String)]
dupes = map head . filter ((> 1) . length) . groupBy ((==) `on` snd)
```

For worder we first split into lines and use zip [1..] to assign line numbers, then split each line into words. The changes to dupes are fairly minor - when grouping we use groupBy and consider two words to be adjacent looking only at the word part, not the line number. We are now printing out line numbers, making the error easy to find.

Iteration 3

The type of dupes is more specific than we need, so we can generalise it. Thinking about what dupes should do, we are really getting in a list of pairs of some information, and a value to check for repetition on. Therefore, we can write:

```dupes :: Eq v => [(k, v)] -> [(k, v)]
dupes = map head . filter ((> 1) . length) . groupBy ((==) `on` snd)
```

Note that the code has not changed, merely the signature. With the new signature we can also be sure we are not inadvertently comparing on the line number, since we have no Eq context for k.

Iteration 4

Looking at some sample documents, it became clear our worder implementation is insufficient, in particular:

• It is case sensitive, while our repeated words might be at the start of a sentence, e.g. "In in this chapter"
• It keeps punctuation, while our repeated words might be followed by a comma, e.g. "as shown in this chapter chapter, we have"
• It finds non-alphabetic words, e.g. "The position is at coordinate 1 1 3."

Fixing the first problem is simple - just map toLower over the string at some point. Fixing the others requires more thought, as we still want punctuation and non-alphabetic characters to separate words, so it cannot simply be discarded. There are two approaches to the problem - one is to change the splitting procedure (which effectively involves designing a finite state machine for when to split), the other is to process the input to make it suitable for words. While the first is more likely to produce something maintainable and adaptable, the second is often quicker to implement at first. For this program, I chose the second approach.

I realised that if we convert to lowercase, then replace all non-alphabetic non-space characters with " 1 2 ", we meet all the criteria above. Punctuation separates words, as does " 1 2 ". By replacing with a sequence of two distinct words we ensure that repeated punctuation does not flag as a spurious repeated word. By choosing characters that are themselves replaced by the sequence, we ensure we do not make a repeated word with the word before/after the replacement.

```worder :: String -> [(Int, String)]
worder whole = [(i, word) | (i, line) <- zip [1..] \$ lines \$ f whole, word <- words line]
where f = concatMap (\x -> if isAlpha x || isSpace x then [x] else " 1 2 ") . map toLower
```

We define a local function f to make the changes to the input string. To perform Char to String replacements on a string we use concatMap with the replacement described above. We could have fused the two iterations over the string, but keeping them separate makes it slightly clearer.

Iteration 5

The final part of the spec we have ignored until now is "for each file", which we can implement as:

```main :: IO ()
main = do
files <- getArgs
bad <- fmap concat \$ forM files \$ \file -> do