Saturday, March 21, 2009

Concise Generic Queries

A few weeks ago David Miani asked how to write concise queries over a data type. The answer is certainly generic programming, a technique that I feel is underused in the Haskell community. I suggested David look at Uniplate, but he found greater success with SYB. Sean Leather gave a solution using EMGM. One of the advantages of Uniplate is conciseness, so I decided to tackle the same problem and compare.

A full description of the task, including data type definitions, is at Sean's blog. From a data type representing structured files (tables, headings, paragraphs) find a heading with a particular name then within that heading find a paragraph starting with "Description". The rest of this post contains solutions using Uniplate, EMGM (taken from Sean) and SYB (from David). The SYB solution is slightly different from the EMGM or Uniplate solutions, but they all do roughly the same generic operations. It is entirely possible that the EMGM/SYB solutions could be improved, but that is a job for other people.

Uniplate Solution

The Uniplate solution is:


projDesc :: String -> OrgFileP -> Maybe String
projDesc name p = listToMaybe [y |
OrgHeadingP _ x ys <- universeBi p, name == x,
ParagraphP y <- universeBi ys, "Description" `isPrefixOf` y]


The code can be read as:


  • Line 1: Type signature, given a name and a file, return the paragraph if you find one

  • Line 3: Find a heading with the right name

  • Line 4: Find a paragraph below that heading, whose name starts with "Description"

  • Line 2: Pick the paragraph



I find this code to be a clear, concise and simple description of the problem. The thought process to come up with the solution was as follows: You want to search, or perform a query. The first question is whether this is a deep (all nodes) or shallow (just the children) query - David doesn't say but the example seems to imply deep. If it's deep use universeBi. Operations are combined with a list comprehension that finds an element, check it has the necessary properties (the name), then performs more operations. The result is the code you see above.

EMGM Solution

Sean's solution can be found at his blog:


projDesc :: String -> OrgFileP -> Maybe String
projDesc name file = do
hdg <- G.firstr (headings name file)
para <- firstPara hdg
if para =~ "Description" then return para else Nothing

headings :: String -> OrgFileP -> [OrgHeadingP]
headings name = filter check . G.collect
where
check (OrgHeadingP _ possible _) = name == possible

firstPara :: OrgHeadingP -> Maybe String
firstPara hdg = paraStr =<< G.firstr (G.collect hdg)
where
paraStr (ParagraphP str) = Just str
paraStr _ = Nothing


This solution isn't bad, but is more verbose than the Uniplate solution. Perhaps it could be rewritten with list comprehensions? It seems that G.collect is similar to universeBi - although I am not sure.

SYB Solution

David's SYB solution can be found here along with another solution using different combinators.


eitherOr :: Either a b -> Either a b -> Either a b
eitherOr x@(Right _) _ = x
eitherOr _ y = y

getP14Desc :: OrgElement -> Either ErrString String
getP14Desc org = everything eitherOr (Left descError `mkQ` findDesc) =<<
everything eitherOr (Left findError `mkQ` findP14) org
where
findP14 h@(Heading {headingName=name})
| name == "Project14" = Right h
findP14 _ = Left findError

findDesc (Paragraph {paragraphText=text})
| text =~ "Description" = Right text
findDesc _ = Left findError

descError = "Couldn't find description for project"
findError = "Couldn't find project."


Summary

The relative merits of each solution are highly subjective, but I believe the Uniplate solution is concise. The Uniplate solution is a simple translation of the problem, without any clever steps, so hopefully other users (who didn't write the library!) will be able to achieve similar results. The Uniplate solution required only one function from the Uniplate library, so has a small learning curve. Even if you don't choose Uniplate, generic programming techniques are very useful, and can make your code concise and robust.

8 comments:

  1. Hey, Neil! Could you please try to use '<' instead of '<' in the HTML? My browser handles it but my RSS reader is stricter.

    Thanks!

    ReplyDelete
  2. I mean, use "& l t ;" ;)

    ReplyDelete
  3. meteficha: Should be all fixed now. In future I'll try and remember to avoid < in the code snippets without escaping.

    ReplyDelete
  4. Hello Neil. Nice post.

    The SYB solution is working with Either to provide feedback when the element is not found. The feedback is better than that provided by the Uniplate or EMGM versions, at the cost of some extra lines. Using Maybe we can get a shorter version:

    getP14Desc :: OrgElement -> Maybe String
    getP14Desc org = everything mplus (Nothing `mkQ` findP14) org >>=
    everything mplus (Nothing `mkQ` findDesc)

    where
    findP14 h@(Heading {headingName=name})
    | name == "Project14" = Just h
    findP14 _ = Nothing

    findDesc (Paragraph {paragraphText=text})
    | text =~ "Description" = Just text
    findDesc _ = Nothing


    Nevertheless, I agree that here the Uniplate style is the clearest.

    ReplyDelete
  5. Anonymous11:53 AM

    You are right in that Uniplate's "universeBi" and EMGM's "collect" are the same. EMGM's solution used Maybe directly, instead of keeping things in [], so list comprehensions couldn't be used. However, since the EMGM solution uses "firstr" only to convert lists to Maybe, you could use listToMaybe as well. We can keep using [] and rely on lazyness such that the outermost listToMaybe will only actually force calculation of the first element in the comprehension, instead of cutting off at the first element with firstr inside "firstPara", we get:

    projDesc :: String -> OrgFileP -> Maybe String
    projDesc name file = listToMaybe $
    [ str
    | OrgHeadingP _ x ys <- G.collect file, x == name
    , ParagraphP y <- G.collect ys, str ~= "Description"
    ]

    Which is exactly the same as the Uniplate version (barring (~=) and isPrefixOf, but that's not the point).

    ReplyDelete
  6. José: Thanks for the Maybe version of SYB, that does follow the same pattern now. I realised there were differences in the operations with the Either vs Maybe, but couldn't be bothered to change/test the SYB again.

    Joeri: Is there any reason for EMGM to use Maybe over list comprehensions? You revised EMGM seems much clearer.

    ReplyDelete
  7. Anonymous10:44 AM

    There's actually a couple of differences in that they both search for something different, due to the backtracking nature of [].

    The Maybe version will only consider the first OrgHeaderP with the right name, and only return Just when the first ParagraphP inside that OrgHeaderP matches "Description".

    The [] version will consider all OrgHeaderPs with the right name, and return Just if any ParagraphP inside matches "Description". It'll return Just the text of the first matching ParagraphP inside the first matching OrgHeaderP.

    If we somehow have the guarantees (due to file structure or other) that if we have a "Description", it's in the first ParagraphP, and the names of OrgHeaderP are unique, it's the same thing. If we cannot guarantee such a thing, it depends on what you're looking for.

    Of course, the same holds for the Uniplate version, which can, in a similar way, be rewritten using Maybe. Just throw in some extra listToMaybes.

    ReplyDelete
  8. Anonymous10:53 AM

    I worded that third paragraph poorly. The [] version will search search for other matching OrgHeaderPs when the first one with the right name doesn't have a matching ParagraphP. The conversion to Maybe cuts short this further searching.

    If we have the guarantees mentioned in my previous comment, the shortcutting done by conversion to Maybe will be more efficient, because it'll save us from searching through the whole tree.

    But as far as I know, it's not something library-specific. If needed, insert listToMaybe in the Uniplate version if you want the notationally more cumbersome, but non-backtracking version.

    So I think in this case, I think it's just a different interpretation of the problem.

    ReplyDelete