Sunday, February 18, 2018

Atomic Expressions Generically

Summary: For certain hints HLint needs to determine if a Haskell expression is atomic. I wrote a generic method to generate expressions and test if they are atomic.

With HLint, if you write a statement such as:

main = print ("Hello")

You get the hint:

Sample.hs:1:14: Warning: Redundant bracket
Found:
  ("Hello")
Why not:
  "Hello"

One of ways HLint figures out if brackets are redundant is if the expression inside the brackets is "atomic" - if you never have to bracket it in any circumstances. As an example, a literal string is atomic, but an if expression is not. The isAtom function from haskell-src-exts-util has a list of the types of expression which are atomic, but the Exp type from haskell-src-exts has 55 distinct constructors, and I don't even know what many of them do. How can we check the isAtom function is correct?

One approach is to use human thought, and that's the approach used until now, with reasonable success. However, I've recently written a script which solves the problem more permanently, generating random expressions and checking that isAtom gives the right value. In this post I'm going to outline a few features of how that script works. There are basically three steps:

1) Generate a type-correct Exp

The first step is to generate a random Exp which follows the type definition. Fortunately the Data class in Haskell lets us generate values. We define:

mkValue :: forall a . Data a => Int -> IO a
mkValue depth
    | Just x <- cast "aA1:+" = randomElem x
    | Just x <- cast [-1 :: Int, 1] = randomElem x
    | Just x <- cast [-1 :: Integer, 1] = randomElem x
    | AlgRep cs <- dataTypeRep $ dataTypeOf (undefined :: a) =
        if depth <= 0 then throwIO LimitReached else fromConstrM (mkValue $ depth - 1) =<< randomElem cs

Here we are saying that given a depth, and a result type a, we generate a value of type a. Note that the a argument is the result, but we don't pass anything in of type a. The first three lines of the body follow the pattern:

    | Just x <- cast [list_of_element] = randomElem x

This tries to convert list_of_element to [a] by using runtime type information. If it succeeds, we pick a random element from the list. If it doesn't we continue onwards.

The final case uses dataTypeRep/dataTypeOf to get a list of the constructors of a. Note that we don't have a value of a, so we make one up using undefined :: a - but that's OK because dataTypeOf promises not to look at its argument. Given a list of constructors, we pick one at random, and then call fromConstrM - which says how to create a value of the right constructor, using some argument to fill in all the fields. We pass mkValue as that argument, which causes us to recursively build up random values.

One immediate problem is what if we are building a [Int] and the random generator often picks (:)? We'll take a very long time to finish. To solve this problem we keep a depth counter, decrement it in every recursive call, and when it runs out, throwIO an exception and give up.

2) Generate a parsing Exp

Now we've got a valid Exp value, but just because an Exp can be represented in the AST doesn't mean it corresponds to Haskell fragment. As an example, consider Var (UnQual (Ident "Test")). That's a valid value of type Exp, but if you pretty print it you get Test, and if you parse it back you'll get Con (UnQual (Ident "Test")) - variables must start with a leading lower-case letter.

To ignore invalid expressions we try pretty printing then parsing the expression, and ignore all expressions which don't roundtrip.

3) Determine if the Exp is atomic

Now we've got a valid Exp, which we know the user could have typed in as a source program, we need to figure out if isAtom is correct. To do that we see if given expression x whether self-application roundtrips, i.e. x x. As a positive example, foo (a variable) roundtrips as foo foo being foo applied to itself. However, if b then t else f when applied to itself gives if b then t else f if b then t else f, which parses back more like if b then t else f (if b then t else f), and is not atomic.

Putting it all together

Now we've got a random expression, and we know if the atomicity agrees with what we were expecting, we can report any differences. That approach has identified many additional patterns to match, but it's not perfect, in particular:

  • Most values either exceed the depth limit or fail to roundtrip. For 10,000 if expressions I typically get 1 or 2 which roundtrip properly. For non-if expressions it's usually 100 or so. The advantage of random testing is that throwing more time at a problem solves such issues without thinking too hard.
  • For some expressions, e.g. ParComp, I've never managed to get a valid value created. Perhaps haskell-src-exts can't parse it, or perhaps it requires constants I don't have in my hardcoded list - none of these were particularly common examples.
  • haskell-src-exts has a bug where -1 is pretty printed as (-1), which is then parsed as a paren and -1. That fails step 2, so we don't test with negative literals. As it happens, non-negative literals are atomic, but negative literals aren't, so we need to take care.
  • There are some patterns which appear to roundtrip successfully on their own, but not when surrounded by brackets, but secretly are just very weird. For example do rec\n [] parses successfully, but with source positions that are error values, and when applied to itself pretty prints incorrectly. There's at least one haskell-src-exts bug here.
  • The program appears to leak progressively more memory. I solved that by running slices of it at a time, and didn't look too hard. I've seen cases of blowup in Data constructors when recursing, so it could be that. but needs investigating.

As a result of all this work a future HLint will spot unnecessary brackets for 20 more types of expression, 8 more types of pattern and 7 more types of type.

4 comments:

Warbo said...

The memory problems sound very similar to naively generating recursive values in QuickCheck.

The obvious approach of picking a constructor then generating arguments, as you hinted at, quickly blows up. More specifically, if the expected number of recursive calls is >=1 then the expected depth is infinite. For example, we might have a binary tree with a nullary `Leaf` constructor and a binary `Node` constructor, which we pick between with a 50/50 chance. The expected number of recursive calls is then 0.5*0 + 0.5*2 = 1. Hence every time we generate a tree we should expect, on average, to generate one sub-tree; which will need one sub-tree; and so on.

Unfortunately, that's what most generic/derived implementations of Arbitrary will do :(

The sort of depth-based approach you've opted for does solve the memory blow up, but I've found it to be too biased towards small values. Since depth information only flows from parent to child, we can end up with lots of large siblings at the same level of the tree, and since the number of terms at each level grows exponentially with depth, we're forced to pick a very small depth to avoid the case where this exponential number of siblings are all large.

My preferred method is to pass in a "fuel" parameter, which gets decremented by each additional constructor and divided up (randomly) between sub-terms. This way, we only get large sub-terms if their siblings are small, so we can end up generating much deeper structures with confidence (since they'll necessarily be narrow).

Perhaps this isn't so much of an issue with checking atomicity, since that presumably only depends on 2 or 3 levels of nesting at most. Still, it's such a common issue that I like to mention it when random testing is brought up :)

Also, the generic-random package looks like a nice way to derive Arbitrary generators which don't suffer these blow ups. I found it via https://byorgey.wordpress.com/2016/09/20/the-generic-random-library-part-1-simple-generic-arbitrary-instances/

> As a result of all this work a future HLint will spot unnecessary brackets for 20 more types of expression, 8 more types of pattern and 7 more types of type.

Wearing my cynical hat, I would point out that the very fact that it required an automated random test suite to discover that these are "unnecessary" is proof enough that they *are* necessary ;)

Presumably HLint will have a switch to ignore some/all such checks? I routinely run HLint in my projects' test suites (because it's excellent, thank you), but I'm also of the "smug lisp weenie" camp which thinks that operator precedence is nothing other than a mental burden, a language bug and solvable by the liberal addition of extra brackets :)

Neil Mitchell said...

Warbo: Note that with the depth-based approach I have opted for there STILL HAS a memory blow up, it's just I think it's not from that code, but data instances. I agree a fuel parameter can work quite nicely, and that with other types of random generation I'd do a better job - in this case a poor approach coupled with a relatively large amount of compute does the job nicely enough. The generic-random package might indeed have been less work, but it seems to want to know the weights for each argument - which would have required a per-type information.

Note that HLint never suggests avoiding brackets because of operators (apart from $, because if you didn't want to avoid the bracket, why write the $). In the examples I've caught it has more usually been because there are quite a lot of constructors and while I know the syntax is atomic, I don't know that from the haskell-src-exts constructor name.

MitchellSalad said...

What's the significance of the lists "aA1:+-", [-1::Int,1], and [-1::Integer,1]?

Neil Mitchell said...

@MitchellSalad, they are for the non-algebraic constructors, so I have to have another way of choosing them. I just picked a few representative values that let me explore a reasonable design space and select from them.