Sunday, May 31, 2020

HLint --cross was accidentally quadratic

Summary: HLint --cross was accidentally quadratic in the number of files.

One of my favourite blogs is Accidentally Quadratic, so when the Haskell linter HLint suffered such a fate, I felt duty bound to write it up. Most HLint hints work module-at-a-time (or smaller scopes), but there is one hint that can process multiple modules simultaneously - the duplication hint. If you write a sufficiently large repeated fragment in two modules, and pass --cross, then this hint will detect the duplication. The actual application of hints is HLint is governed by:

applyHintsReal :: [Setting] -> Hint -> [ModuleEx] -> [Idea]

Given a list of settings, a list of hints (which gets merged to a single composite Hint) and a list of modules, produce a list of ideas to suggest. Usually this function is called in parallel with a single module at a time, but when --cross is passed, all the modules being analysed get given in one go.

In HLint 3, applyHintsReal became quadratic in the number of modules. When you have 1 module, 1^2 = 1, and everything works fine, but --cross suffers a lot. The bug was simple. Given a Haskell list comprehension:

[(a,b) | a <- xs, b <- xs]

When given the list xs of [1,2] you get back the pairs [(1,1),(1,2),(2,1),(2,2)] - the cross product, which is quadratic in the size of xs. The real HLint code didn't look much different:

[ generateHints m m'
| m <- ms
, ...
, (nm',m') <- mns'
, ...
]
where
    mns' = map (\x -> (scopeCreate (GHC.unLoc $ ghcModule x), x)) ms

We map over ms to create mns' containing each module with some extra information. In the list comprehension we loop over each module ms to get m, then for each m in ms, loop over mns' to get m'. That means you take the cross-product of the modules, which is quadratic.

How did this bug come about? HLint used to work against haskell-src-exts (HSE), but now works against the GHC parser. We migrated the hints one by one, changing HLint to thread through both ASTs, and then each hint could pick which AST to use. The patch that introduced this behaviour left ms as the HSE AST, and made mns' the GHC AST. It should have zipped these two together, so for each module you have the HSE and GHC AST, but accidentally took the cross-product.

How did we spot it? Iustin Pop filed a bug report noting that each hint was repeated once per file being checked and performance had got significantly worse, hypothesising it was O(n^2). Iustin was right!

How did we fix it? By the time the bug was spotted, the HSE AST had been removed entirely, and both m and m' were the same type, so deleting one of the loops was easy. The fix is out in HLint version 3.1.4.

Should I be using --cross? If you haven't heard about --cross in HLint, I don't necessarily suggest you start experimenting with it. The duplicate detection hints are pretty dubious and I think most people would be better suited with a real duplicate detection tool. I've had good experiences with Simian in the past.

1 comment:

Iustin Pop said...

Thanks a lot for the fast fix!