Monday, March 26, 2012

Finding Two unordered-containers Bugs

Summary: Over the past few weeks I found two bugs in unordered-containers. I strongly recommend everyone upgrades to unordered-containers- or later, which fixes these bugs.

Both Shake and Uniplate use the unordered-containers library, which provides fast Map and Set types. Both my libraries had bugs reported against them which turned out to be introduced by unordered-containers- As soon as these bugs were identified Johan released a new version ( fixing the bugs, and both my packages now work as expected.

Bug 1: Shake breaks because delete fails

Just over two weeks ago Evan Laforge reported he'd started seeing "thread blocked indefinitely in an MVar operation" if there was a compile error. Shake goes to a lot of effort to clean up nicely, so something was going wrong. Since Evan's project compiles on Linux, and I only have access to Windows machines, I tried replicating it on one of my projects. I perturbed the project in several ways and did manage to replicate the error once, but the same perturbation next time made it succeed - something was clearly non-deterministic (which is the norm in a highly concurrent program such as Shake).

I'd been wanting to write a random build system generator and tester for a while, to thoroughly test Shake, and used this bug as motivation. I wrote the tester, and usually within 5 random runs (about 60 seconds of churning) the error was reproduced on my desktop. However, the error wouldn't reproduce on my laptop even after an hour (because I was on an older version of unordered-containers, I now know), and as I was off travelling I had to put the bug away to look into on my return.

On my return I investigated further. Figuring out the command line to run the random test took about an hour (note to self - write these things down). Once I had that, it still reproduced reliably in about 60 seconds. I cut out the random parts of the test that I didn't think were contributing to the bug (tests for building successfully, minimal rebuilding etc) and cut the reproduction time down to about 5 seconds. I then started augmenting the code with print statements. The cleanup code for exceptions is almost exclusively within the thread pool implementation (see Development.Shake.Pool if you are reading the source). In particular, the code that cleans up after an exception is:

t <- myThreadId
mapM_ killThread $ Set.toList $ Set.delete t $ threads s
signalBarrier done $ Just e

This code finds all threads in the thread pool, with the exception of this thread (the Set.delete t), and kills them. After that has finished, signal to the thread who first called the thread pool that everything has finished with an exception.

Adding trace statements at every step showed that the exception started being cleaned up, a handful of threads were killed, but not all the threads were killed and the barrier was never signaled. An initial guess was that my thread was being killed, and thus that Set.delete had failed to remove t. Since I had debugged a separate unordered-containers bug a week before (bug 2 started after and finished before this one), I went to the unordered-containers commit list and immediately found a bug fixed against delete. I upgraded, and ran my random tester for 3 hours without failure.

Bug 2: Uniplate breaks because filter fails

The Uniplate bug was reported just over a week ago, while I was travelling, by Aleksey Khudyakov. This bug report included a relatively small test case demonstrating a clear bug, along with the vital piece of information that the test case worked with unordered-containers-0.1, but not 0.2. With that information the question became whether Uniplate was using some non-guaranteed behaviour (such as assuming the result from Set.toList is ordered), or whether unordered-containers had a bug. The supplied test case was very sensitive - it involved three modules, and simply changing any module name caused the bug to disappear. The code was also exercising a very sensitive and complex part of Uniplate. After 12 hours of cooperative debugging between myself, Aleksey and Johan, I came up with the reduced test case:

let broken = Set.filter (`elem` good) $ Set.fromList $ good ++ bad
working = Set.fromList $ Set.toList broken
print $ Set.member (head good) broken
print $ Set.member (head good) working

This printed False/True, when it should print True/True. Of course, the choice of the good and bad lists is critical, and was based on the Hashable instance for TypeRep, which uses the fully qualified type name, and thus changes if you rename the modules.

Once I had provided Johan with the reduced test case, it was fixed the same day.

No comments: