Tuesday, November 19, 2013

The oldest Shake bug - async exceptions are hard

Summary: Shake could corrupt the build database under rare circumstances. Last week I finally figured out why, and fixed it.

The Shake build system records build metadata as it runs, using several techniques to ensure the metadata is not corrupted, even if the build process is killed. Over the past few years I've received a small number of reports where Shake ended up with corrupted metadata, causing a complete rebuild. These reports usually involved a build error, and often a laptop, but I was never able to reproduce the bug. Last week I finally understood the cause, and the bug is now fixed in shake-0.10.9. In this post I explain how Shake records its metadata, what the bug was and how I fixed it.

The Shake metadata

Shake records build metadata in a file, as a sequence of entries. Each entry represents one firing of a build rule, and contains the length of the entry, followed by the key of that rule (e.g. a filename) and the values produced by that rule (e.g. a modification time and list of dependencies).

The metadata file is designed to be robust even if the build is forcibly killed, using the length of the entry to detect whether the entry was written out in full, or if the program aborted while writing the entry. The key property I rely on is that all entries are complete and valid, apart from the last which may be only partially written.

The Bug

I was writing to the file with:

withLock lock $ LBS.hPut handle entry

This code was called from multiple threads, so to ensure writes were not interleaved, I used withLock. In Shake, if any build rule fails, I abort the build by killing all worker threads. However, during the process of aborting, one thread could take the lock, write out some prefix of its entry, then be killed. Another thread could follow the same pattern, resulting in two half-entries and a corrupted metadata file.

For this problem to arise you need to abort a thread that is in the middle of writing to the file, then wait sufficiently long to give a second thread the chance to start writing before it too is killed. In practice I believe it requires an error to be raised and two rules to produce entries almost simultaneously, and for the first thread to be killed to be the one that took the lock, and for the threads to be killed somewhat slowly. In days of random testing once every 20 seconds I never managed to reproduce such a delicate arrangement.

The Fix

The fix is straightforward, just make a single thread responsible for all writes to the file:

chan <- newChan
forkIO $ forever $ LBS.hPut h =<< readChan
let write x = do
    evaluate $ LBS.length x
    writeChan x

Here the chan keeps a list of things to write out, a single thread reads from chan and writes to the file, and any thread can call write. It is important to evaluate before writing to chan so that any exceptions caused by binary encoding are raised by the thread that caused them, and errors get the correct stack trace (Shake adds nice stack traces to all error messages).

In practice, the code is complicated by the possibility of exceptions, cleaning up on shutdown and a desire to flush the file periodically. The real code in Development.Shake.Storage is:

flushThread :: Maybe Double -> Handle -> ((LBS.ByteString -> IO ()) -> IO a) -> IO a
flushThread flush h act = do
    chan <- newChan -- operations to perform on the file
    kick <- newEmptyMVar -- kicked whenever something is written
    died <- newBarrier -- has the writing thread finished

    flusher <- case flush of
        Nothing -> return Nothing
        Just flush -> fmap Just $ forkIO $ forever $ do
            takeMVar kick
            threadDelay $ ceiling $ flush * 1000000
            tryTakeMVar kick
            writeChan chan $ hFlush h >> return True

    root <- myThreadId
    writer <- forkIO $ handle (\(e :: SomeException) -> signalBarrier died () >> throwTo root e) $
        -- only one thread ever writes, ensuring only the final write can be torn
        whileM $ join $ readChan chan

    (act $ \s -> do
            evaluate $ LBS.length s -- ensure exceptions occur on this thread
            writeChan chan $ LBS.hPut h s >> tryPutMVar kick () >> return True)
        `finally` do
            maybe (return ()) killThread flusher
            writeChan chan $ signalBarrier died () >> return False
            waitBarrier died

This function takes the flush interval (in seconds, or Nothing to never flush), and the file handle, and an action to run which requires the write function. It's pretty complex code, which is why it has such a high density of comments (compared to my usual code).


Steve Severance said...

Sweet. I run into this maybe 10 times per day with our codebase. Can't wait to pull this fix in.

Neil Mitchell said...

Steve: Wow, 10 times per day? How many Shake builds are you running a day? Are you doing something unusual, or running in weird conditions? I see less than 1 error every 1000 builds, so am surprised it is hitting you that hard.

If you continue to see corruption, or if there are any other issues with Shake, please email the mailing list or me personally (if it's secret stuff).