Summary: Shake could corrupt the build database under rare circumstances. Last week I finally figured out why, and fixed it.
The Shake build system records build metadata as it runs, using several techniques to ensure the metadata is not corrupted, even if the build process is killed. Over the past few years I've received a small number of reports where Shake ended up with corrupted metadata, causing a complete rebuild. These reports usually involved a build error, and often a laptop, but I was never able to reproduce the bug. Last week I finally understood the cause, and the bug is now fixed in shake-0.10.9. In this post I explain how Shake records its metadata, what the bug was and how I fixed it.
The Shake metadata
Shake records build metadata in a file, as a sequence of entries. Each entry represents one firing of a build rule, and contains the length of the entry, followed by the key of that rule (e.g. a filename) and the values produced by that rule (e.g. a modification time and list of dependencies).
The metadata file is designed to be robust even if the build is forcibly killed, using the length of the entry to detect whether the entry was written out in full, or if the program aborted while writing the entry. The key property I rely on is that all entries are complete and valid, apart from the last which may be only partially written.
I was writing to the file with:
withLock lock $ LBS.hPut handle entry
This code was called from multiple threads, so to ensure writes were not interleaved, I used
withLock. In Shake, if any build rule fails, I abort the build by killing all worker threads. However, during the process of aborting, one thread could take the lock, write out some prefix of its entry, then be killed. Another thread could follow the same pattern, resulting in two half-entries and a corrupted metadata file.
For this problem to arise you need to abort a thread that is in the middle of writing to the file, then wait sufficiently long to give a second thread the chance to start writing before it too is killed. In practice I believe it requires an error to be raised and two rules to produce entries almost simultaneously, and for the first thread to be killed to be the one that took the
lock, and for the threads to be killed somewhat slowly. In days of random testing once every 20 seconds I never managed to reproduce such a delicate arrangement.
The fix is straightforward, just make a single thread responsible for all writes to the file:
chan <- newChan forkIO $ forever $ LBS.hPut h =<< readChan let write x = do evaluate $ LBS.length x writeChan x
chan keeps a list of things to write out, a single thread reads from
chan and writes to the file, and any thread can call
write. It is important to
evaluate before writing to
chan so that any exceptions caused by binary encoding are raised by the thread that caused them, and errors get the correct stack trace (Shake adds nice stack traces to all error messages).
In practice, the code is complicated by the possibility of exceptions, cleaning up on shutdown and a desire to flush the file periodically. The real code in
flushThread :: Maybe Double -> Handle -> ((LBS.ByteString -> IO ()) -> IO a) -> IO a flushThread flush h act = do chan <- newChan -- operations to perform on the file kick <- newEmptyMVar -- kicked whenever something is written died <- newBarrier -- has the writing thread finished flusher <- case flush of Nothing -> return Nothing Just flush -> fmap Just $ forkIO $ forever $ do takeMVar kick threadDelay $ ceiling $ flush * 1000000 tryTakeMVar kick writeChan chan $ hFlush h >> return True root <- myThreadId writer <- forkIO $ handle (\(e :: SomeException) -> signalBarrier died () >> throwTo root e) $ -- only one thread ever writes, ensuring only the final write can be torn whileM $ join $ readChan chan (act $ \s -> do evaluate $ LBS.length s -- ensure exceptions occur on this thread writeChan chan $ LBS.hPut h s >> tryPutMVar kick () >> return True) `finally` do maybe (return ()) killThread flusher writeChan chan $ signalBarrier died () >> return False waitBarrier died
This function takes the flush interval (in seconds, or
Nothing to never flush), and the file handle, and an action to run which requires the
write function. It's pretty complex code, which is why it has such a high density of comments (compared to my usual code).