Tuesday, November 19, 2013

The oldest Shake bug - async exceptions are hard

Summary: Shake could corrupt the build database under rare circumstances. Last week I finally figured out why, and fixed it.

The Shake build system records build metadata as it runs, using several techniques to ensure the metadata is not corrupted, even if the build process is killed. Over the past few years I've received a small number of reports where Shake ended up with corrupted metadata, causing a complete rebuild. These reports usually involved a build error, and often a laptop, but I was never able to reproduce the bug. Last week I finally understood the cause, and the bug is now fixed in shake-0.10.9. In this post I explain how Shake records its metadata, what the bug was and how I fixed it.

The Shake metadata

Shake records build metadata in a file, as a sequence of entries. Each entry represents one firing of a build rule, and contains the length of the entry, followed by the key of that rule (e.g. a filename) and the values produced by that rule (e.g. a modification time and list of dependencies).

The metadata file is designed to be robust even if the build is forcibly killed, using the length of the entry to detect whether the entry was written out in full, or if the program aborted while writing the entry. The key property I rely on is that all entries are complete and valid, apart from the last which may be only partially written.

The Bug

I was writing to the file with:

withLock lock $ LBS.hPut handle entry

This code was called from multiple threads, so to ensure writes were not interleaved, I used withLock. In Shake, if any build rule fails, I abort the build by killing all worker threads. However, during the process of aborting, one thread could take the lock, write out some prefix of its entry, then be killed. Another thread could follow the same pattern, resulting in two half-entries and a corrupted metadata file.

For this problem to arise you need to abort a thread that is in the middle of writing to the file, then wait sufficiently long to give a second thread the chance to start writing before it too is killed. In practice I believe it requires an error to be raised and two rules to produce entries almost simultaneously, and for the first thread to be killed to be the one that took the lock, and for the threads to be killed somewhat slowly. In days of random testing once every 20 seconds I never managed to reproduce such a delicate arrangement.

The Fix

The fix is straightforward, just make a single thread responsible for all writes to the file:

chan <- newChan
forkIO $ forever $ LBS.hPut h =<< readChan
let write x = do
    evaluate $ LBS.length x
    writeChan x

Here the chan keeps a list of things to write out, a single thread reads from chan and writes to the file, and any thread can call write. It is important to evaluate before writing to chan so that any exceptions caused by binary encoding are raised by the thread that caused them, and errors get the correct stack trace (Shake adds nice stack traces to all error messages).

In practice, the code is complicated by the possibility of exceptions, cleaning up on shutdown and a desire to flush the file periodically. The real code in Development.Shake.Storage is:

flushThread :: Maybe Double -> Handle -> ((LBS.ByteString -> IO ()) -> IO a) -> IO a
flushThread flush h act = do
    chan <- newChan -- operations to perform on the file
    kick <- newEmptyMVar -- kicked whenever something is written
    died <- newBarrier -- has the writing thread finished

    flusher <- case flush of
        Nothing -> return Nothing
        Just flush -> fmap Just $ forkIO $ forever $ do
            takeMVar kick
            threadDelay $ ceiling $ flush * 1000000
            tryTakeMVar kick
            writeChan chan $ hFlush h >> return True

    root <- myThreadId
    writer <- forkIO $ handle (\(e :: SomeException) -> signalBarrier died () >> throwTo root e) $
        -- only one thread ever writes, ensuring only the final write can be torn
        whileM $ join $ readChan chan

    (act $ \s -> do
            evaluate $ LBS.length s -- ensure exceptions occur on this thread
            writeChan chan $ LBS.hPut h s >> tryPutMVar kick () >> return True)
        `finally` do
            maybe (return ()) killThread flusher
            writeChan chan $ signalBarrier died () >> return False
            waitBarrier died

This function takes the flush interval (in seconds, or Nothing to never flush), and the file handle, and an action to run which requires the write function. It's pretty complex code, which is why it has such a high density of comments (compared to my usual code).

Saturday, November 16, 2013

ACM Article - Leaking Space

I wrote an article for the Communcations of the ACM which is now available on the ACM Queue, entitled Leaking Space, all about space leaks. It's got plenty of Haskell, but is intended as a general article on space leaks, and even has an example in Javascript. The introduction features both printed encyclopedia's and postage stamps - both things my son might only learn about on Wikipedia. The abstract:

A space leak occurs when a computer program uses more memory than necessary. In contrast to memory leaks, where the leaked memory is never released, the memory consumed by a space leak is released, but later than expected. This article presents example space leaks and how to spot and eliminate them.

Read Leaking Space for free, or if you have an ACM subscription, also here.

Thanks to Andy Gill for requesting the article and for suggestions to improve it, and to the team at ACM for typesetting/proofing etc.

Tuesday, November 12, 2013

NSIS plugins and LZMA solid

Summary: If you use NSIS plugins along with LZMA solid compression, make sure you reference the plugins first.

The NSIS Windows Installer supports plugins, which are dlls added to the installer which are called at install time. It also supports LZMA solid compression, which means that all files are compressed as a single continuous run for better compression. These two features can interact poorly. As an example:

SetCompressor /SOLID lzma
Page instfiles "" createInstFiles
Section dummy
    SetOutPath "$INSTDIR"
    File "*.exe"
Function createInstFiles

This installer is LZMA solid compressed and installs all .exe files into the installation directory. It also uses the Taskbar progress plugin to provide progress information in the taskbar.

Unfortunately, if the .exe files are large (say 1Gb) the installer will visibly freeze for around 40 seconds when you start to install. When the createInstFiles function is run, it first extracts the 3Kb file w7tbp.dll and loads it. NSIS adds files to the archive in the order they are discovered, meaning that since w7tbp.dll is first mentioned at the end of the file it is placed at the end of the archive. Solid compression requires decompressing everything before a file to extract it, and thus accessing the last file requires a lot of disk access and CPU time.

The workaround is to move all addin calls before all sections, so they go at the front of the archive, and can be extracted quickly. However, sections define names which are only accessible later on in the file, so not all functions can be moved before sections. Therefore, as a general technique, you can write a dummy function that is never called but which references each plugin, and put that function before any sections.

My NSIS wrapper library automatically uses the workaround.

Monday, November 04, 2013

NSIS Haskell wrapper now supports plugins

Summary: I've just released NSIS-0.2.3 which supports NSIS plugins.

My NSIS library is a Haskell EDSL to make it easier to write NSIS Windows Installers. I've just released a new version which has support for NSIS plugins. These plugins can provide extra functions to an NSIS installer, for example base64 encode/decode or displaying Windows 7 taskbar progress indications.

My NSIS library provides a low-level wrapper to interact with plugins, which can be wrapped to produce type-safe easy-to-use plugins. As an example, here are two NSIS plugins I have wrapped:

Example 1: base64 encode/decode

The base64 plugin provides two methods, encrypt and decrypt, both with the same pattern. In NSIS we can write:

StrLen $1 $2
Base64::Encrypt $1 $2
Pop $3

Using the Haskell library this becomes:

encrypt :: Exp String -> Exp String
encrypt x = share x $ \x -> do
    plugin "Base64" "Encrypt" [exp_ x, exp_ $ strLength x]

We have given the encrypt function a proper type, and compute the strLength automatically. The only complication is that since we are using a call-by-name language, and are using x twice, we must share it to ensure it is computed only once. We can now write:

alert $ "Hello encoded is " & encrypt "Hello"

The module Development.NSIS.Plugins.Base64 provides this function, along with decrypt.

Example 2: Windows 7 taskbar progress

The Taskbar progress plugin provides a progress bar inside the taskbar on Windows 7 and above. In NSIS we can write:

Function ShowInstFiles

The !define must occur before the appropriate !insertmacro, so you are restricted to where this code can go. Using the Haskell library this becomes:

taskbar :: Action ()
taskbar = onPageShow InstFiles $ plugin "w7tbp" "Start" []

Now the taskbar function can be called anywhere, and it will automatically move the macros to the right place. The module Development.NSIS.Plugins.Taskbar provides this function.

Contributions welcome

NSIS is hosted on github, and I welcome contributions. Some things you could help with:

  • I currently wrap most of the standard NSIS functions, but not all of them. I welcome any additional wrappers. I have been wrapping functions by need, but eventually would like to have everything wrapped.
  • The functions are mostly intended to be understood in conjunction with the NSIS documentation. I welcome any enhanced documented or work to make the documentation standalone, so people don't need to look at the underlying NSIS docs.
  • I currently wrap 2 plugins - one I needed and one that made a good demo. I welcome wrappers for all plugins.

I have written this library to address my needs. I would welcome bug reports or pull requests, in particular if you can write the installer you need in NSIS but not in this library.