Thursday, January 14, 2016

A simple Haskell function

Summary: An example of a small function I recently wrote - from type signature to tests.

When writing a build system there are lots of nasty corner cases to consider. One is that command line limits combined with lots of arguments sometimes requires splitting a single command up into multiple commands, each of which is under some maximum length. In this post I'll describe a function that was required, my implementation, and how I tested it.

Type signature and documentation

Before I even got to the function, it already had a type signature and some Haddock documentation:

-- | @chunksOfSize size strings@ splits a given list of strings into chunks not
--   exceeding @size@ characters. If that is impossible, it uses singleton chunks.
chunksOfSize :: Int -> [String] -> [[String]]

As an example:

chunksOfSize 5 ["this","is","a","test"] == [["this"],["is","a"],["test"]]

Implementation

My implementation was:

chunksOfSize n = repeatedly $ \xs ->
    let i = length $ takeWhile (<= n) $ scanl1 (+) $ map length xs
    in splitAt (max 1 i) xs

First we use the repeatedly function from the extra library. This has the signature:

repeatedly :: ([a] -> (b, [a])) -> [a] -> [b]

Given a list of input, you supply a function that splits off an initial piece and returns the rest. One of the examples in the documentation is:

repeatedly (splitAt 3) xs  == chunksOf 3 xs

So we can see how repeatedly lets us focus on just the "next step" of this list, ignoring the recursion. For the function argument we have two tasks - first decide how many items to put in this chunk, then to split the chunks. Splitting the chunks is the easy bit, and can be written:

splitAt (max 1 i) xs

If we know the next i elements will be at or below the limit, then we can use splitAt to divide the elements. As a special case, if no elements would be allowed, we allow one, using max 1 to ensure we never pass 0 to splitAt (and thus enter an infinite loop). That leaves us with:

i = length $ takeWhile (<= n) $ scanl1 (+) $ map length xs

Reading from right to left, we reduce each element to it's length, then use scanl1 to produce a running total - so each element represents the total length up to that point. We then use takeWhile (<= n) to keep grabbing elements while they are short enough, and finally length to convert back to something we can use with splitAt.

Tests

When testing, I tend to start with a few concrete examples then move on to QuickCheck properties. As an initial example we can do:

quickCheck $
    chunksOfSize 3 ["a","b","c","defg","hi","jk"] ==
    [["a","b","c"],["defg"],["hi"],["jk"]]

Here we are explicitly testing some of the corner cases - we want to make sure the full complement of 3 get into the first chunk (and we haven't got an off-by-one), we also test a singleton chunk of size 4. Now we move on to QuickCheck properties:

quickCheck $ \n xs ->
    let res = chunksOfSize n xs
    in concat res == xs &&
       all (\r -> length r == 1 || length (concat r) <= n) res

There are really two properties here - first, the chunks concat together to form the original. Secondly, each chunk is either under the limit or a singleton. These properties capture the requirements in the documentation.

A final property we can check is that it should never be possible to move the first piece from a chunk to the previous chunk. We can write such a property as:

all (> n) $ zipWith (+)
    (map (sum . map length) res)
    (drop 1 $ map (length . head) res)

This property isn't as important as the other invariants, and is somewhat tested in the example, so I didn't include it in the test suite.

Performance and alternatives

The complexity is O(n) in the number of Char values, which is as expected, since we have to count them all. Some observations about this point in the design space:

  • In a strict language this would be an O(n^2) implementation, since we would repeatedly length and scanl the remainder of the tail each time. As it is, we are calling length on the first element of each chunk twice, so there is minor constant overhead.
  • Usually in Haskell, instead of counting the number of elements and then doing splitAt we would prefer to use span - something like span ((<= n) . fst) .... While possible, it makes the special singleton case more difficult, and requires lots of tuples/contortions to associate each element with its rolling sum.
  • For a build system, the entire input will be evaluated before, and the entire output will be kept in memory afterwards. However, if we think about this program with lazy streaming inputs and outputs, it will buffer each element of the output list separately. As a result memory would be bounded by the maximum of the longest string and the Int argument to chunksOfSize.
  • It is possible to write a streaming version of this function, which returns each String as soon as it is consumed, with memory bounded by the longest string alone. Moreover, if the solution above was to use lazy naturals, it would actually come quite close to being streaming (albeit gaining a quadratic complexity term from the takeWhile (<= n)).
  • The type signature could be generalised to [a] instead of String, but I would suspect in this context it's more likely for String to be replaced by Text or ByteString, rather than to be used on [Bool]. As a result, sticking to String seems best.

Refactoring the previous version

The function already existed in the codebase I was working on, so below is the original implementation. This implementation does not handle the long singleton special case (it loops forever). We can refactor it to support the singleton case, which we do in several steps. The original version was:

chunksOfSize _    [] = []
chunksOfSize size strings = reverse chunk : chunksOfSize size rest
  where
    (chunk, rest) = go [] 0 strings
    go res _         []     = (res, [])
    go res chunkSize (s:ss) =
        if newSize > size then (res, s:ss) else go (s:res) newSize ss
      where
        newSize = chunkSize + length s

Refactoring to use repeatedly we get:

chunksOfSize size = repeatedly $ second reverse . go [] 0
  where
    go res _         []     = (res, [])
    go res chunkSize (s:ss) =
        if newSize > size then (res, s:ss) else go (s:res) newSize ss
      where
        newSize = chunkSize + length s

Changing go to avoid the accumulator we get:

chunksOfSize size = repeatedly $ go 0
  where
    go _         []     = ([], [])
    go chunkSize (s:ss) =
        if newSize > size then ([], s:ss) else first (s:) $ go newSize ss
      where
        newSize = chunkSize + length s

It is then reasonably easy to fix the singleton bug:

chunksOfSize size = repeatedly $ \(x:xs) -> first (x:) $ go (length x) xs
  where
    go _         []     = ([], [])
    go chunkSize (s:ss) =
        if newSize > size then ([], s:ss) else first (s:) $ go newSize ss
      where
        newSize = chunkSize + length s

Finally, it is slightly simpler to keep track of the number of characters still allowed, rather than the number of characters already produced:

chunksOfSize size = repeatedly $ \(x:xs) -> first (x:) $ go (size - length x) xs
  where
    go n (x:xs) | let n2 = n - length x, n2 >= 0 = first (x:) $ go n2 xs
    go n xs = ([], xs)

Now we have an alternative version that is maximally streaming, only applies length to each element once, and would work nicely in a strict language. I find the version at the top of this post more readable, but this version is a reasonable alternative.

Acknowledgements: Thanks to Andrey Mokhov for providing the repo, figuring out all the weird corner cases with ar, and distilling it down into a Haskell problem.

6 comments:

Andrey Mokhov said...

Thanks for the detailed analysis! Surprisingly interesting for what initially appeared to be a boring function.

I imagine the following generalisation to be useful:

chunksOfSizeBy :: (a -> Int) -> Int -> [a] -> [[a]]

So that

chunksOfSize == chunksOfSizeBy length

Then chunksOfSizeBy id makes sense too. I could even imagine applying it to [Bool] when we need to get chunks with no more than a certain number of ones or zeroes.

The ultimate generalisation is, perhaps, as follows:

chunksOfSizeBy :: Real measure => (a -> measure) -> measure -> [a] -> [[a]]

:)

Daniel Díaz said...

"The type signature could be generalised to [a] instead of String, but I would suspect in this context it's more likely for String to be replaced by Text or ByteString, rather than to be used on [Bool]. As a result, sticking to String seems best."

Perhaps the function could be implemented in terms of the StableFactorialMonoid typeclass from monoid-subclasses. It provides the required methods (length,splitAt) and has instances for lists, Text and ByteString, among others.

I also like to use TextualMonoid whenever possible to make functions work with both String and Text.

Neil Mitchell said...

Andrey: Yes, this function was a lot deeper than I expected! With your generalisation, that is quite general - although Semigroup + Ord would work instead of Real too.

Daniel: I need split on the outer list, but for the inner piece I only need length. There are a lot of divisions of the Monoid class I never knew about before!

Andrey Mokhov said...

Neil: Semigroup + Ord is a reasonable constraint too, but Real permits negative sizes! This may be useful, when you are never allowed to exceed the size of an intermediate chunk, yet cancellations are allowed. Imagine calling chunksOfSizeBy on a list of data StackOp a = Push a | Pop with weights +1 and -1, respectively.

Neil Mitchell said...

Andrey: Using the Sum Monoid wrapper you can just have any Num, so it's just as powerful as Real. Having push/pop grouped that way would be a very novel use of chunksOfSizeBy, certainly one I hadn't thought of.

Andrey Mokhov said...

Neil: Ah, of course, you are right. I'm now tempted to generalise our implementation in Shaking-up-GHC, just for fun :-)