*Summary: We have a version of mapM for IO that takes *

*O(1)*stack and is faster than the standard Haskell/GHC one for long lists.

The standard Haskell/GHC base library `sequence`

function in `IO`

takes *O(n)* stack space. However, working with Tom Ellis, we came up with a version that takes *O(1)* stack space. Our version is slower at reasonable sizes, but faster at huge sizes (100,000+ elements). The standard definition of `sequence`

(when specialised for both `IO`

and `[]`

) is equivalent to:

```
sequence :: [IO a] -> IO [a]
sequence [] = return []
sequence (x:xs) = do y <- x; ys <- sequence xs; return (y:ys)
```

Or, when rewritten inlining the monadic bind and opening up the internals of GHC's IO type, becomes:

```
sequence :: [IO a] -> IO [a]
sequence [] = IO $ \r -> (# r, () #)
sequence (y:ys) = IO $ \r -> case unIO y r of
(# r, y #) -> case unIO (sequence xs) r of
(# r, ys #) -> (# r, y:ys #)
```

For those not familiar with `IO`

, it is internally defined as:

```
newtype IO a = IO (State# RealWorld -> (# State# RealWorld, a #)
```

Each `IO`

action takes a `RealWorld`

token and returns a `RealWorld`

token, which ensures that `IO`

actions run in order. See here for a full tutorial.

Our observation was that this version requires *O(n)* stack space, as each recursive call is performed inside a `case`

. The algorithm proceeds in two phases:

- First, it traverses the input list, evaluating each action and pushing
`y`

on the stack. - After reaching the
`[]`

at the end of the list, it traverses the stack constructing the output list.

By constructing the list directly on the heap we can avoid the extra copy and use *O(1)* stack. Our version is:

```
sequenceIO :: [IO a] -> IO [a]
sequenceIO xs = do
ys <- IO $ \r -> (# r, apply r xs #)
evaluate $ demand ys
return ys
where
apply r [] = []
apply r (IO x:xs) = case x r of
(# r, y #) -> y : apply r xs
demand [] = ()
demand (x:xs) = demand xs
```

Here the two traversals are explicit:

- First, we traverse the list using
`apply`

. Note that we pass the`RealWorld`

token down the list (ensuring the items happen in the right order), but we do not pass it back. - To ensure all the
`IO`

actions performed during`apply`

happen before we return any of the list, we then`demand`

the list, ensuring the`[]`

element has been forced.

Both these traversals use *O(1)* stack. The first runs the actions and constructs the list. The second ensures evaluation has happened before we continue. The trick in this algorithm is:

```
ys <- IO $ \r -> (# r, apply r xs #)
```

Here we cheat by duplicating `r`

, which is potentially unsafe. This line does not evaluate `apply`

, merely returns a thunk which when evaluated will force `apply`

, and cause the `IO`

to happen during evaluation, at somewhat unspecified times. To regain well-defined evaluation order we force the result of `apply`

on the next line using `demand`

.

**Benchmarks**

We benchmarked using GHC 7.10.2, comparing the default `sequence`

(which has identical performance to the specialised monomorphic variant at the top of this post), and our `sequenceIO`

. We benchmarked at different lengths of lists. Our `sequenceIO`

is twice as slow at short lists, draws even around 10,000-100,000 elements, and goes 40% faster by 1,000,000 elements.

Our algorithm saves allocating stack, at the cost of iterating through the list twice. It is likely that by tuning the stack allocation flags the standard algorithm would be faster everywhere.

**Using sequence at large sizes**

Despite improved performance at large size, we would not encourage people to use `sequence`

or `mapM`

at such large sizes - these functions still require *O(n)* memory. Instead:

- If you are iterating over the elements, consider fusing this stage with subsequence stages, so that each element is processed one-by-one. The
`conduit`

and`pipes`

libraries both help with that approach. - If you are reducing the elements, e.g. performing a
`sum`

, consider fusing the`mapM`

and`sum`

using something like`foldM`

.

For common operations, such a `concat`

after a `mapM`

, an obvious definition of `concatMapM`

is:

```
concatMapM :: Monad m => (a -> m [b]) -> [a] -> m [b]
concatMapM f = liftM concat . mapM f
```

But that if the result of the argument is regularly `[]`

then a more efficient version is:

```
concatMapM op = foldr f (return [])
where f x xs = do x <- op x
if null x then xs else liftM (x ++) xs
```

For lists of 10,000 elements long, using the function `const (return [])`

, this definition is about 4x faster. Version 1.4.2 of the extra library uses this approach for both `concatMapM`

and `mapMaybeM`

.

**Update:** there are now real benchmarks of this technique and some others.

## 7 comments:

This is (more or less) the instance mentioned back when I talked about why `mapM` needed to remain in Traversable for now.

There we just constructed the output list in reverse order on the heap and then reversed it at the end. This uses more memory than the chicanery you use here, but it works for more types. You can actually use that form of `mapM` with any monad walking [].

However for other monads it usually isn't a pure win, in fact its more or less just IO, ST s, and STM-based monads where this wins. =( Many other (non-IO) monads that can productively produce the result list inside of their respectively monads.

I did try the reversing list, but it benchmarked quite a lot slower - I suspect two traversals and two allocations is too expensive. Note that even for IO it's only a win at fairly large sizes, I don't think it's ever worth using.

Some related thoughts can be found on https://www.joachim-breitner.de/blog/620-Constructing_a_list_in_a_Monad

Thanks nomeata, very interesting, I guess this technique could be another way to do what you were doing.

I have included your’s and Twern’s ideas in my benchmarks, see http://www.joachim-breitner.de/blog/684-Constructing_a_list_in_a_monad_revisited for the result.

Thanks nomeata - I've updated the post to link to your post.

I was also able to use DList is a go-through to have O(1) stack space, using foldM. Thus, the variant of mapM that uses DList for all monads depends on Foldable instead of Traversable, and declared as such:

dlistMapM :: (Monad m, Foldable l) => (a -> m b) -> l a -> m [b]

dlistMapM f = fmap DList.toList . foldM item DList.empty

where item acc x = f x >>= return . (acc `DList.snoc`)

Post a Comment