Thursday, December 11, 2008

mapM, mapM_ and monadic statements

In my last post on F# I mentioned that do mapM f xs; return 1 caused a space leak, and that the programmer should have written mapM_. I also proposed that monadic statements should work more like in F# where non-unit return values can't be ignored. Various people seemed to misunderstand both points, so I thought I'd elaborate.

mapM as a Space Leak

First, I should clarify what I understand as a space leak. A space leak is not a memory leak in the C sense. A space leak is when a computation retains more live memory than necessary for some period of time. One sign of a possible space leak is that lots of memory is retained by garbage collection.

Comparing mapM and mapM_ on the following program:

main = do
mapM* putChar (replicate 10000 'a')
return ()

The mapM_ variant has a maximum heap residency of 2Kb, while the mapM variant has 226Kb. Given an input list of length n, the residency of mapM_ is O(1), while mapM is O(n).

The exact reasons for the space leak are quite detailed, and I'm not going to attempt to cover them. My intuition is that the return list is wrapped in the IO monad, and therefore can't be deallocated until the IO action finishes. In summary, unless you are going to use the end result of a monadic map, always use mapM_.

Monadic Statements

In the above example it would be nice if the compiler had complained. You generated a value, but you didn't use it. Fortunately, it is a very easy fix - change the type of monadic bind (>>) :: Monad m => m a -> m b -> b to Monad m => m () -> m b -> m b. Now, if a monadic statement generates a value that isn't (), you get a type error. The above examples with mapM would be rejected by the type checker.

But what if we really wanted to call mapM? There are two options. The first is to bind the result, for example do _ <- mapM f xs; return 1. The second option, which F# favours, is do ignore $ mapM f xs ; return 1, with the auxiliary ignore :: Monad m => m a -> m (). I prefer the second option, as it clearly states that you want to ignore the result of a computation. You could even write a rule ignore . mapM f == mapM_ f.


Andy Gill said...

I like the idea of >> taking a 'm ()' as its first argument. Matches my intuition that >> ignores its first arguments result.

I wonder how much code would break? Or how many space leaks would be fixed?

Neil Mitchell said...

My guess is that it will break a reasonable amount of code, but that all the code should be easily fixable. You could probably prototype it using the syntax rebinding that GHC offers.

Lennart suggested it would be good to have as a warning to start with - no idea how you can achieve that in the type checker though!

Neil Mitchell said...

My Dr Haskell tool ( will spot obvious things like using mapM and not using the result. Unfortunately its syntax directed, so it probably catches a reasonable amount of mistakes, but not all.

Josef said...

I also like the idea of changing the type of >>. Apart from changing the type of >> one would also change the typing rules for the do notation. It may come as a surprise to you but ghc doesn't use >> when translating the do notation, it uses >>= all the time.

Neil Mitchell said...

Josef: That sounds like a GHC bug, doesn't the report say to use >> for statements with no bindings?

maltem said...

Actually, why is the memory behaviour of the examples as observed? That is, why can't the values of the additional list be gc'ed immediately?

Arne Vogel said...

Your suggestion to redeclare

(>>) :: m () -> m b -> m b

is highly problematic. For example, like many Haskell developers I wrote a Parser monad (actually a MonadPlus instance) that can be used like this (this example demonstrates how to use it with a simplistic markup language):

charP :: Char -> Parser Char

stringP :: String -> Parser String
stringP = mapM charP

elementP :: Parser MLElement
elementP = do
  charP '<'
  tagName <- identP
  charP '>'
  children <- many0P elementP
  stringP ("</" ++ tagName ++ ">")
  return (MLElement tagName children)

My use of stringP would thus be a "space leak" even if the amount of memory required is probably insignificant. As you suggest redeclaring (>>) means I would either have to write something like

elementP = do
  _ <- charP '<'
  tagName <- identP
  _ <- charP '>'
  children <- many0P elementP
  _ <- stringP ("</" ++ tagName ++ ">")
  return (MLElement tagName children)

(effectively reimplementing (>>) without the constraint that the first argument be m () ), which is ugly and unnecessarily verbose, or come up with a unit version of all parser generators whose results I won't be interested in. E.g.

stringP_ :: String -> Parser ()
stringP_ = mapM_ charP

which is rather easy (assuming you won't redeclare mapM_ :: (a -> m ()) -> [a] -> m () at which point things start to get ugly), but I also have a very useful

consP :: Parser a -> Parser [a] -> Parser [a]
consP = liftM2 (:)

It's easy to define

consP_ = consP >>= \_ -> return ()

but while this solves the type problem, it doesn't solve the "space leak" you refer to. In fact, to solve this leak, I would have to write two versions of every parser that uses consP, one that gathers results and one that discards them.

By the way, some may say that I don't need a String -> Parser String function anyway because the result string is simply the same as the argument. However, in this case I couldn't write:

stringsetP :: [String] -> Parser String
stringsetP = msum . map stringP

Long story short, redeclaring (>>) :: m () -> m b -> m b would be a classical case of premature optimization. Worse, enforcing this doesn't necessarily reduce the CPU or memory usage of any Haskell program as this restriction is easy to circumvent and programmers will absolutely do that. The two undisputable effects are code bloat and making code more difficult to read. (Given that Haskell code already has a reputation for being hard to read, undeserved as it may be, this spells harm to Haskell's attractiveness.)

Haskell is, in my opinion, THE language for rapid prototype development especially for data-processing programs. Let's keep it like that and apply optimizations later after the real bottlenecks have been identified. In this example, it would be much more expedient to rewrite stringP to not use charP at all by defining a parser function that consumes more than one character at a time, as Parser a is actually

data Parser a = Parser { parse :: String -> [ParseResult a] }

Neil Mitchell said...

@maltem: Someone else can probably give a better explanation than me! Think of the IO as holding the list, and you need to evaluate the IO action to get round it.

@Arne: Certainly being () is much more applicable to the IO monad than other monads. However, in your parser example I'd respecify charP as Char -> Parser () - they give you a character, so they already know what the result would be. I think its rare you'd need the other form.

My reason for wanting >> to take a unit _isn't_ optimisation - its clarity. My goal isn't to make programs faster, but to attempt to remove another source of confusion for people, and match my intuition. And its only an idea, not saying its a good one :-)

Anonymous said...

I thought that I wouldn't need a String -> Parser String function because the result string is simply the same as the argument. But, I'll look into it anyway.

P.S. Wasn't your former partner Daniel Tammet or 'The Boy with The Incredible Brain?'

tushar said...

Hello Neil,

Thanks for sharing the article. I am new to Haskell and have done a lot of C++/C# and also learning F#.

As you worked in both, C# and Haskell, whats the bet way to get to speed to Haskell and the lambda expressions.


Neil Mitchell said...

@Tushar: Haskell doesn't really use explicit lambdas, since we've got things like partial application etc. Just learn general Haskell, and the lambdas will come naturally. I recommend this book:

gwern said...

Incidentally, if anyone is interested, there will be a Control.Monad.void in GHC 6.14

Generic Viagra For Men said...

Neil Nice blog written. I am new to Haskell ....