[Haskell-cafe] mapM is supralinear?

Thu Sep 22 02:05:30 CEST 2011

On Wed, Sep 21, 2011 at 1:57 PM, Tim Docker <tim at dockerz.net> wrote:
>
> On 09/09/2011, at 8:19 PM, John Lato wrote:
>
>> Agreed.  Whenever I'd like to use mapM (or any other function for
>> which a *M_ is available), I've found the following rules helpful:
>>
>> 1.  If I can guarantee the list is short (~ n<=20), go ahead and use mapM
>> 2.  Otherwise use mapM_, foldM_, or foldM if a real reduction is
>> possible (i.e. not "foldM snocM []").
>>
>> Step 2 sometimes requires changing my design, but it's always been for
>> the better.  `mapM_` tends to require more pipeline composition, so
>> it's leveraging the language's strengths.
>
> This thread is really interesting - it relates directly to problems I am
> currently
> having with mapM over large lists (see the thread "stack overflow pain").
>
> Can you explain what you mean by "mapM_ tends to require more pipeline
> composition"?
> In what way is it leveraging the language strengths?

Hmm, that is suitably cryptic.  One way to think of it is an inversion
of control.  Instead of operating on whole collections of things in a
monad, you specify monadic actions (pipelines) which are applied
sequentially to each input.

Here's a simple example.  Suppose you have a bunch of data serialized
to files, and you want to read each file into a data structure, apply
some process based upon the last file's data, and write out the output
to new files.  One way to do that would look like:

do
    dats <- mapM readMyData files
    let pairs = zip (mempty:dats) dats
    zipWithM_ (\(last, this) fname -> writeMyData (update last this)
fname) pairs newFiles

However, you could also put everything into a single monadic
operation, like this

do
    foldM_ (\last (infile, outfile) -> do
                                                    this <- readMyData infile
                                                    writeMyData
(update last this) outfile
                                                    return this
               )
               mempty
               (zip files newFiles)

The first interleaves control (mapM, zipWIthM_) with monadic actions
(file IO), whereas the second only has one control function (foldM_)
which completely processes one input.  I say this is more pipeline
composition because you have to create an entire pipeline from input
to output, which is then sequentially fed inputs by the control
function.

I say this leverages Haskell's strengths because it's quite easy to
compose functions and monadic actions in Haskell.  It also tends to be
garbage-collector friendly.  I also find it much easier to reason
about space usage.  You don't need to worry if part of a list is being
retained, because the full list of data doesn't appear anywhere.  If
you need to access prior elements they're specified explicitly so you
know exactly how much data you're holding on to.

My perspective might be warped by my work on iteratees, but I find
this a very natural approach.

John L.