[Haskell-cafe] monadic MapReduce

Thu Mar 5 13:48:27 EST 2009

On Tue, Mar 03, 2009 at 07:27:35AM -1000, Tim Newsham wrote:
>> How about this. Is there a reason why I can't
>> replace the variables b and c in the type signature of mapReduce with with (IO b')
>> and (IO c'). b and c  can be any types.
>>
>> mapReduce :: Strategy (IO b')    -- evaluation strategy for mapping
>>           -> (a -> IO b')      -- map function
>>           -> Strategy (IO c')    -- evaluation strategy for reduction
>>           -> ([IO b'] -> (IO c'))    -- reduce function
>>           -> [a]           -- list to map over
>>           -> (IO c')
>>
>> Just remember to wrap all values back in the IO monad.
>
> Remember, the idea of map-reduce is to provide a very restrictive
> programming interface so that you have a lot of flexibility in
> your execution strategy.  If you start loosening the interface
> you will still be able to execute the program, but you may
> not be able to perform all the great optimizations you want to
> perform.  For example, if you are using IO actions that are
> stateful, what are the semantics?  Can one map action affect
> other map actions?  Does this imply an ordering of the map functions?
> Does this imply they all run on the same machine or at least have
> state communicated between the machines on which they run?

Yes, IO actions in mapReduce will introduce all these questions.
I thought Manlio's example was a  case where the IO actions consist 
only of reading an entire file into a string and no ordering or 
dependence between map actions is implied. I have a similar application
where this, unsafe as it is, might be useful. I need to read in a large 
number of CSV files, with numeric columns, and group columns with high 
correlation together.
I have yet to try my own suggestion though.
I have tried the first approach I suggested, read in the contents into 
strings and work with the normal map-reduce. That does keep the handles
open till map-reduce is done.

> The austere interface precludes any of these issues, and therein
> lies the beauty.
>
>> Anish
>
> Btw. I prefer the sawzall formulation over the map-reduce formulation.
> A sawzall program just specifies how to map some data to a monoid
> and the system is free to mappend the monoid values in whatever order
> it wants (by applying associativity).

Thanks for the tip, I wasn't aware of this.
At least in the map-reduce formulation in this thread, the fold is 
completely specified by the user and not optimized further.  So the 
sawazall formulation seems promising, because sometimes the fold action 
is more expensive than the map action.

And while we are talking about different formulations, the code here 
http://article.gmane.org/gmane.comp.lang.haskell.cafe/41944
splits the list of operands into chunks and does the fold action in 
parallel on each chunk. This actually works well for me, though I find 
the semantics a little hard to remember.

Anish

> Tim Newsham
> http://www.thenewsh.com/~newsham/