Jan-Willem Maessen jmaessen at alum.mit.edu
Mon Aug 13 16:01:14 EDT 2007

```On Aug 13, 2007, at 2:53 PM, Mitar wrote:

> Hi!
>
> I am thinking about a model where you would have only n threads on a
> n-core (or processor) machine. They would be your worker threads and
> you would spawn them only once (at the beginning of the program) and
> then just delegate work between them.
>
> On 8/13/07, Jan-Willem Maessen <jmaessen at alum.mit.edu> wrote:
>> The problem here is that while Cilk spawns are incredibly cheap,
>> they're still more than a simple procedure call (2-10x as expensive
>> if my fading memory serves me rightly).  Let's imagine we have a
>> nice, parallelizable computation that we've expressed using recursive
>> subdivision (the Cilk folks like to use matrix multiplication as an
>> example here).  Near the leaves of that computation we still spend
>> the majority of our time paying the overhead of spawning.  So we end
>> up actually imposing a depth bound, and writing two versions of our
>> computation---the one that spawns, which we use for coarse-grained
>> computations, and the one that doesn't, which we use when computation
>> gets fine-grained.  It makes a really big difference in practice.
>
> But this could be done at the runtime too. If the
> lazy-non-evaluated-yet chunk is "big" then divide it into a few parts
> and run each part in its thread. But if the chunk is small (you are at
> the end of the evaluation and you already evaluated necessary
> subexpressions) you do it in the thread which encounters this
> situation (which is probably near the end of the program or the end of

I didn't make my point very well.  The hard part is determining
exactly when a chunk is "big" or "small" without actually computing
its value.  Consider recursive fib (which I use only because it's
easy to understand, and has been used as an example of this problem
by the Cilk implementors):

fib n = if n <= 1 then n else fib (n-1) + fib (n-2)

Imagine we're computing (fib 30).  We can divide and conquer; plenty
of parallelism there!  But do most calls to fib represent enough work
to justify running them in parallel?  No, because most of the calls
are to (fib 0) or (fib 1)!  We should only pay the spawn cost up to a
certain bound --- say n >= 5 --- and then run serially for smaller
n.  This has a dramatic effect on how fast fib runs, but of course
the best possible choice of bound is going to be machine-dependent.

We can instrument our program and have some chance of doing OK for
fib :: Int -> Int; it's not at all obvious what to do for:
myFunction :: [Int] -> (Int -> Bool) -> [Frob]

In effect, I end up needing to write myFunction with a built-in bound
on computation, and I need to do it in such a way that the underlying
systems knows that one branch should be serial and the other branch
parallel.  This is the problem I was attempting to allude to above.
Yes, we can decide on a function-by-function or callsite-by-callsite
basis that "enough work" is being done to justify parallelism.  But
the answer to this question is often "maybe", or even "no" (as in fib).

> Yes, you have parMap but the problem I saw with it (and please correct
> me if I am wrong) is that it spawns a new thread for every application
> of the function to the element?

> But what if functions are small? Then
> this is quite an overhead. And you have to know this in advance if you
> want to use something else than the default parMap which is not always
> possible (if we are passing a function as an argument to the function
> which calls map).
>
> For example:
>
> calculate f xs = foldl (+) 0 \$ map f xs -- or foldr, I am not sure

You seem to be arguing that we can pick the right implementation of
"map" and "fold" (serial or parallel) here if we only knew that xs
was "big enough" and f "expensive enough".

I agree.  But that begs the question: let's assume "calculate" is a
function that's called from numerous places, with a mixture of "big"
and "small" arguments.  Now I need two versions of calculate, and I
need to decide at each call site whether to call "big calculate" or
"small calculate".  We also need to make sure any eagerness we
introduce is semantically sound, too, but I think we've got a pretty
good handle on that part in practice between my work on resource-
bounded evaluation in Eager Haskell, Rob Ennals' work on eager
evaluation in GHC, and the Singh & Harris paper.

That isn't to say that any of this is impossible---but it's going to
take a while to figure out how to get it right [it'd be a great Ph.D.
project to even knock off the low-hanging fruit like fib and
recursive matrix multiply].

Meanwhile, we also need to work hard educating programmers how to
write code that'll run in parallel in the first place, and giving
them the libraries that'll make it easy.

-Jan-Willem Maessen

```