[Haskell-cafe] Dynamic thread management?

Mon Aug 13 14:53:06 EDT 2007

Hi!

I am thinking about a model where you would have only n threads on a
n-core (or processor) machine. They would be your worker threads and
you would spawn them only once (at the beginning of the program) and
then just delegate work between them.

On 8/13/07, Jan-Willem Maessen <jmaessen at alum.mit.edu> wrote:
> The problem here is that while Cilk spawns are incredibly cheap,
> they're still more than a simple procedure call (2-10x as expensive
> if my fading memory serves me rightly).  Let's imagine we have a
> nice, parallelizable computation that we've expressed using recursive
> subdivision (the Cilk folks like to use matrix multiplication as an
> example here).  Near the leaves of that computation we still spend
> the majority of our time paying the overhead of spawning.  So we end
> up actually imposing a depth bound, and writing two versions of our
> computation---the one that spawns, which we use for coarse-grained
> computations, and the one that doesn't, which we use when computation
> gets fine-grained.  It makes a really big difference in practice.

But this could be done at the runtime too. If the
lazy-non-evaluated-yet chunk is "big" then divide it into a few parts
and run each part in its thread. But if the chunk is small (you are at
the end of the evaluation and you already evaluated necessary
subexpressions) you do it in the thread which encounters this
situation (which is probably near the end of the program or the end of
the monadic IO action).

And this line when you choose to delegate or not can be determined at
runtime too.

In combination with some transactional memory or some other trick
which would be behind this delegation this could be probably possible.

We could also hint runtime that the function would probably take a
long time to compute (even if it is lazy) with making a type for such
functions which would signal this. Of course this could also make
things worse if used improperly. But sometimes you know that you will
be running the map of time-consuming function.

Yes, you have parMap but the problem I saw with it (and please correct
me if I am wrong) is that it spawns a new thread for every application
of the function to the element? But what if functions are small? Then
this is quite an overhead. And you have to know this in advance if you
want to use something else than the default parMap which is not always
possible (if we are passing a function as an argument to the function
which calls map).

For example:

calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure

And I would like to see something like this: it gets to the point when
we need to evaluate this function call, for some "big" f and some
"big" list of xs, so the thread which gets to it starts evaluating the
first value and when it starts with another (it is recursive call so
it is a "similar" evaluation) it sees that the other thread is empty
and the function would probably take a long time (it speculates) so it
delegates it there and continues with the third element that is a
dummy recursive call to the function, in this case foldl (dummy
because it did not really evaluate everything at the previous level).
Now, if the second thread finishes first, it goes to the next element
(recursive call) but sees that it is already (still) evaluating, so it
gets to the fourth. Otherwise, it the first thread finishes first just
goes to the next element.

This would be some kind of speculative evaluating. If the CPUs know
how to do that why would not we at the higher level?

It would be also an interesting lazy evaluation of the, in this
example, foldl function. The system would make a recursive call but
would just go another call deeper if it finds that it is impossible
(because it is still evaluating somewhere else) to evaluate it. And at
every level it finishes it will check previous levels if it can
collapse them (and maybe prematurely finish the whole thing).

It would be like that I unwind the loop (in this case recursion),
evaluate everything (on many threads) and then (try to) calculate the
value. If it finishes prematurely ... sad, but for "big" lists and
"big" functions it would be a saver. And the size of this window
(unwinding) would be a heuristic speculative function of number of
possible threads (cores, processors) and of information how long have
previous evaluations of the same function taken.

I really messed this explanation up. Or maybe it is completely wrong
and this is why it looks like a mess to me. :-)

Mitar