[Haskell-cafe] Grouping and SIMD in parallel Haskell (using Nested Data Parallel Haskell ideas in legacy code)

Tue Aug 18 06:06:21 EDT 2009

> -----Original Message-----
> From: Eugene Kirpichov [mailto:ekirpichov at gmail.com]
> Sent: Tuesday, August 18, 2009 11:28 AM
> To: Zefirov Sergey
> Cc: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Grouping and SIMD in parallel Haskell
> (using Nested Data Parallel Haskell ideas in legacy code)
> 
> Now I finally understood the idea that you were talking about :) I
> didn't quite get it during the MskHUG meeting.
> The idea is brilliant to my mind; I am surprised that none of the
> gurus are answering.
> 
> So, if I understood it correctly, you are suggesting to:
>  - make a separate spark queue (implemented by a linked list of
> groupSize-sized arrays) for every thunk type in the program that can
> possibly be the argument of a `par` or `pseq` (which, however, may
> turn out to be almost all thunk types present in the program)

Yep.

And it could be beneficiary by itself.

>  - automagically create vectorized versions of these thunks' code

If we can.

Looks like we can. ;)

>  - spawn sparks by appending them to the corresponding queue (to a
> non-locked array node?)

Sparks queues can be non-locked if they belong to threads. Or, in other words, each thread has it's own spark's queue and all threads share main queue.

If sparks queues are shared between threads, they should be locked.

My model suggests that private spark queues slows down execution speed, but I haven't paid enough attention to the subject so my code could be wrong.

Look yourself:
    usePrivateGroups  maxABTaskLength  ticks (fib 15)
    0                 0                7568
    0                 16               3157
    0                 32               2977
    1                 0                7753
    1                 16               3772
    1                 32               4242
All other parameters are: cpuCount 16 taskFetchsPerCycle 1 thunksPerAddition 1 cyclesPerAddition 1.

Anyway, let's say we have several processors and they all share all queues. When one processor encounter that some queue is locked it immediately proceed to another queue, which could be locked with less probability.

>  - evaluate sparks in a vectorized fashion by invoking the vectorized
> code over groups

Again, if we can.

> The problems I see here are:
>  - not all thunks code can be vectorized automatically

I tried to find a contradictory example and failed. But I haven't tried very hard. ;)

I noticed that first argument for par should be considered strict. So we can demand that fib will receive Int# instead of boxed (and delayed) Int.

Operations on [:Int#:] ([:Float#:], [:Char#:]) should be very efficient.

An argument could be delayed (lazy) or of an algebraic type (like Maybe). Here we will have forcing of computation or pattern matching and, therefore, conditional execution.

Conditional execution could be expressed in parallel SIMD operations, an example is given at Tim Sweeney presentation (near the end).
(http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf)

The version there isn't best possible, but it works.

>  - how should we represent the sparks in the queue for a certain thunk
> type? It's clear that it is no longer needed to put the whole closure
> into the spark queue, since the closure code is already known and
> fixed for each queue. For efficient vectorization, the queue probably
> should consist of closure data arrays.

I think we would borrow closure representation from NDP Haskell. ;)

>  - I do not immediately see an efficient (lockless) way of picking a
> nonempty spark queue; however, I am sure there must be such a way, and
> even if there isn't, the inefficiency may well be overweighted by the
> efficiency gain from vectorization

With different sparks queues we prevent frequent locking of main spark queue. CPUs get more work per single lock. I think it's good in itself, vectorization could only add to it.

Look at the table:
    maxABTaskLength  ticks (fib 15)
    0                7568
    1                5651
    2                4656
    4                3663
    8                3263
    16               3157
    32               2977
    64               2931
(other model parameters: cpuCount 16 usePrivateGroups 0 maxABTaskLength <variable> taskFetchsPerCycle 1 thunksPerAddition 1 cyclesPerAddition 1)

The effect on vectorization according to my model is negligible: thunksPerAddition 2 results in 2894 ticks and then time stop lowering. Increase in taskFetchsPerCycle also does not provide any noticeable speed up.

I cannot promise because, but I will try to implement my idea because I already see it as a good idea. ;)

PS
I always fascinated how Haskell is amenable for RTS changes. Add a complication underneath once and free yourself from complication on the surface for ever. ;)