[Haskell] Implicit parallel functional programming

Bjorn Lisper lisper at it.kth.se
Thu Jan 20 10:16:18 EST 2005


Keean Schupke:
>>A guess is that the first generation will support a shared memory model much
>>like SMP:s of today (shared main memory with on-chip cache(s), or some other
>>kind of local memory (-ies)). Here, I think implicit parallelism in
>>functional languages can be a win in some situations. This is provided
>>spawning of parallel threads is cheap and local memory can be efficiently
>>utilized (so control of granularity is no problem, and dynamic parallelism
>>is cheap). Cheap threads is likely to be true, efficient use of local memory
>>is a question mark.
>>  
>>
>You only need one thread per CPU-core, so the spawning threads cost is
>a red-herring. Just like GHC schedules haskell-threads from a single
>OS-thread.

Then you need some mechanism to assign jobs to threads, which also carries
an overhead. How large that overhead is depends to some degree on the
architectural characteristics of the machine (some shared workpool is
necessary, which means the cost of common memory accesses comes into play).

>>However, some obstacles:
>>
>>Laziness, if used strictly (sorry) gets in the way for parallelism. Waiting
>>to spawn a parallel thread until we surely know it's result is needed, if it
>>is anyway needed 99% of the time, is not an optimal strategy. I think
>>optimistic evaluation strategies a la Robert Ennals can be helpful here. But
>>designing them to work for new parallel architectures may be non-trivial.
>>I think there can be material for PhD theses here.
>>  
>>
>I suggest speculative execution, as used by current CPUs to take advantage
>of multiple microcode execution units. If the a CPU has spare capacity, 
>functions
>whose results might be needed are run. As soon as you know a result is not
>needed the thread can be killed, and the time used for something else. A lot
>of work has already been done on hardware superscalar-architectures, and
>a lot of this should be applicable to the software case.

Yes, precisely what I refer to above (the work of Robert Ennals). But you
need to choose well what to speculate on. This can be nontrivial.

>>If the program implements an inherently sequential algorithm then
>>parallelism won't help anyway. The use of lists often leads to sequential
>>algorithms: a prime example is list folds, which all translate the linear
>>spine structure of a list into a sequential dependence graph of the
>>resulting computation. Folds over tree-like structures are better in this
>>regard.  The FP community will have a problem with legacy code using
>>inherently sequential computations over lists.  There has been quite some
>>research on how to parallellize list computations (skeletons, program
>>transformations using BMF,...). However, they typically rely on knowing the
>>length of the list. In a lazy language it is common that lists are evaluated
>>dynamically, and then it's hard to use these techniques efficiently. Again,
>>speculation may help.
>>  
>>
>But in the worst case its just a sequential computation, so any gain from
>parallelism is still a gain...

It depends on what you compare with. Multicore CPU:s will probably have
cores that are simpler than current processor cores, which means you will
want to have some parallelism. Cf. a superscalar processor, which really in
a sense is a parallel machine but where you add some complex control
hardware to make it emulate a sequential machine. Running a multicore CPU on
a single core at a time corresponds to running a superscalar machine where
you use none of the superscalarity. Or, running a pipelined CPU (another
form of parallelism) without utilizing any of the pipelining. This is not
what you want.

> also the dependant part of a loop is often
>separate from an independant part... For example consider a list of 
>strings, and
>taking the length of each string. The next iteration does not have to wait
>for the length to be computed, it can start the next length computation as
>soon as the pointer for the next list cell has been dereferenced.

Sure you can get some overlap, which can be quite big if the task to perform
per list element is big, but a dependence graph with O(n) nodes will still
have O(n) depth => at maximum constant speedup.

>>In the long run, as hardware dimensions shrink even more and communication
>>becomes relatively even more expensive, it is likely that we will have some
>>kind of on-chip massively parallel, mesh-connected distributed memory
>>machine. This will most likely favor static parallelization techniques,
>>which means only for certain classes of functional programs an automatic
>>approach with implicit parallelism will work well.
>>  
>>
>The next generation of CPUs seem to be using a shared cache architecture,
>so overheads are extreemely small...

I'm not talking about the next generation but further ahead. Maybe these
architectures will have some kind of on-chip shared cache, who knows, but
that memory will then still be much more expensive to access than the memory
in a local core. So you will want to use the core-local memory all the time,
and only communicate between processor cores which are quite close on the
chip.

>Erm, by static I assume you mean programmer specified parallelism. I would
>tend to think of static parallelization as something that takes place at 
>compile >time (and would still count as implicit).

No, I mean automatic techniques to detect the parallelism statically (and
schedule and allocate work statically), which only will work well for
restricted classes of programs.

Björn


More information about the Haskell mailing list