[Haskell-cafe] optimising for vector units
Jan-Willem Maessen - Sun Labs East
Janwillem.Maessen at Sun.COM
Wed Jul 28 10:00:26 EDT 2004
MR K P SCHUPKE wrote:
>>That was me. I think you're underestimating the cost of starting
>>threads even in this very lightweight world.
>
>
> Maybe... Perhaps haskell could be made to resemble dataflow instructions
> more... If when a computation completes we insert the result directly
> into the data structure which represents function, we can infact pick any
> function for execution with _no_ overhead. (In a true dataflow system any
> instruction from any thread can be executed with no overhead)
With architectural support for dataflow execution, this is true
(though hardware support for scheduling necessarily imposes resource
limitations). You might be rather interested in a couple of books
about the Monsoon dataflow machine, and the compiler for the
functional language (Id) used to prgram it---Id is similar in many
ways to Haskell. The Haskell dialect pH was basically Id with Haskell
syntax, and early versions used the Id back end to generate Monsoon code.
@book{102865,
author = {Gregory M. Papadopoulos},
title = {Implementation of a general-purpose dataflow multiprocessor},
year = {1991},
isbn = {0-262-66069-5},
publisher = {MIT Press},
}
@book{103033,
author = {Kenneth R. Traub},
title = {Implementation of non-strict functional programming languages},
year = {1991},
isbn = {0-262-70042-5},
publisher = {MIT Press},
}
Interestingly, even in a general-purpose dataflow system it became
obvious that pushing data to and from memory was rather inefficient.
As a result, Monsoon had registers to handle loop code, and ran inner
loops sequentially. A discussion of the tradeoffs between sequential
and parallel loops can be found here:
@inproceedings{165203,
author = {Boon Seong Ang},
title = {Efficient implementation of sequential loops in dataflow
computation},
booktitle = {Proceedings of the conference on Functional programming
languages and computer architecture},
year = {1993},
isbn = {0-89791-595-X},
pages = {169--178},
location = {Copenhagen, Denmark},
doi = {http://doi.acm.org/10.1145/165180.165203},
publisher = {ACM Press},
}
> The suggestion I guess is to only use instructions that get their arguments
> from main memory.
In the absence of architectural support for dataflow, this is the
simplest option by far.
> This way any instruction can be sequenced with no overhead
> on any CPU. With modern on-die core-speed caches this can be almost as fast
> a registers (with good cache access patterns)
>
> ... Note that I am only suggesting
> interleaving instructions at the function level, so registers can
be used within
> functions... of course as things get more and more parallel we may see
> hardware with no registers, just pipelined high speed cache access.
(The
> hardware may well use registers to pre-fetch cache values, but that can
> be made transparent to the software)...
Alas, I think your making a few naive assumptions here:
* All your functions must be strict in their arguments, yet your
language as a whole is somehow non-strict. How?
* Tracking which computations are partially satisfied has no overhead
(we'll need to match arguments to computations somehow).
* Choosing a function which is ready to run has no overhead (we'll
need a queue, a stack, or some similar data structure to manage ready
instructions).
* The extra instructions do not interfere in any material way with the
instruction level parallelism that may have existed in our program
(doubtful, all those scheduling instructions, operand loads, and
result stores consume issue bandwidth at the very least).
There's been quite a bit of work on splitting non-strict code up into
strict threads which can be run in much the way I think you're
imagining. The following paper describes some of the basics:
@inproceedings{141568,
author = {Kenneth R. Traub and David E. Culler and Klaus E. Schauser},
title = {Global analysis for partitioning non-strict programs into
sequential threads},
booktitle = {Proceedings of the 1992 ACM conference on LISP and
functional programming},
year = {1992},
isbn = {0-89791-481-3},
pages = {324--334},
location = {San Francisco, California, United States},
doi = {http://doi.acm.org/10.1145/141471.141568},
publisher = {ACM Press},
}
The real killer, of course, is memory latency. The cache resources
required to hold pending work can't be used to hold heap data and the
like instead. You're going to increase your cache miss rate as well.
The result? A drop in performance.
A multi-threaded processor can help in masking cache miss latency
(while increasing the miss rate of an individual thread). But it is
still limited by the fundamental bandwidth limitations of the
load/store units. You will, on the whole, still be doing
substantially worse than an ordinary program.
Why bother at all? The same reason we bother with laziness (which has
essentially the same set of problems in different clothing). It's
easy to write clean, beautiful programs. And I do still hold out hope
that we can find enough parallelism to make the idea of parallel
execution realistic.
> Hardware manufacturers have hit the limit for pure sequential execution speed,
> so more parallelism is the only way forward (see Intels revised roadmap, they
> abandoned the pentium 4 and 5 and have focused on an updated _low_power_
> pentium 3M, and are planning multi core versions for more speed).
>
> C and other imperitive languages focus toom
> much on the how and not the what to be easy to use in such multi-cpu
> environments... A language with abstracts and hides the parallelism could
> well take off in a big way.
It's a laudable goal that's remained just out of reach for a very long
time. As you can probably tell, I've though a lot about it. I think
there's still a great deal to be learned. But I'd encourage you---and
anyone thinking about the problem---to understand the problems that
have stymied some very smart people.
> Keean.
-Jan-Willem Maessen
More information about the Haskell-Cafe
mailing list