[Haskell-cafe] optimising for vector units

Wed Jul 28 10:00:26 EDT 2004

MR K P SCHUPKE wrote:
>>That was me.  I think you're underestimating the cost of starting
>>threads even in this very lightweight world.
> 
> 
> Maybe... Perhaps haskell could be made to resemble dataflow instructions
> more... If when a computation completes we insert the result directly
> into the data structure which represents function, we can infact pick any
> function for execution with _no_ overhead. (In a true dataflow system any
> instruction from any thread can be executed with no overhead)

With architectural support for dataflow execution, this is true 
(though hardware support for scheduling necessarily imposes resource 
limitations).  You might be rather interested in a couple of books 
about the Monsoon dataflow machine, and the compiler for the 
functional language (Id) used to prgram it---Id is similar in many 
ways to Haskell.  The Haskell dialect pH was basically Id with Haskell 
syntax, and early versions used the Id back end to generate Monsoon code.

@book{102865,
  author = {Gregory M. Papadopoulos},
  title = {Implementation of a general-purpose dataflow multiprocessor},
  year = {1991},
  isbn = {0-262-66069-5},
  publisher = {MIT Press},
  }

@book{103033,
  author = {Kenneth R. Traub},
  title = {Implementation of non-strict functional programming languages},
  year = {1991},
  isbn = {0-262-70042-5},
  publisher = {MIT Press},
  }

Interestingly, even in a general-purpose dataflow system it became 
obvious that pushing data to and from memory was rather inefficient. 
As a result, Monsoon had registers to handle loop code, and ran inner 
loops sequentially.  A discussion of the tradeoffs between sequential 
and parallel loops can be found here:

@inproceedings{165203,
  author = {Boon Seong Ang},
  title = {Efficient implementation of sequential loops in dataflow 
computation},
  booktitle = {Proceedings of the conference on Functional programming 
languages and computer architecture},
  year = {1993},
  isbn = {0-89791-595-X},
  pages = {169--178},
  location = {Copenhagen, Denmark},
  doi = {http://doi.acm.org/10.1145/165180.165203},
  publisher = {ACM Press},
  }

> The suggestion I guess is to only use instructions that get their arguments
> from main memory. 

In the absence of architectural support for dataflow, this is the 
simplest option by far.

 > This way any instruction can be sequenced with no overhead
> on any CPU. With modern on-die core-speed caches this can be almost as fast
> a registers (with good cache access patterns)
 >
 > ... Note that I am only suggesting
 > interleaving instructions at the function level, so registers can 
be used within
 > functions... of course as things get more and more parallel we may see
 > hardware with no registers, just pipelined high speed cache access. 
(The
 > hardware may well use registers to pre-fetch cache values, but that can
 > be made transparent to the software)...

Alas, I think your making a few naive assumptions here:
* All your functions must be strict in their arguments, yet your 
language as a whole is somehow non-strict.  How?
* Tracking which computations are partially satisfied has no overhead 
(we'll need to match arguments to computations somehow).
* Choosing a function which is ready to run has no overhead (we'll 
need a queue, a stack, or some similar data structure to manage ready 
instructions).
* The extra instructions do not interfere in any material way with the 
instruction level parallelism that may have existed in our program 
(doubtful, all those scheduling instructions, operand loads, and 
result stores consume issue bandwidth at the very least).

There's been quite a bit of work on splitting non-strict code up into 
strict threads which can be run in much the way I think you're 
imagining.  The following paper describes some of the basics:
@inproceedings{141568,
  author = {Kenneth R. Traub and David E. Culler and Klaus E. Schauser},
  title = {Global analysis for partitioning non-strict programs into 
sequential threads},
  booktitle = {Proceedings of the 1992 ACM conference on LISP and 
functional programming},
  year = {1992},
  isbn = {0-89791-481-3},
  pages = {324--334},
  location = {San Francisco, California, United States},
  doi = {http://doi.acm.org/10.1145/141471.141568},
  publisher = {ACM Press},
  }

The real killer, of course, is memory latency.  The cache resources 
required to hold pending work can't be used to hold heap data and the 
like instead.  You're going to increase your cache miss rate as well. 
  The result?  A drop in performance.

A multi-threaded processor can help in masking cache miss latency 
(while increasing the miss rate of an individual thread).  But it is 
still limited by the fundamental bandwidth limitations of the 
load/store units.  You will, on the whole, still be doing 
substantially worse than an ordinary program.

Why bother at all?  The same reason we bother with laziness (which has 
essentially the same set of problems in different clothing).  It's 
easy to write clean, beautiful programs.  And I do still hold out hope 
that we can find enough parallelism to make the idea of parallel 
execution realistic.

> Hardware manufacturers have hit the limit for pure sequential execution speed,
> so more parallelism is the only way forward (see Intels revised roadmap, they
> abandoned the pentium 4 and 5 and have focused on an updated _low_power_
> pentium 3M, and are planning multi core versions for more speed).
> 
> C and other imperitive languages focus toom
> much on  the how and not the what to be easy to use in such multi-cpu
> environments... A language with abstracts and hides the parallelism could
> well take off in a big way.

It's a laudable goal that's remained just out of reach for a very long 
time.  As you can probably tell, I've though a lot about it.  I think 
there's still a great deal to be learned.  But I'd encourage you---and 
anyone thinking about the problem---to understand the problems that 
have stymied some very smart people.

> Keean.

-Jan-Willem Maessen