[Haskell-cafe] CPU with Haskell support

Wed Jan 20 12:43:32 UTC 2016

Am 20.01.2016 um 13:16 schrieb Serguey Zefirov:
> You are unneccessary overly pessimistic, let me show you somethings you,
> probably, have not thought or heard about.

Okaaaay...

> A demonstration from the industry, albeit not quite hardware industry:
>
> http://www.disneyanimation.com/technology/innovations/hyperion - "Hyperion
> handles several million light rays at a time by sorting and bundling them
> together according to their directions. When the rays are grouped in this
> way, many of the rays in a bundle hit the same object in the same region of
> space. This similarity of ray hits then allows us – and the computer – to
> optimize the calculations for the objects hit."

Sure. If you have a few gazillion of identical algorithms, you can 
parallelize on that. That's the reason why 3D cards even took off, the 
graphics pipeline grew processing capabilities and evolved into a 
(rather restricted) GPU core model. So it's not necessarily impossible 
to build something useful, merely very unlikely.

> Then, let me bring up an old idea of mine:
> https://mail.haskell.org/pipermail/haskell-cafe/2009-August/065327.html
>
> Basically, we can group identical closures into vectors, ready for SIMD
> instructions to operate over them. The "vectors" should work just like
> Data.Vector.Unboxed - instead of vector of tuple of arguments there should
> be a tuple of vectors with individual arguments (and results to update for
> lazy evaluation).
>
> Combine this with sorting of addresses in case of references and you can
> get a lot of speedup by doing... not much.

Heh. Such stuff could work - *provided* that you can really make a case 
of having enough similar work.

Still, I'd work on making a model of that on GPGPU hardware first.
Two advantages:
1) No hardware investment.
2) You can see what the low-hanging fruit are and get a rough first idea 
how much parallelization really gives you.

The other approach: See what you can get out of a Xeon with really many 
cores (14, or even more).

Compare the single-GPGPU vs. multi-GPGPU speedup with the single-CPUcore 
vs. multi-CPUcore speedup. That might provide insight into how well the 
interconnects and cache coherence protocols interfere with the multicore 
speedup.

Why I'm so central on multicore? Because that's where hardware is going 
to go, because hardware isn't going to clock much higher but people will 
still want to improve performance.
Actually I think that single-core improvements aren't going to be very 
important. First on my list would be exploiting multicore, second cache 
locality. There's more to be gotten from there than from specialized 
hardware, IMVHO.

Regards,
Jo