> Thanks Bulat, but now you scattered my hopes that GHC would magically do all these optimizations for me ;-)
> I must say that although the performance of Haskell is not really a concern to me, I was a bit disappointed that even with all the tricks of the state monad, unboxing, and no-bounds-check, the matrix-vector multiplication was still 7 to 8 times slower than the C version. And at the end of the paper, it's only a factor 4 slower. Okay, going from 300x slower to 4x slower is impressive, but why is it *still* 4x slower? It would be interesting to compare the assembly code generated by the C compiler versus the GHC compiler; after all, we're just talking about a vector/matrix multiplication, which is just a couple of lines of assembly code... And now I'm again talking about  performance, nooo! ;-)
> >http://www.cse.unsw.edu.au/~chak/papers/afp-arrays.ps.gz

Yeah, there's some known low level issues in the code generator
regarding heap and stack checks inside loops, and the use of registers
on x86.

But note this updated paper,

Add another core to your machine and it is no longer 4x slower :)
Add 15 more cores and its really no longer 4x slower :)

