Low-level array performance

Mon Jun 16 15:51:31 EDT 2008

Greetings,

Recently, due to scattered complaints I'd seen on the internet, I set about to 
rewrite the fannkuch [1] benchmark on the Great Computer Language Shootout. 
The current entry uses Ptr/Addr#, malloc, etc. so it's not particularly 
representative of code one would actually write in Haskell these days. Over 
the past few days, I've written several versions of the benchmark, and 
collaborated a bit with local speed guru Don Stewart, but an entry that bests 
the current entry does not seem to be in the cards currently. So, I thought 
I'd write about some of the issues, and hopefully get some encouraging news 
about the situation.

Issue 1: STUArrays aren't optimized as fully as one might ideally expect.

This isn't so much of an issue, I suppose, except possibly for an environment 
like the shootout. I wrote versions of the benchmark (and particular pieces 
of the overall benchmark, as well) for both STUArrays, and MUArrays from 
dons' new uvector [2] library. The STUArray versions are consistently slower, 
and a look at the core reveals that STUArray code contains significantly more 
boxing than uvector. This isn't a big deal, as there's no reason not to use 
uvector, aside from it not being blessed by being in the GHC distribution. 
However, I thought I'd bring it up in case anyone hasn't heard of the library 
yet.

Issue 2: Reading from/writing to a MutableByteArray# is slower than an Addr#

This is, I think, the crux of the issue. The main content of the benchmark is 
reversing/shifting items in an array. To get a somewhat easier look at the 
core, I boiled things down to a benchmark that just reverses a small array 
many times. In the interest of further reducing things, I wrote a version of 
the benchmark that uses raw Addr#s, and a version that uses raw 
MutableByteArray#s. I've attached both versions.

The fannkuch benchmark (input 11) with byte arrays runs in around 12 seconds 
on my machine. To get the reversal benchmark to around that time, I can tell 
it to, for instance, reverse a 10-element array 250 million times. Those same 
inputs to the Addr# version of the reverse benchmark result in a runtime of 7 
to 8 seconds, so there's a significant discrepancy at the level the benchmark 
actually runs at (the current fannkuch entry that uses Addr#/Ptr takes around 
9 seconds, so there's other stuff going on in the benchmark evening them out, 
but this is likely the source of the difference between the two).

uvector performance on the reversal benchmark is comparable with 
MutableByteArray#, so the practical slowdown using a library for actual 
public consumption seems to be in the fact that it's based on byte arrays 
(STUArray again falls further behind by retaining too many boxes).

So, to wrap things up: dons and I were rather curious about the performance 
differences between MutableByteArray# and Addr#, as one might expect them to
be comparable, both being low-level, raw pointer type structures; the code for 
using the two is nearly identical.

(If one were to raise an Issue 3, it'd likely be that, even with Addr# and 
malloc, which is currently the fastest Haskell solution, we're not that fast. 
A glance at that shootout list reveals that we're beat by a myriad of 
Java-based languages, Python (Psyco), Ada, Ocaml, Eiffel, Lisp (SBCL), Clean, 
and of course C++, C, Fortran, etc. Is there any hope of getting performance 
improvements that will push us up in that list? I've heard that people are 
working on improved code generation, but I don't know if that can help with 
this sort of thing (or if it's more of a runtime system area).

While writing this mail, it came to mind that this is also a potential source 
of slowness in the sorting 'library' I've been writing for uvector. I already 
have some routines that are competitive with/beat qsort from glibc, but GNU's 
C++ STL introsort implementation has remained untouched. However, if 
reads/writes simply take twice as much time in Haskell as they do in C++, my 
results aren't as surprising.)

Anyhow, thanks for any light that can be shed on the subject.

1: http://shootout.alioth.debian.org/gp4/benchmark.php?test=fannkuch
2: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ByteArr.hs
Type: text/x-hssrc
Size: 1413 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20080616/4aea9bc1/ByteArr.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Ptr.hs
Type: text/x-hssrc
Size: 867 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20080616/4aea9bc1/Ptr.bin