Low-level array performance

Tue Jun 17 00:36:20 EDT 2008

Dan

John Dias is indeed spending 6 months at Microsoft to work on GHC's back end.  He's doing a pretty wholesale re-architecting job, so it will be a couple of months before we have the new setup glued together, but once we do I hope that we'll have a much more friendly framework in place for doing good optimizations.

Meanwhile, concrete examples like yours are very useful.  It would be still more useful if you and Don could pinpoint more accurately just what the difference (say) between Addr# and MutableByteArray# is. Is it array bounds checking, for example?  Looking first at the Core that is generated and then (if it looks the same on both cases) at the C--, is often illuminating.  If you can say "for this little loop, GHC generates this stupid code sequence" that's particularly helpful.   (Don often does this.)

Even if you don't, we'll still take a look in due course, but the more you can pin it down the more motivating it is for us to fix it!

Do start a Trac ticket so your thoughts and code examples are not lost in the welter of email.

Thanks

Simon

| -----Original Message-----
| From: glasgow-haskell-users-bounces at haskell.org [mailto:glasgow-haskell-
| users-bounces at haskell.org] On Behalf Of Dan Doel
| Sent: 16 June 2008 20:52
| To: glasgow-haskell-users at haskell.org
| Subject: Low-level array performance
|
| Greetings,
|
| Recently, due to scattered complaints I'd seen on the internet, I set about
| to
| rewrite the fannkuch [1] benchmark on the Great Computer Language Shootout.
| The current entry uses Ptr/Addr#, malloc, etc. so it's not particularly
| representative of code one would actually write in Haskell these days. Over
| the past few days, I've written several versions of the benchmark, and
| collaborated a bit with local speed guru Don Stewart, but an entry that bests
| the current entry does not seem to be in the cards currently. So, I thought
| I'd write about some of the issues, and hopefully get some encouraging news
| about the situation.
|
| Issue 1: STUArrays aren't optimized as fully as one might ideally expect.
|
| This isn't so much of an issue, I suppose, except possibly for an environment
| like the shootout. I wrote versions of the benchmark (and particular pieces
| of the overall benchmark, as well) for both STUArrays, and MUArrays from
| dons' new uvector [2] library. The STUArray versions are consistently slower,
| and a look at the core reveals that STUArray code contains significantly more
| boxing than uvector. This isn't a big deal, as there's no reason not to use
| uvector, aside from it not being blessed by being in the GHC distribution.
| However, I thought I'd bring it up in case anyone hasn't heard of the library
| yet.
|
| Issue 2: Reading from/writing to a MutableByteArray# is slower than an Addr#
|
| This is, I think, the crux of the issue. The main content of the benchmark is
| reversing/shifting items in an array. To get a somewhat easier look at the
| core, I boiled things down to a benchmark that just reverses a small array
| many times. In the interest of further reducing things, I wrote a version of
| the benchmark that uses raw Addr#s, and a version that uses raw
| MutableByteArray#s. I've attached both versions.
|
| The fannkuch benchmark (input 11) with byte arrays runs in around 12 seconds
| on my machine. To get the reversal benchmark to around that time, I can tell
| it to, for instance, reverse a 10-element array 250 million times. Those same
| inputs to the Addr# version of the reverse benchmark result in a runtime of 7
| to 8 seconds, so there's a significant discrepancy at the level the benchmark
| actually runs at (the current fannkuch entry that uses Addr#/Ptr takes around
| 9 seconds, so there's other stuff going on in the benchmark evening them out,
| but this is likely the source of the difference between the two).
|
| uvector performance on the reversal benchmark is comparable with
| MutableByteArray#, so the practical slowdown using a library for actual
| public consumption seems to be in the fact that it's based on byte arrays
| (STUArray again falls further behind by retaining too many boxes).
|
| So, to wrap things up: dons and I were rather curious about the performance
| differences between MutableByteArray# and Addr#, as one might expect them to
| be comparable, both being low-level, raw pointer type structures; the code
| for
| using the two is nearly identical.
|
| (If one were to raise an Issue 3, it'd likely be that, even with Addr# and
| malloc, which is currently the fastest Haskell solution, we're not that fast.
| A glance at that shootout list reveals that we're beat by a myriad of
| Java-based languages, Python (Psyco), Ada, Ocaml, Eiffel, Lisp (SBCL), Clean,
| and of course C++, C, Fortran, etc. Is there any hope of getting performance
| improvements that will push us up in that list? I've heard that people are
| working on improved code generation, but I don't know if that can help with
| this sort of thing (or if it's more of a runtime system area).
|
| While writing this mail, it came to mind that this is also a potential source
| of slowness in the sorting 'library' I've been writing for uvector. I already
| have some routines that are competitive with/beat qsort from glibc, but GNU's
| C++ STL introsort implementation has remained untouched. However, if
| reads/writes simply take twice as much time in Haskell as they do in C++, my
| results aren't as surprising.)
|
| Anyhow, thanks for any light that can be shed on the subject.
|
| 1: http://shootout.alioth.debian.org/gp4/benchmark.php?test=fannkuch
| 2: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector