Low-level array performance
Dan Doel
dan.doel at gmail.com
Tue Jun 17 12:32:20 EDT 2008
On Tuesday 17 June 2008, Simon Marlow wrote:
> So I tried your examples and the Addr# version looks slower than the MBA#
> version:
Hmm...
> I tried with 6.8.2 and 6.8.3, using -O2 in both cases. I tried the Ptr
> version with and without -fvia-C -optc-O2, no difference.
I had forgotten about the via-c in the pragma when I sent it, but I've tested
it both via-c and with the new backend (and triple checked since your
message), and I always come away with the Ptr version being faster. -fvia-c
doesn't seem to affect the speed of the Addr# version much, while it improves
the speed of the MBA# version. However, even with the improved speed, Addr#
seems to edge it out here.
With the new backend, I get the results I sent in my initial mail. The
ByteArray version takes 11 - 12 seconds to reverse a size 10 array 250
million times, whereas the Addr# version takes around 7 seconds.
(I also noticed a bug I'd missed before sending the ByteArray version. It
should allocate based on w, but I left it hard coded to 4# when I was
experimenting. This was causing segmentation faults on large arrays on my
machine, since I'm running in 64-bit mode, and 8# is the correct value here.
Are you running in 32-bit, and if so, could that be the source of our
discrepancy?)
> Are these exactly the same programs you measured? What parameters did you
> use?
Aside from the couple oversights above, yes. The actual fannkuch benchmark
doesn't use very large arrays. The current test input is n = 11, and all the
arrays it uses are length n. It gets its work from copying, reversing and
shifting (portions of) those arrays n! or more times. So, I thought it'd be
truer to the benchmark to reverse a small array many times. I've been running
with command lines like './ByteArr 250000000 10', which says to reverse a
size-10 array 250 million times.
I tested with other sizes, and things seem to stay about the same increasing
the array size and decreasing the iterations by the same factor, until I got
to an array size of around 100,000, at which point there's a drop off for
both (Addr# still being faster). I assume that's due to cache effects.
Here's some example runs, using '--make -O2' for both (OPTIONS pragma changed
to only have -fglasgow-exts for both, and the w bug fixed).
./ByteArr 250000000 10 +RTS -sstderr
Done.
56,824 bytes allocated in the heap
552 bytes copied during GC (scavenged)
0 bytes copied during GC (not scavenged)
45,056 bytes maximum residency (1 sample(s))
1 collections in generation 0 ( 0.00s)
1 collections in generation 1 ( 0.00s)
1 Mb total memory in use
INIT time 0.00s ( 0.00s elapsed)
MUT time 10.35s ( 11.15s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.36s ( 11.15s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 5,486 bytes per MUT second
Productivity 100.0% of total user, 92.9% of total elapsed
./Ptr 250000000 10 +RTS -sstderr
Done.
57,840 bytes allocated in the heap
552 bytes copied during GC (scavenged)
0 bytes copied during GC (not scavenged)
45,056 bytes maximum residency (1 sample(s))
1 collections in generation 0 ( 0.00s)
1 collections in generation 1 ( 0.00s)
1 Mb total memory in use
INIT time 0.00s ( 0.00s elapsed)
MUT time 6.53s ( 7.05s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 6.53s ( 7.05s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 8,854 bytes per MUT second
Productivity 100.0% of total user, 92.7% of total elapsed
As I mentioned before, using -fvia-c -optc-O2 leaves Ptr unchanged, and speeds
up ByteArr, but not enough to catch up with Ptr (here, at least).
Anyhow, my apologies for the mistakes above, and thanks for your time and
assistance. I'll try puzzling over the C-- some and probably open a trac
ticket a bit later as the other Simon suggested (if that's still
appropriate).
Thanks again,
-- Dan
More information about the Glasgow-haskell-users
mailing list