Low-level array performance
Daniel Fischer
daniel.is.fischer at web.de
Tue Jun 17 15:28:14 EDT 2008
Am Dienstag, 17. Juni 2008 20:35 schrieb Dan Doel:
> On Tuesday 17 June 2008, Daniel Fischer wrote:
> > I've experimented a bit and found that Ptr is faster for small arrays
> > (only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr
> > performs much better for larger arrays
> > ...
> > The GC time for the Addr# version is frightening
>
> I had an entire e-mail written about what a bizarre and interesting result
> you'd just found, but unfortunately, I then remembered exactly how the
> array gets filled in the Ptr version. Namely:
>
> (Ptr a) <- newArray [0..n-1]
Ouch. I should've looked at both sources, that would have been obvious then :)
>
> Which, I assume does something terrible, like calling length to get the
> size needed for the array, while also needing the values after array
> creation to feed into it. For small arrays like the ones I'd been testing
> with, this doesn't matter, because the work done on the list is negligible.
> However, when you get to very large lists (100,000 elements and above,
> apparently) this starts causing massive space leaks, which explains the
> terrible GC behavior we were seeing.
>
> If you change the benchmark like so:
>
> bench :: Int -> Int -> IO ()
> bench k n = do (Ptr a) <- mallocArray n :: IO (Ptr Int)
> fill a n
> replicateM_ k (reverse a 0 (n-1))
>
> fill :: Addr# -> Int -> IO ()
> fill a (I# n) = IO (go 0#)
> where
> go i s
>
> | i <# n = case writeIntOffAddr# a i i s of s -> go (i +# 1#) s
> | otherwise = (# s, () #)
>
> The space leak goes away, and the runtimes stay consistent. Up to around
> 10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around
> 11. At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays
> there (again, I imagine due to running into bad cache effects at that
> level). This is all for size * iterations = 2.5 billion on my machine.
Since my computer is slower, I confined my tests to size*iterations ~= 10^9.
Mostly, I find no noticeable difference, between 0.2 and 0.5 %, sometimes one
faster, sometimes the other. I have the impression that the Ptr version is
the faster more often than the ByteArr version, but the tendency isn't very
strong.
That applies to both compiled with -O2 -fvia-C -optc-O3.
Compiling with -O2 -fasm doesn't make a noticeable difference for Ptr, but is
about 13% slower for ByteArr when the arrays are large (too large for the
cache, I suppose) and about 50% slower for small arrays.
So what I can read off that is that the native code generator still has to
catch up for such code, not that either way of implementing arrays is
inherently faster.
>
> A good catch anyhow, though. That could explain why Simon Marlow was seeing
> the Addr# version as slower, since he was using a large array, and thus
> work done on the list could have contributed significantly (although MUT
> time was higher with Ptr, so it would have had to contribute work there,
> not just GC thrashing).
>
> Thanks for the input,
> -- Dan
Cheers,
Daniel
More information about the Glasgow-haskell-users
mailing list