Low-level array performance

Tue Jun 17 15:28:14 EDT 2008

Am Dienstag, 17. Juni 2008 20:35 schrieb Dan Doel:
> On Tuesday 17 June 2008, Daniel Fischer wrote:
> > I've experimented a bit and found that Ptr is faster for small arrays
> > (only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr
> > performs much better for larger arrays
> > ...
> > The GC time for the Addr# version is frightening
>
> I had an entire e-mail written about what a bizarre and interesting result
> you'd just found, but unfortunately, I then remembered exactly how the
> array gets filled in the Ptr version. Namely:
>
>     (Ptr a) <- newArray [0..n-1]

Ouch. I should've looked at both sources, that would have been obvious then :)

>
> Which, I assume does something terrible, like calling length to get the
> size needed for the array, while also needing the values after array
> creation to feed into it. For small arrays like the ones I'd been testing
> with, this doesn't matter, because the work done on the list is negligible.
> However, when you get to very large lists (100,000 elements and above,
> apparently) this starts causing massive space leaks, which explains the
> terrible GC behavior we were seeing.
>
> If you change the benchmark like so:
>
>     bench :: Int -> Int -> IO ()
>     bench k n = do (Ptr a) <- mallocArray n :: IO (Ptr Int)
>                    fill a n
>                    replicateM_ k (reverse a 0 (n-1))
>
>     fill :: Addr# -> Int -> IO ()
>     fill a (I# n) = IO (go 0#)
>      where
>      go i s
>
>        | i <# n    = case writeIntOffAddr# a i i s of s -> go (i +# 1#) s
>        | otherwise = (# s, () #)
>
> The space leak goes away, and the runtimes stay consistent. Up to around
> 10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around
> 11. At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays
> there (again, I imagine due to running into bad cache effects at that
> level). This is all for size * iterations = 2.5 billion on my machine.

Since my computer is slower, I confined my tests to size*iterations ~= 10^9.
Mostly, I find no noticeable difference, between 0.2 and 0.5 %, sometimes one 
faster, sometimes the other. I have the impression that the Ptr version is 
the faster more often than the ByteArr version, but the tendency isn't very 
strong. 
That applies to both compiled with -O2 -fvia-C -optc-O3. 
Compiling with -O2 -fasm doesn't make a noticeable difference for Ptr, but is 
about 13% slower for ByteArr when the arrays are large (too large for the 
cache, I suppose) and about 50% slower for small arrays.

So what I can read off that is that the native code generator still has to 
catch up for such code, not that either way of implementing arrays is 
inherently faster.

>
> A good catch anyhow, though. That could explain why Simon Marlow was seeing
> the Addr# version as slower, since he was using a large array, and thus
> work done on the list could have contributed significantly (although MUT
> time was higher with Ptr, so it would have had to contribute work there,
> not just GC thrashing).
>
> Thanks for the input,
> -- Dan

Cheers,
Daniel