Low-level array performance

Tue Jun 17 14:35:38 EDT 2008

On Tuesday 17 June 2008, Daniel Fischer wrote:
> I've experimented a bit and found that Ptr is faster for small arrays (only
> very slightly so if compiled with -fvia-C -optc-O3), but ByteArr performs
> much better for larger arrays
> ...
> The GC time for the Addr# version is frightening

I had an entire e-mail written about what a bizarre and interesting result 
you'd just found, but unfortunately, I then remembered exactly how the array 
gets filled in the Ptr version. Namely:

    (Ptr a) <- newArray [0..n-1]

Which, I assume does something terrible, like calling length to get the size 
needed for the array, while also needing the values after array creation to 
feed into it. For small arrays like the ones I'd been testing with, this 
doesn't matter, because the work done on the list is negligible. However, 
when you get to very large lists (100,000 elements and above, apparently) 
this starts causing massive space leaks, which explains the terrible GC 
behavior we were seeing.

If you change the benchmark like so:

    bench :: Int -> Int -> IO ()
    bench k n = do (Ptr a) <- mallocArray n :: IO (Ptr Int)
                   fill a n
                   replicateM_ k (reverse a 0 (n-1))

    fill :: Addr# -> Int -> IO ()
    fill a (I# n) = IO (go 0#)
     where
     go i s
       | i <# n    = case writeIntOffAddr# a i i s of s -> go (i +# 1#) s
       | otherwise = (# s, () #)

The space leak goes away, and the runtimes stay consistent. Up to around 
10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around 11. 
At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays there 
(again, I imagine due to running into bad cache effects at that level). This 
is all for size * iterations = 2.5 billion on my machine.

A good catch anyhow, though. That could explain why Simon Marlow was seeing 
the Addr# version as slower, since he was using a large array, and thus work 
done on the list could have contributed significantly (although MUT time was 
higher with Ptr, so it would have had to contribute work there, not just GC 
thrashing).

Thanks for the input,
-- Dan