[Haskell-cafe] Performance and STUArrays

Mon Apr 23 14:52:48 EDT 2007

Am Sonntag, 22. April 2007 15:06 schrieb Dominic Steinitz:
> I've been playing around some more trying improve the performance of the
> SHA1 implmentation in the crypto library. I've isolated one of the
> functions and implemented it using
>
> a) unfold
>
> and
>
> b) STUArray
>
> The STUArray implementation is about twice as fast but I was expecting an
> order of magnitude improvement given I thought I would have been allocating
> 16 x 80 new 32 bit words with unfold but nothing with the STUArray.

How do you calculate that?
I'd think that in the unfoldred function (h) you'd refer to 15 Word32s you 
don't use for each test, so that would be roughly a 20% overhead, the tail of 
h's argument could be reused, so in each step one Word32 - or rather one 
thunk to calculate a Word32 - would be appended (if the compiler knows how to 
do that, otherwise there'd be a lot of copying Word32s).

Warning: I know next to nothing about computers, so I'm quite naive in my 
expectations.

What I've tested is whether strictifying h gives a performance gain, it does,
roughly 10% on my computer.

h [w0 .. w15] = let w16 = rotL 1 (w0 `xor` ..) in w16 `seq` 
			Just (w0,[w1 .. w16])

A further speedup results from switching from lists to a custom datatype:

data WW = WW
        { f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15 :: !Word32 }

fromL :: [Word32] -> WW
fromL (w0:w1:w2:w3:w4:w5:w6:w7:w8:w9:w10:w11:w12:w13:w14:w15:_)
    = WW w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

next :: WW -> WW
next (WW w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15)
    = WW w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 (rotL 1 $ w0 `xor` 
w2 `xor` w8 `xor` w13)

v4 :: a -> [Word32] -> [Word32]
v4 ss xs = take 80 $ unfoldr (\ww -> Just (f0 ww, next ww)) $ fromL xs

is about 16% faster than your v1 here.
Very odd, now this is slower than just strictifying, though I haven't touched 
that code. I still have the binary where v4 as above is faster than v1 (with 
or without added strictness), but in all more recently compiled versions it's 
slower than strictified v1.
How can that happen?

The STUArray version is about 9% faster if you use unsafeRead/Write instead of 
read/writeArray (we know it's safe here), a further 2% I gained by a slightly 
different loop (we needn't use list-indexing since we copy it in order):

v3 ss xs = ws
    where
      ws = runST us
      us =
         do w <- newArray (0,79) 0 :: ST s (STUArray s Int Word32)
            let put = unsafeWrite w
                look = unsafeRead w
                loop 80 = getElems w
                loop n = do wm16 <- look (n-16)
                            wm14 <- look (n-14)
                            wm8  <- look (n-8)
                            wm3  <- look (n-3)
                            put n (rotL 1 $ wm3 `xor` wm8 `xor` wm14 `xor` 
wm16)
                            loop (n+1)
                ini n [] = loop n
                ini n (x:xs)
                   = do put n x
                        ini (n+1) xs
            ini 0 xs

>
> Should I have been disappointed?

Don't know, I'm somewhat disappointed that a C-version is almost three times 
as fast as the STUArray version.
And I'm really astonished that compiled with -O0 the Unfold version is more 
than twice as fast as the STUArray.
Phrased differently, optimising (-O2) doesn't help the Unfold version much, 
but speeds the STUArray version up more than four times!

Cheers,
Daniel

>
> Dominic.
>