[Haskell-cafe] Performance and STUArrays
Daniel Fischer
daniel.is.fischer at web.de
Mon Apr 23 14:52:48 EDT 2007
Am Sonntag, 22. April 2007 15:06 schrieb Dominic Steinitz:
> I've been playing around some more trying improve the performance of the
> SHA1 implmentation in the crypto library. I've isolated one of the
> functions and implemented it using
>
> a) unfold
>
> and
>
> b) STUArray
>
> The STUArray implementation is about twice as fast but I was expecting an
> order of magnitude improvement given I thought I would have been allocating
> 16 x 80 new 32 bit words with unfold but nothing with the STUArray.
How do you calculate that?
I'd think that in the unfoldred function (h) you'd refer to 15 Word32s you
don't use for each test, so that would be roughly a 20% overhead, the tail of
h's argument could be reused, so in each step one Word32 - or rather one
thunk to calculate a Word32 - would be appended (if the compiler knows how to
do that, otherwise there'd be a lot of copying Word32s).
Warning: I know next to nothing about computers, so I'm quite naive in my
expectations.
What I've tested is whether strictifying h gives a performance gain, it does,
roughly 10% on my computer.
h [w0 .. w15] = let w16 = rotL 1 (w0 `xor` ..) in w16 `seq`
Just (w0,[w1 .. w16])
A further speedup results from switching from lists to a custom datatype:
data WW = WW
{ f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15 :: !Word32 }
fromL :: [Word32] -> WW
fromL (w0:w1:w2:w3:w4:w5:w6:w7:w8:w9:w10:w11:w12:w13:w14:w15:_)
= WW w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15
next :: WW -> WW
next (WW w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15)
= WW w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 (rotL 1 $ w0 `xor`
w2 `xor` w8 `xor` w13)
v4 :: a -> [Word32] -> [Word32]
v4 ss xs = take 80 $ unfoldr (\ww -> Just (f0 ww, next ww)) $ fromL xs
is about 16% faster than your v1 here.
Very odd, now this is slower than just strictifying, though I haven't touched
that code. I still have the binary where v4 as above is faster than v1 (with
or without added strictness), but in all more recently compiled versions it's
slower than strictified v1.
How can that happen?
The STUArray version is about 9% faster if you use unsafeRead/Write instead of
read/writeArray (we know it's safe here), a further 2% I gained by a slightly
different loop (we needn't use list-indexing since we copy it in order):
v3 ss xs = ws
where
ws = runST us
us =
do w <- newArray (0,79) 0 :: ST s (STUArray s Int Word32)
let put = unsafeWrite w
look = unsafeRead w
loop 80 = getElems w
loop n = do wm16 <- look (n-16)
wm14 <- look (n-14)
wm8 <- look (n-8)
wm3 <- look (n-3)
put n (rotL 1 $ wm3 `xor` wm8 `xor` wm14 `xor`
wm16)
loop (n+1)
ini n [] = loop n
ini n (x:xs)
= do put n x
ini (n+1) xs
ini 0 xs
>
> Should I have been disappointed?
Don't know, I'm somewhat disappointed that a C-version is almost three times
as fast as the STUArray version.
And I'm really astonished that compiled with -O0 the Unfold version is more
than twice as fast as the STUArray.
Phrased differently, optimising (-O2) doesn't help the Unfold version much,
but speeds the STUArray version up more than four times!
Cheers,
Daniel
>
> Dominic.
>
More information about the Haskell-Cafe
mailing list