[Haskell] Performance, Optimization and Code Generation
George Beshers
gbeshers at cox.net
Fri Sep 22 20:30:20 EDT 2006
This starts out with my being interested in darcs <-> git related issues.
Since git uses sha1 I wanted to have the ability to calculate sha1 in
an application where I was intending to use darcs as a back-end.
The performance gap is > * 30 between Haskell and sha1sum. That
seemed rather steep and so I started looking...
Using ghc 6.4.2
The following code is from SHA1:
-- {-# INLINE step #-}
step :: ABCDE -> BS.ByteString -> ABCDE
step abcde0@(ABCDE a b c d e) words = abcde5
where s16 = get_word_32s words
s80 = s16 ++ (zipWith4 f0) (drop 13 s80) (drop 8 s80)
(drop 2 s80) s80
f0 a b c d = rotL (a `xor` b `xor` c `xor` d) 1
(s20_0, s60) = splitAt 20 s80
(s20_1, s40) = splitAt 20 s60
(s20_2, s20) = splitAt 20 s40
(s20_3, _) = splitAt 20 s20
abcde1 = foldl (doit f1 0x5a827999) abcde0 s20_0
abcde2 = foldl (doit f2 0x6ed9eba1) abcde1 s20_1
abcde3 = foldl (doit f3 0x8f1bbcdc) abcde2 s20_2
ABCDE a' b' c' d' e' = foldl (doit f2 0xca62c1d6) abcde3 s20_3
f1 (XYZ x y z) = (x .&. y) .|. ((complement x) .&. z)
f2 (XYZ x y z) = x `xor` y `xor` z
f3 (XYZ x y z) = (x .&. y) .|. (x .&. z) .|. (y .&. z)
abcde5 = ABCDE (a + a') (b + b') (c + c') (d + d') (e + e')
-- {-# INLINE get_word_32s #-}
get_word_32s :: BS.ByteString -> [Word32]
get_word_32s s = map f [0..15]
where f i = foldl (+) 0 $ map (\n -> toEnum (fromEnum (BS.index
s (i*4+n))) `shiftL` (8 * (3-n))) [0..3]
-- {-# INLINE doit #-}
doit :: (XYZ -> Word32) -> Word32 -> ABCDE -> Word32 -> ABCDE
doit f k (ABCDE a b c d e) w = ABCDE a' a (rotL b 30) c d
where a' = rotL a 5 + f (XYZ b c d) + e + w + k
-- {-# INLINE rotL #-}
rotL :: Word32 -> Rotation -> Word32
rotL a s = shiftL a s .|. shiftL a (s-32)
-- rotL a s = a `seq` rotate a s
I want to focus on the *rotL* function --- get_word_32s might be faster
with a rewrite.
Using -prof -auto-all and -P at runtime produced the following summary...
COST CENTRE MODULE %time %alloc
ticks bytes
get_word_32s MySHA1 39.3 35.0
10597 6952614960
rotL MySHA1 22.6 18.5
6089 3682275072
step MySHA1 21.2 24.1
5701 4783669848
doit MySHA1 16.2 21.8
4362 4332088320
As part of a larger program, but the sha1 portion was where 98% of the
time went.
So
* I tried using the built-in "rotate",
* I tried inlining,
* I tried using -fasm directly
* I tried generating C with -O2,
* I tried using 'seq' in a number of places.
None of this made much difference.
Comment, the C code for just rotL would be
unsigned int
rotate(unsigned int a, int b)
{
return (a << b) | (a >> (b - 32));
}
and the assembler at gcc -O2 is concise and straightforward:
.file "rotate.c"
.text
.p2align 4,,15
.globl rotate
.type rotate, @function
rotate:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
popl %ebp
movl %edx, %eax
sall %cl, %eax
subl $32, %ecx
shrl %cl, %edx
orl %edx, %eax
ret
.size rotate, .-rotate
.section .note.GNU-stack,"", at progbits
.ident "GCC: (GNU) 3.3.5 (Debian 1:3.3.5-3)"
GHC's code on this part is nothing like competitive because of the need
for closures --- I think.
1) What approaches to getting the compiler to optimize have I
overlooked? Or, how would
you change the code to get the compiler to generate better code for
rotL?
2) Am I missing an easy way to get strictness on the rotL function?
3) Is there something I am missing that keeps ghc's code from becoming
about as
concise as gcc -O2?
4) Is the code generator simply not able to handle this?
Thanks in advance,
George
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell/attachments/20060922/5b7a8802/attachment.htm
More information about the Haskell
mailing list