[Haskell] Performance, Optimization and Code Generation

Fri Sep 22 20:30:20 EDT 2006

This starts out with my being interested in darcs <-> git related issues.
Since git uses sha1 I wanted to have the ability to calculate sha1 in
an application where I was intending to use darcs as a back-end.

The performance gap is > * 30 between Haskell and  sha1sum.  That
seemed rather steep and so I started looking...

Using ghc 6.4.2

The following code is from SHA1:

    -- {-# INLINE step #-}
    step :: ABCDE -> BS.ByteString -> ABCDE
    step abcde0@(ABCDE a b c d e) words = abcde5
        where s16 = get_word_32s words
              s80 = s16 ++ (zipWith4 f0) (drop 13 s80) (drop 8 s80)
    (drop 2 s80) s80
              f0 a b c d = rotL (a `xor` b `xor` c `xor` d) 1
              (s20_0, s60) = splitAt 20 s80
              (s20_1, s40) = splitAt 20 s60
              (s20_2, s20) = splitAt 20 s40
              (s20_3, _)   = splitAt 20 s20
              abcde1 = foldl (doit f1 0x5a827999) abcde0 s20_0
              abcde2 = foldl (doit f2 0x6ed9eba1) abcde1 s20_1
              abcde3 = foldl (doit f3 0x8f1bbcdc) abcde2 s20_2
              ABCDE a' b' c' d' e' = foldl (doit f2 0xca62c1d6) abcde3 s20_3
              f1 (XYZ x y z) = (x .&. y) .|. ((complement x) .&. z)
              f2 (XYZ x y z) = x `xor` y `xor` z
              f3 (XYZ x y z) = (x .&. y) .|. (x .&. z) .|. (y .&. z)
              abcde5 = ABCDE (a + a') (b + b') (c + c') (d + d') (e + e')

    -- {-# INLINE get_word_32s #-}
    get_word_32s :: BS.ByteString -> [Word32]
    get_word_32s s = map f [0..15]
        where f i = foldl (+) 0 $ map (\n -> toEnum (fromEnum (BS.index
    s (i*4+n))) `shiftL` (8 * (3-n))) [0..3]

    -- {-# INLINE doit #-}
    doit :: (XYZ -> Word32) -> Word32 -> ABCDE -> Word32 -> ABCDE
    doit f k (ABCDE a b c d e) w = ABCDE a' a (rotL b 30) c d
     where a' = rotL a 5 + f (XYZ b c d) + e + w + k

    -- {-# INLINE rotL #-}
    rotL :: Word32 -> Rotation -> Word32
    rotL a s = shiftL a s .|. shiftL a (s-32)
     -- rotL a s = a `seq` rotate a  s

I want to focus on the *rotL* function --- get_word_32s might be faster 
with a rewrite.

Using -prof -auto-all and -P at runtime produced the following summary...

    COST CENTRE                    MODULE               %time %alloc 
    ticks     bytes

    get_word_32s                   MySHA1                39.3   35.0 
    10597 6952614960
    rotL                           MySHA1                22.6   18.5  
    6089 3682275072
    step                           MySHA1                21.2   24.1  
    5701 4783669848
    doit                           MySHA1                16.2   21.8  
    4362 4332088320

As part of a larger program, but the sha1 portion was where 98% of the 
time went.
So

    * I tried using the built-in "rotate",
    * I tried inlining,
    * I tried using -fasm directly
    * I tried generating C with -O2,
    *  I tried using 'seq' in a number of places.

None of this made much difference.

Comment, the C code for just rotL would be

    unsigned int
    rotate(unsigned int a, int b)
    {
      return (a << b) | (a >> (b - 32));
    }

and the assembler at gcc -O2 is concise and straightforward:

        .file    "rotate.c"
        .text
        .p2align 4,,15
    .globl rotate
        .type    rotate, @function
    rotate:
        pushl    %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %edx
        movl    12(%ebp), %ecx
        popl    %ebp
        movl    %edx, %eax
        sall    %cl, %eax
        subl    $32, %ecx
        shrl    %cl, %edx
        orl    %edx, %eax
        ret
        .size    rotate, .-rotate
        .section    .note.GNU-stack,"", at progbits
        .ident    "GCC: (GNU) 3.3.5 (Debian 1:3.3.5-3)"

GHC's code on this part is nothing like competitive because of the need 
for closures --- I think.

1) What approaches to getting the compiler to optimize have I 
overlooked?  Or, how would
    you change the code to get the compiler to generate better code for 
rotL?

2) Am I missing an easy way to get strictness on the rotL function?

3) Is there something I am missing that keeps ghc's code from becoming 
about as
concise as gcc -O2? 

4) Is the code generator simply not able to handle this?

Thanks in advance,

George

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell/attachments/20060922/5b7a8802/attachment.htm