[Haskell-cafe] Re: Performance Help

Mon Mar 12 07:53:06 EDT 2007

Dominic Steinitz wrote:
> I have re-written the sha1 code so that it is (hopefully) easy to see that it 
> faithfully implements the algorithm (see 
> http://www.itl.nist.gov/fipspubs/fip180-1.htm). Having got rid of the space 
> leak, I have been trying to improve performance.
> 
> Currently, the haskell code is 2 orders of magnitude slower than the sha1sum 
> that ships with my linux.
>
> Here's the code that is taking the majority of the time.
> 
>> ($&) :: [Word32] -> [Word32] -> [Word32]
>> a $& b = zipWith (+) a b
>>
>> -- Word128 -> Word512 -> Word128
>> oneBlock ss xs = Word128 (as!!80) (bs!!80) (cs!!80) (ds!!80) (es!!80)
>>    where
>>       ws = xs ++ map (rotL 1) (zipWith4 xxxor wm3s wm8s wm14s ws)
>>          where
>>             xxxor a b c d = a `xor` b `xor` c `xor` d
>>             wm3s  = drop (16-3)  ws
>>             wm8s  = drop (16-8)  ws
>>             wm14s = drop (16-14) ws
>>       as = ai:ts
>>       bs = bi:as
>>       cs = ci:(map (rotL 30) bs)
>>       ds = di:cs
>>       es = ei:ds
>>       ts = (map (rotL 5) as) $& (zipWith4 f [0..] bs cs ds) $& es $& (map k
>> [0..]) $& ws 
>>       Word128 ai bi ci di ei = ss 
> 
> Any help would be appreciated.

This code is clean Haskell without algorithmic flaws, optimizing it
means to scrape the constant factor off.

Of course, de- and constructing those lazy lists is quite expensive and
the canonical answer is: deforestation, also known as "fusion". The goal
is to avoid building intermediate lists like if they get consumed at
some point, here by (!! 80). This is like transforming the factorial or
the fibonacci numbers

  fac n = facs !! n
     where
     facs = 1 : zipWith (*) facs [1..]

  fib n = fibs !! n
     where
     fibs = 1:1: zipwWith (+) fibs (tail fibs)

to their accumulating cousins

  fac n = fac' n 1
     where
     fac'  0  x = x
     fac' !n !x = fac' (n-1) (x*n)

  fib n = fib' n (1,1)
     where
     fib'  0 ( x, y) = x
     fib' !n (!x,!y) = fib' (n-1) (y,x+y)

The algorithm splits in two parts: calculating ws and accumulating the
quintuple a,b,c,d,e over it. Fusing the quintuple is straightforward and
already suggested by the imperative specification on the website you
mentioned (use with -fbang-patterns):

  oneBlock ss xs = foldl' g (0,ss) ws
     where
     ws = ...
     g (!n,!Word160 a b c d e) w = (n+1,
        Word160 (rotL 5 a + f n b c d + e + w + k n) a (rotL 30 b) c d))

Together with -funbox-strict-fields, GHC should be able to generate
decent assembly from that.

Fusing the ws is trickier. Directly appealing to the fibonacci-number
example is not recommended because this would mean to keep the last 16
ws in memory and shifting them right to left by hand. But as the
"Alternate method of computation" on the website suggests, you can
delegate the shifting to an index that shifts around mod 16. Having a
mutable array is helpful, then.
Of course, you can also fill a large static (boxed!) array of 80 Word8s via

  ws :: Data.Array.IArray.Array Int Word8
  ws = accumArray 0 (0,80) (curry snd) $
       zip [0..15] xs ++ [(i, xxor i) | i<-[16..80]]
      where
      xxor i = ws ! (i-16) `xor`
           ws ! (i-3) `xor` ws ! (i-8) `xor` ws ! (i-14)

or something like that (I didn't care about correct indices and bounds).
GHC can fuse such array accumulations.

In general, keeping stuff in lists is not wrong, but ByteStrings are
more adapted to current CPU and RAM architecture.

Regards,
apfelmus