[Haskell-cafe] sha1 implementation thats "only" 12 times slower then C

Tue Jul 3 16:29:01 EDT 2007

inlining some of the functions definitely gave me a boost, so i am
about 8.5 times slower then openssl sha1sum.  I dont really understand
the core output, but after inlining i got a completely different
profile output, i am guessing its because the cost of the inlined
functions is spread to the callers.

COST CENTRE                    MODULE               %time %alloc

updateElem                     SHA1                  13.4    0.0
sRotateL                       SHA1                  13.4    0.0
hashElem                       SHA1                  12.5    0.0
sXor                           SHA1                  10.9    0.0
unboxW                         SHA1                  10.0    0.0
temp                           SHA1                   8.1    0.0
sAdd                           SHA1                   7.8    0.0
sAnd                           SHA1                   5.0    0.0
do20                           SHA1                   4.1   18.0
hashA16IntoA80                 SHA1                   2.8    0.9
do60                           SHA1                   2.5   18.0
splitByN                       SHA1                   2.2   15.6
ffkk                           SHA1                   2.2    0.0
sOr                            SHA1                   1.6    0.0
do40                           SHA1                   0.9   18.0
hashPtrIntoA80                 SHA1                   0.6    2.7
hashA80                        SHA1                   0.6    1.8
do80                           SHA1                   0.6   18.0
joinTail                       SHA1                   0.0    2.1
main                           Main                   0.0    4.8

On 6/30/07, Donald Bruce Stewart <dons at cse.unsw.edu.au> wrote:
> aeyakovenko:
> > So I tried implementing a more efficient sha1 in haskell, and i got to
> > about 12 times slower as C.  The darcs implementation is also around
> > 10 to 12 times slower, and the crypto one is about 450 times slower.
> > I haven't yet unrolled the loop like the darcs implementation does, so
> > I can still get some improvement from that, but I want that to be the
> > last thing i do.
> >
> > I think I've been getting speed improvements when minimizing
> > unnecessary allocations.  I went from 40 times slower to 12 times
> > slower by converting a foldM to a mapM that modifies a mutable array.
> >
> > Anyone have any pointers on how to get hashElem and updateElem to run
> > faster, or any insight on what exactly they are allocating.  To me it
> > seems that those functions should be able to do everything they need
> > to without a malloc.
>
> Try inlining key small functions, and check the core.
>
>     -O2 -ddump-simpl | less
>
> -- Don
>