[Haskell-cafe] Number 1, at least for now

Thu Feb 2 03:52:11 EST 2006

Everyone is welcome to write 'better' code for the shootout.  Its a good way to
learn useful techniques, and share them.  I will add notes about the benefits
that were to be had from using this ugly code.

Ben Lippmeier wrote:
> Donald Bruce Stewart wrote:
> 
>>> Ha! I don't think "pure lazy fp" would be the phrase I would choose
>>> to describe this code.
>>>
>>> An example (from fannkuch):
>>>
>>> t <- IO $ \s ->
>>>     case readIntOffAddr# p1# 0# s of
>>>       (# s, p1 #) -> case readIntOffAddr# p1# (n# -# 1#) s of
>>>           (# s, pn #) -> (# s, not (p1 ==# 0# || pn ==# (n# -# 1#)) #)
>>
>> Ok, I'll bite ;) This is a very atypical "example", as fannkuch is the
>> only entry written this way (and it could be rewritten, with careful
>> attention to the Core). There are also many lovely pure, lazy entries.
> 
> 

There are two Haskell entries for fannkuch.  The "normal" one has this code:

> rotate 2 (x1:x2:xs) = x2:x1:xs
> rotate 3 (x1:x2:x3:xs) = x2:x3:x1:xs
> rotate 4 (x1:x2:x3:x4:xs) = x2:x3:x4:x1:xs
> rotate 5 (x1:x2:x3:x4:x5:xs) = x2:x3:x4:x5:x1:xs
> rotate 6 (x1:x2:x3:x4:x5:x6:xs) = x2:x3:x4:x5:x6:x1:xs
> rotate 7 (x1:x2:x3:x4:x5:x6:x7:xs) = x2:x3:x4:x5:x6:x7:x1:xs
> rotate 8 (x1:x2:x3:x4:x5:x6:x7:x8:xs) = x2:x3:x4:x5:x6:x7:x8:x1:xs
> rotate 9 (x1:x2:x3:x4:x5:x6:x7:x8:x9:xs) = x2:x3:x4:x5:x6:x7:x8:x9:x1:xs
> rotate 10 (x1:x2:x3:x4:x5:x6:x7:x8:x9:x10:xs) = x2:x3:x4:x5:x6:x7:x8:x9:x10:x1:xs
> rotate n (x:xs) = rotate' n xs
>     where rotate' 1 xs     = x:xs
>           rotate' n (x:xs) = x:rotate' (n-1) xs

Which does not scale particularly well.  The "normal" entry is 9.0x slower than
the leader, which the readIntOffAddr# is 3.1x slower.  So the gain from explicit
unboxing and state passing was substantial.

> Not so atypical.
> 
> More examples (I didn't look /that/ hard.. :))
> 
> -----------------------------------
> * From reverse-complement
> 
> reverseit iubc strand i l s =
>     if i >=# l
>         then (# s, (I# i, I# l) #)
>         else case readWord8OffAddr# strand i s  of { (# s, c #) ->
>              case readWord8OffAddr# strand l s  of { (# s, x #) ->
>              case readWord8OffAddr# iubc (word2Int# x) s  of { (# s, y#)
>              case readWord8OffAddr# iubc   (word2Int# c) s  of { (# s, z #) ->
>              case writeWord8OffAddr# strand i y s of { s ->
>              case writeWord8OffAddr# strand l z s of { s ->
>              reverseit iubc strand (i +# 1#) (l -# 1#) s
>         } } } } } }

Again, there is a "normal" entry, with a screen full of the definition of

> complement :: Base -> Base
> complement 'A' = 'T'
> complement 'a' = 'T'
> complement 'C' = 'G'
> complement 'c' = 'G'
> complement 'G' = 'C'
> ...

The "normal" entry was 32x slower than the leader and used 31x the memory.  The
optimized entry runs 6.3x slower and  uses 1.1x the memory.  The fast entry uses
only the state & unboxing seen above, the rest of the program is normal Haskell,
e.g. the one line that replaces a screen full:

> pairs = map (c2w *** c2w) [('A','T'),('C','G'),('B','V'),('D','H'),('K','M'),('R','Y'),('\0','\0')]

> 
> 
> * From k-nucleotide.
> 
> eqmem i ptr1 ptr2 s = if i ==# 0# then (# s , True #) else
>     case readInt8OffAddr# ptr1 0# s of { (# s, i8a #) ->
>     case readInt8OffAddr# ptr2 0# s of { (# s, i8b #) ->
>     if i8a ==# i8b
>         then eqmem (i -# 1#) (plusAddr# ptr1 1#) (plusAddr# ptr2 1#) s
>         else (# s, False #) } }
> 

I wrote that, copying the technique from Don.  That is what a fast packed string
library which compares ascii bytes should do internally.  In c-code it is exacly
the 'memcmp' function.  And yet the GHC 6.4.1 distribution does not have this
function.  This is an example of the library missing things that it should have.

The "normal" Haskell entry was 26x slower and 29x bigger than the leader.  The
replacement code above is 22x and 7.1x instead.  Note that I did *not* get a big
speed increase.  This is because I must use the provided Data.Hashtable which is
simply not competitive to other languages.  This shootout benchmark has
highlighted the deficiency.

> 
> * From n-body.
> 
> kineticE i = let i' = (.|. (shiftL i 3))
>              in do m <- mass i
>                    vx <- unsafeRead b (i' vx)
>                    vy <- unsafeRead b (i' vy)
>                    vz <- unsafeRead b (i' vz)
>                    return $! 0.5 * m * (vx*vx + vy*vy + vz*vz)
> 

I wrote that out of desperation.  Einar has a previous entry that is beautiful
Haskell, creating an "open mutable records" system just for this benchmark.  It
ran in 18x time, 5.8 space.  Many of us put entries on the wiki which were
different ways to approach it (arrays of mutable data; mutable arrays of
constant data).  None were able to beat Einar's entry.  Eventually I and Don
wrote a flat ptr to array of double version, doing it's own offset computation
using (shiftL) and (.!.) as you see above.  This runs in 6.6x time and 4.1x
space.  So without resorting to state & unboxing we got nearly triple the speed.

Though we are still half the speed of OCaml here.

> ----------------------
> 
> *I* am certainly not implying that Haskell is anything less than the
> most wonderous language in the entire world.

The neat thing is that for speed we do not drop out into c-code or assembly.  We
just use different parts of Haskell.  The mutable state for the n-body code is
simple using the IO features to act on a ptr to machine doubles.

> 
> I'm saying that there's a stark difference in style between the programs
> submitted to the shootout, and the ones I would show to people that I
> myself was trying to introduce to the wonders of purely lazy functional
> programming. :).

True.  The ugly code above is not wondrous.

I will assert that the use of explicit-IO-state passing and unboxed code is good
for teaching.  If you want to explain the IO monad, you will end up talking
about the state of the "real world" and how nice it is that you don't have to
pass it around so that pure functions can peek/poke at it.  By showing someone
the examples above, they get to read what their IO-do-notation means once it has
been de-sugared to a lower level than (>>=).

Another way to look at it is that "readIntOffAddr# p1# 0# s" is making use of a
nominally pure function, where peekPtr / pokePtr are impure.

> 
> I think there's a big-fat-lesson about the tension between abstraction
> and implementation in these entries.
> 
> On one hand we've got "This is what I want", on the other it's "What do
> I have to do to implement it".
> 
> Ben.

You have the truth of it.

Haskell wins the lines-of-code metric by a large margin from the abstraction.
But entries that run into big performance problems get parts rewritten in lower
level Haskell.