[Haskell-cafe] Optimizing cellular automata & the beauty of unlifted types

Fri Dec 21 17:55:48 EST 2007

Justin Bailey wrote:
> On Dec 20, 2007 7:42 PM, Sterling Clover <s.clover at gmail.com> wrote:
> > I'm curious how much of the unboxing helped performance and how much
> > didn't. In my experience playing with this stuff, GHC's strictness
> > analyzer has consistently been really excellent, given the right
> > hints. Unboxed tuples are generally a win, but GHC was often smarter
> > at unboxing ints than I was.
> 
> It really did help. I started with an implementation that used Ints,
> and this sped the program up by at least 2x. I think that's because of
> the bit-manipulation I'm doing. For example, Data.Bits defines the
> bitwise and operation on Ints as:
> 
>   (I# x#) .&.   (I# y#)  = I# (word2Int# (int2Word# x# `and#` int2Word# y#))
> 
> Which you can see has 3 conversions in it.

I believe you're barking up the wrong tree here. Consider

| module Q (f, g) where
| 
| import GHC.Base
| 
| f :: Word# -> Word# -> Word#
| f a b = a `and#` b
| 
| g :: Int# -> Int# -> Int#
| g a b = word2Int# (int2Word# a `and#` int2Word# b)

If you look at the generated machine code, you'll find that f and g
are identical functions. The sole purpose of the int2Word# and
word2Int# operations is to satisfy the type checker. (This is
even true at the core level. The core language is typed, so you
still need to do the conversions there. No code is generated for
them though.)

> The I# deconstruction/construction is also suspicious to me, but I
> don't know if there are performance costs there or not. Regardless,
> using Word# directly lets me write (assuming w1# and w2# are already
> words):
> 
>   w1# `and#` w2#

The I# deconstruction has a cost, but with proper strictness annotations
ghc should optimize those away - check the intermediate Core to see
whether it actually does.

In fact most of the speedup you got seems to come from the use of
uncheckedShiftL# and uncheckedShiftRL# - just using

  shiftL, shiftR :: Int -> Int -> Int
  I# a `shiftR` I# b = I# (word2Int# (int2Word# a `uncheckedShiftRL#` b))
  I# a `shiftL` I# b = I# (word2Int# (int2Word# a `uncheckedShiftL#` b))

speeds up the stepWithUArray code by a factor of 2 here, starting with
the first program at http://hpaste.org/4151.

The normal shiftL and shiftR implementations include bounds checks to
get the results for shifts that are bigger than the word size correct.
(For example  1 `shiftL` 65 = 0, while  1 `uncheckedShiftL#` 65 = 2
(probably. it depends on your architecture)). Eliminating those checks
results in a major speedup.

Apparently those bounds checks aren't eliminated even if the shifting
width is constant. I wonder why, there's clearly some room for
optimization here.

(I used ghc 6.8.1 on an Athlon XP. I used ghc -O2, no fancy options.)

enjoy,

Bertram