[Haskell-cafe] Puzzled about inlining and specialise inline under ghc -O2

Thu Mar 22 15:35:08 CET 2012

Dear Haskell-Cafe,

I'm computing a histogram of a bunch of symbols with up to 8 bits of
information each, stored in a unboxed vector of Word8. The histogram is
represented as an unboxed vector of Int with size 2^bits. I compute the
histogram by folding an increment function.

The problem: depending on what types and what annotations I give to the
increment and histogram function (see below), the GC gets through a
different amount of memory. I'm using GHC 7.0.3 and -O2. I'd like to
better understand how and why the optimisation does/doesn't kick in.

Here are the functions with the most generic types I can think of:

import qualified Data.Vector.Unboxed as UV
import qualified Data.Vector.Unboxed.Mutable as UMV
import Control.Monad.Primitive (PrimMonad, PrimState)

increment :: (PrimMonad m, UMV.Unbox a, Num a, Integral b) =>
    UMV.MVector (PrimState m) a -> b -> m (UMV.MVector (PrimState m) a)
increment v x = do
    n <- UMV.read v (fromIntegral x)
    UMV.write v (fromIntegral x) (n+1)
    return v

histogram :: (Integral a, UMV.Unbox a) => Int -> UV.Vector a ->
UV.Vector Int
histogram bitsPerSym v = runST $ do
    a <- UMV.replicate (2^bitsPerSym) (0::Int)
    a' <- UV.foldM' increment a v
    UV.unsafeFreeze a'

Running my test load, I get: total alloc =  33,206,568 bytes

Looking at the core, ghc is not specialising the functions, even if I
tell it to inline them. So let's brutally change the types to be as
specific as I need for my application:

increment :: UMV.MVector s Int -> Word8 -> ST s (UMV.MVector s Int)
histogram :: Int -> UV.Vector Word8 -> UV.Vector Int

result: total alloc =  19,581,152 bytes

and if I put INLINE pragmas for both functions: 16,952,512 bytes

I should be able to achieve the same effect with SPECIALISE INLINE
pragmas, right? Let's try that:

{-# SPECIALISE INLINE increment :: UMV.MVector s Int -> Word8 -> ST s
(UMV.MVector s Int) #-}
{-# SPECIALISE INLINE histogram :: Int -> UV.Vector Word8 -> UV.Vector
Int #-}

result: 33,139,856 bytes
(GHC can't figure out application of the first rule, giving:
  Warning: RULE left-hand side too complicated to desugar)

So unfortunately my most generic form won't work here, I need to
specialise increment to be in ST (which sucks, because I want it to work
for both IO and ST):

increment :: (UMV.Unbox a, Num a, Integral b) =>
    UMV.MVector s a -> b -> ST s (UMV.MVector s a)
{-# SPECIALISE INLINE increment :: UMV.MVector s Int -> Word8 -> ST s
(UMV.MVector s Int) #-}

result: 17,016,192 bytes

This is very close to the most specific function instantiations and
INLINE, but:
- I've lost being generic between ST and IO
- it's still a little bigger than the specific instances + INLINE

So my questions are: what is going on? Can I have genericity between ST
and IO while keeping the low GC usage? How come SPECIALISE INLINE does
not give the same result as specific instances + INLINE?

Obviously, for this example, I don't really *need* increment to work
inside IO, since I'm using runST... but I want to understand what
is going on. Profiling Haskell performance and memory usage has always
been difficult for me.

Much thanks in advance,

Rafal Kolanski.