[Haskell-cafe] Mutable arrays

Thu Feb 7 13:55:54 EST 2008

Ok, I've done some testing on different array accessing strategies,
and at this point I believe that, as Stefan previously suggested,
using an Oleg-inspired fold for access offers the best performance.
My first non-unsafe idea was to try:

> import Data.Array.MArray
> import Data.Array.IO
> import Data.Int (Int32)
> import Data.List (foldl')
>
> -- |Create an array.  Change the type signature to experiment with different array types.
> makeAry :: IO (IOUArray Integer Float)
> makeAry = newListArray (1,1000000) [1..]

> getElemsLazy :: (MArray a e m, Ix i) => a i e -> m [e]
> getElemsLazy myAry = getBounds myAry >>=
> 	mapM (readArray myAry) . range

this works on a 1 million element Int32 array (IOArray or IOUArray):

> myElems <-getElemsLazy myAry
> let myMax = foldl' max 0 myElems

But with a 10-million element array, I get a stack overflow.

I next tried the foldA functions Stefan posted, but had some problems
with them.  I think that the array elements need to be instances of
Bounded and Fractional, and I don't know of a suitable type.  Anyway,
his code is a little too compact for me to understand without further
study, so I moved on.

Next I created my own fold operators.  These are fairly directly
copied from an email Oleg posted to the haskell mailing list:

> leftFoldA' myArray self iteratee seed = do
>     elem <- readArray myArray $ fst seed
>     case iteratee seed elem of
>         Left seed -> return seed
>         Right seed -> self iteratee seed
>
> leftFoldA :: (MArray a e m, Ix a1) => a a1 e -> ((a1, b) -> e -> Either t (a1, b)) -> a1, b) -> m t
> leftFoldA array iteratee seed =
>     let g = (leftFoldA' array) g in
>     g iteratee seed
>
> leftFoldModA' myArray self iteratee seed = do
>     elem <- readArray myArray $ fst seed
>     result <- iteratee seed elem
>     case result of
>         Left seed -> return seed
>         Right seed -> self iteratee seed
>
> leftFoldMA :: (MArray a e m, Ix a1) => a a1 e -> ((a1, b) -> e -> m (Either t (a1, b))) -> a1, b) -> m t
> leftFoldMA array iteratee seed =
>     let g = (leftFoldModA' array) g in
>     g iteratee seed

the two functions to use are leftFoldA and leftFoldMA.  The
difference, show in the type signature, is that the iteratee in
leftFoldA is pure while the iteratee in leftFoldMA operates in the
monad.  So if you're going to modify the array, for example, you need
to use the leftFoldMA version.  These folds let me define:

> findMax myAry = do
>     (lowB, highB) <- getBounds myAry
>     let maxReader (ix, acc) c =
>         let
>             ix' = ix + 1
>             acc' = max c acc
>         in acc' `seq` (if ix' > highB then Left else Right) (ix', acc')
>     (_, myMax) <- leftFoldA myAry maxReader (lowB, 0)
>     return myMax
>
> normalizeArray myAry = do
>     myMax <- findMax myAry
>     (lowB, highB) <- getBounds myAry
>     let normMod = 1/myMax
>     let normalizer (ix, _) c = do
>         let ix' = ix + 1
>         curVal <- readArray myAry ix
>         writeArray myAry ix (curVal * normMod)
>         return $ (if ix' > highB then Left else Right) (ix', ())
>     leftFoldMA myAry normalizer (lowB, ())
>
> -- |show all elements from lowB to highB
> -- this only works with IO arrays, since the readArray and print happen in the same monad.
> -- you could possibly make it work with ST arrays if you use a monad transformer, but you'd have
> -- to change leftFoldMA as well.
> showElems myAry lowB highB = do
>     let showFunc (ix, _) c = do
>         let ix' = ix + 1
>         readArray myAry ix >>= print
>         return $ (if ix' > highB then Left else Right) (ix', ())
>     leftFoldMA myAry showFunc (lowB, ())

There are probably better ways to define the folds, but I hope these
are pretty easy to follow.  If I was doing a lot of different
operations, I'd probably define functions like

> makeIteratee highB func = (\(ix, acc) c ->
>     let
>         ix' = ix + 1
>         acc' = func c acc
>     in acc' `seq` (if ix' > highB then Left else Right) (ix', acc'))

so that findMax becomes:
> findMax2 myAry = do
>     (lowB, highB) <- getBounds myAry
>     (_, myMax) <- leftFoldA myAry (makeIteratee highB max) (lowB, 0)
>     return myMax

instead of having to define a bunch of iteratees in let clauses.

Now, running the program

> main :: IO ()
> main = do
>     myAry <- makeAry
>     (lowB, highB) <- getBounds myAry
>     normalizeArray myAry
>     showElems myAry 1 10
>     showElems myAry 999990 1000000
>     return ()

takes about 5 seconds, with no optimizations.  It takes less than a
second with -O2.  Much better.  It does take longer with boxed arrays,
but that's to be expected.  No unsafe functions, either.  The only
sketchy bit is using seq in the iteratees (e.g. maxReader in findMax)
to force the accumulator.  I say sketchy not because it's unsafe or
unreliable, but it would be nice if it was unnecessary.

Notes: I used GHC 6.8.1 on WinXP, 3.2GHZ P4, 1.5GB RAM.  All of my
original tests were done without any optimizations, which is why I
didn't include timing information for the earlier stuff.  The
getElemsLazy function is fast, but the stack overflow is annoying.
Using the leftFold functions is both faster and avoids the space leak.
 I was also using the Data.List.Stream and Control.Monad.Stream
versions of foldl' and mapM, although I don't think those made much
difference.

Anyway, this was a good exercise for me.  I learned a lot, especially
about the power of left-fold enumerators.
Cheers,
John