[Haskell-cafe] Fun with the ST monad

Sat Feb 26 10:33:33 CET 2011

In part to help solidify my own understanding and usage, I wrote up
the following which shows a comparison of processing an input file.
Andrew Coppin originally posed the issue concerning strictness imposed
by using the ST monad for processing an input file.

This literate example shows a comparison of processing a file using:
    1. the ST monad
    2. the ST monad with Luke Palmer's suggested laziness
    3. the State monad
    4. a direct Iteratee (from John Millikin's Enumerator package)
    5. the same Iteratee in Monad form
    6. another slight variation of the Iteratee in Monad form

First, lets get the basics taken care of:

> import System.IO
> import System.Environment
> import Data.Word
> import Data.Bits
> import qualified Data.ByteString as B
> import Control.Applicative ( (<$>) )
> import Control.Monad.Trans.Class (lift)
> import Control.Monad.IO.Class
> import Control.Monad.ST.Lazy
> import Data.STRef.Lazy
> import Control.Monad.Trans.State.Lazy
> import qualified Data.Enumerator as E
> import Data.Enumerator ( ($$) )
> import qualified Data.Enumerator.Binary as EB
> import qualified Data.Enumerator.List as EL

This example is intended to show the effects of lazy or strict
processing of a file, so an input file is needed.

> inp = "input.example"

This input file can contain whatever you'd like, but for my testing I

simply created a 5MB file of zeros via:

    $ dd if=/dev/zero of=input.example count=10000
    $ ls -sh input.example
    4.9M input.example

The output file will use the following base name, with the number of
the processing mode appended.

> oup = "output.example"

The stats output of ghc will be used to compare the different
processing modes, so only one process will be performed each time the
application is run.  The processing mode desired will be input as a
command-line parameter, defaulting to the first mode.

> main = do tna <- getArgs
>           let tn = read $ head $ tna ++ ["1"]
>           case tn of
>             6 -> testE 6 transform6
>             5 -> testE 5 transform5
>             4 -> testE 4 transform4
>             3 -> testT 3 transform3
>             2 -> testT 2 transform2
>             _ -> testT 1 transform1

To build and run this example (assuming this literate source is saved
as fproc.lhs):

   $ ghc -o fproc --make fproc.lhs && for N in $(seq 1 5) ; do time ./fproc $N +RTS -t -RTS ; done

That's all the basic setup out of the way.

The actual processing of the file is irrelevant other than needing to
remember previous input to process the current input.  In my example
each byte is usually combined with the previous byte to determine the
output byte.  In the ST and State monad forms, the previous byte value
is stored in the state portion of the monad.

The ST form is my interpretation of Andrew's original intent.

> transform1 xs = runST (newSTRef 0 >>= work xs)

>     where work [] _ = return []
>           work (e:es) s = do n <- readSTRef s
>                              writeSTRef s $ shiftR e 4
>                              let r = if e < 32 then e else n+e
>                              (r :) <$> work es s

To run this with standardized file processing, ByteString -> Word8
conversion, and output, main uses the testT wrapper.  Hopefully all
the pack and unpack operations are fusing and I haven't skewed the
results by introducing strictness at this level.

> testT n t = let oun = oup ++ show n
>                 op = B.pack . t . B.unpack
>             in print n >> op <$> B.readFile inp >>= B.writeFile oun

My output from this is:

./fproc 1 +RTS -t
1
<<ghc: 2454775948 bytes, 4665 GCs, 130116178/526932092 avg/max bytes residency (9 samples), 1099M in use, 0.00 INIT (0.00 elapsed), 2.99 MUT (3.04 elapsed), 10.66 GC (11.82 elapsed) :ghc>>

real	0m14.998s
user	0m13.650s
sys	0m1.333s

This is a processing rate of about 333KB/s, and memory consumption is
quite high, despite lazy processing.  Note that this is GHC 6.12.3, so
it doesn't have the IO performance updates present in 7.x.

Just to verify that there was laziness, I changed the imports from
Control.Monad.ST.Lazy and Data.STRef.Lazy to the .Strict versions and
got this:

./fproc 1 +RTS -t
1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
<<ghc: 726175216 bytes, 1329 GCs, 48863531/163010616 avg/max bytes residency (7 samples), 334M in use, 0.00 INIT (0.00 elapsed), 3.46 MUT (3.51 elapsed), 6.41 GC (6.80 elapsed) :ghc>>

real	0m10.351s

user	0m9.865s
sys	0m0.455s

Luke Palmer recommended some laziness techniques.  Notably I think he
added strictness to the STRef update value computation and used fmap
(x:) to yield a value prior to the recursion.  I don't know if the
latter is also achieved by Applicative's <$> that I used above, but
here is the updated version:

> transform2 xs = runST (newSTRef 0 >>= work xs)
>     where work [] _ = return []
>           work (e:es) s = do n <- readSTRef s
>                              writeSTRef s $! shiftR e 4
>                              let r = if e < 32 then e else n+e
>                              fmap (r :) $ work es s
>

This yields nearly identical results (actually slightly worse, but
that may be within the measuring variance):

./fproc 2 +RTS -t
2
<<ghc: 2643891308 bytes, 5025 GCs, 130139248/527043288 avg/max bytes residency (9 samples), 1113M in use, 0.00 INIT (0.00 elapsed), 3.17 MUT (3.29 elapsed), 10.81 GC (11.95 elapsed) :ghc>>

real	0m15.378s
user	0m13.985s
sys	0m1.346s

And just to verify that the performance is not unique to the ST monad,
here's the same thing with the State monad:

> transform3 xs = evalState (work xs) 0
>     where work [] = return []
>           work (e:es) = do n <- get
>                            put $ shiftR e 4
>                            let r = if e < 32 then e else n+e
>                            (r :) <$> work es

./fproc 3 +RTS -t
3

<<ghc: 2783483064 bytes, 5292 GCs, 130145645/527083664 avg/max bytes residency (9 samples), 1106M in use, 0.00 INIT (0.00 elapsed), 3.42 MUT (3.51 elapsed), 10.51 GC (11.69 elapsed) :ghc>>

real	0m15.351s
user	0m13.932s
sys	0m1.369s

Pretty much the same as the ST results.

Now to try the iteratee approach.  I've used John Millikin's
Enumerator package.  His package provides an Enumerator for ByteString
and I don't have any intermediate processing that I'd use an
Enumeratee for, so I simply write an Iteratee.  This is a little
different than the ST/State monad forms above because I move the
processing of writing the output file into the Iteratee, but this is
in line with the overall intent of maximal laziness and interleaving
the file reading and writing.

> transform4 :: MonadIO m => Handle -> E.Iteratee B.ByteString m ()
> transform4 h = E.continue $ work 0
>     where work n E.EOF = E.yield () E.EOF
>           work n (E.Chunks []) = E.continue $ work n
>           work n (E.Chunks (e:es)) =
>               let op a b = (shiftR b 4, if b < 32 then b else a+b)
>                   (m, r) = B.mapAccumL op n e
>               in do liftIO $ B.hPut h r
>                     work m $ E.Chunks es

The state element is now simply the first argument to the recursive
inner work function.  Each chunk will be a ByteString, so I use
mapAccumL to process each ByteString as it's provided.

The testE wrapper is enumerator equivalent of the testT wrapper used
with the ST and State monads.

> testE n t = do print n
>                h <- openFile (oup ++ show n) WriteMode
>                E.run_ (EB.enumFile inp $$ t h)

The results of this Iteratee approach:

./fproc 4 +RTS -t

4
<<ghc: 1576097772 bytes, 2968 GCs, 69753/69776 avg/max bytes residency (22 samples), 2M in use, 0.00 INIT (0.00 elapsed), 2.52 MUT (2.61 elapsed), 0.03 GC (0.05 elapsed) :ghc>>

real	0m2.663s
user	0m2.549s
sys	0m0.078s

Very nice!  That's a throughput rate of about 1.85MB/s (almost 6x
faster) and memory consumption is about half of the ST/State monad
processes.

It's also worth noting that pretty much everything but the last 4
lines of the Iteratee version are mostly standard boilerplate for an
Iteratee.  The mapAccumL adds a little complexity, but the code
footprint is roughly the same size as the ST/State monad forms.
Iteratees do take some mental wrestling to come to terms with, but for
me it was probably on par with my initial introduction to using State
monads.

There are many writeups on Iteratees but I heavily replied on John
Millikin's description ("Understanding Iteratees" on
http://john-millikin.com/software/enumerator), Oleg's original paper
(http://okmij.org/ftp/Streams.html) and also Michael Snoyman's blog
(linked from the Millikin's page and directly beginning at
http://docs.yesodweb.com/blog/enumerators-tutorial-part-1).

For the next form, I'll follow Michael's lead and use the fact that an
Iteratee is itself a Monad instance to try an alternate form of
writing the transformation.  Most of the boilerplate I referred to
above is absorbed into the EL.head operation:

> transform5 :: MonadIO m => Handle -> E.Iteratee B.ByteString m ()
> transform5 h = work 0
>     where work n = do e <- EL.head
>                       case e of
>                         Nothing -> return ()
>                         Just e' -> next n e'
>           next n e = do (m,r) <- return $ B.mapAccumL op n e
>                         liftIO $ B.hPut h r
>                         work m

>           op a b = (shiftR b 4, if b < 32 then b else a+b)

As would be expected, this has nearly identical performance to the
non-monadic version:

./fproc 5 +RTS -t
5
<<ghc: 1576361992 bytes, 2969 GCs, 69708/69716 avg/max bytes residency (22 samples), 2M in use, 0.00 INIT (0.00 elapsed), 2.27 MUT (2.31 elapsed), 0.03 GC (0.05 elapsed) :ghc>>

real	0m2.356s
user	0m2.300s
sys	0m0.056s

And finally just for the sake of golfing, I reduced the work function in the previous example a litle bit:

> transform6 :: MonadIO m => Handle -> E.Iteratee B.ByteString m ()
> transform6 h = work 0
>     where work n = EL.head >>= maybe (return ()) (next n)
>           next n e = do let (m,r) = B.mapAccumL op n e
>                         liftIO $ B.hPut h r
>                         work m
>           op a b = (shiftR b 4, if b < 32 then b else a+b)

That's not so bad!  The learning curve of Iteratees is non-trivial,
but the results are pretty readable, IMHO.

And here's verification that the output is reasonable:

$ ls -1sh *.example*
4.9M input.example
4.9M output.example1
4.9M output.example2
4.9M output.example3
4.9M output.example4
4.9M output.example5
4.9M output.example6

Hopefully this has been a useful comparison of using Iteratee
techniques in relation to more conventional monads, and the
performance results are good support of the usefulness of Iteratee's.

As always, the greatest benefit was probably for myself in actually
implementing and writing this up, but if you read through this far I
hope you found it readable and useful.

-Kevin Quick

-- 
-KQ