efficient Bytestring mapM_ for IO/ST?

Mon Mar 21 21:40:20 CET 2011

Is there an efficient way to iterate over the bytes of a ByteString?

For a benchmark, I am just computing the sum of the bytes in an STRef.
This is a bit silly, but makes for simple test code. (Accumulating a sum
in a boxed STRef is several times slower than computing a full histogram
in an STUArray, and foldl' (\x b -> x + fromIntegral b) 0 is faster still).

The skeleton of the test program is

import qualified Data.ByteString.Lazy as L
import Control.Monad.ST
import Control.Monad
import Data.STRef
import Data.Word

mapMBS_ST : (Word8 -> ST s ()) -> L.ByteString -> ST s ()
mapMBS_ST = ???

main = do bytes <- L.readFile "input"
          print $ runST (do
            v <- newSTRef (0 :: Word64)
            mapMBS_ST (\b -> do
              x <- readSTRef v
              writeSTRef v $! x + fromIntegral b) bytes
            readSTRef v)

The file "input" is 1GB of random bytes (from /dev/urandom).

A version which uses the internal representation and
explicit peeks runs several times faster than the
best implementation I could write without using the
unsafe interface.

The safe implementations I tried were

sum3 : mapMBS_ST f bytes = L.foldr (\b r -> f b >> r) (return ()) bytes
sum4 : mapMBS_ST f bytes = if L.null bytes then return () else f (L.head bytes) 
>> mapMBS_ST f (L.tail bytes)
sum5 : mapMBS_ST f bytes = mapM_ f (L.unpack bytes)

sum used an unsafe implementation, and sum2 simply uses foldl'.

Here are times. Runs vary by a few hundredths of a second user time.

./sum  8.62s user 0.39s system 99% cpu 9.006 total
./sum2  1.26s user 0.34s system 99% cpu 1.604 total
./sum3  72.25s user 0.50s system 99% cpu 1:12.96 total
./sum4  20.55s user 0.66s system 99% cpu 21.251 total
./sum5  391.18s user 1.23s system 99% cpu 6:32.76 total

Comparing sum and sum2 gives some idea of the overhead that
comes from the boxed STRef, so the relative overhead of the
different iteration routines is probably even higher.

Looking at the ByteString code, I suspect the foldr version
performs badly because foldr uses one unsafePerformIO over
a loop which builds the value in an argument, and GHC doesn't
see that pointer accesses in the loop could be lifted out of the
unsafePerformIO and interleaved with the IO actions built from
the argument. (I don't expect GHC to be too smart about taking
advantage of unsafePerformIO).

Perhaps foldr could be modified so the foldr version above will
optimze nicely. Are the benchmarks in 
http://darcs.haskell.org/bytestring/
good for avoiding performance regressions?

That repository has recent changes, but the page
Is http://www.cse.unsw.edu.au/~dons/fps.html
which is supposedly the bytestring homepage
seems a little out of date.

Brandon

Code for the unsafe map follows:

module BytestringMaps (mapMS_IO, mapMBS_IO, mapMS_ST, mapMBS_ST) where
import qualified Data.ByteString.Lazy as L
import Data.ByteString as S
import Data.ByteString.Unsafe as BU
import Data.Word
import Control.Monad
import Data.Array.Storable
import Foreign.Storable
import Foreign.Ptr
import Control.Monad.ST

for l u m = go l where
  go ix | ix >= u = return ()
        | otherwise = m ix >> go (ix+1)

{-# INLINE mapMS_IO #-}
mapMS_IO :: (Word8 -> IO ()) -> S.ByteString -> IO ()
mapMS_IO fun chunk = unsafeUseAsCStringLen chunk (\(ptr,len) ->
  let ptr' = castPtr ptr :: Ptr Word8
  in for 0 len (\ix -> do
       c <- peekElemOff ptr' ix
       fun c))

{-# INLINE mapMBS_IO #-}
mapMBS_IO :: (Word8 -> IO ()) -> L.ByteString -> IO ()
mapMBS_IO fun str = mapM_ (mapMS_IO fun) (L.toChunks str)

{-# INLINE mapMS_ST #-}
mapMS_ST :: (Word8 -> ST s ()) -> S.ByteString -> ST s ()
mapMS_ST fun chunk = unsafeIOToST $ unsafeUseAsCStringLen chunk (\(ptr,len) ->
  let ptr' = castPtr ptr :: Ptr Word8
  in for 0 len (\ix -> do
                 c <- peekElemOff ptr' ix
                 unsafeSTToIO (fun c)))

{-# INLINE mapMBS_ST #-}
mapMBS_ST :: (Word8 -> ST s ()) -> L.ByteString -> ST s ()
mapMBS_ST fun str = mapM_ (mapMS_ST fun) (L.toChunks str)