efficient Bytestring mapM_ for IO/ST?

Wed Mar 23 03:58:28 CET 2011

On 3/22/11 7:36 AM, John Lato wrote:
>> From: wren ng thornton<wren at freegeek.org>
>> On 3/21/11 4:40 PM, Brandon Moore wrote:
>>> Is there an efficient way to iterate over the bytes of a ByteString?
>>
>> The code I've been using (rather similar to your unsafe map) is:
>>
>>      import qualified Data.ByteString.Internal as BSI
>>      import qualified Foreign.ForeignPtr       as FFI
>>
>>      foldIO :: (a ->  Word8 ->  IO a) ->  a ->  ByteString ->  IO a
>>      foldIO f z0 (BSI.PS fp off len) =
>>          FFI.withForeignPtr fp $ \p0 ->  do
>>              let q = p0 `plusPtr` (off+len)
>>              let go z p
>>                      | z `seq` p `seq` False = undefined
>>                      | p == q    = return z
>>                      | otherwise = do
>>                          w<- peek p
>>                          z'<- f z w
>>                          go z' (p `plusPtr` 1)
>>              go z0 (p0 `plusPtr` off)
>>      {-# INLINE foldIO #-}
>>
>> Some things to note:
>>
>> * It's a left fold rather than a right fold, just like foldM, except
>> that we can't generalize it to work for all monads. (We could do a right
>> fold just as easily by starting with p0`plusPtr`(off+len) and counting
>> down to p0`plusPtr`off if desired.)
>>
>> * Because we're just keeping the head pointer, we can increment it as we
>> go instead of using peekElemOff. This improves the performance by only
>> performing one addition per loop (the p++) instead of two (ix++ and
>> *(p+ix)), and by requiring one less register (for ix).
>
> Out of curiosity, do you have measurements that demonstrate improved
> performance from this?

Only anecdotally, not formally. (As per my subsequent bullet about not 
doing much performance hacking on it.) But if you accept anecdotes then 
yes. For my particular use case I think freeing up the register was more 
important than removing the extra addition, since that allows avoiding 
stack spills et al.

> When I did some tests with a similar problem, there
> was no noticeable difference between the two approaches.  In my case I also
> needed the element index though, so it was a slightly different problem.

Yes, that's a rather different problem. I wouldn't expect to see any 
difference for this task since you have to maintain the index anyways. 
(E.g., there's not much difference between doing { *p ; ++i ; ++p } vs { 
*(p+i) ; ++i } unless you're trying to manually schedule your ALU 
pipelines.)

-- 
Live well,
~wren