[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

Don Stewart dons at galois.com
Sun Dec 16 18:21:57 EST 2007

I've had a look at how some of the code was being compiled for 
strict and lazy bytestrings, and also which rules weren't firing.
With some small tweaks the code seems back in good shape.

An updated bytestring library is at :


Enjoy! :)


Summary: the suspicious lazy bytestring program works now. (constant
space, and fastest overall, as expected originally)

Program 1, lazy bytestring length . filter

        ./A +RTS -sstderr < 150M  1.01s user 0.10s system 98% cpu 1.123 total
        40M allocated

  * Today (fixed!):
        ./A +RTS -sstderr < 150M  0.26s user 0.06s system 96% cpu 0.332 total
        2M allocated

    Reason, deprecated array fusion mucking up the optimiser.

I think we can close this regression.


Also, I had a look at Program 3: lazy bytestring, custom loop

    Unchanged. 2.4s, constant space. This was a bit slow.

Further investigation shows lots of unnecessary bounds checks, as we
take apart the Chunk lazy bytestring type, then test and continue.

This representation was chosen to make it possible to process chunks
efficiently, so that we can avoid these bounds check. 

Something like this instead:

        cnt :: Int -> B.ByteString -> Int
        cnt n B.Empty        = n
        cnt n (B.Chunk x xs) = cnt (n + cnt_strict 0 x) xs  -- process lazy spine

            -- now we can process a chunk without checking for Empty
                cnt_string !i !s                            -- then strict chunk 
                    | S.null s  = i
                    | c == ' '  = cnt_strict (i+1) t
                    | otherwise = cnt_strict i     t
                    (c,t) = (S.w2c (S.unsafeHead s), S.unsafeTail s) -- no bounds check

        main = do s <- B.getContents; print (cnt 0 s)

Let's us avoid redundant checks for Empty, while allowing 'go' to
avoid unnecessary checks for the empty strict bytestring. This is 
some 4x faster.

This alternating between lazy spines and strict chunk processing is
the best way to get reliable performance from lazy bytestring custom loops.

-- Don

More information about the Haskell-Cafe mailing list