[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

Sat Dec 15 09:34:27 EST 2007

On Sat, 2007-12-15 at 09:25 +0100, Peter Lund wrote:
> What do you think the relative speeds are of the six small haskell
> programs at the end of this email?

Ok, I presume this is a guessing game and we're supposed to just look at
the code without running and timing them.

> All they do is read from stdin and count the number of spaces they see.
> There are two that use strict bytestrings, two that use lazy
> bytestrings, and two that use the standard Haskell strings.  Three use a
> recursive function with an accumulator parameter and three use a foldl
> with a lambda function.
> 
> Say the fastest one takes the time 1.  How much time will the others
> take?
> 
> And how about memory?  How much memory do you think they require?  Let's
> say we feed a 150MB(*) file into each of them, how many megabytes do you
> think they end up using (as seen from the OS, not in terms of how big
> the live heap is)?
> 
> I'm going to post full benchmarks + analysis on Wednesday.

Right'o. I'll have a go. Lets see if I can't embarrass myself with being
completely inaccurate.

> PS: For extra credit, what do you think is the peak memory use for this
>     program when given an input file of 150MB?

Ok.

> {-# LANGUAGE BangPatterns #-}
> 
> import qualified Data.ByteString.Lazy.Char8 as B
> import GHC.Int (Int64)
> 
> -- note that D.BS.Lazy.Char8.length is ByteString -> Int64
> --           D.BS.C8.length is ByteString -> Int

Yes, because strict bytestring cannot be bigger than the size of virtual
memory and with ghc at least, Int tracks the size of the machine
pointer.

> cnt	:: B.ByteString -> Int64
> cnt bs	= B.length (B.filter (== ' ') bs)
> 
> main = do s <- B.getContents
> 	  print (cnt s)

Hmm. So that should work in constant memory, a few 64 chunks at once.
I'd expect this to be pretty fast.

> 
> 
> 
> ==============================
> hs/space-bs-c8-acc-1.hs:
> {-# LANGUAGE BangPatterns #-}
> 
> import qualified Data.ByteString.Char8 as B
> 
> cnt	:: Int -> B.ByteString -> Int
> cnt !acc bs = if B.null bs
> 		then acc
> 	        else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
> 
> main = do s <- B.getContents
> 	  print (cnt 0 s)

This uses strict bytestrings so will use at least 150Mb and that'll make
it a good deal slower. In fact it'll be worse than that since
getContents does not know in advance how big the input will be so it has
to play the doubling and copying game. So it'll end up copying all the
data roughly twice. cnt is strict and tail recursive so that shouldn't
be any problem, though it's probably not as fast as the first length .
filter since head, tail, null all have to do bounds checks.

> ==============================
> hs/space-bslc8-acc-1.hs:
> {-# LANGUAGE BangPatterns #-}
> 
> import qualified Data.ByteString.Lazy.Char8 as B
> 
> cnt	:: Int -> B.ByteString -> Int
> cnt !acc bs = if B.null bs
> 		then acc
> 	        else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
> 
> main = do s <- B.getContents
> 	  print (cnt 0 s)

For the same reason as above, I'd expect this cnt to be slower than
B.length . B.filter (== ' ')

> ==============================
> hs/space-xxxxx-acc-1.hs:
> {-# LANGUAGE BangPatterns #-}
> 
> cnt	:: Int -> String -> Int
> cnt !acc bs = if null bs
> 		then acc
> 	        else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)
> 
> main = do s <- getContents
> 	  print (cnt 0 s)

Lazy, so constant memory use, but much higher constant factors due to
using String.

> ==============================
> hs/space-bs-c8-foldlx-1.hs:
> {-# LANGUAGE BangPatterns #-}
> 
> import qualified Data.ByteString.Char8 as B
> 
> cnt	:: B.ByteString -> Int
> cnt bs	= B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
> 
> main = do s <- B.getContents
> 	  print (cnt s)

This is of course still strict so that's going to make the reading slow.

This is a manually fused B.length . B.filter (== ' ') which hopefully is
the same speed as the automatically fused one if the fusion is working
ok. If not, then the B.length . B.filter (== ' ') will be doing a extra
copy, and memory writes are expensive.

> ==============================
> hs/space-bslc8-foldlx-1.hs:
> {-# LANGUAGE BangPatterns #-}
> 
> import qualified Data.ByteString.Lazy.Char8 as B
> 
> cnt	:: B.ByteString -> Int
> cnt bs	= B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
> 
> main = do s <- B.getContents
> 	  print (cnt s)

As above but now in constant memory space.

> ==============================
> hs/space-xxxxx-foldl.hs:
> {-# LANGUAGE BangPatterns #-}
> 
> cnt	:: String -> Int
> cnt bs	= foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
> 
> main = do s <- getContents
> 	  print (cnt s)

Oh, no! not foldl that's a killer.

Ok, so best way to summarise I think is to organise by data type since I
think that'll dominate.

So I think the lazy bytestring versions will be fastest due to having
the best memory access patterns and doing the least copying. I think the
foldl's will be faster than the explicit accumulators due to having
fewer bounds checks.

space-bslc8-foldlx-1
space-bslc8-acc-1

space-bs-c8-foldlx-1
space-bs-c8-acc-1

space-xxxxx-acc-1
space-xxxxx-foldl

I'll try guessing at some ratios:

1.0 space-bslc8-foldlx-1
1.1 space-bslc8-acc-1

2.0 space-bs-c8-foldlx-1
2.1 space-bs-c8-acc-1

4.0 space-xxxxx-acc-1
15  space-xxxxx-foldl

Duncan