[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

Sat Dec 15 10:19:22 EST 2007

On Sat, 2007-12-15 at 14:34 +0000, Duncan Coutts wrote:

> Ok, I presume this is a guessing game and we're supposed to just look at
> the code without running and timing them.

Precisely :)

> > All they do is read from stdin and count the number of spaces they see.
> > There are two that use strict bytestrings, two that use lazy
> > bytestrings, and two that use the standard Haskell strings.  Three use a
> > recursive function with an accumulator parameter and three use a foldl
> > with a lambda function.
> > 
> > Say the fastest one takes the time 1.  How much time will the others
> > take?
> > 
> > And how about memory?  How much memory do you think they require?  Let's
> > say we feed a 150MB(*) file into each of them, how many megabytes do you
> > think they end up using (as seen from the OS, not in terms of how big
> > the live heap is)?
> > 
> > I'm going to post full benchmarks + analysis on Wednesday.
> 
> Right'o. I'll have a go. Lets see if I can't embarrass myself with being
> completely inaccurate.

Thanks for biting!

You were, thankfully, only almost completely inaccurate ;)

> > PS: For extra credit, what do you think is the peak memory use for this
> >     program when given an input file of 150MB?

> Hmm. So that should work in constant memory, a few 64 chunks at once.
> I'd expect this to be pretty fast.

You are right about the speed.
Can you guess a number in kilobytes?

> > 
> > 
> > 
> > ==============================
> > hs/space-bs-c8-acc-1.hs:
> > {-# LANGUAGE BangPatterns #-}
> > 
> > import qualified Data.ByteString.Char8 as B
> > 
> > cnt	:: Int -> B.ByteString -> Int
> > cnt !acc bs = if B.null bs
> > 		then acc
> > 	        else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
> > 
> > main = do s <- B.getContents
> > 	  print (cnt 0 s)
> 
> This uses strict bytestrings so will use at least 150Mb and that'll make
> it a good deal slower. In fact it'll be worse than that since
> getContents does not know in advance how big the input will be so it has
> to play the doubling and copying game. So it'll end up copying all the
> data roughly twice. cnt is strict and tail recursive so that shouldn't
> be any problem, though it's probably not as fast as the first length .
> filter since head, tail, null all have to do bounds checks.

You are right about the memory.  It is actually slightly faster than the
"extra credit" (length/filter combination) above.

> > ==============================
> > hs/space-bslc8-acc-1.hs:
> > {-# LANGUAGE BangPatterns #-}
> > 
> > import qualified Data.ByteString.Lazy.Char8 as B
> > 
> > cnt	:: Int -> B.ByteString -> Int
> > cnt !acc bs = if B.null bs
> > 		then acc
> > 	        else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
> > 
> > main = do s <- B.getContents
> > 	  print (cnt 0 s)
> 
> For the same reason as above, I'd expect this cnt to be slower than
> B.length . B.filter (== ' ')

It is slower but not for the same reason as above.

> > ==============================
> > hs/space-xxxxx-acc-1.hs:
> > {-# LANGUAGE BangPatterns #-}
> > 
> > cnt	:: Int -> String -> Int
> > cnt !acc bs = if null bs
> > 		then acc
> > 	        else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)
> > 
> > main = do s <- getContents
> > 	  print (cnt 0 s)
> 
> Lazy, so constant memory use, but much higher constant factors due to
> using String.

Spot on.

> > ==============================
> > hs/space-bs-c8-foldlx-1.hs:
> > {-# LANGUAGE BangPatterns #-}
> > 
> > import qualified Data.ByteString.Char8 as B
> > 
> > cnt	:: B.ByteString -> Int
> > cnt bs	= B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
> > 
> > main = do s <- B.getContents
> > 	  print (cnt s)
> 
> This is of course still strict so that's going to make the reading slow.

Nope.

> This is a manually fused B.length . B.filter (== ' ') which hopefully is
> the same speed as the automatically fused one if the fusion is working
> ok. If not, then the B.length . B.filter (== ' ') will be doing a extra
> copy, and memory writes are expensive.
> 
> > ==============================
> > hs/space-bslc8-foldlx-1.hs:
> > {-# LANGUAGE BangPatterns #-}
> > 
> > import qualified Data.ByteString.Lazy.Char8 as B
> > 
> > cnt	:: B.ByteString -> Int
> > cnt bs	= B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
> > 
> > main = do s <- B.getContents
> > 	  print (cnt s)
> 
> As above but now in constant memory space.

Nope.

> > ==============================
> > hs/space-xxxxx-foldl.hs:
> > {-# LANGUAGE BangPatterns #-}
> > 
> > cnt	:: String -> Int
> > cnt bs	= foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs
> > 
> > main = do s <- getContents
> > 	  print (cnt s)
> 
> Oh, no! not foldl that's a killer.

You think it's worse than the program just above?

> 
> Ok, so best way to summarise I think is to organise by data type since I
> think that'll dominate.
> 
> So I think the lazy bytestring versions will be fastest due to having
> the best memory access patterns and doing the least copying. I think the
> foldl's will be faster than the explicit accumulators due to having
> fewer bounds checks.
> 
> space-bslc8-foldlx-1
> space-bslc8-acc-1
> 
> space-bs-c8-foldlx-1
> space-bs-c8-acc-1
> 
> space-xxxxx-acc-1
> space-xxxxx-foldl
> 
> I'll try guessing at some ratios:
> 
> 1.0 space-bslc8-foldlx-1
> 1.1 space-bslc8-acc-1
> 
> 2.0 space-bs-c8-foldlx-1
> 2.1 space-bs-c8-acc-1
> 
> 4.0 space-xxxxx-acc-1
> 15  space-xxxxx-foldl

I've done the measurements on a 2GHz Athlon64 3000+, a 1667 MHz Core
Duo, and a 600MHz Pentium III.  They all show the same pattern (with a
few minor aberrations for the PIII).

You did got one of the relative speeds right ;)
(and some of the memory usages)

I've tested with ghc 6.8.1 and 6.9.20071119 and 6.9.20071208 (or
thereabouts).  6.6.1 won't run my benchmarks and it also won't let me
install bytestring-0.9.0.1 to replace its built-in version.

-Peter