[Haskell-cafe] Re: String vs ByteString

Fri Aug 13 14:42:05 EDT 2010

On Friday 13 August 2010 19:53:37, Bryan O'Sullivan wrote:
> On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer 
<daniel.is.fischer at web.de>wrote:
> > That's an unfortunate example. Using the stringsearch package,
> > substring searching in ByteStrings was considerably faster than in
> > Data.Text in my tests.
>
> Interesting. Got a test case so I can repro and fix? :-)

Sure, use http://norvig.com/big.txt (~6.2M), cat it together a few times to 
test on larger files.

ByteString code (bmLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Search

main :: IO ()
main = do
    (file : pat : _) <- getArgs
    let !spat = C.pack pat
        work = indices spat
    L.readFile file >>= print . length . work
----------------------------------------------------------------
Data.Text.Lazy (textLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import System.Environment (getArgs)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import Data.Text.Lazy.Search

main :: IO ()
main = do
    (file : pat : _) <- getArgs
    let !spat = T.pack pat
        work = indices spat
    TIO.readFile file >>= print . length . work
----------------------------------------------------------------
(Data.Text.Lazy.Search is of course not exposed by default ;), I use 
text-0.7.2.1)

Some local timings:

1. real words in a real text file:

$ time ./textLazy big.txt the
92805                                                                                   
0.59user 0.00system 0:00.61elapsed 97%CPU
$ time ./bmLazy big.txt the92805                                                                                 
0.02user 0.01system 0:00.04elapsed 104%CPU

$ time ./textLazy big.txt and
43587                                                                                   
0.56user 0.01system 0:00.58elapsed 100%CPU
$ time ./bmLazy big.txt and
43587                                                                                 
0.02user 0.01system 0:00.03elapsed 88%CPU

$ time ./textLazy big.txt mother
317
0.44user 0.01system 0:00.46elapsed 99%CPU
$ time ./bmLazy big.txt mother
317
0.00user 0.01system 0:00.02elapsed 69%CPU

$ time ./textLazy big.txt deteriorate
2
0.37user 0.00system 0:00.38elapsed 98%CPU
$ time ./bmLazy big.txt deteriorate
2
0.01user 0.01system 0:00.02elapsed 114%CPU

$ time ./textLazy big.txt "Project Gutenberg"
177
0.37user 0.00system 0:00.38elapsed 97%CPU
$ time ./bmLazy big.txt "Project Gutenberg"
177
0.00user 0.01system 0:00.01elapsed 100%CPU

2. periodic pattern in a file of 33.4M of aaaaa:

$ time ./bmLazy ../AAA 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
34999942
1.22user 0.04system 0:01.30elapsed 97%CPU
$ time ./textLazy ../AAA 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
3.07user 0.03system 0:03.14elapsed 98%CPU

Oh, that's closer, but text doesn't find overlapping matches, well, we can 
do that too (replace indices with nonOverlappingIndices):

$ time ./noBMLazy ../AAA 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
0.18user 0.04system 0:00.23elapsed 97%CPU

Yeah, that's more like it :D