[Haskell-cafe] Re: String vs ByteString
Daniel Fischer
daniel.is.fischer at web.de
Fri Aug 13 14:42:05 EDT 2010
On Friday 13 August 2010 19:53:37, Bryan O'Sullivan wrote:
> On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
<daniel.is.fischer at web.de>wrote:
> > That's an unfortunate example. Using the stringsearch package,
> > substring searching in ByteStrings was considerably faster than in
> > Data.Text in my tests.
>
> Interesting. Got a test case so I can repro and fix? :-)
Sure, use http://norvig.com/big.txt (~6.2M), cat it together a few times to
test on larger files.
ByteString code (bmLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Search
main :: IO ()
main = do
(file : pat : _) <- getArgs
let !spat = C.pack pat
work = indices spat
L.readFile file >>= print . length . work
----------------------------------------------------------------
Data.Text.Lazy (textLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import System.Environment (getArgs)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import Data.Text.Lazy.Search
main :: IO ()
main = do
(file : pat : _) <- getArgs
let !spat = T.pack pat
work = indices spat
TIO.readFile file >>= print . length . work
----------------------------------------------------------------
(Data.Text.Lazy.Search is of course not exposed by default ;), I use
text-0.7.2.1)
Some local timings:
1. real words in a real text file:
$ time ./textLazy big.txt the
92805
0.59user 0.00system 0:00.61elapsed 97%CPU
$ time ./bmLazy big.txt the92805
0.02user 0.01system 0:00.04elapsed 104%CPU
$ time ./textLazy big.txt and
43587
0.56user 0.01system 0:00.58elapsed 100%CPU
$ time ./bmLazy big.txt and
43587
0.02user 0.01system 0:00.03elapsed 88%CPU
$ time ./textLazy big.txt mother
317
0.44user 0.01system 0:00.46elapsed 99%CPU
$ time ./bmLazy big.txt mother
317
0.00user 0.01system 0:00.02elapsed 69%CPU
$ time ./textLazy big.txt deteriorate
2
0.37user 0.00system 0:00.38elapsed 98%CPU
$ time ./bmLazy big.txt deteriorate
2
0.01user 0.01system 0:00.02elapsed 114%CPU
$ time ./textLazy big.txt "Project Gutenberg"
177
0.37user 0.00system 0:00.38elapsed 97%CPU
$ time ./bmLazy big.txt "Project Gutenberg"
177
0.00user 0.01system 0:00.01elapsed 100%CPU
2. periodic pattern in a file of 33.4M of aaaaa:
$ time ./bmLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
34999942
1.22user 0.04system 0:01.30elapsed 97%CPU
$ time ./textLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
3.07user 0.03system 0:03.14elapsed 98%CPU
Oh, that's closer, but text doesn't find overlapping matches, well, we can
do that too (replace indices with nonOverlappingIndices):
$ time ./noBMLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
0.18user 0.04system 0:00.23elapsed 97%CPU
Yeah, that's more like it :D
More information about the Haskell-Cafe
mailing list