[Haskell-beginners] Re: Re: Re: When to use ByteString rather than
[Char] ... ?
Maciej Piechotka
uzytkownik2 at gmail.com
Sun Apr 11 19:01:36 EDT 2010
On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:
> Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> >
> > Of course:
> > - I haven't done any tests. I guessed (which I written)
>
> I just have done a test.
> Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no
> characters outside latin1 range) and the same with
> ('\n':map toEnum [256 .. 10000] ++ "\n") appended.
>
Converted myspell polish dictonary (a few % of non-ascii chars) added
twice (6531616 bytes).
> Code:
>
> main = A.readFile "big.txt" >>= print . B.length
>
{-# LANGUAGE BangPatterns #-}
import Control.Applicative
import qualified Data.ByteString as S
import qualified Data.ByteString.UTF8 as SU
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Lazy.UTF8 as LU
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import Data.List hiding (find)
import Data.Time.Clock
import System.Mem
import System.IO hiding (readFile)
import Text.Printf
import Prelude hiding (readFile)
readFile :: String -> IO String
readFile p = do h <- openFile p ReadMode
hSetEncoding h utf8
hGetContents h
measure :: IO a -> IO (NominalDiffTime)
measure a = do performGC
start <- getCurrentTime
!_ <- a
end <- getCurrentTime
return $! end `diffUTCTime` start
find !x v | fromEnum v == 32 = x + 1
| otherwise = x
find' !x 'ą' = x + 1
find' !x 'Ą' = x + 1
find' !x _ = x
main = printMeasure "Length - ByteString" (S.length <$> S.readFile
"dict") >>
printMeasure "Length - Lazy ByteString" (L.length <$> L.readFile
"dict") >>
printMeasure "Length - String" (length <$> readFile "dict") >>
printMeasure "Length - UTF8 ByteString" (SU.length <$> S.readFile
"dict") >>
printMeasure "Length - UTF8 Lazy ByteString" (LU.length <$>
L.readFile "dict") >>
printMeasure "Length - Text" (T.length <$> T.readFile "dict") >>
printMeasure "Length - Lazy Text" (TL.length <$> TL.readFile
"dict") >>
printMeasure "Searching - ByteString" (S.foldl' find 0 <$>
S.readFile "dict") >>
printMeasure "Searching - ByteString" (L.foldl' find 0 <$>
L.readFile "dict") >>
printMeasure "Searching - String" (foldl' find 0 <$> readFile
"dict") >>
printMeasure "Searching - UTF8 ByteString" (SU.foldl find 0 <$>
S.readFile "dict") >>
printMeasure "Searching - UTF8 Lazy ByteString" (LU.foldl find 0
<$> L.readFile "dict") >>
printMeasure "Searching - Text" (T.foldl' find 0 <$> T.readFile
"dict") >>
printMeasure "Searching - Lazy Text" (TL.foldl' find 0 <$>
TL.readFile "dict") >>
printMeasure "Searching ą - String" (foldl' find' 0 <$> readFile
"dict") >>
printMeasure "Searching ą - UTF8 ByteString" (SU.foldl find' 0 <
$> S.readFile "dict") >>
printMeasure "Searching ą - UTF8 Lazy ByteString" (LU.foldl find'
0 <$> L.readFile "dict") >>
printMeasure "Searching ą - Text" (T.foldl' find' 0 <$>
T.readFile "dict") >>
printMeasure "Searching ą - Lazy Text" (TL.foldl' find' 0 <$>
TL.readFile "dict")
printMeasure :: String -> IO a -> IO ()
printMeasure s a = measure a >>= \v -> printf "%-40s %8.5f s\n" (s ++
":") (realToFrac v :: Float)
> where (A,B) is a suitable combination of
> - Data.ByteString[.Lazy][.Char8][.UTF8]
> - Data.Text[.IO]
> - Prelude
>
> Times:
> Data.ByteString[.Lazy]: 0.00s
> Data.ByteString.UTF8: 0.14s
> Prelude: 0.21s
> Data.ByteString.Lazy.UTF8: 0.56s
> Data.Text: 0.66s
>
Optimized:
Length - ByteString: 0.01223 s
Length - Lazy ByteString: 0.00328 s
Length - String: 0.15474 s
Length - UTF8 ByteString: 0.19945 s
Length - UTF8 Lazy ByteString: 0.30123 s
Length - Text: 0.70438 s
Length - Lazy Text: 0.62137 s
String seems to be fastest correct
Searching - ByteString: 0.04604 s
Searching - ByteString: 0.04424 s
Searching - String: 0.18178 s
Searching - UTF8 ByteString: 0.32606 s
Searching - UTF8 Lazy ByteString: 0.42984 s
Searching - Text: 0.26599 s
Searching - Lazy Text: 0.37320 s
While ByteString is clear winner String is actually good compared to
others.
Searching ą - String: 0.18557 s
Searching ą - UTF8 ByteString: 0.32752 s
Searching ą - UTF8 Lazy ByteString: 0.43811 s
Searching ą - Text: 0.28401 s
Searching ą - Lazy Text: 0.37612
String is fastest? Hmmm.
Compiled:
Length - ByteString: 0.00861 s
Length - Lazy ByteString: 0.00409 s
Length - String: 0.16059 s
Length - UTF8 ByteString: 0.20165 s
Length - UTF8 Lazy ByteString: 0.31885 s
Length - Text: 0.70891 s
Length - Lazy Text: 0.65553 s
ByteString is also clear winner but String once again wins in 'correct'
section.
Searching - ByteString: 1.27414 s
Searching - ByteString: 1.27303 s
Searching - String: 0.56831 s
Searching - UTF8 ByteString: 0.68742 s
Searching - UTF8 Lazy ByteString: 0.75883 s
Searching - Text: 1.16121 s
Searching - Lazy Text: 1.76678 s
I mean... what? I may be doing something wrong
Searching ą - String: 0.32612 s
Searching ą - UTF8 ByteString: 0.41564 s
Searching ą - UTF8 Lazy ByteString: 0.52919 s
Searching ą - Text: 0.87463 s
Searching ą - Lazy Text: 1.52369 s
No comment.
Intepreted
Length - ByteString: 0.00511 s
Length - Lazy ByteString: 0.00378 s
Length - String: 0.16657 s
Length - UTF8 ByteString: 0.21639 s
Length - UTF8 Lazy ByteString: 0.33952 s
Length - Text: 0.79771 s
Length - Lazy Text: 0.65320 s
As with others.
Searching - ByteString: 9.12051 s
Searching - ByteString: 8.94038 s
Searching - String: 8.57391 s
Searching - UTF8 ByteString: 7.71766 s
Searching - UTF8 Lazy ByteString: 7.79422 s
Searching - Text: 8.34435 s
Searching - Lazy Text: 9.07538 s
Now they are pretty much equal.
Searching ą - String: 3.17010 s
Searching ą - UTF8 ByteString: 3.94399 s
Searching ą - UTF8 Lazy ByteString: 3.92382 s
Searching ą - Text: 3.32901 s
Searching ą - Lazy Text: 4.18038 s
Hmm. Still the best?
Your test:
Optimized Compiled Interpreted
ByteString: 0.011 0.011 0.421
ByteString Lazy: 0.006 0.006 0.535
String: 0.237 0.240 0.650
Text: 0.767 0.720 1.192
Text Lazy: 0.661 0.614 1.061
ByteString UTF8: 0.204 0.204 0.631
ByteString Lazy UTF8: 0.386 0.309 0.744
System:
Core 2 Duo T9600 2.80 GHz, 2 GiB RAM
Gentoo Linux x86-64.
Linux 2.6.33 + gentoo patches + ck.
Glibc 2.11
GHC 6.12.1
base 4.2.0.0
bytestring 0.9.1.5
text 0.7.1.0
utf8-string 0.3.6
PS. Tests were repeated a few times and each gave similar results.
>
> > - It wasn't written what is the typical case
>
> Aren't there several quite different typical cases?
> One fairly typical case is big ASCII or latin1 files (e.g. fasta files,
> numerical data). For those, usually ByteString is by far the best choice.
>
On the other hand - if you load the numerical data it is likely that:
- It will have some labels. The labels can happen to need non-ascii or
non-latin elements
- Biggest time will be spent on operating on numbers then strings.
> Another fairly typical case is *text* processing, possibly with text in
> different scripts (latin, hebrew, kanji, ...). Depending on what you want
> to do (and the encoding), any of Prelude.String, Data.Text and
> Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings
> probably aren't. String and Text also have the advantage that you aren't
> tied to utf-8.
>
> Choose your datatype according to your problem, not one size fits all.
>
My measurements seems to prefer String but they are probably wrong.
Regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/beginners/attachments/20100411/5663b187/attachment.bin
More information about the Beginners
mailing list