[Haskell-beginners] Re: Re: Re: When to use ByteString rather than [Char] ... ?

Sun Apr 11 19:01:36 EDT 2010

On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:
> Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> >
> > Of course:
> >  - I haven't done any tests. I guessed (which I written)
> 
> I just have done a test.
> Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no 
> characters outside latin1 range) and the same with
> ('\n':map toEnum [256 .. 10000] ++ "\n") appended.
> 

Converted myspell polish dictonary (a few % of non-ascii chars) added
twice (6531616 bytes).

> Code:
> 
> main = A.readFile "big.txt" >>= print . B.length
> 

{-# LANGUAGE BangPatterns #-}
import Control.Applicative
import qualified Data.ByteString as S
import qualified Data.ByteString.UTF8 as SU
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Lazy.UTF8 as LU
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import Data.List hiding (find)
import Data.Time.Clock
import System.Mem
import System.IO hiding (readFile)
import Text.Printf
import Prelude hiding (readFile)

readFile :: String -> IO String
readFile p = do h <- openFile p ReadMode
                hSetEncoding h utf8
                hGetContents h

measure :: IO a -> IO (NominalDiffTime)
measure a = do performGC
               start <- getCurrentTime
               !_ <- a
               end <- getCurrentTime
               return $! end `diffUTCTime` start

find !x v | fromEnum v == 32 = x + 1
          | otherwise        = x

find' !x 'ą' = x + 1
find' !x 'Ą' = x + 1
find' !x  _  = x

main = printMeasure "Length - ByteString" (S.length <$> S.readFile
"dict") >>
       printMeasure "Length - Lazy ByteString" (L.length <$> L.readFile
"dict") >>
       printMeasure "Length - String" (length <$> readFile "dict") >>
       printMeasure "Length - UTF8 ByteString" (SU.length <$> S.readFile
"dict") >>
       printMeasure "Length - UTF8 Lazy ByteString" (LU.length <$>
L.readFile "dict") >>
       printMeasure "Length - Text" (T.length <$> T.readFile "dict") >>
       printMeasure "Length - Lazy Text" (TL.length <$> TL.readFile
"dict") >>
       printMeasure "Searching - ByteString" (S.foldl' find 0 <$>
S.readFile "dict") >>
       printMeasure "Searching - ByteString" (L.foldl' find 0 <$>
L.readFile "dict") >>
       printMeasure "Searching - String" (foldl' find 0 <$> readFile
"dict") >>
       printMeasure "Searching - UTF8 ByteString" (SU.foldl find 0 <$>
S.readFile "dict") >>
       printMeasure "Searching - UTF8 Lazy ByteString" (LU.foldl find 0
<$> L.readFile "dict") >>
       printMeasure "Searching - Text" (T.foldl' find 0 <$> T.readFile
"dict") >>
       printMeasure "Searching - Lazy Text" (TL.foldl' find 0 <$>
TL.readFile "dict") >>
       printMeasure "Searching ą - String" (foldl' find' 0 <$> readFile
"dict") >>
       printMeasure "Searching ą - UTF8 ByteString" (SU.foldl find' 0 <
$> S.readFile "dict") >>
       printMeasure "Searching ą - UTF8 Lazy ByteString" (LU.foldl find'
0 <$> L.readFile "dict") >>
       printMeasure "Searching ą - Text" (T.foldl' find' 0 <$>
T.readFile "dict") >>
       printMeasure "Searching ą - Lazy Text" (TL.foldl' find' 0 <$>
TL.readFile "dict")

printMeasure :: String -> IO a -> IO ()
printMeasure s a = measure a >>= \v -> printf "%-40s %8.5f s\n" (s ++
":") (realToFrac v :: Float)

> where (A,B) is a suitable combination of 
> - Data.ByteString[.Lazy][.Char8][.UTF8]
> - Data.Text[.IO]
> - Prelude
> 
> Times:
> Data.ByteString[.Lazy]: 0.00s
> Data.ByteString.UTF8: 0.14s
> Prelude:  0.21s
> Data.ByteString.Lazy.UTF8: 0.56s
> Data.Text:  0.66s
> 

                       Optimized:

Length - ByteString:                      0.01223 s
Length - Lazy ByteString:                 0.00328 s
Length - String:                          0.15474 s
Length - UTF8 ByteString:                 0.19945 s
Length - UTF8 Lazy ByteString:            0.30123 s
Length - Text:                            0.70438 s
Length - Lazy Text:                       0.62137 s

String seems to be fastest correct

Searching - ByteString:                   0.04604 s
Searching - ByteString:                   0.04424 s
Searching - String:                       0.18178 s
Searching - UTF8 ByteString:              0.32606 s
Searching - UTF8 Lazy ByteString:         0.42984 s
Searching - Text:                         0.26599 s
Searching - Lazy Text:                    0.37320 s

While ByteString is clear winner String is actually good compared to
others.

Searching ą - String:                     0.18557 s
Searching ą - UTF8 ByteString:            0.32752 s
Searching ą - UTF8 Lazy ByteString:       0.43811 s
Searching ą - Text:                       0.28401 s
Searching ą - Lazy Text:                  0.37612 

String is fastest? Hmmm.

                       Compiled:

Length - ByteString:                      0.00861 s
Length - Lazy ByteString:                 0.00409 s
Length - String:                          0.16059 s
Length - UTF8 ByteString:                 0.20165 s
Length - UTF8 Lazy ByteString:            0.31885 s
Length - Text:                            0.70891 s
Length - Lazy Text:                       0.65553 s

ByteString is also clear winner but String once again wins in 'correct'
section.

Searching - ByteString:                   1.27414 s
Searching - ByteString:                   1.27303 s
Searching - String:                       0.56831 s
Searching - UTF8 ByteString:              0.68742 s
Searching - UTF8 Lazy ByteString:         0.75883 s
Searching - Text:                         1.16121 s
Searching - Lazy Text:                    1.76678 s

I mean... what? I may be doing something wrong 

Searching ą - String:                     0.32612 s
Searching ą - UTF8 ByteString:            0.41564 s
Searching ą - UTF8 Lazy ByteString:       0.52919 s
Searching ą - Text:                       0.87463 s
Searching ą - Lazy Text:                  1.52369 s

No comment.

                       Intepreted

Length - ByteString:                      0.00511 s
Length - Lazy ByteString:                 0.00378 s
Length - String:                          0.16657 s
Length - UTF8 ByteString:                 0.21639 s
Length - UTF8 Lazy ByteString:            0.33952 s
Length - Text:                            0.79771 s
Length - Lazy Text:                       0.65320 s

As with others.

Searching - ByteString:                   9.12051 s
Searching - ByteString:                   8.94038 s
Searching - String:                       8.57391 s
Searching - UTF8 ByteString:              7.71766 s
Searching - UTF8 Lazy ByteString:         7.79422 s
Searching - Text:                         8.34435 s
Searching - Lazy Text:                    9.07538 s

Now they are pretty much equal.

Searching ą - String:                     3.17010 s
Searching ą - UTF8 ByteString:            3.94399 s
Searching ą - UTF8 Lazy ByteString:       3.92382 s
Searching ą - Text:                       3.32901 s
Searching ą - Lazy Text:                  4.18038 s

Hmm. Still the best?

Your test:
                        Optimized  Compiled  Interpreted
ByteString:             0.011      0.011     0.421
ByteString Lazy:        0.006      0.006     0.535
String:                 0.237      0.240     0.650
Text:                   0.767      0.720     1.192
Text Lazy:              0.661      0.614     1.061
ByteString UTF8:        0.204      0.204     0.631
ByteString Lazy UTF8:   0.386      0.309     0.744

System:
Core 2 Duo T9600 2.80 GHz, 2 GiB RAM
Gentoo Linux x86-64.
Linux 2.6.33 + gentoo patches + ck.
Glibc 2.11
GHC 6.12.1
base 4.2.0.0
bytestring 0.9.1.5
text 0.7.1.0
utf8-string 0.3.6

PS. Tests were repeated a few times and each gave similar results.

> 
> >  - It wasn't written what is the typical case
> 
> Aren't there several quite different typical cases?
> One fairly typical case is big ASCII or latin1 files (e.g. fasta files, 
> numerical data). For those, usually ByteString is by far the best choice.
> 

On the other hand - if you load the numerical data it is likely that:
- It will have some labels. The labels can happen to need non-ascii or
non-latin elements
- Biggest time will be spent on operating on numbers then strings.

> Another fairly typical case is *text* processing, possibly with text in 
> different scripts (latin, hebrew, kanji, ...). Depending on what you want 
> to do (and the encoding), any of Prelude.String, Data.Text and 
> Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings 
> probably aren't. String and Text also have the advantage that you aren't 
> tied to utf-8.
> 
> Choose your datatype according to your problem, not one size fits all.
> 

My measurements seems to prefer String but they are probably wrong.

Regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/beginners/attachments/20100411/5663b187/attachment.bin