[Haskell-beginners] Re: Re: Re: When to use ByteString rather than [Char] ... ?

Sun Apr 11 20:56:40 EDT 2010

Am Montag 12 April 2010 01:01:36 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:
> > Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> > > Of course:
> > >  - I haven't done any tests. I guessed (which I written)
> >
> > I just have done a test.
> > Input file: "big.txt" from Norvig's spelling checker (6488666 bytes,
> > no characters outside latin1 range) and the same with
> > ('\n':map toEnum [256 .. 10000] ++ "\n") appended.
>
> Converted myspell polish dictonary (a few % of non-ascii chars) added
> twice (6531616 bytes).

>
>                        Optimized:
>
> Length - ByteString:                      0.01223 s
> Length - Lazy ByteString:                 0.00328 s
> Length - String:                          0.15474 s
> Length - UTF8 ByteString:                 0.19945 s
> Length - UTF8 Lazy ByteString:            0.30123 s
> Length - Text:                            0.70438 s
> Length - Lazy Text:                       0.62137 s
>
> String seems to be fastest correct

For me, strict UTF8 ByteString was faster than String, but as for you, both 
played in the same league.

>
> Searching - ByteString:                   0.04604 s
> Searching - ByteString:                   0.04424 s
> Searching - String:                       0.18178 s
> Searching - UTF8 ByteString:              0.32606 s
> Searching - UTF8 Lazy ByteString:         0.42984 s
> Searching - Text:                         0.26599 s
> Searching - Lazy Text:                    0.37320 s
>
> While ByteString is clear winner String is actually good compared to
> others.

The ByteStrings should be faster if you use count instead of foldl' find 0.
Anyway, where applicable, ByteStrings are much faster for certain tasks - I 
suspect they wouldn't do so well if one had a (map function . filter 
predicate) on the input.

I'm surprised here that
a) searching takes so much longer than calculating the length for BS.UTF8
b) searching is *much faster* than calculating the length for Text
c) both, UTF8 BS and Text, are slower than String

>
> Searching ą - String:                     0.18557 s
> Searching ą - UTF8 ByteString:            0.32752 s
> Searching ą - UTF8 Lazy ByteString:       0.43811 s
> Searching ą - Text:                       0.28401 s
> Searching ą - Lazy Text:                  0.37612
>
> String is fastest? Hmmm.

Not much difference to the previous.

>
>                        Compiled:

Compiled means no optimisations? That's a *bad* idea for ByteStrings and 
Text. You only get the fusion magic with optimisations turned on, without, 
well...

>
> Length - ByteString:                      0.00861 s
> Length - Lazy ByteString:                 0.00409 s
> Length - String:                          0.16059 s
> Length - UTF8 ByteString:                 0.20165 s
> Length - UTF8 Lazy ByteString:            0.31885 s
> Length - Text:                            0.70891 s
> Length - Lazy Text:                       0.65553 s
>
> ByteString is also clear winner but String once again wins in 'correct'
> section.
>
> Searching - ByteString:                   1.27414 s
> Searching - ByteString:                   1.27303 s
> Searching - String:                       0.56831 s
> Searching - UTF8 ByteString:              0.68742 s
> Searching - UTF8 Lazy ByteString:         0.75883 s
> Searching - Text:                         1.16121 s
> Searching - Lazy Text:                    1.76678 s
>
> I mean... what? I may be doing something wrong

Yes, using ByteString and Text without optimisations :)

>
> PS. Tests were repeated a few times and each gave similar results.
>
> > >  - It wasn't written what is the typical case
> >
> > Aren't there several quite different typical cases?
> > One fairly typical case is big ASCII or latin1 files (e.g. fasta
> > files, numerical data). For those, usually ByteString is by far the
> > best choice.
>
> On the other hand - if you load the numerical data it is likely that:
> - It will have some labels. The labels can happen to need non-ascii or
> non-latin elements

Possible. But label-free formats of n columns of numbers are fairly common 
(and easier to handle).

> - Biggest time will be spent on operating on numbers then strings.
>

In that case, it is of course less important which type you use for IO.

> > Another fairly typical case is *text* processing, possibly with text
> > in different scripts (latin, hebrew, kanji, ...). Depending on what
> > you want to do (and the encoding), any of Prelude.String, Data.Text
> > and Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla
> > ByteStrings probably aren't. String and Text also have the advantage
> > that you aren't tied to utf-8.
> >
> > Choose your datatype according to your problem, not one size fits all.
>
> My measurements seems to prefer String but they are probably wrong.

Yes, measurements always are wrong ;)
More seriously, you measured a couple of tasks on one system. For different 
tasks and other systems, different results should be expected.

The best choice depends on the task. For SPOJ problems, ByteStrings are 
what you'll want. For text processing, probably not.
If you want to find a pattern in a long ASCII string, however, they likely 
are.

>
> Regards

Cheers