[Haskell-beginners] Re: Re: When to use ByteString rather than [Char] ... ?

Sun Apr 11 16:07:53 EDT 2010

Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 17:17 +0200, Daniel Fischer wrote:
> > > I *guess* that in most cases the overhead on I/O will be
> >
> > sufficiently
> >
> > > great to make the difference insignificant. However:
> >
> > ? which difference?

I meant: difference between ByteString-IO and [Char]-IO or which 
difference?

> >
> > Try reading large files.
>
> Well - while large files are not not-important IIRC most files are small
> (< 4 KiB) - at least on *nix file systems (at least that's the core
> 'idea' of reiserfs/reiser4 filesystems).

Well, sometimes one has to process large files even though most are small.
If the processing itself is simple, IO-speed is important then.

>
> I guess that for large strings something like text (I think I mentioned
> it) is better
>

Unless you know you only have to deal with one-byte characters, when plain 
ByteStrings are the simplest and fastest method.

But those are special cases, in general I agree.

> > Count the lines or something else, as long as it's
> > simple. The speed difference between ByteString-IO and [Char]-IO is
> > enormous.
> > When you do something more complicated the difference in IO-speed may
> > become insignificant.
>
> Hmm. As newline is a single-byte character in most encodings it is
> believable.

You can measure it yourself :)
cat-ing together a few copies of /usr/share/dict/words should give a large 
enough file.

> However what is the difference in counting chars (not bytes
> - chars)? I wouldn't be surprise is difference was smaller.

Nor would I. In fact I'd be surprised if it wasn't smaller. [see below]

This example was meant to illustrate the difference in IO-speed, so an 
extremely simple processing was appropriate. The combination of doing IO 
and processing is something different. If you're doing complicated things, 
IO time has a good chance to become negligible.

>
> Of course:
>  - I haven't done any tests. I guessed (which I written)

I just have done a test.
Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no 
characters outside latin1 range) and the same with
('\n':map toEnum [256 .. 10000] ++ "\n") appended.

Code:

main = A.readFile "big.txt" >>= print . B.length

where (A,B) is a suitable combination of 
- Data.ByteString[.Lazy][.Char8][.UTF8]
- Data.Text[.IO]
- Prelude

Times:
Data.ByteString[.Lazy]: 0.00s
Data.ByteString.UTF8: 0.14s
Prelude:  0.21s
Data.ByteString.Lazy.UTF8: 0.56s
Data.Text:  0.66s

Of course Data.ByteString didn't count characters but bytes, so for the 
modified file, those printed larger numbers than the others (well, it's 
BYTEString, isn't it?).

It's a little unfair, though, as the ByteString[.Lazy] variants don't need 
to look at each individual byte, so I also let them and Prelude.String 
count newlines to see how fast they can inspect each character/byte,

BS[.Lazy]: 0.02s
Prelude: 0.23s

both take 0.02s to inspect each item.

To summarise:
* ByteString-IO is blazingly fast, since all it has to do is get a sequence 
of bytes from disk into memory.
* [Char]-IO is much slower because it has to transform the sequence of 
bytes to individual characters as they come.
* counting utf-8 encoded characters in a ByteString is - unsurprisingly - 
slow. I'm a bit surprised *how* slow it is for lazy ByteStrings.
(Caveat: I've no idea whether Data.ByteString.UTF8 would suffer from more 
multi-byte characters to the point where String becomes faster. My guess is 
no, not for single traversal. For multiple traversal, String has to 
identify each individual character only once, while BS.UTF8 must do it each 
time, so then String may be faster.)
* Data.Text isn't very fast for that one.

>  - It wasn't written what is the typical case

Aren't there several quite different typical cases?
One fairly typical case is big ASCII or latin1 files (e.g. fasta files, 
numerical data). For those, usually ByteString is by far the best choice.

Another fairly typical case is *text* processing, possibly with text in 
different scripts (latin, hebrew, kanji, ...). Depending on what you want 
to do (and the encoding), any of Prelude.String, Data.Text and 
Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings 
probably aren't. String and Text also have the advantage that you aren't 
tied to utf-8.

Choose your datatype according to your problem, not one size fits all.

>  - What is 'significant' difference

Depends of course. For a task performed once, who cares whether it takes 
one second or three? One hour or three, however, is a significant 
difference (assuming approximately equal times to write the code).
Sometimes 10% difference in performance is important, sometimes a factor of 
10 isn't.

The point is that you should be aware of the performance differences when 
making your choice.

>
> Regards

Cheers,
Daniel