[Haskell-beginners] Re: Re: When to use ByteString rather than
[Char] ... ?
Daniel Fischer
daniel.is.fischer at web.de
Sun Apr 11 16:07:53 EDT 2010
Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 17:17 +0200, Daniel Fischer wrote:
> > > I *guess* that in most cases the overhead on I/O will be
> >
> > sufficiently
> >
> > > great to make the difference insignificant. However:
> >
> > ? which difference?
I meant: difference between ByteString-IO and [Char]-IO or which
difference?
> >
> > Try reading large files.
>
> Well - while large files are not not-important IIRC most files are small
> (< 4 KiB) - at least on *nix file systems (at least that's the core
> 'idea' of reiserfs/reiser4 filesystems).
Well, sometimes one has to process large files even though most are small.
If the processing itself is simple, IO-speed is important then.
>
> I guess that for large strings something like text (I think I mentioned
> it) is better
>
Unless you know you only have to deal with one-byte characters, when plain
ByteStrings are the simplest and fastest method.
But those are special cases, in general I agree.
> > Count the lines or something else, as long as it's
> > simple. The speed difference between ByteString-IO and [Char]-IO is
> > enormous.
> > When you do something more complicated the difference in IO-speed may
> > become insignificant.
>
> Hmm. As newline is a single-byte character in most encodings it is
> believable.
You can measure it yourself :)
cat-ing together a few copies of /usr/share/dict/words should give a large
enough file.
> However what is the difference in counting chars (not bytes
> - chars)? I wouldn't be surprise is difference was smaller.
Nor would I. In fact I'd be surprised if it wasn't smaller. [see below]
This example was meant to illustrate the difference in IO-speed, so an
extremely simple processing was appropriate. The combination of doing IO
and processing is something different. If you're doing complicated things,
IO time has a good chance to become negligible.
>
> Of course:
> - I haven't done any tests. I guessed (which I written)
I just have done a test.
Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no
characters outside latin1 range) and the same with
('\n':map toEnum [256 .. 10000] ++ "\n") appended.
Code:
main = A.readFile "big.txt" >>= print . B.length
where (A,B) is a suitable combination of
- Data.ByteString[.Lazy][.Char8][.UTF8]
- Data.Text[.IO]
- Prelude
Times:
Data.ByteString[.Lazy]: 0.00s
Data.ByteString.UTF8: 0.14s
Prelude: 0.21s
Data.ByteString.Lazy.UTF8: 0.56s
Data.Text: 0.66s
Of course Data.ByteString didn't count characters but bytes, so for the
modified file, those printed larger numbers than the others (well, it's
BYTEString, isn't it?).
It's a little unfair, though, as the ByteString[.Lazy] variants don't need
to look at each individual byte, so I also let them and Prelude.String
count newlines to see how fast they can inspect each character/byte,
BS[.Lazy]: 0.02s
Prelude: 0.23s
both take 0.02s to inspect each item.
To summarise:
* ByteString-IO is blazingly fast, since all it has to do is get a sequence
of bytes from disk into memory.
* [Char]-IO is much slower because it has to transform the sequence of
bytes to individual characters as they come.
* counting utf-8 encoded characters in a ByteString is - unsurprisingly -
slow. I'm a bit surprised *how* slow it is for lazy ByteStrings.
(Caveat: I've no idea whether Data.ByteString.UTF8 would suffer from more
multi-byte characters to the point where String becomes faster. My guess is
no, not for single traversal. For multiple traversal, String has to
identify each individual character only once, while BS.UTF8 must do it each
time, so then String may be faster.)
* Data.Text isn't very fast for that one.
> - It wasn't written what is the typical case
Aren't there several quite different typical cases?
One fairly typical case is big ASCII or latin1 files (e.g. fasta files,
numerical data). For those, usually ByteString is by far the best choice.
Another fairly typical case is *text* processing, possibly with text in
different scripts (latin, hebrew, kanji, ...). Depending on what you want
to do (and the encoding), any of Prelude.String, Data.Text and
Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings
probably aren't. String and Text also have the advantage that you aren't
tied to utf-8.
Choose your datatype according to your problem, not one size fits all.
> - What is 'significant' difference
Depends of course. For a task performed once, who cares whether it takes
one second or three? One hour or three, however, is a significant
difference (assuming approximately equal times to write the code).
Sometimes 10% difference in performance is important, sometimes a factor of
10 isn't.
The point is that you should be aware of the performance differences when
making your choice.
>
> Regards
Cheers,
Daniel
More information about the Beginners
mailing list