[Haskell-beginners] Re: When to use ByteString rather than [Char]
... ?
Daniel Fischer
daniel.is.fischer at web.de
Sun Apr 11 11:17:00 EDT 2010
Am Sonntag 11 April 2010 15:31:52 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:
> > Hi,
> >
> >
> > After working through a few Haskell tutorials, I've come across
> > numerous recommendations to use the Data.ByteString library rather
> > than standard [Char], for reasons of "performance". I'm having
> > trouble swallowing this -- presumably the standard String is default
> > for good reasons.
But performance is none of those reasons.
The choice to make String a synonym for [Char] instead of a specialised
datatype allows you to easily manipulate strings with the plethora of list-
processing functions from the standard libraries.
However, that means some things can't be fast and there's a significant
space overhead for string handling.
> > Nothing has answered this question: in what case is
> > it better to use [Char]?
>
> In most cases you need an actuall String and it is not time-critical I
> believe. ByteString is... well string of bytes not char - you have no
> idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as
> jpeg ;) ). If you want the next char you don't know how many bytes you
> need to read (1? 2? 3? depends on contents?).
And
- sorting short strings
- for using map or filter, [Char] is often superior
>
> String ([Char]) have defined representation - while read/write function
> might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes
> ascii encoding IIRC on read) it is their error.
>
> > Could anyone point me to a good resource showing the differences
> > between how [Char] and ByteString are implemented, and giving good a
> > heuristic for me to decide which is better in any one case?
>
> ByteString is pointer with offset and length. Lazy ByteString is a
> linked list of ByteStrings (with additional condition that none of inner
> ByteStrings are empty).
And it's a head-strict list, not the usual lazy Haskell list.
>
> In theory String is [Char] i.e. [a] i.e.
>
> data [a] = [] | a:[a]
>
> In other words it is linked list of characters. That, for long strings,
> may be inefficient (because of cache, O(n) on random access and
> necessity of checking for errors while evaluating further[1]).
>
> I heard somewhere that actual implementations optimizes it to arrays
> when it is possible (i.e. can be detected and does not messes with
> non-strict semantics). However I don't know if it is true.
I've never heard of that before, so I'm skeptical.
>
> I *guess* that in most cases the overhead on I/O will be sufficiently
> great to make the difference insignificant. However:
? which difference?
Try reading large files. Count the lines or something else, as long as it's
simple. The speed difference between ByteString-IO and [Char]-IO is
enormous.
When you do something more complicated the difference in IO-speed may
become insignificant.
On the other hand, when you're appending a lot of short lines to a file one
by one, there's a good chance that [Char]-IO is actually faster.
>
> - If you need exact byte representation - for example for compression,
> digital signatures etc. you need ByteString
> - If you need to operate on text rather then bytes use String or
> specialized data structures as Data.Text & co.
> - If you don't care about performance and need easy of use (pattern
> matching etc.) use String.
> - If you have no special requirements than you can ByteString
>
> While some languages (for example C, Python, Ruby) mixes the text and
> it's representation I guess it is not always the best way. String in
> such separation is an text while ByteString is a binary representation
> of something (can be text, picture, compresses data etc.).
>
> > Best,
> >
> >
> > James Fisher
>
> Regards
>
> [1] However the O(n) access time and checking of errors are still
> introduced by decoding string. So if you need UTF-8 you will still get
> the O(n) access time ;)
It might then be a good idea to use a UArray Int Char if you need repeated
random access.
More information about the Beginners
mailing list