[Haskell-beginners] Re: When to use ByteString rather than [Char] ... ?

Sun Apr 11 09:31:52 EDT 2010

On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:
> Hi,
> 
> 
> After working through a few Haskell tutorials, I've come across
> numerous recommendations to use the Data.ByteString library rather
> than standard [Char], for reasons of "performance".  I'm having
> trouble swallowing this -- presumably the standard String is default
> for good reasons.  Nothing has answered this question: in what case is
> it better to use [Char]?  
> 

In most cases you need an actuall String and it is not time-critical I
believe. ByteString is... well string of bytes not char - you have no
idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as
jpeg ;) ). If you want the next char you don't know how many bytes you
need to read (1? 2? 3? depends on contents?).

String ([Char]) have defined representation - while read/write function
might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes
ascii encoding IIRC on read) it is their error.

> Could anyone point me to a good resource showing the differences
> between how [Char] and ByteString are implemented, and giving good a
> heuristic for me to decide which is better in any one case?
> 

ByteString is pointer with offset and length. Lazy ByteString is a
linked list of ByteStrings (with additional condition that none of inner
ByteStrings are empty).

In theory String is [Char] i.e. [a] i.e.

data [a] = [] | a:[a]

In other words it is linked list of characters. That, for long strings,
may be inefficient (because of cache, O(n) on random access and
necessity of checking for errors while evaluating further[1]).

I heard somewhere that actual implementations optimizes it to arrays
when it is possible (i.e. can be detected and does not messes with
non-strict semantics). However I don't know if it is true.

I *guess* that in most cases the overhead on I/O will be sufficiently
great to make the difference insignificant. However:

- If you need exact byte representation - for example for compression,
digital signatures etc. you need ByteString
- If you need to operate on text rather then bytes use String or
specialized data structures as Data.Text & co.
- If you don't care about performance and need easy of use (pattern
matching etc.) use String.
- If you have no special requirements than you can ByteString

While some languages (for example C, Python, Ruby) mixes the text and
it's representation I guess it is not always the best way. String in
such separation is an text while ByteString is a binary representation
of something (can be text, picture, compresses data etc.).

> 
> Best,
> 
> 
> James Fisher

Regards

[1] However the O(n) access time and checking of errors are still
introduced by decoding string. So if you need UTF-8 you will still get
the O(n) access time ;)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/beginners/attachments/20100411/6c0f9424/attachment.bin