[Haskell-beginners] Re: When to use ByteString rather than [Char] ... ?

Sun Apr 11 11:17:00 EDT 2010

Am Sonntag 11 April 2010 15:31:52 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:
> > Hi,
> >
> >
> > After working through a few Haskell tutorials, I've come across
> > numerous recommendations to use the Data.ByteString library rather
> > than standard [Char], for reasons of "performance".  I'm having
> > trouble swallowing this -- presumably the standard String is default
> > for good reasons.

But performance is none of those reasons.
The choice to make String a synonym for [Char] instead of a specialised 
datatype allows you to easily manipulate strings with the plethora of list-
processing functions from the standard libraries.
However, that means some things can't be fast and there's a significant 
space overhead for string handling.

> > Nothing has answered this question: in what case is
> > it better to use [Char]?
>
> In most cases you need an actuall String and it is not time-critical I
> believe. ByteString is... well string of bytes not char - you have no
> idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as
> jpeg ;) ). If you want the next char you don't know how many bytes you
> need to read (1? 2? 3? depends on contents?).

And
- sorting short strings
- for using map or filter, [Char] is often superior

>
> String ([Char]) have defined representation - while read/write function
> might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes
> ascii encoding IIRC on read) it is their error.
>
> > Could anyone point me to a good resource showing the differences
> > between how [Char] and ByteString are implemented, and giving good a
> > heuristic for me to decide which is better in any one case?
>
> ByteString is pointer with offset and length. Lazy ByteString is a
> linked list of ByteStrings (with additional condition that none of inner
> ByteStrings are empty).

And it's a head-strict list, not the usual lazy Haskell list.

>
> In theory String is [Char] i.e. [a] i.e.
>
> data [a] = [] | a:[a]
>
> In other words it is linked list of characters. That, for long strings,
> may be inefficient (because of cache, O(n) on random access and
> necessity of checking for errors while evaluating further[1]).
>
> I heard somewhere that actual implementations optimizes it to arrays
> when it is possible (i.e. can be detected and does not messes with
> non-strict semantics). However I don't know if it is true.

I've never heard of that before, so I'm skeptical.

>
> I *guess* that in most cases the overhead on I/O will be sufficiently
> great to make the difference insignificant. However:

? which difference?

Try reading large files. Count the lines or something else, as long as it's 
simple. The speed difference between ByteString-IO and [Char]-IO is 
enormous.
When you do something more complicated the difference in IO-speed may 
become insignificant.
On the other hand, when you're appending a lot of short lines to a file one 
by one, there's a good chance that [Char]-IO is actually faster.

>
> - If you need exact byte representation - for example for compression,
> digital signatures etc. you need ByteString
> - If you need to operate on text rather then bytes use String or
> specialized data structures as Data.Text & co.
> - If you don't care about performance and need easy of use (pattern
> matching etc.) use String.
> - If you have no special requirements than you can ByteString
>
> While some languages (for example C, Python, Ruby) mixes the text and
> it's representation I guess it is not always the best way. String in
> such separation is an text while ByteString is a binary representation
> of something (can be text, picture, compresses data etc.).
>
> > Best,
> >
> >
> > James Fisher
>
> Regards
>
> [1] However the O(n) access time and checking of errors are still
> introduced by decoding string. So if you need UTF-8 you will still get
> the O(n) access time ;)

It might then be a good idea to use a UArray Int Char if you need repeated 
random access.