[Haskell-cafe] Efficient string construction

Daniel Fischer daniel.is.fischer at web.de
Thu Jun 3 12:16:53 EDT 2010


On Thursday 03 June 2010 17:26:36, Kevin Jardine wrote:
> --- On Thu, 6/3/10, Daniel Fischer <daniel.is.fischer at web.de> wrote:
> > Perhaps Data.ByteString[.Lazy].UTF8 is an even better
> > choice than Data.Text (depends on what you do).
>
> I thought that I had the differences between the three libraries figured
> out but I guess not now from what you say.
>
> I had thought that String was a simple but memory inefficient model,
> that Text was for, well text, and that bytestrings were for binary data
> (eg. images, audio files and applications that required a true view on
> each text byte).

Well, not necessarily. 
String can be quite memory efficient. As a stupid example,

length (replicate 10000000 'a')

will need less memory than the equivalents using ByteString or Text.
Less stupidly, if the String is lazily produced and consumed from head to 
last, String is memory efficient. And it's not necessarily much slower than 
ByteString or Text.

In fact, String is sometimes faster than Text (cf. e.g.
http://www.haskell.org/pipermail/haskell-cafe/2010-May/078220.html and 
following).

When you have to deal with text that is ASCII or latin1 (or some other 
encoding with a byte <-> char correspondence), plain ByteStrings are 
usually by far the fastest method. But that's of course a severe 
restriction.

>
> So why is there a UTF8 implementation for bytestrings? Does that not
> duplicate what Text is trying to do? If so, why the duplication?

I think Data.ByteString.UTF8 predates Data.Text.

> When is each library more appropriate?

Generally, ByteString for binary data or text, when you know it's safe and 
you need the speed.
For text, either String or Data.Text may be the better choice.
IIRC, Data.Text uses utf-16 (or some other 16-bit encoding), so if you 
receive utf-8 encoded text, Data.ByteString.UTF8 can be the better choice.
I haven't much experience with either Data.Text or Data.ByteString.UTF8, so 
I can't say much about their relative merits.

>
> Kevin



More information about the Haskell-Cafe mailing list