[Haskell-cafe] Re: String vs ByteString

Tom Harper rtomharper at gmail.com
Tue Aug 17 06:09:09 EDT 2010


> I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
> makes it inefficient for many purposes.

In the first iteration of the Text package, UTF-16 was chosen because
it had a nice balance of arithmetic overhead and space.  The
arithmetic for UTF-8 started to have serious performance impacts in
situations where the entire document was outside ASCII (i.e. a Russian
or Arabic document), but UTF-16 was still relatively compact, compared
to both the UTF-32 and String alternatives.  This, however, obviously
does not represent your use case.   I don't know if your use case is
the more common one (though it seems likely).

The underlying principles of Text should work fine with UTF-8.  It has
changed a lot since its original writing (thanks to some excellent
tuning and maintenance by bos), including some more efficient binary
arithmetic.  The situation may have changed with respect to the
performance limitations of UTF-8, or there may be room for it and a
UTF-16 version.  Any takers for implementing a UTF-8 version and
comparing the two?


> A large fraction - probably most - textual data isn't natural language
> text, but data formatted in textual form, and these formats are
> typically restricted to ASCII (except for a few text fields).
>
> For instance, a typical project for me might be 10-100GB of data, mostly
> in various text formats, "real" text only making up a few percent of
> this.  The combined (all languages) Wikipedia is 2G words, probably less
> than 20GB.
>
> Being agnostic about string encoding - viz. treating it as bytes - works
> okay, but it would be nice to allow Unicode in the bits that actually
> are text, like string fields and labels and such.

Is your point that ASCII characters take up the same amount of space
(i.e. 16 bits) as higher code points? Do you have any comparisons that
quantify how much this affects your ability to process text in real
terms?  Does it make it too slow? Infeasible memory-wise?


More information about the Haskell-Cafe mailing list