[Haskell-cafe] Re: String vs ByteString

Johan Tibell johan.tibell at gmail.com
Tue Aug 17 04:20:37 EDT 2010

On Tue, Aug 17, 2010 at 9:08 AM, Ketil Malde <ketil at malde.org> wrote:

> Benedikt Huber <benjovi at gmx.net> writes:
> > Despite of all this, I think the performance of the text
> > package is very promising, and hope it will improve further!
> I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
> makes it inefficient for many purposes.

It's not clear to me that using UTF-16 internally does make Data.Text
noticeably slower. If we could get conclusive evidence that using UTF-16
hurts performance, we could look into changing the internal representation
(a major undertaking). What Bryan and I need is benchmarks showing where
Data.Text is performing poorly, compare to String or ByteString, so we can
investigate the cause(s).

Hypothesis are a good starting point for performance improvements, but
they're not enough. We need benchmarks and people looking at profiling and
compiler output to really understand what's going on. For example, how many
know that the Handle implementations copies the input first into a mutable
buffer and then into a Text value, for reads less than the buffer size (8k
if I remember correctly). One of these copies could be avoided. How do we
know that it's using UTF-16 that's our current performance bottleneck and
not this extra copy? We need to benchmark, change the code, and then
benchmark again.

Perhaps the outcome of all the benchmarking and investigation is indeed that
UTF-16 is a problem; then we can change the internal encoding. But there are
other possibilities, like poorly laid out branches in the generated code. We
need to understand what's going on if we are to make progress.

A large fraction - probably most - textual data isn't natural language
> text, but data formatted in textual form, and these formats are
> typically restricted to ASCII (except for a few text fields).
> For instance, a typical project for me might be 10-100GB of data, mostly
> in various text formats, "real" text only making up a few percent of
> this.  The combined (all languages) Wikipedia is 2G words, probably less
> than 20GB.

I think this is an important observation.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100817/2ccf3f4d/attachment.html

More information about the Haskell-Cafe mailing list