[Haskell-cafe] Re: String vs ByteString

Wed Aug 18 11:24:19 EDT 2010

Hi Michael,

On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman <michael at snoyman.com>wrote:

> Here's my response to the two points:
>
> * I haven't written a patch showing that Data.Text would be faster using
> UTF-8 because that would require fulfilling the second point (I'll get to in
> a second). I *have* shown where there are huge performance differences
> between text and ByteString/String. Unfortunately, the response has been
> "don't use bytestring, it's the wrong datatype, text will get fixed," which
> is quite underwhelming.
>

I went through all the emails you sent on with topic "String vs ByteString"
and "Re: String vs ByteString" and I can't find a single benchmark. I do
agree with you that

    * UTF-8 is more compact than UTF-16, and
    * UTF-8 is by far the most used encoding on the web.

and that establishes a reasonable *theoretical* argument for why switching
to UTF-8 might be faster.

What I'm looking for is a program that shows a big difference so we can
validate the hypothesis. As Duncan mentioned we already ran some benchmarks
early on the showed the opposite. Someone posted a benchmark earlier in this
thread and Bryan addressed the issue raised by that poster. We want more of
those.

> * Since the prevailing attitude has been such a disregard to any facts
> shown thus far, it seems that the effort required to learn the internals of
> the text package and attempt a patch would be wasted. In the meanwhile,
> Jasper has released blaze-builder which does an amazing job at producing
> UTF-8 encoded data, which for the moment is my main need. As much as I'll be
> chastised by the community, I'll stick with this approach for the moment.
>

I'm not sure this discussion has surfaced that many facts. What we do have
is plenty of theories. I can easily add some more:

    * GHC is not doing a good job laying out the branches in the validation
code that does arithmetic on the input byte sequence, to validate the input
and compute the Unicode code point that should be streamed using fusion.

    * The differences in text and bytestring's fusion framework get in the
way of some optimization in GHC (text uses a more sophisticated fusion
frameworks that handles some cases bytestring can't according to Bryan).

    * Lingering space leaks is hurting performance (Bryan plugged one
already).

    * The use of a polymorphic loop state in the fusion framework gets in
the way of unboxing.

    * Extraneous copying in the Handle implementation slows down I/O.

All these are plausible reasons why Text might perform worse than
ByteString. We need find out why ones are true by benchmarking and looking
at the generated Core.

>  Now if you tell me that text would consider applying a UTF-8 patch, that
> would be a different story. But I don't have the time to maintain a separate
> UTF-8 version of text. For me, the whole point of this discussion was to
> determine whether we should attempt porting to UTF-8, which as I understand
> it would be a rather large undertaking.
>

I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was
faster on some set of benchmarks (starting with the ones already in the
library) that we agree on.

Cheers,
Johan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100818/0aac49fd/attachment.html