[Haskell-cafe] Re: String vs ByteString

Wed Aug 18 13:12:28 EDT 2010

On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <johan.tibell at gmail.com>wrote:

> Hi Michael,
>
>
> On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman <michael at snoyman.com>wrote:
>
>> Here's my response to the two points:
>>
>> * I haven't written a patch showing that Data.Text would be faster using
>> UTF-8 because that would require fulfilling the second point (I'll get to in
>> a second). I *have* shown where there are huge performance differences
>> between text and ByteString/String. Unfortunately, the response has been
>> "don't use bytestring, it's the wrong datatype, text will get fixed," which
>> is quite underwhelming.
>>
>
> I went through all the emails you sent on with topic "String vs ByteString"
> and "Re: String vs ByteString" and I can't find a single benchmark. I do
> agree with you that
>
>     * UTF-8 is more compact than UTF-16, and
>     * UTF-8 is by far the most used encoding on the web.
>
> and that establishes a reasonable *theoretical* argument for why switching
> to UTF-8 might be faster.
>
> What I'm looking for is a program that shows a big difference so we can
> validate the hypothesis. As Duncan mentioned we already ran some benchmarks
> early on the showed the opposite. Someone posted a benchmark earlier in this
> thread and Bryan addressed the issue raised by that poster. We want more of
> those.
>
>
Sorry, I thought I'd sent these out. While working on optimizing Hamlet I
started playing around with the BigTable benchmark. I wrote two blog posts
on the topic:

http://www.snoyman.com/blog/entry/bigtable-benchmarks/
http://www.snoyman.com/blog/entry/optimizing-hamlet/

Originally, Hamlet had been based on the text package; the huge slow-down
introduced by text convinced me to migrate to bytestrings, and ultimately
blaze-html/blaze-builder. It could be that these were flaws in text that are
correctable and have nothing to do with UTF-16; however, it will be
difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using
UTF-16 bytestrings would probably overstate the impact since it wouldn't be
using Bryan's fusion logic.

* Since the prevailing attitude has been such a disregard to any facts shown
>> thus far, it seems that the effort required to learn the internals of the
>> text package and attempt a patch would be wasted. In the meanwhile, Jasper
>> has released blaze-builder which does an amazing job at producing UTF-8
>> encoded data, which for the moment is my main need. As much as I'll be
>> chastised by the community, I'll stick with this approach for the moment.
>>
>
> I'm not sure this discussion has surfaced that many facts. What we do have
> is plenty of theories. I can easily add some more:
>
>     * GHC is not doing a good job laying out the branches in the validation
> code that does arithmetic on the input byte sequence, to validate the input
> and compute the Unicode code point that should be streamed using fusion.
>
>     * The differences in text and bytestring's fusion framework get in the
> way of some optimization in GHC (text uses a more sophisticated fusion
> frameworks that handles some cases bytestring can't according to Bryan).
>
>     * Lingering space leaks is hurting performance (Bryan plugged one
> already).
>
>     * The use of a polymorphic loop state in the fusion framework gets in
> the way of unboxing.
>
>     * Extraneous copying in the Handle implementation slows down I/O.
>
> All these are plausible reasons why Text might perform worse than
> ByteString. We need find out why ones are true by benchmarking and looking
> at the generated Core.
>
>
 Now if you tell me that text would consider applying a UTF-8 patch, that
>> would be a different story. But I don't have the time to maintain a separate
>> UTF-8 version of text. For me, the whole point of this discussion was to
>> determine whether we should attempt porting to UTF-8, which as I understand
>> it would be a rather large undertaking.
>>
>
> I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was
> faster on some set of benchmarks (starting with the ones already in the
> library) that we agree on.
>
> I think that's the main issue, and one that Duncan nailed on the head: we
have to think about what are the important benchmarks. For Hamlet, I need
fast UTF-8 bytestring generation. I don't care at all about algorithmic
speed for split texts, as an example. My (probably uneducated) guess is that
UTF-16 tends to perform many operations in memory faster since almost all
characters are represented as 16 bits, while the big benefit for UTF-8 is in
reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as
I said, that's an (uneducated) guess.

Some people have been floating the idea of multiple text packages. I
personally would *not* want to go down that road, but it might be the only
approach that allows top performance for all use cases. As is, I'm quite
happy using blaze-builder for Hamlet.

Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100818/66cdc23e/attachment.html