[Haskell-cafe] Re: String vs ByteString

Wed Aug 18 16:58:58 EDT 2010

On Wed, Aug 18, 2010 at 7:12 PM, Michael Snoyman <michael at snoyman.com>wrote:

> On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <johan.tibell at gmail.com>wrote:
>
>>
>>
> Sorry, I thought I'd sent these out. While working on optimizing Hamlet I
> started playing around with the BigTable benchmark. I wrote two blog posts
> on the topic:
>
> http://www.snoyman.com/blog/entry/bigtable-benchmarks/
> http://www.snoyman.com/blog/entry/optimizing-hamlet/
>
> Originally, Hamlet had been based on the text package; the huge slow-down
> introduced by text convinced me to migrate to bytestrings, and ultimately
> blaze-html/blaze-builder. It could be that these were flaws in text that are
> correctable and have nothing to do with UTF-16; however, it will be
> difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using
> UTF-16 bytestrings would probably overstate the impact since it wouldn't be
> using Bryan's fusion logic.
>

Those are great. As Bryan mentioned we've already improved performance and I
think I know how to improve it further.

I appreciate that it's difficult to show the UTF-8/UTF-16 divide. I think
the approach we're trying at the moment is looking at benchmarks, improving
performance, and repeating until we can't improve anymore. It could be the
case that we get a benchmark where the performance difference between
bytestring and text cannot be explained/fixed by factors other than changing
the internal encoding. That would be strong evidence that we should try to
switch the internal encoding. We haven't seen any such benchmarks yet.

As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to
browse through the repo but could find that input ByteStrings are actually
validated anywhere. If they're not it's a big generous to say that it deals
with UTF-8 data, as it would really just be concatenating byte sequences,
without validating them. We should ask Jasper about the current state.

> I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was
>> faster on some set of benchmarks (starting with the ones already in the
>> library) that we agree on.
>>
>> I think that's the main issue, and one that Duncan nailed on the head: we
> have to think about what are the important benchmarks. For Hamlet, I need
> fast UTF-8 bytestring generation. I don't care at all about algorithmic
> speed for split texts, as an example. My (probably uneducated) guess is that
> UTF-16 tends to perform many operations in memory faster since almost all
> characters are represented as 16 bits, while the big benefit for UTF-8 is in
> reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as
> I said, that's an (uneducated) guess.
>

I agree. Lets create some more benchmarks.

For example, lately I've been working on a benchmark, inspired by a real
world problem, where I iterate over the lines in a ~500 MBs file, encoded
using UTF-8 data, inserting each line in a Data.Map and do a bunch of
further processing on it (such as splitting the strings into words). This
tests text I/O throughput, memory overhead, performance of string
comparison, etc.

We already have benchmarks for reading files (in UTF-8) in several different
ways (lazy I/O and iteratee style folds).

Boil down the things you care about into a self contained benchmark and send
it to this list or put it somewhere were we can retrieve it.

Cheers,
Johan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100818/2f1aebbf/attachment.html