[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 07:07:35 EDT 2010

Hi Ketil,

On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde <ketil at malde.org> wrote:

> Johan Tibell <johan.tibell at gmail.com> writes:
>
> > It's not clear to me that using UTF-16 internally does make Data.Text
> > noticeably slower.
>
> I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
> 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
> RAM, UTF-16 will be slower than UTF-8.  Many applications will get away
> with streaming over data, retaining only a small part, but some won't.
>

I'm not sure if this is a great example as genome data is probably much
better stored in a vector (using a few bits per "letter"). I agree that
whenever one data structure will fit in the available RAM and another won't
the smaller will win. I just don't know if this case is worth spending weeks
worth of work optimizing for. That's why I'd like to see benchmarks for more
idiomatic use cases.

> In other cases (e.g. processing CJK text, and perhap also
> non-Latin1 text), I'm sure it'll be faster - but my (still
> unsubstantiated) guess is that the difference will be much smaller, and
> it'll be a case of winning some and losing some - and I'd also
> conjecture that having 3Gb "real" text (i.e. natural language, as
> opposed to text-formatted data) is rare.
>

I would like to verify this guess. In my personal experience it's really
hard to guess which changes will lead to a noticeable performance
improvement. I'm probably wrong more often than I'm right.

Cheers,
Johan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100817/63f40fe4/attachment.html