[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 06:09:10 EDT 2010

Johan Tibell <johan.tibell at gmail.com> writes:

> It's not clear to me that using UTF-16 internally does make Data.Text
> noticeably slower. 

I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
RAM, UTF-16 will be slower than UTF-8.  Many applications will get away
with streaming over data, retaining only a small part, but some won't.

In other cases (e.g. processing CJK text, and perhap also
non-Latin1 text), I'm sure it'll be faster - but my (still
unsubstantiated) guess is that the difference will be much smaller, and
it'll be a case of winning some and losing some - and I'd also
conjecture that having 3Gb "real" text (i.e. natural language, as
opposed to text-formatted data) is rare.

I think that *IF* we are aiming for a single, grand, unified text
library to Rule Them All, it needs to use UTF-8.  Alternatively, we
can have different libraries with different representations for
different purposes, where you'll get another few percent of juice by
switching to the most appropriate.

Currently the latter approach looks to be in favor, so if we can't have
one single library, let us at least aim for a set of libraries with
consistent interfaces and optimal performance.  Data.Text is great for
UTF-16, and I'd like to have something similar for UTF-8.  Is all I'm
trying to say.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants