[Haskell-cafe] Re: String vs ByteString

Yitzchak Gale gale at sefer.org
Tue Aug 17 06:50:58 EDT 2010


Ketil Malde wrote:
> I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
> 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
> RAM, UTF-16 will be slower than UTF-8...

I don't think the genome is typical text. And
I doubt that is true if that text is in a CJK language.

> I think that *IF* we are aiming for a single, grand, unified text
> library to Rule Them All, it needs to use UTF-8.

Given the growth rate of China's economy, if CJK isn't
already the majority of text being processed in the world,
it will be soon. I have seen media reports claiming CJK is
now a majority of text data going over the wire on the web,
though I haven't seen anything scientific backing up those claims.
It certainly seems reasonable. I believe Google's measurements
based on their own web index showing wide adoption of UTF-8
are very badly skewed due to a strong Western bias.

In that case, if we have to pick one encoding for Data.Text,
UTF-16 is likely to be a better choice than UTF-8, especially
if the cost is fairly low even for the special case of Western
languages. Also, UTF-16 has become by far the dominant internal
text format for most software and for most user platforms.
Except on desktop Linux - and whether we like it or not, Linux
desktops will remain a tiny minority for the foreseeable future.

> Alternatively, we
> can have different libraries with different representations for
> different purposes, where you'll get another few percent of juice by
> switching to the most appropriate.
>
> Currently the latter approach looks to be in favor, so if we can't have
> one single library, let us at least aim for a set of libraries with
> consistent interfaces and optimal performance.  Data.Text is great for
> UTF-16, and I'd like to have something similar for UTF-8.  Is all I'm
> trying to say.

I agree.

Thanks,
Yitz


More information about the Haskell-Cafe mailing list