[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 07:00:56 EDT 2010

On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <gale at sefer.org> wrote:

> Ketil Malde wrote:
> > I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
> > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
> > RAM, UTF-16 will be slower than UTF-8...
>
> I don't think the genome is typical text. And
> I doubt that is true if that text is in a CJK language.
>
> > I think that *IF* we are aiming for a single, grand, unified text
> > library to Rule Them All, it needs to use UTF-8.
>
> Given the growth rate of China's economy, if CJK isn't
> already the majority of text being processed in the world,
> it will be soon. I have seen media reports claiming CJK is
> now a majority of text data going over the wire on the web,
> though I haven't seen anything scientific backing up those claims.
> It certainly seems reasonable. I believe Google's measurements
> based on their own web index showing wide adoption of UTF-8
> are very badly skewed due to a strong Western bias.
>
> In that case, if we have to pick one encoding for Data.Text,
> UTF-16 is likely to be a better choice than UTF-8, especially
> if the cost is fairly low even for the special case of Western
> languages. Also, UTF-16 has become by far the dominant internal
> text format for most software and for most user platforms.
> Except on desktop Linux - and whether we like it or not, Linux
> desktops will remain a tiny minority for the foreseeable future.
>
>  I think you are conflating two points here, and ignoring some important
data. Regarding the data: you haven't actually quoted any statistics about
the prevalence of CJK data, but even if the majority of web pages served are
in those three languages, a fairly high percentage of the content will
*still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd
hate to make up statistics on the spot, especially when I don't have any
numbers from you to compare them with.

As far as the conflation, there are two questions with regard to the
encoding choice: encoding/decoding time and space usage. I don't think
*anyone* is asserting that UTF-16 is a common encoding for files anywhere,
so by using UTF-16 we are simply incurring an overhead in every case. We
can't consider a CJK encoding for text, so its prevalence is irrelevant to
this topic. What *is* relevant is that a very large percentage of web pages
*are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by
default UTF-8.

As far as space usage, you are correct that CJK data will take up more
memory in UTF-8 than UTF-16. The question still remains whether the overall
document size will be larger: I'd be interested in taking a random sampling
of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
think simply talking about this in the vacuum of data is pointless. If
anyone can recommend a CJK website which would be considered representative
(or a few), I'll do the test myself.

Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100817/2ce0cfe9/attachment.html