[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 08:23:34 EDT 2010

Michael Snoyman wrote:
> Regarding the data: you haven't actually quoted any
> statistics about the prevalence of CJK data

True, I haven't seen any - except for Google, which
I don't believe is accurate. I would like to see some
good unbiased data.

Right now we just have our intuitions based on anecdotal
evidence and whatever years of experience we have in IT.

For the anecdotal evidence, I really wish that people from
CJK countries were better represented in this discussion.
Unfortunately, Haskell is less prevalent in CJK countries,
and there is somewhat of a language barrier.

> I'd hate to make up statistics on the spot, especially when
> I don't have any numbers from you to compare them with.

I agree, I wish we had better numbers.

> even if the majority of web pages served are
> in those three languages, a fairly high percentage
> of the content will *still* be ASCII, due simply to the HTML,
> CSS and Javascript overhead...
> As far as space usage, you are correct that CJK data will take up more
> memory in UTF-8 than UTF-16. The question still remains whether the overall
> document size will be larger: I'd be interested in taking a random sampling
> of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
> think simply talking about this in the vacuum of data is pointless. If
> anyone can recommend a CJK website which would be considered representative
> (or a few), I'll do the test myself.

Again, I agree that some real data would be great.

The problem is, I'm not sure if there is anyone in this discussion
who is qualified to come up with anything even close to a fair
random sampling or a CJK website that is representative.
As far as I can tell, most of us participating in this discussion
have absolutely zero perspective of what computing is like
in CJK countries.

> As far as the conflation, there are two questions
> with regard to the encoding choice: encoding/decoding time
> and space usage.

No, there is a third: using an API that results in robust, readable
and maintainable code even in the face of changing encoding
requirements. Unless you have proof that the difference in
performance between that API and an API with a hard-wired
encoding is the factor that is causing your particular application
to fail to meet its requirements, the hard-wired approach
is guilty of aggravated premature optimization.

So for example, UTF-8 is an important option
to have in a web toolkit. But if that's the only option, that
web toolkit shouldn't be considered a general-purpose one
in my opinion.

> I don't think *anyone* is asserting that
> UTF-16 is a common encoding for files anywhere,
> so by using UTF-16 we are simply incurring an overhead
> in every case.

Well, to start with, all MS Word documents are in UTF-16.
There are a few of those around I think. Most applications -
in some sense of "most" - store text in UTF-16

Again, without any data, my intuition tells me that
most of the text data stored in the world's files are in
UTF-16. There is currently not much Haskell code
that reads those formats directly, but I think that will
be changing as usage of Haskell in the real world
picks up.

> We can't consider a CJK encoding for text,

Not as a default, certainly not as the only option. But
nice to have as a choice.

> What *is* relevant is that a very large percentage of web pages
> *are*, in fact, standardizing on UTF-8,

In Western countries.

Regards,
Yitz