[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 09:12:38 EDT 2010

On Tue, Aug 17, 2010 at 3:23 PM, Yitzchak Gale <gale at sefer.org> wrote:

> Michael Snoyman wrote:
> > Regarding the data: you haven't actually quoted any
> > statistics about the prevalence of CJK data
>
> True, I haven't seen any - except for Google, which
> I don't believe is accurate. I would like to see some
> good unbiased data.
>
> Right now we just have our intuitions based on anecdotal
> evidence and whatever years of experience we have in IT.
>
> For the anecdotal evidence, I really wish that people from
> CJK countries were better represented in this discussion.
> Unfortunately, Haskell is less prevalent in CJK countries,
> and there is somewhat of a language barrier.
>
> > I'd hate to make up statistics on the spot, especially when
> > I don't have any numbers from you to compare them with.
>
> I agree, I wish we had better numbers.
>
> > even if the majority of web pages served are
> > in those three languages, a fairly high percentage
> > of the content will *still* be ASCII, due simply to the HTML,
> > CSS and Javascript overhead...
> > As far as space usage, you are correct that CJK data will take up more
> > memory in UTF-8 than UTF-16. The question still remains whether the
> overall
> > document size will be larger: I'd be interested in taking a random
> sampling
> > of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
> > think simply talking about this in the vacuum of data is pointless. If
> > anyone can recommend a CJK website which would be considered
> representative
> > (or a few), I'll do the test myself.
>
> Again, I agree that some real data would be great.
>
> The problem is, I'm not sure if there is anyone in this discussion
> who is qualified to come up with anything even close to a fair
> random sampling or a CJK website that is representative.
> As far as I can tell, most of us participating in this discussion
> have absolutely zero perspective of what computing is like
> in CJK countries.
>
> I won't call this a scientific study by any stretch of the imagination, but
I did a quick test on the www.qq.com homepage. The original file encoding
was GB2312; here are the file sizes:

GB2312: 193014
UTF8: 200044
UTF16: 371938

> > As far as the conflation, there are two questions
> > with regard to the encoding choice: encoding/decoding time
> > and space usage.
>
> No, there is a third: using an API that results in robust, readable
> and maintainable code even in the face of changing encoding
> requirements. Unless you have proof that the difference in
> performance between that API and an API with a hard-wired
> encoding is the factor that is causing your particular application
> to fail to meet its requirements, the hard-wired approach
> is guilty of aggravated premature optimization.
>
> So for example, UTF-8 is an important option
> to have in a web toolkit. But if that's the only option, that
> web toolkit shouldn't be considered a general-purpose one
> in my opinion.
>
> I'm not talking about API changes here; the topic at hand is the internal
representation of the stream of characters used by the text package. That is
currently UTF-16; I would argue switching to UTF8.

> > I don't think *anyone* is asserting that
> > UTF-16 is a common encoding for files anywhere,
> > so by using UTF-16 we are simply incurring an overhead
> > in every case.
>
> Well, to start with, all MS Word documents are in UTF-16.
> There are a few of those around I think. Most applications -
> in some sense of "most" - store text in UTF-16
>
> Again, without any data, my intuition tells me that
> most of the text data stored in the world's files are in
> UTF-16. There is currently not much Haskell code
> that reads those formats directly, but I think that will
> be changing as usage of Haskell in the real world
> picks up.
>
> I was referring to text files, not binary files with text embedded within
them. While we might use the text package to deal with the data from a Word
doc once in memory, we would almost certainly need to use ByteString (or
binary perhaps) to actually parse the file. But at the end of the day,
you're right: there would be an encoding penalty at a certain point, just
not on the entire file.

> We can't consider a CJK encoding for text,
>
> Not as a default, certainly not as the only option. But
> nice to have as a choice.
>
> I think you're missing the point at hand: I don't think *any* is opposed to
offering encoders/decoders for all the multitude of encoding types out
there. In fact, I believe the text-icu package already supports every
encoding type under discussion. The question is the internal representation
for text, for which a language-specific encoding is *not* a choice, since it
does not support all unicode code points.

Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100817/899e113e/attachment.html