[Haskell-cafe] Re: String vs ByteString
fireman119 at gmail.com
Wed Aug 18 04:01:14 EDT 2010
More typical Chinese web sites:
www.ifeng.com (web site likes nytimes)
dzh.mop.com (community for fun)
www.csdn.net (web site for IT)
www.sohu.com (web site like yahoo)
www.sina.com (web site like yahoo)
On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman <michael at snoyman.com>wrote:
> Well, I'm not certain if it counts as a typical Chinese website, but here
> are the stats;
> UTF8: 64,198
> UTF16: 113,160
> And just for fun, after gziping:
> UTF8: 17,708
> UTF16: 19,367
> On Wed, Aug 18, 2010 at 2:59 AM, anderson leo <fireman119 at gmail.com>wrote:
>> Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the
>> wikipedia for Chinese.
>> On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <michael at snoyman.com>wrote:
>>> On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <gale at sefer.org> wrote:
>>>> Ketil Malde wrote:
>>>> > I haven't benchmarked it, but I'm fairly sure that, if you try to fit
>>>> > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
>>>> > RAM, UTF-16 will be slower than UTF-8...
>>>> I don't think the genome is typical text. And
>>>> I doubt that is true if that text is in a CJK language.
>>>> > I think that *IF* we are aiming for a single, grand, unified text
>>>> > library to Rule Them All, it needs to use UTF-8.
>>>> Given the growth rate of China's economy, if CJK isn't
>>>> already the majority of text being processed in the world,
>>>> it will be soon. I have seen media reports claiming CJK is
>>>> now a majority of text data going over the wire on the web,
>>>> though I haven't seen anything scientific backing up those claims.
>>>> It certainly seems reasonable. I believe Google's measurements
>>>> based on their own web index showing wide adoption of UTF-8
>>>> are very badly skewed due to a strong Western bias.
>>>> In that case, if we have to pick one encoding for Data.Text,
>>>> UTF-16 is likely to be a better choice than UTF-8, especially
>>>> if the cost is fairly low even for the special case of Western
>>>> languages. Also, UTF-16 has become by far the dominant internal
>>>> text format for most software and for most user platforms.
>>>> Except on desktop Linux - and whether we like it or not, Linux
>>>> desktops will remain a tiny minority for the foreseeable future.
>>>> I think you are conflating two points here, and ignoring some important
>>> data. Regarding the data: you haven't actually quoted any statistics about
>>> the prevalence of CJK data, but even if the majority of web pages served are
>>> in those three languages, a fairly high percentage of the content will
>>> hate to make up statistics on the spot, especially when I don't have any
>>> numbers from you to compare them with.
>>> As far as the conflation, there are two questions with regard to the
>>> encoding choice: encoding/decoding time and space usage. I don't think
>>> *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
>>> so by using UTF-16 we are simply incurring an overhead in every case. We
>>> can't consider a CJK encoding for text, so its prevalence is irrelevant to
>>> this topic. What *is* relevant is that a very large percentage of web pages
>>> *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by
>>> default UTF-8.
>>> As far as space usage, you are correct that CJK data will take up more
>>> memory in UTF-8 than UTF-16. The question still remains whether the overall
>>> document size will be larger: I'd be interested in taking a random sampling
>>> of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
>>> think simply talking about this in the vacuum of data is pointless. If
>>> anyone can recommend a CJK website which would be considered representative
>>> (or a few), I'll do the test myself.
>>> Haskell-Cafe mailing list
>>> Haskell-Cafe at haskell.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe