[Haskell-cafe] Re: String vs ByteString

Michael Snoyman michael at snoyman.com
Wed Aug 18 05:32:48 EDT 2010


Alright, here's the results for the first three in the list (please forgive
me for being lazy- I am a Haskell programmer after all):

ifeng.com:
UTF8: 299949
UTF16: 566610

dzh.mop.com:
GBK: 1866
UTF8: 1891
UTF16: 3684

www.csdn.net:
UTF8: 122870
UTF16: 217420

Seems like UTF8 is a consistent winner versus UTF16, and not much of a loser
to the native formats.

Michael

On Wed, Aug 18, 2010 at 11:01 AM, anderson leo <fireman119 at gmail.com> wrote:

> More typical Chinese web sites:
>     www.ifeng.com         (web site likes nytimes)
>     dzh.mop.com           (community for fun)
>     www.csdn.net          (web site for IT)
>     www.sohu.com        (web site like yahoo)
>     www.sina.com         (web site like yahoo)
>
> -- Andrew
>
>
> On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman <michael at snoyman.com>wrote:
>
>> Well, I'm not certain if it counts as a typical Chinese website, but here
>> are the stats;
>>
>> UTF8: 64,198
>> UTF16: 113,160
>>
>> And just for fun, after gziping:
>>
>> UTF8: 17,708
>> UTF16: 19,367
>>
>>
>> On Wed, Aug 18, 2010 at 2:59 AM, anderson leo <fireman119 at gmail.com>wrote:
>>
>>> Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the
>>> wikipedia for Chinese.
>>>
>>> -Andrew
>>>
>>> On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <michael at snoyman.com>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <gale at sefer.org> wrote:
>>>>
>>>>> Ketil Malde wrote:
>>>>> > I haven't benchmarked it, but I'm fairly sure that, if you try to fit
>>>>> a
>>>>> > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
>>>>> > RAM, UTF-16 will be slower than UTF-8...
>>>>>
>>>>> I don't think the genome is typical text. And
>>>>> I doubt that is true if that text is in a CJK language.
>>>>>
>>>>> > I think that *IF* we are aiming for a single, grand, unified text
>>>>> > library to Rule Them All, it needs to use UTF-8.
>>>>>
>>>>> Given the growth rate of China's economy, if CJK isn't
>>>>> already the majority of text being processed in the world,
>>>>> it will be soon. I have seen media reports claiming CJK is
>>>>> now a majority of text data going over the wire on the web,
>>>>> though I haven't seen anything scientific backing up those claims.
>>>>> It certainly seems reasonable. I believe Google's measurements
>>>>> based on their own web index showing wide adoption of UTF-8
>>>>> are very badly skewed due to a strong Western bias.
>>>>>
>>>>> In that case, if we have to pick one encoding for Data.Text,
>>>>> UTF-16 is likely to be a better choice than UTF-8, especially
>>>>> if the cost is fairly low even for the special case of Western
>>>>> languages. Also, UTF-16 has become by far the dominant internal
>>>>> text format for most software and for most user platforms.
>>>>> Except on desktop Linux - and whether we like it or not, Linux
>>>>> desktops will remain a tiny minority for the foreseeable future.
>>>>>
>>>>>  I think you are conflating two points here, and ignoring some
>>>> important data. Regarding the data: you haven't actually quoted any
>>>> statistics about the prevalence of CJK data, but even if the majority of web
>>>> pages served are in those three languages, a fairly high percentage of the
>>>> content will *still* be ASCII, due simply to the HTML, CSS and Javascript
>>>> overhead. I'd hate to make up statistics on the spot, especially when I
>>>> don't have any numbers from you to compare them with.
>>>>
>>>> As far as the conflation, there are two questions with regard to the
>>>> encoding choice: encoding/decoding time and space usage. I don't think
>>>> *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
>>>> so by using UTF-16 we are simply incurring an overhead in every case. We
>>>> can't consider a CJK encoding for text, so its prevalence is irrelevant to
>>>> this topic. What *is* relevant is that a very large percentage of web pages
>>>> *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by
>>>> default UTF-8.
>>>>
>>>> As far as space usage, you are correct that CJK data will take up more
>>>> memory in UTF-8 than UTF-16. The question still remains whether the overall
>>>> document size will be larger: I'd be interested in taking a random sampling
>>>> of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
>>>> think simply talking about this in the vacuum of data is pointless. If
>>>> anyone can recommend a CJK website which would be considered representative
>>>> (or a few), I'll do the test myself.
>>>>
>>>> Michael
>>>>
>>>> _______________________________________________
>>>> Haskell-Cafe mailing list
>>>> Haskell-Cafe at haskell.org
>>>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100818/0169d35e/attachment.html


More information about the Haskell-Cafe mailing list