[Haskell-cafe] Re: String vs ByteString

anderson leo fireman119 at gmail.com
Wed Aug 18 04:01:14 EDT 2010


More typical Chinese web sites:
    www.ifeng.com         (web site likes nytimes)
    dzh.mop.com           (community for fun)
    www.csdn.net          (web site for IT)
    www.sohu.com        (web site like yahoo)
    www.sina.com         (web site like yahoo)

-- Andrew

On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman <michael at snoyman.com>wrote:

> Well, I'm not certain if it counts as a typical Chinese website, but here
> are the stats;
>
> UTF8: 64,198
> UTF16: 113,160
>
> And just for fun, after gziping:
>
> UTF8: 17,708
> UTF16: 19,367
>
>
> On Wed, Aug 18, 2010 at 2:59 AM, anderson leo <fireman119 at gmail.com>wrote:
>
>> Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the
>> wikipedia for Chinese.
>>
>> -Andrew
>>
>> On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <michael at snoyman.com>wrote:
>>
>>>
>>>
>>> On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <gale at sefer.org> wrote:
>>>
>>>> Ketil Malde wrote:
>>>> > I haven't benchmarked it, but I'm fairly sure that, if you try to fit
>>>> a
>>>> > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
>>>> > RAM, UTF-16 will be slower than UTF-8...
>>>>
>>>> I don't think the genome is typical text. And
>>>> I doubt that is true if that text is in a CJK language.
>>>>
>>>> > I think that *IF* we are aiming for a single, grand, unified text
>>>> > library to Rule Them All, it needs to use UTF-8.
>>>>
>>>> Given the growth rate of China's economy, if CJK isn't
>>>> already the majority of text being processed in the world,
>>>> it will be soon. I have seen media reports claiming CJK is
>>>> now a majority of text data going over the wire on the web,
>>>> though I haven't seen anything scientific backing up those claims.
>>>> It certainly seems reasonable. I believe Google's measurements
>>>> based on their own web index showing wide adoption of UTF-8
>>>> are very badly skewed due to a strong Western bias.
>>>>
>>>> In that case, if we have to pick one encoding for Data.Text,
>>>> UTF-16 is likely to be a better choice than UTF-8, especially
>>>> if the cost is fairly low even for the special case of Western
>>>> languages. Also, UTF-16 has become by far the dominant internal
>>>> text format for most software and for most user platforms.
>>>> Except on desktop Linux - and whether we like it or not, Linux
>>>> desktops will remain a tiny minority for the foreseeable future.
>>>>
>>>>  I think you are conflating two points here, and ignoring some important
>>> data. Regarding the data: you haven't actually quoted any statistics about
>>> the prevalence of CJK data, but even if the majority of web pages served are
>>> in those three languages, a fairly high percentage of the content will
>>> *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd
>>> hate to make up statistics on the spot, especially when I don't have any
>>> numbers from you to compare them with.
>>>
>>> As far as the conflation, there are two questions with regard to the
>>> encoding choice: encoding/decoding time and space usage. I don't think
>>> *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
>>> so by using UTF-16 we are simply incurring an overhead in every case. We
>>> can't consider a CJK encoding for text, so its prevalence is irrelevant to
>>> this topic. What *is* relevant is that a very large percentage of web pages
>>> *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by
>>> default UTF-8.
>>>
>>> As far as space usage, you are correct that CJK data will take up more
>>> memory in UTF-8 than UTF-16. The question still remains whether the overall
>>> document size will be larger: I'd be interested in taking a random sampling
>>> of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
>>> think simply talking about this in the vacuum of data is pointless. If
>>> anyone can recommend a CJK website which would be considered representative
>>> (or a few), I'll do the test myself.
>>>
>>> Michael
>>>
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> Haskell-Cafe at haskell.org
>>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100818/35f8fc07/attachment.html


More information about the Haskell-Cafe mailing list