[Haskell-cafe] Re: String vs ByteString
pumpkingod at gmail.com
Tue Aug 17 09:21:32 EDT 2010
Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
UTF-16 "segments" in it list of strict text elements :) Then big chunks of
western text will be encoded efficiently, and same with CJK! Not sure what
to do about strict Data.Text though :)
On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde <ketil at malde.org> wrote:
> Michael Snoyman <michael at snoyman.com> writes:
> > As far as space usage, you are correct that CJK data will take up more
> > memory in UTF-8 than UTF-16.
> With the danger of sounding ... alphabetist? as well as belaboring a
> point I agree is irrelevant (the storage format):
> I'd point out that it seems at least as unfair to optimize for CJK at
> the cost of Western languages. UTF-16 uses two bytes for (most) CJK
> ideograms, and (all, I think) characters in Western and other phonetic
> scripts. UTF-8 uses one to two bytes for a lot of Western alphabets,
> but three for CJK ideograms.
> Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
> while an ASCII letter is about six bits. Thus, the information density
> of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
> 15/16 vs 6/16 for UTF-16. In other words a given document translated
> between Chinese and English should occupy roughly the same space in
> UTF-8, but be 2.5 times longer in English for UTF-16.
> If I haven't seen further, it is by standing in the footprints of giants
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe