[Haskell-cafe] Re: String vs ByteString

Daniel Peebles pumpkingod at gmail.com
Tue Aug 17 09:21:32 EDT 2010


Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
UTF-16 "segments" in it list of strict text elements :) Then big chunks of
western text will be encoded efficiently, and same with CJK! Not sure what
to do about strict Data.Text though :)

On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde <ketil at malde.org> wrote:

> Michael Snoyman <michael at snoyman.com> writes:
>
> > As far as space usage, you are correct that CJK data will take up more
> > memory in UTF-8 than UTF-16.
>
> With the danger of sounding ... alphabetist? as well as belaboring a
> point I agree is irrelevant (the storage format):
>
> I'd point out that it seems at least as unfair to optimize for CJK at
> the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
> ideograms, and (all, I think) characters in Western and other phonetic
> scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
> but three for CJK ideograms.
>
> Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
> while an ASCII letter is about six bits.  Thus, the information density
> of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
> 15/16 vs 6/16 for UTF-16.  In other words a given document translated
> between Chinese and English should occupy roughly the same space in
> UTF-8, but be 2.5 times longer in English for UTF-16.
>
> -k
> --
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100817/bc0cee0b/attachment.html


More information about the Haskell-Cafe mailing list